-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Hi,
I have been trying to use the read_fwf()
dataframe method in pandas 0.16.0. I have supplied a list of tuples for the colspec parameter. There are 9 tuples specified. I have also supplied a list to the names parameter, with 9 column names.
Here's some code:
df = pd.read_fwf('c:/6starts.tab',
header=None,
colspec=[(0, 11), (11, 14), (14, 43), (43, 49), (49, 69), (69, 98), (98, 110), (110, 133), (133, 145)],
names=['lotid', 'lottype', 'part', 'qty', 'startdate', 'proc', 'rnum', 'material', 'who'])
Here's the traceback (sorry about the formatting):
ValueError Traceback (most recent call last)
in ()
10 (110, 133),
11 (133, 145)],
---> 12 names=['lotid', 'lottype', 'part', 'qty', 'startdate', 'proc', 'rnum', 'material', 'who']
13 )
C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read_fwf(filepath_or_buffer, colspecs, widths, **kwds)
499 kwds['colspecs'] = colspecs
500 kwds['engine'] = 'python-fwf'
--> 501 return _read(filepath_or_buffer, kwds)
502
503
C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
254 return parser
255
--> 256 return parser.read()
257
258 _parser_defaults = {
C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
713 raise ValueError('skip_footer not supported for iteration')
714
--> 715 ret = self._engine.read(nrows)
716
717 if self.options.get('as_recarray'):
C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read(self, rows)
1561 content = content[1:]
1562
-> 1563 alldata = self._rows_to_cols(content)
1564 data = self._exclude_implicit_index(alldata)
1565
C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in _rows_to_cols(self, content)
1936 msg = ('Expected %d fields in line %d, saw %d' %
1937 (col_len, row_num + 1, zip_len))
-> 1938 raise ValueError(msg)
1939
1940 if self.usecols:
ValueError: Expected 9 fields in line 1, saw 7
The problem I'm seeing is that pandas does not appear to be using colspec to parse the file. Instead it seems to be using whitespace and is detecting 7 distinct columns based on that.
I have tried specifying delimiter=''
to see if that would make a difference, but it doesn't fix it. I have also tried specifying usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8]
and that seems to prevent the exception occurring, but it still only reads in 7 columns and pads the last 2 columns with NaN.
In the snapshot above, the first column is fine, however the 2nd and 3rd columns are merged together (the 'D ' and the '8SL*****' bit). The next snapshot shows some of the desired column widths (highlighted in pink):
I think the issue is in the PythonParser class within the parsers.py file, in the _rows_to_cols() method, but I'm not familiar enough with it yet to attempt any sort of fix.
Here is the version information from pandas.show_versions():
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.16.0
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
Please let me know if you need any more information.
Thanks,
Adrian.