Skip to content

Reading of fixed width file is detecting incorrect number of columns #10198

@amckiern

Description

@amckiern

Hi,
I have been trying to use the read_fwf() dataframe method in pandas 0.16.0. I have supplied a list of tuples for the colspec parameter. There are 9 tuples specified. I have also supplied a list to the names parameter, with 9 column names.

Here's some code:

df = pd.read_fwf('c:/6starts.tab', 
    header=None,
    colspec=[(0, 11), (11, 14), (14, 43), (43, 49), (49, 69), (69, 98), (98, 110), (110, 133), (133, 145)], 
    names=['lotid', 'lottype', 'part', 'qty', 'startdate', 'proc', 'rnum', 'material', 'who'])

Here's the traceback (sorry about the formatting):


ValueError Traceback (most recent call last)
in ()
10 (110, 133),
11 (133, 145)],
---> 12 names=['lotid', 'lottype', 'part', 'qty', 'startdate', 'proc', 'rnum', 'material', 'who']
13 )

C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read_fwf(filepath_or_buffer, colspecs, widths, **kwds)
499 kwds['colspecs'] = colspecs
500 kwds['engine'] = 'python-fwf'
--> 501 return _read(filepath_or_buffer, kwds)
502
503

C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
254 return parser
255
--> 256 return parser.read()
257
258 _parser_defaults = {

C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
713 raise ValueError('skip_footer not supported for iteration')
714
--> 715 ret = self._engine.read(nrows)
716
717 if self.options.get('as_recarray'):

C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in read(self, rows)
1561 content = content[1:]
1562
-> 1563 alldata = self._rows_to_cols(content)
1564 data = self._exclude_implicit_index(alldata)
1565

C:\Program Files\Anaconda\lib\site-packages\pandas\io\parsers.py in _rows_to_cols(self, content)
1936 msg = ('Expected %d fields in line %d, saw %d' %
1937 (col_len, row_num + 1, zip_len))
-> 1938 raise ValueError(msg)
1939
1940 if self.usecols:

ValueError: Expected 9 fields in line 1, saw 7


The problem I'm seeing is that pandas does not appear to be using colspec to parse the file. Instead it seems to be using whitespace and is detecting 7 distinct columns based on that.

I have tried specifying delimiter=''to see if that would make a difference, but it doesn't fix it. I have also tried specifying usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8] and that seems to prevent the exception occurring, but it still only reads in 7 columns and pads the last 2 columns with NaN.

pandas_1

In the snapshot above, the first column is fine, however the 2nd and 3rd columns are merged together (the 'D ' and the '8SL*****' bit). The next snapshot shows some of the desired column widths (highlighted in pink):

pandas_4

I think the issue is in the PythonParser class within the parsers.py file, in the _rows_to_cols() method, but I'm not familiar enough with it yet to attempt any sort of fix.

Here is the version information from pandas.show_versions():

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.0
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None

Please let me know if you need any more information.

Thanks,
Adrian.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions