Skip to content

Getting error while trying to read csv using pandas Python due to extra column values #26470

@JafferWilson

Description

@JafferWilson

Here is the scenario that I am trying to rid of:
I am trying to read the following type of csv:

para1,para2,para3,para4
1,2,3,4
1,2,3,4,5
1,2,3,4
2,3,4,5,6,7,8,9,0

I am using the following command and getting the following error:

>>> import pandas as pd
>>> df =pd.read_csv("test.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

I tried to search for the issue and got this thread on SO:
Python Pandas Error tokenizing data

So, I tried. This is not what I was expecting. It is truncating the values.

>>> df =pd.read_csv("test.csv",error_bad_lines=False)
b'Skipping line 3: expected 4 fields, saw 5\nSkipping line 5: expected 4 fields, saw 9\n'
>>> df
para1  para2  para3  para4
0      1      2      3      4
1      1      2      3      4

What I wanted is something like this:
if there are extra values, then take the columns as the integer values with the highest column found in extra. then make the rest of the values as zero(0) till the last column and read the csv.

The output I am expecting is something like this:

>>> df =pd.read_csv("test.csv")
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  NaN  NaN  NaN  NaN  NaN
1      1      2      3      4  5.0  NaN  NaN  NaN  NaN
2      1      2      3      4  NaN  NaN  NaN  NaN  NaN
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0
>>> df = df.fillna(0)
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  0.0  0.0  0.0  0.0  0.0
1      1      2      3      4  5.0  0.0  0.0  0.0  0.0
2      1      2      3      4  0.0  0.0  0.0  0.0  0.0
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0

But please take a note of, I do not want to take care of the column. Instead the program must automatically understand and make the column headers as given above.

Second, please try to avoid suggesting me to write the header. As there can be number of columns where I might not able to write the header but just leave it as it is. so the missing column header will be the number integer as stated above. Do someone have any solution for the query, please let me know?

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Windows
OS-release:
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.0
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.6
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.1
bs4: 4.7.1
html5lib: 0.9999999
sqlalchemy: 1.2.10
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions