Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

Open
normanius opened this issue Mar 23, 2021 · 6 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@normanius
Copy link

normanius commented Mar 23, 2021

I sometimes receive a Error tokenizing data. C error: ... for tables that can normally be read with read_csv() without any problems.

Find attached the .csv file sample.tar.gz for which I can reproduce the problem.

import pandas as pd
path = "sample.csv"
pd.read_csv(path, sep=";", header=[0,1])

This raises the following exception:

ParserError: Error tokenizing data. C error: Expected 15 fields in line 983050, saw 23

The tables I try to read have 23 columns, as declared correctly in the file header. However, the files contain corrupted lines (very few, <0.01% of all lines), where the data of 8 columns are omitted. For those lines, 8 delimiters are missing.

I'm working with about 100 different files containing 1M to 20M lines. All files suffer from the same kind of ill-formatted lines. read_csv() graciously handles those lines most of the time. Only for the file provided above, it raises an exception.

I can avoid the exception as follows:

  • Delete a couple of unrelated (healthy) lines at the beginning of the document
  • By setting engine="python" (slow)
  • By setting low_memory=False
  • By setting error_bad_lines=False (drops a couple of lines)

In summary, I think read_csv() behaves inconsistently if running with low-memory=True and C-engine.

I first thought that the problem is related to issue #11166, but I'm not sure 100%.

sample.tar.gz

I'm running python3.8 and pandas 1.2.3. See details below.

Expected Output

No exception for file sample.csv, regardless of the settings for engine and low_memory.

System

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Fri Oct 30 12:37:06 PDT 2020; root:xnu-4903.278.44.0.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : None.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 54.1.2
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : 0.8.5
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 1.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.3.2
sqlalchemy       : None
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@normanius normanius added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 23, 2021
@nmay231
Copy link
Contributor

nmay231 commented Mar 25, 2021

@normanius When I read the data using any of the methods that "fixed" the problem, I ended up with data in the wrong columns, e.g. the datatime columns were shifted a couple of columns to the left.

I can understand the qualms about the inconsistent behavior with slightly different files, but I would think that inconsistent data would be more of an issue.

In any case, I defer to those more knowledgeable about read_csv to address consistency issues.

@normanius
Copy link
Author

Correct, the data is inconsistent. Unfortunately, I can fix this only in retrospect, and pandas is my tool of choice here. The problem is actually relatively easy to fix - given that pandas is able to read the file.

I created this report because the observed behavior of read_csv occurs only sometimes, which may hint on a possible flaw in the algorithm. But I also understand that read_csv cannot handle all possible kinds of inconsistencies.

@meettaraviya
Copy link

Do we know which version introduced this bug? It's really annoying that we can't get around this

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jun 6, 2021
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 23, 2021
@phofl
Copy link
Member

phofl commented Jan 28, 2022

As an explanation what is going on here:

If low memory is True, the file is read in chunks. Unfortunately every chunk determines the number of columns for itself. One of your chunks starts with 15 columns, hence the error when it encounters 23.

One workaround (if you know the number of columns) is to set the names attribute. This forces 23 as expected columns and does not raise an error.

@ronaldgevern
Copy link

I think the problem is obvious - you're running out of memory because you're trying to load so much data into memory at once, and then process it.

You need to either:

  • get a machine with more memory.
  • re-architect the solution to use a pipelined approach using either a generator or coroutine pipeline to do the processing stepwise over your data.

The problem with the first approach is it won't scale indefinitely and is expensive. The second way is the right way to do it, but needs more coding.

Also, the pandas.parser.CParserError: Error tokenizing data generated when reading a file written by pandas.to_csv(), it might be because there is a carriage return ('\r') in a column names, in which case to_csv() will actually write the subsequent column names into the first column of the data frame, it will cause a difference between the number of columns in the first X rows. This difference is one cause of the CParserError .

@phofl
Copy link
Member

phofl commented Jul 12, 2022

This has nothing to do with memory issues. This is an implementation bug occurring when reading in chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

7 participants