Skip to content

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

@normanius

Description

@normanius

I sometimes receive a Error tokenizing data. C error: ... for tables that can normally be read with read_csv() without any problems.

Find attached the .csv file sample.tar.gz for which I can reproduce the problem.

import pandas as pd
path = "sample.csv"
pd.read_csv(path, sep=";", header=[0,1])

This raises the following exception:

ParserError: Error tokenizing data. C error: Expected 15 fields in line 983050, saw 23

The tables I try to read have 23 columns, as declared correctly in the file header. However, the files contain corrupted lines (very few, <0.01% of all lines), where the data of 8 columns are omitted. For those lines, 8 delimiters are missing.

I'm working with about 100 different files containing 1M to 20M lines. All files suffer from the same kind of ill-formatted lines. read_csv() graciously handles those lines most of the time. Only for the file provided above, it raises an exception.

I can avoid the exception as follows:

  • Delete a couple of unrelated (healthy) lines at the beginning of the document
  • By setting engine="python" (slow)
  • By setting low_memory=False
  • By setting error_bad_lines=False (drops a couple of lines)

In summary, I think read_csv() behaves inconsistently if running with low-memory=True and C-engine.

I first thought that the problem is related to issue #11166, but I'm not sure 100%.

sample.tar.gz

I'm running python3.8 and pandas 1.2.3. See details below.

Expected Output

No exception for file sample.csv, regardless of the settings for engine and low_memory.

System

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Fri Oct 30 12:37:06 PDT 2020; root:xnu-4903.278.44.0.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : None.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 54.1.2
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : 0.8.5
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 1.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.3.2
sqlalchemy       : None
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions