-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
I sometimes receive a Error tokenizing data. C error: ...
for tables that can normally be read with read_csv()
without any problems.
Find attached the .csv file sample.tar.gz for which I can reproduce the problem.
import pandas as pd
path = "sample.csv"
pd.read_csv(path, sep=";", header=[0,1])
This raises the following exception:
ParserError: Error tokenizing data. C error: Expected 15 fields in line 983050, saw 23
The tables I try to read have 23 columns, as declared correctly in the file header. However, the files contain corrupted lines (very few, <0.01% of all lines), where the data of 8 columns are omitted. For those lines, 8 delimiters are missing.
I'm working with about 100 different files containing 1M to 20M lines. All files suffer from the same kind of ill-formatted lines. read_csv()
graciously handles those lines most of the time. Only for the file provided above, it raises an exception.
I can avoid the exception as follows:
- Delete a couple of unrelated (healthy) lines at the beginning of the document
- By setting
engine="python"
(slow) - By setting
low_memory=False
- By setting
error_bad_lines=False
(drops a couple of lines)
In summary, I think read_csv()
behaves inconsistently if running with low-memory=True
and C-engine.
I first thought that the problem is related to issue #11166, but I'm not sure 100%.
I'm running python3.8 and pandas 1.2.3. See details below.
Expected Output
No exception for file sample.csv
, regardless of the settings for engine
and low_memory
.
System
INSTALLED VERSIONS
------------------
commit : f2c8480af2f25efdbd803218b9d87980f416563e
python : 3.8.0.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Fri Oct 30 12:37:06 PDT 2020; root:xnu-4903.278.44.0.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8
pandas : 1.2.3
numpy : 1.20.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.3.3
setuptools : 54.1.2
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 1.0.0
pyxlsb : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None