Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: CSV file with carriage return breaks pandas #51141

Open
3 tasks done
pgdr opened this issue Feb 3, 2023 · 4 comments
Open
3 tasks done

BUG: CSV file with carriage return breaks pandas #51141

pgdr opened this issue Feb 3, 2023 · 4 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@pgdr
Copy link

pgdr commented Feb 3, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

[.../pd]$ python3 -m venv e
[.../pd]$ source e/bin/activate
(e) [.../pd]$ pip install pandas
...
Successfully installed numpy-1.24.1 pandas-1.5.3 python-dateutil-2.8.2 pytz-2022.7.1 six-1.16.0
(e) [.../pd]$ wget https://gist.githubusercontent.com/pgdr/e6c6ad236666909426cf841fe2704050/raw/cd0910f56ec3791fe0fd1679e426424fcdb39c24/mwe.csv
...
(e) [.../pd]$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> len(pandas.read_csv("mwe.csv"))
<stdin>:1: DtypeWarning: Columns (1,4,5) have mixed types. Specify dtype option on import or set low_memory=False.
131073
>>>
(e) [.../pd]$ wc mwe.csv
  3   4 195 mwe.csv
(e) [.../pd]$ nano mwe.csv  # delete _any_ character before CR
(e) [.../pd]$ wc mwe.csv
  3   4 194 mwe.csv
(e) [.../pd]$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> len(pandas.read_csv("mwe.csv"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/scratch/pd/e/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Issue Description

The file in the gist at pgdr#e6c6ad236666909426cf841fe2704050 is a CSV file mwe.csv with 3 lines and 195 characters, of which one is a carriage return.

The CSV file is "broken" so it is fine that Pandas doesn't open it correctly.

However, when parsed with read_csv, pandas returns a dataframe with 131071 rows!

In addition, if you delete any one character from the file, the result is

Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Expected Behavior

Do not buffer overflow

Installed Versions

INSTALLED VERSIONS

commit : f06c96a
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-137-generic
Version : #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_DK.UTF-8
LOCALE : en_DK.UTF-8

pandas : 2.0.0.dev0+1401.gf06c96a93f
numpy : 1.24.1
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 44.0.0
pip : 20.0.2

@pgdr pgdr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2023
@rmhowe425
Copy link
Contributor

take

@rmhowe425 rmhowe425 removed their assignment Feb 19, 2023
@rmhowe425
Copy link
Contributor

take

@rmhowe425 rmhowe425 removed their assignment Mar 11, 2023
@topper-123
Copy link
Contributor

Thanks for the bug report @pgdr.

This example should fail also. Probably something to do with our use of regexes.

A PR would be welcome.

@topper-123 topper-123 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2023
@topper-123
Copy link
Contributor

Possibly related to #40587.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants