Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading text files with carriage return symbols #713

Conversation

mozharovsky
Copy link

@mozharovsky mozharovsky commented Oct 5, 2020

The new pandas-based text reader isn't able to work properly with files that contain carriage return symbols (\r).

It fails with the following error message:

...
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I figured out the pandas uses those symbols as line terminators and this eventually causes the error. Explicitly specifying the lineterminator fixes that issue and everything works fine.

Please, consider this PR as it seems to be a common issue to solve.

@mozharovsky
Copy link
Author

Discussed in #622, fixed in #715. Closing the issue. Thanks @lhoestq, it works now! 👍

@mozharovsky mozharovsky closed this Oct 5, 2020
@mozharovsky mozharovsky deleted the develop-fix-text-dataset-reading branch October 5, 2020 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant