Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

Closed
2 of 3 tasks
indigoviolet opened this issue Aug 10, 2022 · 17 comments · Fixed by #50311
Closed
2 of 3 tasks

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

indigoviolet opened this issue Aug 10, 2022 · 17 comments · Fixed by #50311
Labels
Bug IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue

Comments

@indigoviolet
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'

In [30]:
len(open("bad.csv").readlines())
Out [30]:
3

In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'


In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')

In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)

Issue Description

The above data file has two rows + header. Row 2 is valid, Row 3 is bad.

For df1, I'm setting on_bad_line=warn, and I see a warning for line 3.

For d2, I'm passing on_bad_lines=print, and I don't see any prints - the bad line is silently skipped.

❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl

Expected Behavior

I would expect the bad line to be printed in the second case.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : e8093ba
python : 3.9.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-49-generic
Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 60.6.0
pip : 22.0.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
/home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

@indigoviolet indigoviolet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2022
@phofl
Copy link
Member

phofl commented Aug 10, 2022

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

@mroeschke
Copy link
Member

This may be working as expected if I am looking at your csv file correctly.

As the docs state:

Specifies what to do upon encountering a bad line (a line with too many fields).

And I think each line has the same number of elements?

@mroeschke mroeschke added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2022
@indigoviolet
Copy link
Author

image

  1. Which lines are considered bad should not be different between 'warn' and print.

  2. I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

@indigoviolet
Copy link
Author

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

Here's a simplified version:

❯ cat bad2.csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"

Setting escapechar='\\' will allow reading the second line, but the bug (different behavior b/w warn and print) as reported is still valid.

@baodangpro
Copy link

baodangpro commented Dec 1, 2022

Agree that it should be a bug. No confidence to pass a callback for on_bad_line as some bad lines due to escapechar will be skipped silently.

@kostyafarber
Copy link
Contributor

print returns None, so by the description in the docs the bad line will be ignored.

If the function returns None the bad line will be ignored

@kostyafarber
Copy link
Contributor

kostyafarber commented Dec 11, 2022

So I think I've identified the issue:

try:
# assert for mypy, data is Iterator[str] or None, would error in next
assert self.data is not None
line = next(self.data)
# for mypy
assert isinstance(line, list)
return line
except csv.Error as e:
if self.on_bad_lines in (
self.BadLineHandleMethod.ERROR,
self.BadLineHandleMethod.WARN,

This bit here catches the CSV error (i.e in the example this line chile,"contacto \" corporación colina" not having defined an escapechar=\\) causes this error to be caught:

'\',\' expected after \'"\''

This block does not deal with user defined callables, only the pandas defined ones, and just returns None, effectively "skipping" this line.

The callable gets triggered in this snippet if the number of rows is not as expected:

if (
max_len > col_len
and self.index_col is not False # type: ignore[comparison-overlap]
and self.usecols is None
):
footers = self.skipfooter if self.skipfooter else 0
bad_lines = []
iter_content = enumerate(content)
content_len = len(content)
content = []
for (i, l) in iter_content:
actual_len = len(l)
if actual_len > col_len:
if callable(self.on_bad_lines):
new_l = self.on_bad_lines(l)
if new_l is not None:
content.append(new_l)

For example if I append

japan, bum ba boom, doon

to the example csv

country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"
japan, bum ba boom, doon

The callable will trigger for this line, as max_len > col_len, but not for the chile,"contacto \" corporación colina" .

Not sure what the best way to go about a fix is. Perhaps we could call the user defined callable when we catch the error and let it process the line at return it?

Not sure what the implications would be downstream if the line wasn't processed correctly though and if we would want this to propagate down the parser.

Happy to work on this and open a PR if we have an idea on the approach for the fix.

@kostyafarber
Copy link
Contributor

Any ideas?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 14, 2022

Given how it's documented, I think this statement is correct:

I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

In other words, the callable should be called if the line was detected as bad. That is what should get fixed.

@mroeschke do you agree that if the line is bad in any way, the callable should be called, as opposed to just getting called if the number of fields is wrong?

@mroeschke
Copy link
Member

I think when implementing this feature at the time, I tailored the callable to apply to what a "bad line" was documented at the time

Specifies what to do upon encountering a bad line (a line with too many fields).

And relied on how too many fields was defined internally.

Not sure why too many fields was the definition of a bad line at the time, but I would be open to expanding the definition of what "bad" mean in regards to a line

@kostyafarber
Copy link
Contributor

@mroeschke that makes sense.

What would the appropriate expansion of the meaning of 'bad line' be to improve the expected behaviour of here?

Currently error and warn are implemented to handle any csv.Error. Does csv.Error only catch "a line with too many fields"?

If not, then the documentation may be slightly misleading, as error and warn are dealing with many different cases not just "a line with too many fields" and callable only with "a line with too many fields".

A simple fix would be to change the documentation to better reflect what a bad line means and let the user process the line at the csv.Error stage.

Or we keep it as is and just change the documentation saying that the meaning of bad lines differ between user callables and error, warn etc.

Any thoughts?

@mroeschke
Copy link
Member

For now I would opt to improve the documentation then. "bad line" is indeed a very nondescript term and would be weary of the code changes required to expand its definition without more discussion

@kostyafarber
Copy link
Contributor

Okay I can look at making documentation changes to on_bad_lines. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?

If we open another issue we can discuss how we want to describe "bad lines" to reflect what's happening there.

Otherwise, I propose something along the lines of telling the user that user defined callables act on "to many fields" whereas warn, error are triggered by any CSV parsing error. What do you think?

@mroeschke
Copy link
Member

You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for "bad line"

@paul-theorem
Copy link

I realize i'm commenting on a closed thread. I do think the clarifying documentation is good, but probably need an additional issue / feature request.

on_bad_lines meaning is obviously overloaded. It would be nice to get a callback function for any skipped line so they can be counted. With a large dataset w/ millions of rows - the failure of 1 or 2 lines may be a minor annoyance, and the failure of 100k lines may be a serious defect. Without a general way to know the difference, we cannot pro grammatically differentiate.

It is further confusing because depending on the value of engine - the errors reported by the Exception/warning vary. For the same file, with the default (unspecified) engine, i see:

Skipping line 339779: expected 150 fields, saw 159

Which clearly looks like "too many fields" - and would trigger the callback. But the error is different with engine=python:

Skipping line 339779: ',' expected after '"'

Thus not triggering the callback, and silently skipping without calling back.

@jrhamilton
Copy link

Also getting silent skip on callable functions when using on_bad_lines.

First tried writing to file but was getting blank files. Tried on_bad_lines=print like @indigoviolet , and getting silent skips.

Also getting the same errors as @paul-theorem when turning removing on_bad_lines:

pandas.errors.ParserError: ',' expected after '"'

However, I get the same errors whether I have engine set to python or not.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 1, 2023

@jrhamilton If you have a reproducible example, it would be best to open up a new issue with that example so that the behavior you describe is investigated. We won't do anything on closed issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue
Projects
None yet
8 participants