BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

indigoviolet · 2022-08-10T05:15:25Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'

In [30]:
len(open("bad.csv").readlines())
Out [30]:
3

In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'


In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')

In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)

Issue Description

The above data file has two rows + header. Row 2 is valid, Row 3 is bad.

For df1, I'm setting on_bad_line=warn, and I see a warning for line 3.

For d2, I'm passing on_bad_lines=print, and I don't see any prints - the bad line is silently skipped.

❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl

Expected Behavior

I would expect the bad line to be printed in the second case.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : e8093ba
python : 3.9.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-49-generic
Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 60.6.0
pip : 22.0.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
/home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

The text was updated successfully, but these errors were encountered:

phofl · 2022-08-10T08:09:14Z

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

mroeschke · 2022-08-10T17:40:37Z

This may be working as expected if I am looking at your csv file correctly.

As the docs state:

Specifies what to do upon encountering a bad line (a line with too many fields).

And I think each line has the same number of elements?

indigoviolet · 2022-08-10T18:10:39Z

Which lines are considered bad should not be different between 'warn' and print.
I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

indigoviolet · 2022-08-10T18:12:47Z

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

Here's a simplified version:

❯ cat bad2.csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"

Setting escapechar='\\' will allow reading the second line, but the bug (different behavior b/w warn and print) as reported is still valid.

baodangpro · 2022-12-01T05:03:51Z

Agree that it should be a bug. No confidence to pass a callback for on_bad_line as some bad lines due to escapechar will be skipped silently.

kostyafarber · 2022-12-10T16:36:30Z

print returns None, so by the description in the docs the bad line will be ignored.

If the function returns None the bad line will be ignored

kostyafarber · 2022-12-11T10:11:25Z

So I think I've identified the issue:

pandas/pandas/io/parsers/python_parser.py

Lines 776 to 786 in eb23512

    
           try: 
        
               # assert for mypy, data is Iterator[str] or None, would error in next 
        
               assert self.data is not None 
        
               line = next(self.data) 
        
               # for mypy 
        
               assert isinstance(line, list) 
        
               return line 
        
           except csv.Error as e: 
        
               if self.on_bad_lines in ( 
        
                   self.BadLineHandleMethod.ERROR, 
        
                   self.BadLineHandleMethod.WARN,

This bit here catches the CSV error (i.e in the example this line chile,"contacto \" corporación colina" not having defined an escapechar=\\) causes this error to be caught:

'\',\' expected after \'"\''

This block does not deal with user defined callables, only the pandas defined ones, and just returns None, effectively "skipping" this line.

The callable gets triggered in this snippet if the number of rows is not as expected:

pandas/pandas/io/parsers/python_parser.py

Lines 988 to 1008 in eb23512

    
           if ( 
        
               max_len > col_len 
        
               and self.index_col is not False  # type: ignore[comparison-overlap] 
        
               and self.usecols is None 
        
           ): 
        
               footers = self.skipfooter if self.skipfooter else 0 
        
               bad_lines = [] 
        
               iter_content = enumerate(content) 
        
               content_len = len(content) 
        
               content = [] 
        
               for (i, l) in iter_content: 
        
                   actual_len = len(l) 
        
                   if actual_len > col_len: 
        
                       if callable(self.on_bad_lines): 
        
                           new_l = self.on_bad_lines(l) 
        
                           if new_l is not None: 
        
                               content.append(new_l)

For example if I append

japan, bum ba boom, doon

to the example csv

country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"
japan, bum ba boom, doon

The callable will trigger for this line, as max_len > col_len, but not for the chile,"contacto \" corporación colina" .

Not sure what the best way to go about a fix is. Perhaps we could call the user defined callable when we catch the error and let it process the line at return it?

Not sure what the implications would be downstream if the line wasn't processed correctly though and if we would want this to propagate down the parser.

Happy to work on this and open a PR if we have an idea on the approach for the fix.

kostyafarber · 2022-12-13T10:51:11Z

Any ideas?

Dr-Irv · 2022-12-14T15:47:51Z

Given how it's documented, I think this statement is correct:

I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

In other words, the callable should be called if the line was detected as bad. That is what should get fixed.

@mroeschke do you agree that if the line is bad in any way, the callable should be called, as opposed to just getting called if the number of fields is wrong?

mroeschke · 2022-12-14T17:52:45Z

I think when implementing this feature at the time, I tailored the callable to apply to what a "bad line" was documented at the time

Specifies what to do upon encountering a bad line (a line with too many fields).

And relied on how too many fields was defined internally.

Not sure why too many fields was the definition of a bad line at the time, but I would be open to expanding the definition of what "bad" mean in regards to a line

kostyafarber · 2022-12-15T08:37:12Z

@mroeschke that makes sense.

What would the appropriate expansion of the meaning of 'bad line' be to improve the expected behaviour of here?

Currently error and warn are implemented to handle any csv.Error. Does csv.Error only catch "a line with too many fields"?

If not, then the documentation may be slightly misleading, as error and warn are dealing with many different cases not just "a line with too many fields" and callable only with "a line with too many fields".

A simple fix would be to change the documentation to better reflect what a bad line means and let the user process the line at the csv.Error stage.

Or we keep it as is and just change the documentation saying that the meaning of bad lines differ between user callables and error, warn etc.

Any thoughts?

mroeschke · 2022-12-15T17:02:07Z

For now I would opt to improve the documentation then. "bad line" is indeed a very nondescript term and would be weary of the code changes required to expand its definition without more discussion

kostyafarber · 2022-12-15T17:13:59Z

Okay I can look at making documentation changes to on_bad_lines. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?

If we open another issue we can discuss how we want to describe "bad lines" to reflect what's happening there.

Otherwise, I propose something along the lines of telling the user that user defined callables act on "to many fields" whereas warn, error are triggered by any CSV parsing error. What do you think?

mroeschke · 2022-12-16T01:00:49Z

You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for "bad line"

paul-theorem · 2023-01-10T20:36:39Z

I realize i'm commenting on a closed thread. I do think the clarifying documentation is good, but probably need an additional issue / feature request.

on_bad_lines meaning is obviously overloaded. It would be nice to get a callback function for any skipped line so they can be counted. With a large dataset w/ millions of rows - the failure of 1 or 2 lines may be a minor annoyance, and the failure of 100k lines may be a serious defect. Without a general way to know the difference, we cannot pro grammatically differentiate.

It is further confusing because depending on the value of engine - the errors reported by the Exception/warning vary. For the same file, with the default (unspecified) engine, i see:

Skipping line 339779: expected 150 fields, saw 159

Which clearly looks like "too many fields" - and would trigger the callback. But the error is different with engine=python:

Skipping line 339779: ',' expected after '"'

Thus not triggering the callback, and silently skipping without calling back.

jrhamilton · 2023-05-01T09:11:29Z

Also getting silent skip on callable functions when using on_bad_lines.

First tried writing to file but was getting blank files. Tried on_bad_lines=print like @indigoviolet , and getting silent skips.

Also getting the same errors as @paul-theorem when turning removing on_bad_lines:

pandas.errors.ParserError: ',' expected after '"'

However, I get the same errors whether I have engine set to python or not.

Dr-Irv · 2023-05-01T14:57:36Z

@jrhamilton If you have a reproducible example, it would be best to open up a new issue with that example so that the behavior you describe is investigated. We won't do anything on closed issues

indigoviolet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2022

mroeschke added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2022

kostyafarber mentioned this issue Dec 17, 2022

DOC: Add to the io documentation of on_bad_lines to alert users of silently skipped lines. #50311

Merged

5 tasks

mroeschke closed this as completed in #50311 Jan 3, 2023

zoushucai mentioned this issue Jun 11, 2023

BUG: on_bad_lines=callable does not invoke callable for all bad lines #53599

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

indigoviolet commented Aug 10, 2022

INSTALLED VERSIONS

phofl commented Aug 10, 2022

mroeschke commented Aug 10, 2022

indigoviolet commented Aug 10, 2022

indigoviolet commented Aug 10, 2022

baodangpro commented Dec 1, 2022 •

edited

kostyafarber commented Dec 10, 2022

kostyafarber commented Dec 11, 2022 •

edited

kostyafarber commented Dec 13, 2022

Dr-Irv commented Dec 14, 2022

mroeschke commented Dec 14, 2022

kostyafarber commented Dec 15, 2022

mroeschke commented Dec 15, 2022

kostyafarber commented Dec 15, 2022

mroeschke commented Dec 16, 2022

paul-theorem commented Jan 10, 2023

jrhamilton commented May 1, 2023

Dr-Irv commented May 1, 2023

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020

Comments

indigoviolet commented Aug 10, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Aug 10, 2022

mroeschke commented Aug 10, 2022

indigoviolet commented Aug 10, 2022

indigoviolet commented Aug 10, 2022

baodangpro commented Dec 1, 2022 • edited

kostyafarber commented Dec 10, 2022

kostyafarber commented Dec 11, 2022 • edited

kostyafarber commented Dec 13, 2022

Dr-Irv commented Dec 14, 2022

mroeschke commented Dec 14, 2022

kostyafarber commented Dec 15, 2022

mroeschke commented Dec 15, 2022

kostyafarber commented Dec 15, 2022

mroeschke commented Dec 16, 2022

paul-theorem commented Jan 10, 2023

jrhamilton commented May 1, 2023

Dr-Irv commented May 1, 2023

baodangpro commented Dec 1, 2022 •

edited

kostyafarber commented Dec 11, 2022 •

edited