New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: on_bad_lines=callable does not invoke callable for all bad lines #48020
Comments
Hi, thanks for your report, can reproduce this too. could you try simplifying the csv file? It’s hard to see what’s going on in there right now |
This may be working as expected if I am looking at your csv file correctly. As the docs state:
And I think each line has the same number of elements? |
Here's a simplified version:
Setting |
Agree that it should be a bug. No confidence to pass a callback for on_bad_line as some bad lines due to escapechar will be skipped silently. |
|
So I think I've identified the issue: pandas/pandas/io/parsers/python_parser.py Lines 776 to 786 in eb23512
This bit here catches the
This block does not deal with user defined callables, only the pandas defined ones, and just returns The callable gets triggered in this snippet if the number of rows is not as expected: pandas/pandas/io/parsers/python_parser.py Lines 988 to 1008 in eb23512
For example if I append
to the example csv
The callable will trigger for this line, as Not sure what the best way to go about a fix is. Perhaps we could call the user defined callable when we catch the error and let it process the line at return it? Not sure what the implications would be downstream if the line wasn't processed correctly though and if we would want this to propagate down the parser. Happy to work on this and open a PR if we have an idea on the approach for the fix. |
Any ideas? |
Given how it's documented, I think this statement is correct:
In other words, the callable should be called if the line was detected as bad. That is what should get fixed. @mroeschke do you agree that if the line is bad in any way, the callable should be called, as opposed to just getting called if the number of fields is wrong? |
I think when implementing this feature at the time, I tailored the callable to apply to what a "bad line" was documented at the time
And relied on how too many fields was defined internally. Not sure why too many fields was the definition of a bad line at the time, but I would be open to expanding the definition of what "bad" mean in regards to a line |
@mroeschke that makes sense. What would the appropriate expansion of the meaning of 'bad line' be to improve the expected behaviour of here? Currently If not, then the documentation may be slightly misleading, as A simple fix would be to change the documentation to better reflect what a bad line means and let the user process the line at the Or we keep it as is and just change the documentation saying that the meaning of bad lines differ between user callables and Any thoughts? |
For now I would opt to improve the documentation then. "bad line" is indeed a very nondescript term and would be weary of the code changes required to expand its definition without more discussion |
Okay I can look at making documentation changes to If we open another issue we can discuss how we want to describe "bad lines" to reflect what's happening there. Otherwise, I propose something along the lines of telling the user that user defined callables act on "to many fields" whereas |
You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for "bad line" |
I realize i'm commenting on a closed thread. I do think the clarifying documentation is good, but probably need an additional issue / feature request.
It is further confusing because depending on the value of
Which clearly looks like "too many fields" - and would trigger the callback. But the error is different with
Thus not triggering the callback, and silently skipping without calling back. |
Also getting silent skip on callable functions when using on_bad_lines. First tried writing to file but was getting blank files. Tried Also getting the same errors as @paul-theorem when turning removing on_bad_lines:
However, I get the same errors whether I have engine set to python or not. |
@jrhamilton If you have a reproducible example, it would be best to open up a new issue with that example so that the behavior you describe is investigated. We won't do anything on closed issues |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The above data file has two rows + header. Row 2 is valid, Row 3 is bad.
For
df1
, I'm settingon_bad_line=warn
, and I see a warning for line 3.For
d2
, I'm passingon_bad_lines=print
, and I don't see any prints - the bad line is silently skipped.Expected Behavior
I would expect the bad line to be printed in the second case.
Installed Versions
INSTALLED VERSIONS
commit : e8093ba
python : 3.9.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-49-generic
Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 60.6.0
pip : 22.0.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
/home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
The text was updated successfully, but these errors were encountered: