Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecated usecols with out of bounds indices in read_csv #41130

Merged
merged 10 commits into from May 13, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Apr 23, 2021

This currently raises

Traceback (most recent call last): File "/home/developer/.config/JetBrains/PyCharm2021.1/scratches/scratch_4.py", line 539, in pd.read_csv('test.csv', header=0, usecols=[0, 10], engine='python') File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 552, in read_csv return _read(filepath_or_buffer, kwds) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 471, in _read return parser.read(nrows) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 998, in read index, columns, col_dict = self._engine.read(nrows) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 285, in read data, columns = self._exclude_implicit_index(alldata) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 303, in _exclude_implicit_index names = [names[i] for i in sorted(self._col_indices)] File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 303, in names = [names[i] for i in sorted(self._col_indices)]
IndexError: list index out of range

on master but unfortunately works on 1.2.4, so we can either raise ParserError with 1.2.4 or fix and deprecate then to remove in 2.0, related to #41129

I think fixing and deprecating would be more sensible, but only realised that this works on 1.2.4 after finishing this, so wanted to put up for discussion at least :)

cc @gfyoung

@phofl phofl added the IO CSV read_csv, to_csv label Apr 23, 2021
@@ -797,6 +797,7 @@ I/O
- Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in :func:`read_csv` raising uncontrolled ``ValueError`` when ``usecols`` index is ouf of bounds, now raising ``ParserError`` (:issue:`25623`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure what "uncontrolled" means here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not raised on purpose by us but instead raised because we are accessing a non existent list index

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

unfortunately works on 1.2.4,

what does this mean?

@jreback jreback requested a review from gfyoung April 26, 2021 12:56
@phofl
Copy link
Member Author

phofl commented Apr 26, 2021

This is a regression on master compared to 1.2.x series. So we should probably fix and then deprecate to not change behavior in 1.3

Unfortunately means, if this would not have worked on 1.2.x we could immediately start raising a ParserError without worrying about backwarts compatibility

@jreback jreback added this to the 1.3 milestone Apr 26, 2021

@pytest.mark.parametrize("header", [0, None])
@pytest.mark.parametrize("names", [None, ["a", "b"], ["a", "b", "c"]])
def test_usecols_indices_out_of_bounds(python_parser_only, names, header):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tested with the CParser too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #41129

columns = [names]
num_original_columns = ncols

return columns, num_original_columns, unnamed_cols

def _handle_usecols(self, columns, usecols_key):
def _handle_usecols(self, columns, usecols_key, num_original_columns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief docstring on this new parameter to explain how it differs from columns (and why we couldn't just use columns.length in the logic).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type args here

@gfyoung
Copy link
Member

gfyoung commented Apr 26, 2021

IMO okay with making this change immediately (without deprecation) because ParserError subclasses ValueError.

@phofl
Copy link
Member Author

phofl commented Apr 26, 2021

@gfyoung was not sure, because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

@gfyoung was not sure, because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

oh maybe i misuderstood. so if this was working on 1.2.x then we should deprecate first

@gfyoung
Copy link
Member

gfyoung commented Apr 26, 2021

because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

Oh, I see! The w/o raising an error part puts me in agree with @jreback then.

I would also then advocate for deprecation.

@phofl
Copy link
Member Author

phofl commented Apr 26, 2021

Yeah same on my side. Will mark this as draft until I have fixed the error on master. Then we can switch ParserError with FutureWarning

@phofl phofl marked this pull request as draft April 26, 2021 19:51
@phofl phofl changed the title Bug in read_csv raising list index out of range instead of ParserError Deprecated usecols with out of bounds indices in read_csv May 12, 2021
@phofl phofl marked this pull request as ready for review May 12, 2021 18:05
@phofl
Copy link
Member Author

phofl commented May 12, 2021

After #41244 was merged we can deprecate now for both engines

@jreback jreback added the Deprecate Functionality to remove in pandas label May 13, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small request

columns = [names]
num_original_columns = ncols

return columns, num_original_columns, unnamed_cols

def _handle_usecols(self, columns, usecols_key):
def _handle_usecols(self, columns, usecols_key, num_original_columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type args here

@phofl
Copy link
Member Author

phofl commented May 13, 2021

Done. should we use from future import annotations in a follow up?

@jreback jreback merged commit b2628b0 into pandas-dev:master May 13, 2021
@jreback
Copy link
Contributor

jreback commented May 13, 2021

Done. should we use from future import annotations in a follow up?

yes for sure, as that's being done elsewhere in the codebase.

thanks @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants