Deprecated usecols with out of bounds indices in read_csv #41130

phofl · 2021-04-23T22:45:54Z

xref ENH/BUG: usecols does not raise an exception when col index is out of bounds. #25623
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This currently raises

Traceback (most recent call last): File "/home/developer/.config/JetBrains/PyCharm2021.1/scratches/scratch_4.py", line 539, in pd.read_csv('test.csv', header=0, usecols=[0, 10], engine='python') File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 552, in read_csv return _read(filepath_or_buffer, kwds) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 471, in _read return parser.read(nrows) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 998, in read index, columns, col_dict = self._engine.read(nrows) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 285, in read data, columns = self._exclude_implicit_index(alldata) File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 303, in _exclude_implicit_index names = [names[i] for i in sorted(self._col_indices)] File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 303, in names = [names[i] for i in sorted(self._col_indices)]

IndexError: list index out of range

on master but unfortunately works on 1.2.4, so we can either raise ParserError with 1.2.4 or fix and deprecate then to remove in 2.0, related to #41129

I think fixing and deprecating would be more sensible, but only realised that this works on 1.2.4 after finishing this, so wanted to put up for discussion at least :)

cc @gfyoung

jbrockmendel · 2021-04-25T01:34:57Z

doc/source/whatsnew/v1.3.0.rst

@@ -797,6 +797,7 @@ I/O
 - Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`)
 - Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
 - Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
+- Bug in :func:`read_csv` raising uncontrolled ``ValueError`` when ``usecols`` index is ouf of bounds, now raising ``ParserError`` (:issue:`25623`)


im not sure what "uncontrolled" means here

Not raised on purpose by us but instead raised because we are accessing a non existent list index

jreback · 2021-04-26T12:56:11Z

unfortunately works on 1.2.4,

what does this mean?

phofl · 2021-04-26T13:21:32Z

This is a regression on master compared to 1.2.x series. So we should probably fix and then deprecate to not change behavior in 1.3

Unfortunately means, if this would not have worked on 1.2.x we could immediately start raising a ParserError without worrying about backwarts compatibility

gfyoung · 2021-04-26T14:18:24Z

pandas/tests/io/parser/test_python_parser_only.py

+
+@pytest.mark.parametrize("header", [0, None])
+@pytest.mark.parametrize("names", [None, ["a", "b"], ["a", "b", "c"]])
+def test_usecols_indices_out_of_bounds(python_parser_only, names, header):


Can this be tested with the CParser too?

gfyoung · 2021-04-26T14:19:30Z

pandas/io/parsers/python_parser.py

                    columns = [names]
                    num_original_columns = ncols

        return columns, num_original_columns, unnamed_cols

-    def _handle_usecols(self, columns, usecols_key):
+    def _handle_usecols(self, columns, usecols_key, num_original_columns):


Brief docstring on this new parameter to explain how it differs from columns (and why we couldn't just use columns.length in the logic).

can you type args here

gfyoung · 2021-04-26T14:24:34Z

IMO okay with making this change immediately (without deprecation) because ParserError subclasses ValueError.

phofl · 2021-04-26T15:10:34Z

@gfyoung was not sure, because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

jreback · 2021-04-26T18:40:37Z

@gfyoung was not sure, because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

oh maybe i misuderstood. so if this was working on 1.2.x then we should deprecate first

gfyoung · 2021-04-26T19:47:23Z

because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately

Oh, I see! The w/o raising an error part puts me in agree with @jreback then.

I would also then advocate for deprecation.

phofl · 2021-04-26T19:50:20Z

Yeah same on my side. Will mark this as draft until I have fixed the error on master. Then we can switch ParserError with FutureWarning

…623_python � Conflicts: � doc/source/whatsnew/v1.3.0.rst

# Conflicts: # doc/source/whatsnew/v1.3.0.rst

phofl · 2021-05-12T18:07:14Z

After #41244 was merged we can deprecate now for both engines

jreback

small request

jreback · 2021-05-13T01:44:50Z

pandas/io/parsers/python_parser.py

                    columns = [names]
                    num_original_columns = ncols

        return columns, num_original_columns, unnamed_cols

-    def _handle_usecols(self, columns, usecols_key):
+    def _handle_usecols(self, columns, usecols_key, num_original_columns):


can you type args here

…623_python

phofl · 2021-05-13T18:00:31Z

Done. should we use from future import annotations in a follow up?

jreback · 2021-05-13T23:26:31Z

Done. should we use from future import annotations in a follow up?

yes for sure, as that's being done elsewhere in the codebase.

thanks @phofl

…#41130)

phofl added 3 commits April 23, 2021 23:40

Bug in read_csv raising list index out of range instead of ParserError

9a82d19

Deprecated usecols with out of bounds indices in read_csv with c engine

0287dd9

Adjust test

97158ed

phofl added the IO CSV read_csv, to_csv label Apr 23, 2021

Fix bug for names and add tests

f446e4f

jbrockmendel reviewed Apr 25, 2021

View reviewed changes

jreback requested a review from gfyoung April 26, 2021 12:56

jreback added this to the 1.3 milestone Apr 26, 2021

gfyoung reviewed Apr 26, 2021

View reviewed changes

phofl marked this pull request as draft April 26, 2021 19:51

phofl added 4 commits May 12, 2021 19:49

Merge branch 'master' of https://github.com/pandas-dev/pandas into 25…

f5d3a05

…623_python � Conflicts: � doc/source/whatsnew/v1.3.0.rst

Rebase after fixing bug on master

21b496b

Merge branch '25623_c' into 25623_python

92488c0

# Conflicts: # doc/source/whatsnew/v1.3.0.rst

Remove no longer necessary test

41e3310

phofl changed the title ~~Bug in read_csv raising list index out of range instead of ParserError~~ Deprecated usecols with out of bounds indices in read_csv May 12, 2021

phofl marked this pull request as ready for review May 12, 2021 18:05

phofl mentioned this pull request May 12, 2021

Deprecated usecols with out of bounds indices in read_csv with c engine #41129

Closed

4 tasks

jreback added the Deprecate Functionality to remove in pandas label May 13, 2021

jreback requested changes May 13, 2021

View reviewed changes

jreback mentioned this pull request May 13, 2021

DEPR: log of deprecations in 1.x (to be removed in 2.0) #30228

Closed

Merge branch 'master' of https://github.com/pandas-dev/pandas into 25…

e34631b

…623_python

Type function

5bef676

jreback approved these changes May 13, 2021

View reviewed changes

jreback merged commit b2628b0 into pandas-dev:master May 13, 2021

phofl mentioned this pull request May 13, 2021

ENH/BUG: usecols does not raise an exception when col index is out of bounds. #25623

Closed

phofl deleted the 25623_python branch May 14, 2021 20:19

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

Deprecated usecols with out of bounds indices in read_csv (pandas-dev…

c54643b

…#41130)

mroeschke mentioned this pull request Nov 2, 2022

DEPR: HDFStore.iteritems, read_csv(use_cols) behavior #49483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecated usecols with out of bounds indices in read_csv #41130

Deprecated usecols with out of bounds indices in read_csv #41130

phofl commented Apr 23, 2021

jbrockmendel Apr 25, 2021

phofl Apr 25, 2021

jreback commented Apr 26, 2021

phofl commented Apr 26, 2021

gfyoung Apr 26, 2021

phofl Apr 26, 2021

gfyoung Apr 26, 2021

jreback May 13, 2021

gfyoung commented Apr 26, 2021

phofl commented Apr 26, 2021

jreback commented Apr 26, 2021

gfyoung commented Apr 26, 2021 •

edited

phofl commented Apr 26, 2021

phofl commented May 12, 2021

jreback left a comment

jreback May 13, 2021

phofl commented May 13, 2021

jreback commented May 13, 2021

Deprecated usecols with out of bounds indices in read_csv #41130

Deprecated usecols with out of bounds indices in read_csv #41130

Conversation

phofl commented Apr 23, 2021

jbrockmendel Apr 25, 2021

Choose a reason for hiding this comment

phofl Apr 25, 2021

Choose a reason for hiding this comment

jreback commented Apr 26, 2021

phofl commented Apr 26, 2021

gfyoung Apr 26, 2021

Choose a reason for hiding this comment

phofl Apr 26, 2021

Choose a reason for hiding this comment

gfyoung Apr 26, 2021

Choose a reason for hiding this comment

jreback May 13, 2021

Choose a reason for hiding this comment

gfyoung commented Apr 26, 2021

phofl commented Apr 26, 2021

jreback commented Apr 26, 2021

gfyoung commented Apr 26, 2021 • edited

phofl commented Apr 26, 2021

phofl commented May 12, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback May 13, 2021

Choose a reason for hiding this comment

phofl commented May 13, 2021

jreback commented May 13, 2021

gfyoung commented Apr 26, 2021 •

edited