Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecated usecols with out of bounds indices in read_csv #41130

Merged
merged 10 commits into from
May 13, 2021
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -797,6 +797,7 @@ I/O
- Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in :func:`read_csv` raising uncontrolled ``ValueError`` when ``usecols`` index is ouf of bounds, now raising ``ParserError`` (:issue:`25623`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure what "uncontrolled" means here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not raised on purpose by us but instead raised because we are accessing a non existent list index


Period
^^^^^^
Expand Down
23 changes: 17 additions & 6 deletions pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,12 +470,14 @@ def _infer_columns(self):
if self.usecols is not None:
# Set _use_cols. We don't store columns because they are
# overwritten.
self._handle_usecols(columns, names)
self._handle_usecols(columns, names, num_original_columns)
else:
num_original_columns = len(names)
columns = [names]
else:
columns = self._handle_usecols(columns, columns[0])
columns = self._handle_usecols(
columns, columns[0], num_original_columns
)
else:
try:
line = self._buffered_line()
Expand All @@ -494,10 +496,12 @@ def _infer_columns(self):
columns = [[f"{self.prefix}{i}" for i in range(ncols)]]
else:
columns = [list(range(ncols))]
columns = self._handle_usecols(columns, columns[0])
columns = self._handle_usecols(
columns, columns[0], num_original_columns
)
else:
if self.usecols is None or len(names) >= num_original_columns:
columns = self._handle_usecols([names], names)
columns = self._handle_usecols([names], names, num_original_columns)
num_original_columns = len(names)
else:
if not callable(self.usecols) and len(names) != len(self.usecols):
Expand All @@ -506,13 +510,13 @@ def _infer_columns(self):
"header fields in the file"
)
# Ignore output but set used columns.
self._handle_usecols([names], names)
self._handle_usecols([names], names, ncols)
columns = [names]
num_original_columns = ncols

return columns, num_original_columns, unnamed_cols

def _handle_usecols(self, columns, usecols_key):
def _handle_usecols(self, columns, usecols_key, num_original_columns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief docstring on this new parameter to explain how it differs from columns (and why we couldn't just use columns.length in the logic).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type args here

"""
Sets self._col_indices

Expand All @@ -537,6 +541,13 @@ def _handle_usecols(self, columns, usecols_key):
else:
col_indices.append(col)
else:
missing_usecols = [
col for col in self.usecols if col >= num_original_columns
]
if missing_usecols:
raise ParserError(
f"Usecols indices {missing_usecols} are out of bounds!"
)
col_indices = self.usecols

columns = [
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -312,3 +312,19 @@ def test_malformed_skipfooter(python_parser_only):
msg = "Expected 3 fields in line 4, saw 5"
with pytest.raises(ParserError, match=msg):
parser.read_csv(StringIO(data), header=1, comment="#", skipfooter=1)


@pytest.mark.parametrize("header", [0, None])
@pytest.mark.parametrize("names", [None, ["a", "b"], ["a", "b", "c"]])
def test_usecols_indices_out_of_bounds(python_parser_only, names, header):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tested with the CParser too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #41129

# GH#25623
if header == 0 and names == ["a", "b", "c"]:
pytest.skip("This case is not valid")
parser = python_parser_only
data = """
a,b
1,2
"""
msg = r"Usecols indices \[2\] are out of bounds!"
with pytest.raises(ParserError, match=msg):
parser.read_csv(StringIO(data), usecols=[0, 2], names=names, header=header)