Skip to content

Commit

Permalink
BUG: read_excel return empty dataframe when using usecols and restored
Browse files Browse the repository at this point in the history
capability of passing column labels for columns to be read

- [x] closes #18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

Created 'usecols_excel' that receives a string containing comma separated Excel
ranges and columns.
Changed 'usecols' named argument, now it receives a list of strings containing
column labels or a list of integers representing column indexes or a callable
for 'read_excel' function. Created and altered tests to reflect the new usage
of these named arguments. 'index_col' keyword used to indicated which columns
in the subset of selected columns by 'usecols' or 'usecols_excel' that should
be the index of the DataFrame read. Now 'index_col' indicates which columns of
the DataFrame will be the index even if that column is not in the subset of the
selected columns.
  • Loading branch information
jacksonjos committed Jun 4, 2018
1 parent 4274b84 commit 6c6eede
Show file tree
Hide file tree
Showing 5 changed files with 234 additions and 74 deletions.
42 changes: 36 additions & 6 deletions doc/source/io.rst
Expand Up @@ -2852,23 +2852,53 @@ Parsing Specific Columns

It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. ``read_excel`` takes
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
either a ``usecols`` or ``usecols_excel`` keyword to allow you to specify a
subset of columns to parse. Note that you can not use both ``usecols`` and
``usecols_excel`` named arguments at the same time.

If ``usecols_excel`` is supplied, then it is assumed that indicates a comma
separated list of Excel column letters and column ranges to be parsed.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A:E')
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A,C,E:F')
If ``usecols`` is an integer, then it is assumed to indicate the last column
to be parsed.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
read_excel('path_to_file.xls', 'Sheet1', usecols_excel=2)
If ``usecols`` is a list of integers, then it is assumed to be the file
column indices to be parsed.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=[1, 3, 5])
Element order is ignored, so ``usecols_excel=[0, 1]`` is the same as ``[1, 0]``.

If ``usecols`` is a list of strings, then it is assumed that each string
correspond to column names provided either by the user in `names` or
inferred from the document header row(s) and those strings define which columns
will be parsed.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as
``['joe', 'baz']``.

If `usecols` is a list of integers, then it is assumed to be the file column
indices to be parsed.
If ``usecols`` is callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to True.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
Parsing Dates
+++++++++++++
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Expand Up @@ -1325,6 +1325,7 @@ I/O
- Bug in :func:`DataFrame.to_latex()` where missing space characters caused wrong escaping and produced non-valid latex in some cases (:issue:`20859`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
- Bug in :func:`read_excel` where ``usecols`` keyword argument as a list of strings were returning a empty ``DataFrame`` (:issue:`18273`)
- :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Expand Up @@ -35,7 +35,7 @@ Datetimelike API Changes
Other API Changes
^^^^^^^^^^^^^^^^^

-
- :func:`read_excel` has gained the keyword argument ``usecols_excel`` that receives a string containing comma separated Excel ranges and columns. The ``usecols`` keyword argument at :func:`read_excel` had removed support for a string containing comma separated Excel ranges and columns and for an int indicating the first j columns to be read in a ``DataFrame``. Also, the ``usecols`` keyword argument at :func:`read_excel` had added support for receiving a list of strings containing column labels and a callable. (:issue:`18273`)
-
-

Expand Down
100 changes: 83 additions & 17 deletions pandas/io/excel.py
Expand Up @@ -10,6 +10,8 @@
import abc
import warnings
import numpy as np
import string
import re
from io import UnsupportedOperation

from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -85,20 +87,45 @@
Column (0-indexed) to use as the row labels of the DataFrame.
Pass None if there is no such column. If a list is passed,
those columns will be combined into a ``MultiIndex``. If a
subset of data is selected with ``usecols``, index_col
is based on the subset.
subset of data is selected with ``usecols_excel`` or ``usecols``,
index_col is based on the subset.
parse_cols : int or list, default None
.. deprecated:: 0.21.0
Pass in `usecols` instead.
usecols : int or list, default None
usecols : list-like or callable or int, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or string
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid list-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Note that
you can not give both ``usecols`` and ``usecols_excel`` keyword arguments
at the same time.
If callable, the callable function will be evaluated against the column
names, returning names where the callable function evaluates to True. An
example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``.
.. versionadded:: 0.24.0
Added support to column labels and now `usecols_excel` is the keyword that
receives separated comma list of excel columns and ranges.
usecols_excel : string or list, default None
Return a subset of the columns from a spreadsheet specified as Excel column
ranges and columns. Note that you can not use both ``usecols`` and
``usecols_excel`` keyword arguments at the same time.
* If None then parse all columns,
* If int then indicates last column to be parsed
* If list of ints then indicates list of column numbers to be parsed
* If string then indicates comma separated list of Excel column letters and
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
both sides.
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
inclusive of both sides.
* If list of strings each string shall be an Excel column letter or column
range (e.g. ["A:E"] or ["A", "C", "E:F"]) to be parsed. Ranges are
inclusive of both sides.
.. versionadded:: 0.24.0
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
dtype : Type name or dict of column -> type, default None
Expand Down Expand Up @@ -269,6 +296,17 @@ def _get_default_writer(ext):
return _default_writers[ext]


def _is_excel_columns_notation(columns):
"""Receives a string and check if the string is a comma separated list of
Excel index columns and index ranges. An Excel range is a string with two
column indexes separated by ':')."""
if isinstance(columns, compat.string_types) and all(
(x in string.ascii_letters) for x in re.split(r',|:', columns)):
return True

return False


def get_writer(engine_name):
try:
return _writers[engine_name]
Expand All @@ -286,6 +324,7 @@ def read_excel(io,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
dtype=None,
engine=None,
Expand All @@ -311,6 +350,7 @@ def read_excel(io,
header=header,
names=names,
index_col=index_col,
usecols_excel=usecols_excel,
usecols=usecols,
squeeze=squeeze,
dtype=dtype,
Expand Down Expand Up @@ -405,6 +445,7 @@ def parse(self,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
converters=None,
true_values=None,
Expand Down Expand Up @@ -439,6 +480,7 @@ def parse(self,
header=header,
names=names,
index_col=index_col,
usecols_excel=usecols_excel,
usecols=usecols,
squeeze=squeeze,
converters=converters,
Expand All @@ -455,7 +497,7 @@ def parse(self,
convert_float=convert_float,
**kwds)

def _should_parse(self, i, usecols):
def _should_parse(self, i, usecols_excel, usecols):

def _range2cols(areas):
"""
Expand All @@ -481,19 +523,20 @@ def _excel2num(x):
cols.append(_excel2num(rng))
return cols

if isinstance(usecols, int):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
else:
return i in usecols
# check if usecols_excel is a string that indicates a comma separated
# list of Excel column letters and column ranges
if isinstance(usecols_excel, compat.string_types):
return i in _range2cols(usecols_excel)

return True

def _parse_excel(self,
sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
dtype=None,
true_values=None,
Expand All @@ -512,6 +555,25 @@ def _parse_excel(self,

_validate_header_arg(header)

if (usecols is not None) and (usecols_excel is not None):
raise ValueError("Cannot specify both `usecols` and "
"`usecols_excel`. Choose one of them.")

# Check if some string in usecols may be interpreted as a Excel
# range or positional column
elif _is_excel_columns_notation(usecols):
warnings.warn("The `usecols` keyword argument used to refer to "
"Excel ranges and columns as strings was "
"renamed to `usecols_excel`.", UserWarning,
stacklevel=3)
usecols_excel = usecols
usecols = None

elif (usecols_excel is not None) and not _is_excel_columns_notation(
usecols_excel):
raise TypeError("`usecols_excel` must be None or a string as a "
"comma separeted Excel ranges and columns.")

if 'chunksize' in kwds:
raise NotImplementedError("chunksize keyword of read_excel "
"is not implemented")
Expand Down Expand Up @@ -615,10 +677,13 @@ def _parse_cell(cell_contents, cell_typ):
row = []
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
sheet.row_types(i))):
if usecols is not None and j not in should_parse:
should_parse[j] = self._should_parse(j, usecols)
if ((usecols is not None) or (usecols_excel is not None) or
(j not in should_parse)):
should_parse[j] = self._should_parse(j, usecols_excel,
usecols)

if usecols is None or should_parse[j]:
if (((usecols_excel is None) and (usecols is None)) or
should_parse[j]):
row.append(_parse_cell(value, typ))
data.append(row)

Expand Down Expand Up @@ -674,6 +739,7 @@ def _parse_cell(cell_contents, cell_typ):
dtype=dtype,
true_values=true_values,
false_values=false_values,
usecols=usecols,
skiprows=skiprows,
nrows=nrows,
na_values=na_values,
Expand Down

0 comments on commit 6c6eede

Please sign in to comment.