Skip to content

Commit

Permalink
BUG: read_excel return empty dataframe when using usecols and restored
Browse files Browse the repository at this point in the history
capability of passing column labels for columns to be read

- [x] closes #18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

This commit reimplements usage of 'usecols' as a list of columns lables, list of
ints or a callable for read_excel function. The 'usecols' as used in pandas 0.22
is renamed as 'usecols_excel' and is enables the feature of receiving column
indexes as a list.
  • Loading branch information
jacksonjos committed Apr 30, 2018
1 parent 8ddc0fd commit a747961
Show file tree
Hide file tree
Showing 4 changed files with 182 additions and 36 deletions.
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -856,6 +856,7 @@ Other API Changes
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
- Changed the named argument `usecols` at :func:`read_excel` to `usecols_excel` that receives a list of index numbers or A1 index to select the columns that must be in the DataFrame, so the `usecols` argument can serve its purpose to select the columns that must be in the DataFrame using column labels (:issue:`18273`)

.. _whatsnew_0230.deprecations:

Expand Down Expand Up @@ -1166,6 +1167,7 @@ I/O
- Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
- Bug in :func:`read_excel` where `usecols_excel` named argument as a list of strings were returning a empty DataFrame (:issue:`18273`)
- :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
Expand Down
87 changes: 68 additions & 19 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,19 +85,41 @@
Column (0-indexed) to use as the row labels of the DataFrame.
Pass None if there is no such column. If a list is passed,
those columns will be combined into a ``MultiIndex``. If a
subset of data is selected with ``usecols``, index_col
subset of data is selected with ``usecols_excel``, index_col
is based on the subset.
parse_cols : int or list, default None
.. deprecated:: 0.21.0
Pass in `usecols` instead.
usecols : int or list, default None
Pass in `usecols_excel` instead.
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or string
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid list-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]`` and
``usecols=['foo', 'bar']`` is the same as ``['bar', 'foo']``.
To instantiate a DataFrame from ``data`` with element order preserved use
``pd.read_excel(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
in ``['foo', 'bar']`` order or
``pd.read_excel(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
for ``['bar', 'foo']`` order.
If callable, the callable function will be evaluated against the column
names, returning names where the callable function evaluates to True. An
example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
usecols_excel : int or list, default None
* If None then parse all columns,
* If int then indicates last column to be parsed
* If list of ints then indicates list of column numbers to be parsed
* If string then indicates comma separated list of Excel column letters and
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
inclusive of both sides.
* If list of strings each string shall be an Excel column letter or column
range (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are inclusive of
both sides.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
Expand Down Expand Up @@ -278,14 +300,14 @@ def get_writer(engine_name):


@Appender(_read_excel_doc)
@deprecate_kwarg("parse_cols", "usecols")
@deprecate_kwarg("parse_cols", "usecols_excel")
@deprecate_kwarg("skip_footer", "skipfooter")
def read_excel(io,
sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
dtype=None,
engine=None,
Expand Down Expand Up @@ -320,7 +342,7 @@ def read_excel(io,
header=header,
names=names,
index_col=index_col,
usecols=usecols,
usecols_excel=usecols_excel,
squeeze=squeeze,
dtype=dtype,
converters=converters,
Expand Down Expand Up @@ -413,7 +435,7 @@ def parse(self,
header=0,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
converters=None,
true_values=None,
Expand All @@ -439,7 +461,7 @@ def parse(self,
header=header,
names=names,
index_col=index_col,
usecols=usecols,
usecols_excel=usecols_excel,
squeeze=squeeze,
converters=converters,
true_values=true_values,
Expand All @@ -455,7 +477,7 @@ def parse(self,
convert_float=convert_float,
**kwds)

def _should_parse(self, i, usecols):
def _should_parse(self, i, usecols_excel):

def _range2cols(areas):
"""
Expand All @@ -481,18 +503,26 @@ def _excel2num(x):
cols.append(_excel2num(rng))
return cols

if isinstance(usecols, int):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
if isinstance(usecols_excel, int):
return i <= usecols_excel
# check if usecols_excel is a string that indicates a comma separated
# list of Excel column letters and column ranges
elif isinstance(usecols_excel, compat.string_types):
return i in _range2cols(usecols_excel)
# check if usecols_excel is a list of strings, each one indicating a
# Excel column letter or a column range
elif all(isinstance(x, compat.string_types) for x in usecols_excel):
usecols_excel_str = ",".join(usecols_excel)
return i in _range2cols(usecols_excel_str)
else:
return i in usecols
return i in usecols_excel

def _parse_excel(self,
sheetname=0,
header=0,
names=None,
index_col=None,
usecols_excel=None,
usecols=None,
squeeze=False,
dtype=None,
Expand All @@ -512,6 +542,10 @@ def _parse_excel(self,

_validate_header_arg(header)

if (usecols is not None) and (usecols_excel is not None):
raise TypeError("Cannot specify both `usecols` and `usecols_excel`"
". Choose one of them.")

if 'chunksize' in kwds:
raise NotImplementedError("chunksize keyword of read_excel "
"is not implemented")
Expand Down Expand Up @@ -615,13 +649,27 @@ def _parse_cell(cell_contents, cell_typ):
row = []
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
sheet.row_types(i))):
if usecols is not None and j not in should_parse:
should_parse[j] = self._should_parse(j, usecols)
if usecols_excel is not None and j not in should_parse:
should_parse[j] = self._should_parse(j, usecols_excel)

if usecols is None or should_parse[j]:
if usecols_excel is None or should_parse[j]:
row.append(_parse_cell(value, typ))
data.append(row)

# Check if some string in usecols may be interpreted as a Excel
# positional column
if (usecols is not None) and (not callable(usecols)) and \
(not all(isinstance(x, int) for x in usecols)) and \
any(isinstance(x, compat.string_types) and x.isalpha()
for x in usecols):
warnings.warn("The `usecols` named argument used to refer to "
"Excel column letters or ranges and int "
"positional indexes was renamed to "
"`usecols_excel`. Now `usecols` is used to "
"pass either a list of only string column lables"
" or a list of only integer positional indexes.",
UserWarning, stacklevel=3)

if sheet.nrows == 0:
output[asheetname] = DataFrame()
continue
Expand Down Expand Up @@ -674,6 +722,7 @@ def _parse_cell(cell_contents, cell_typ):
dtype=dtype,
true_values=true_values,
false_values=false_values,
usecols=usecols,
skiprows=skiprows,
nrows=nrows,
na_values=na_values,
Expand Down
18 changes: 18 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1980,6 +1980,24 @@ def TextParser(*args, **kwds):
parse_dates : boolean, default False
keep_date_col : boolean, default False
date_parser : function, default None
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must strings
that correspond to column names provided either by the user in `names`
or inferred from the document header row(s). For example, a valid
list-like `usecols` parameter would be ['foo', 'bar', 'baz']. Element
order is ignored, so ``usecols=['foo', 'bar']`` is the same as
``['bar', 'foo']``.
To instantiate a DataFrame from ``data`` with element order preserved
use ``pd.read_excel(data, usecols=['foo', 'bar'])[['foo', 'bar']]``
for columns in ``['foo', 'bar']`` order or
``pd.read_excel(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
for ``['bar', 'foo']`` order.
If callable, the callable function will be evaluated against the column
names, returning names where the callable function evaluates to True.
An example of a valid callable argument would be ``lambda x: x.upper()
in ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
skiprows : list of integers
Row numbers to skip
skipfooter : int
Expand Down
Loading

0 comments on commit a747961

Please sign in to comment.