BUG: read_excel return empty dataframe when using usecols and restored

capability of passing column labels for columns to be read - [x] closes #18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.
pandas-dev · Jun 4, 2018 · 6c6eede · 6c6eede
1 parent 4274b84
commit 6c6eede
Show file tree

Hide file tree

Showing 5 changed files with 234 additions and 74 deletions.
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -2852,23 +2852,53 @@ Parsing Specific Columns
 
 It is often the case that users will insert columns to do temporary computations
 in Excel and you may not want to read in those columns. ``read_excel`` takes
-a ``usecols`` keyword to allow you to specify a subset of columns to parse.
+either a ``usecols`` or ``usecols_excel`` keyword to allow you to specify a
+subset of columns to parse. Note that you can not use both ``usecols`` and
+``usecols_excel`` named arguments at the same time.
+
+If ``usecols_excel`` is supplied, then it is assumed that indicates a comma
+separated list of Excel column letters and column ranges to be parsed.
+
+.. code-block:: python
+
+   read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A:E')
+   read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A,C,E:F')
 
 If ``usecols`` is an integer, then it is assumed to indicate the last column
 to be parsed.
 
 .. code-block:: python
 
-   read_excel('path_to_file.xls', 'Sheet1', usecols=2)
+   read_excel('path_to_file.xls', 'Sheet1', usecols_excel=2)
+
+If ``usecols`` is a list of integers, then it is assumed to be the file
+column indices to be parsed.
+
+.. code-block:: python
+
+   read_excel('path_to_file.xls', 'Sheet1', usecols=[1, 3, 5])
+
+Element order is ignored, so ``usecols_excel=[0, 1]`` is the same as ``[1, 0]``.
+
+If ``usecols`` is a list of strings, then it is assumed that each string
+correspond to column names provided either by the user in `names` or
+inferred from the document header row(s) and those strings define which columns
+will be parsed.
+
+.. code-block:: python
+
+   read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
+
+Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as
+``['joe', 'baz']``.
 
-If `usecols` is a list of integers, then it is assumed to be the file column
-indices to be parsed.
+If ``usecols`` is callable, the callable function will be evaluated against the
+column names, returning names where the callable function evaluates to True.
 
 .. code-block:: python
 
-   read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
+   read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
 
-Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
 
 Parsing Dates
 +++++++++++++

diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt
@@ -1325,6 +1325,7 @@ I/O
 - Bug in :func:`DataFrame.to_latex()` where missing space characters caused wrong escaping and produced non-valid latex in some cases (:issue:`20859`)
 - Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
 - Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
+- Bug in :func:`read_excel` where ``usecols`` keyword argument as a list of strings were returning a empty ``DataFrame`` (:issue:`18273`)
 - :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
 - :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
 - Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)

diff --git a/doc/source/whatsnew/v0.24.0.txt b/doc/source/whatsnew/v0.24.0.txt
@@ -35,7 +35,7 @@ Datetimelike API Changes
 Other API Changes
 ^^^^^^^^^^^^^^^^^
 
--
+- :func:`read_excel` has gained the keyword argument ``usecols_excel`` that receives a string containing comma separated Excel ranges and columns. The ``usecols`` keyword argument at :func:`read_excel` had removed support for a string containing comma separated Excel ranges and columns and for an int indicating the first j columns to be read in a ``DataFrame``. Also, the ``usecols`` keyword argument at :func:`read_excel` had added support for receiving a list of strings containing column labels and a callable. (:issue:`18273`)
 -
 -
 

diff --git a/pandas/io/excel.py b/pandas/io/excel.py
@@ -10,6 +10,8 @@
 import abc
 import warnings
 import numpy as np
+import string
+import re
 from io import UnsupportedOperation
 
 from pandas.core.dtypes.common import (
@@ -85,20 +87,45 @@
     Column (0-indexed) to use as the row labels of the DataFrame.
     Pass None if there is no such column.  If a list is passed,
     those columns will be combined into a ``MultiIndex``.  If a
-    subset of data is selected with ``usecols``, index_col
-    is based on the subset.
+    subset of data is selected with ``usecols_excel`` or ``usecols``,
+    index_col is based on the subset.
 parse_cols : int or list, default None
 
     .. deprecated:: 0.21.0
        Pass in `usecols` instead.
 
-usecols : int or list, default None
+usecols : list-like or callable or int, default None
+    Return a subset of the columns. If list-like, all elements must either
+    be positional (i.e. integer indices into the document columns) or string
+    that correspond to column names provided either by the user in `names` or
+    inferred from the document header row(s). For example, a valid list-like
+    `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Note that
+    you can not give both ``usecols`` and ``usecols_excel`` keyword arguments
+    at the same time.
+
+    If callable, the callable function will be evaluated against the column
+    names, returning names where the callable function evaluates to True. An
+    example of a valid callable argument would be ``lambda x: x.upper() in
+    ['AAA', 'BBB', 'DDD']``.
+
+    .. versionadded:: 0.24.0
+    Added support to column labels and now `usecols_excel` is the keyword that
+    receives separated comma list of excel columns and ranges.
+usecols_excel : string or list, default None
+    Return a subset of the columns from a spreadsheet specified as Excel column
+    ranges and columns. Note that you can not use both ``usecols`` and
+    ``usecols_excel`` keyword arguments at the same time.
+
     * If None then parse all columns,
-    * If int then indicates last column to be parsed
-    * If list of ints then indicates list of column numbers to be parsed
     * If string then indicates comma separated list of Excel column letters and
-      column ranges (e.g. "A:E" or "A,C,E:F").  Ranges are inclusive of
-      both sides.
+      column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
+      inclusive of both sides.
+    * If list of strings each string shall be an Excel column letter or column
+      range (e.g. ["A:E"] or ["A", "C", "E:F"]) to be parsed. Ranges are
+      inclusive of both sides.
+
+    .. versionadded:: 0.24.0
+
 squeeze : boolean, default False
     If the parsed data only contains one column then return a Series
 dtype : Type name or dict of column -> type, default None
@@ -269,6 +296,17 @@ def _get_default_writer(ext):
     return _default_writers[ext]
 
 
+def _is_excel_columns_notation(columns):
+    """Receives a string and check if the string is a comma separated list of
+    Excel index columns and index ranges. An Excel range is a string with two
+    column indexes separated by ':')."""
+    if isinstance(columns, compat.string_types) and all(
+       (x in string.ascii_letters) for x in re.split(r',|:', columns)):
+        return True
+
+    return False
+
+
 def get_writer(engine_name):
     try:
         return _writers[engine_name]
@@ -286,6 +324,7 @@ def read_excel(io,
                names=None,
                index_col=None,
                usecols=None,
+               usecols_excel=None,
                squeeze=False,
                dtype=None,
                engine=None,
@@ -311,6 +350,7 @@ def read_excel(io,
         header=header,
         names=names,
         index_col=index_col,
+        usecols_excel=usecols_excel,
         usecols=usecols,
         squeeze=squeeze,
         dtype=dtype,
@@ -405,6 +445,7 @@ def parse(self,
               names=None,
               index_col=None,
               usecols=None,
+              usecols_excel=None,
               squeeze=False,
               converters=None,
               true_values=None,
@@ -439,6 +480,7 @@ def parse(self,
                                  header=header,
                                  names=names,
                                  index_col=index_col,
+                                 usecols_excel=usecols_excel,
                                  usecols=usecols,
                                  squeeze=squeeze,
                                  converters=converters,
@@ -455,7 +497,7 @@ def parse(self,
                                  convert_float=convert_float,
                                  **kwds)
 
-    def _should_parse(self, i, usecols):
+    def _should_parse(self, i, usecols_excel, usecols):
 
         def _range2cols(areas):
             """
@@ -481,19 +523,20 @@ def _excel2num(x):
                     cols.append(_excel2num(rng))
             return cols
 
-        if isinstance(usecols, int):
-            return i <= usecols
-        elif isinstance(usecols, compat.string_types):
-            return i in _range2cols(usecols)
-        else:
-            return i in usecols
+        # check if usecols_excel is a string that indicates a comma separated
+        # list of Excel column letters and column ranges
+        if isinstance(usecols_excel, compat.string_types):
+            return i in _range2cols(usecols_excel)
+
+        return True
 
     def _parse_excel(self,
                      sheet_name=0,
                      header=0,
                      names=None,
                      index_col=None,
                      usecols=None,
+                     usecols_excel=None,
                      squeeze=False,
                      dtype=None,
                      true_values=None,
@@ -512,6 +555,25 @@ def _parse_excel(self,
 
         _validate_header_arg(header)
 
+        if (usecols is not None) and (usecols_excel is not None):
+            raise ValueError("Cannot specify both `usecols` and "
+                             "`usecols_excel`. Choose one of them.")
+
+        # Check if some string in usecols may be interpreted as a Excel
+        # range or positional column
+        elif _is_excel_columns_notation(usecols):
+            warnings.warn("The `usecols` keyword argument used to refer to "
+                          "Excel ranges and columns as strings was "
+                          "renamed to `usecols_excel`.", UserWarning,
+                          stacklevel=3)
+            usecols_excel = usecols
+            usecols = None
+
+        elif (usecols_excel is not None) and not _is_excel_columns_notation(
+                usecols_excel):
+            raise TypeError("`usecols_excel` must be None or a string as a "
+                            "comma separeted Excel ranges and columns.")
+
         if 'chunksize' in kwds:
             raise NotImplementedError("chunksize keyword of read_excel "
                                       "is not implemented")
@@ -615,10 +677,13 @@ def _parse_cell(cell_contents, cell_typ):
                 row = []
                 for j, (value, typ) in enumerate(zip(sheet.row_values(i),
                                                      sheet.row_types(i))):
-                    if usecols is not None and j not in should_parse:
-                        should_parse[j] = self._should_parse(j, usecols)
+                    if ((usecols is not None) or (usecols_excel is not None) or
+                            (j not in should_parse)):
+                        should_parse[j] = self._should_parse(j, usecols_excel,
+                                                             usecols)
 
-                    if usecols is None or should_parse[j]:
+                    if (((usecols_excel is None) and (usecols is None)) or
+                            should_parse[j]):
                         row.append(_parse_cell(value, typ))
                 data.append(row)
 
@@ -674,6 +739,7 @@ def _parse_cell(cell_contents, cell_typ):
                                     dtype=dtype,
                                     true_values=true_values,
                                     false_values=false_values,
+                                    usecols=usecols,
                                     skiprows=skiprows,
                                     nrows=nrows,
                                     na_values=na_values,