BUG: read_excel return empty dataframe when using usecols #20480

jacksonjos · 2018-03-25T04:53:02Z

closes read_excel return empty dataframe when using usecols #18273
tests added / passed
passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
whatsnew entry

As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings.
Now lists of strings are correctly interpreted by read_excel function.

codecov · 2018-03-25T05:32:53Z

Codecov Report

Merging #20480 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #20480   +/-   ##
=======================================
  Coverage   91.89%   91.89%           
=======================================
  Files         153      153           
  Lines       49596    49596           
=======================================
  Hits        45576    45576           
  Misses       4020     4020

Flag	Coverage Δ
#multiple	`90.29% <ø> (ø)`	⬆️
#single	`41.86% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 415012f...e257100. Read the comment docs.

jreback · 2018-03-25T13:41:00Z

pandas/io/excel.py

@@ -479,6 +482,9 @@ def _excel2num(x):
            return i <= usecols
        elif isinstance(usecols, compat.string_types):
            return i in _range2cols(usecols)
+        elif all(isinstance(x, compat.string_types) for x in usecols) is True:


you don't need the is True

jreback · 2018-03-25T13:41:16Z

pandas/io/excel.py

@@ -479,6 +482,9 @@ def _excel2num(x):
            return i <= usecols
        elif isinstance(usecols, compat.string_types):
            return i in _range2cols(usecols)
+        elif all(isinstance(x, compat.string_types) for x in usecols) is True:
+            usecols_str = ",".join(usecols)


can you add a 1-line comment on this case (and the case above to differentiate)

jreback · 2018-03-25T13:41:28Z

pandas/tests/io/test_excel.py

@@ -179,6 +179,42 @@ def test_usecols_str(self, ext):
        tm.assert_frame_equal(df2, df1, check_names=False)
        tm.assert_frame_equal(df3, df1, check_names=False)

+    def test_usecols_str_list(self, ext):
+


can you add the gh issue as a comment

jreback · 2018-03-25T13:41:51Z

pandas/tests/io/test_excel.py

+    def test_usecols_str_list(self, ext):
+
+        dfref = self.get_csv_refdf('test1')
+


can you parameterize this test

Hi, thank you for your feedback! =)

I'm not sure I've got it right. Do you want me to parametrize the objects passed as argument to parse_cols argument to each inside of test_usecols_str_list test because its tests are a copy of test_usecols_str or you want me to parametrize something else?

you are testing multiple cases that are pretty similar. the test becomes much simpler if you can parameterize over the possible cases.

Ok, I got it. Do you want me to do the same to the test above (test_usecols_str)?

Hi @jreback. I implemented all requested changes in the code.
I would like to know your opinion about a suggestion made by @chris-b1 at #20480 (comment).

chris-b1 · 2018-03-25T17:14:41Z

This looks OK, but re-reading it, #18273 is actually two somewhat separate problems and we have a bit of an API tangle here

passing a list of Excel column letters doesn't work (this fixes that)
passing a list column names (e.g. column title 'foo' in the spreadsheet) to usecols doesn't work anymore

Number 2 wasn't part of the read_excel documented args, but once worked, because this param used to be called parse_cols (#17774), and if you passed something to usecols it would get passed down to the TextParser logic and do the right thing

pandas/pandas/io/excel.py

Line 529 in 1915ffc

parser = TextParser(data, header=header, index_col=index_col,

Not sure what the solution is, there could be ambiguous cases (column titled'A' in spreadsheet column B), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.

Not saying you need to address item 2 in this PR, just need a new issue if not.

jacksonjos · 2018-03-25T17:38:41Z

Hi @chris-b1, thank you for your comment.

I tried to test what you said, but I don't get it.

| Day      | Number | Animal |
|----------|--------|--------|
| 10/12/18 | 5      | Tatu   |
| 10/13/18 | 4      | Paca   |

I tried the following:

df_not_ok = pd.read_excel("mock.xlsx", ["Day"])
>> XLRDError: No sheet named <'Day'>

df_not_ok = pd.read_excel("mock.xlsx", usecols=["Day"])
>> Empty DataFrame
>> Columns: []
>> Index: []

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

If someone create feature or issue I may can work to make possible to pass col names as arguments to read just them.

chris-b1 · 2018-03-25T17:52:10Z

Right, that doesn't work on 0.22, but it used to, which is what #18273 is about

print(pd.__version__)
# 0.20.3
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, 6]})
df.to_excel('tmp.xlsx', index=False)
pd.read_excel('tmp.xlsx', usecols=['bar'])
   bar
0    4
1    5
2    6

pep8speaks · 2018-03-25T20:26:54Z

Hello @jacksonjos! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 09, 2018 at 22:45 Hours UTC

jacksonjos · 2018-03-25T20:36:23Z

Ok, I got it, @chris-b1. It isn't better to create another issue to implement the possibility to choose columns you want to load by column name using another named argument? I think that might be confusing use usecols for a lot of different possibilities.

jacksonjos · 2018-03-29T03:58:09Z

Is there anything wrong with the requested changes that I implemented to make the pull requested fulfill the requirements to be accepted? I'm asking because I don't understand why this pull requested it wasn't accepted yet.

jreback

waiting for @chris-b1 to have a look for this. I am concerned that usecols is now ambiguous, yes?

jreback · 2018-03-30T20:44:07Z

pandas/io/excel.py

-      column ranges (e.g. "A:E" or "A,C,E:F").  Ranges are inclusive of
+      column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
+      inclusive of both sides.
+    * If list of strings each string shall be a Excel column letter or column


chris-b1 · 2018-03-31T01:39:51Z

Yeah, I've been a bit stuck on what to do here. Current (0.22) behavior where usecols refers only to excel names for regions is bad, it broke with (untested/undocumented behavior but working!) 0.20.3 and is inconsistent with, e.g. read_csv.

I suppose the only reasonable solution is two keywords. usecols_labels and usecols_?

usecols_excel
usecols_a1 (very Excel specific, but "A1 notation" is a somewhat common term)
excel_columns

jacksonjos · 2018-04-14T14:52:25Z

Hi,

I prefer to use usecols_lables and usecols_excel because it's possible refer columns by A1 notation and by numbers. If we decide to use usecols_a1 the named argument would cause misunderstanding about the arguments can be passed to it.

What do you think @chris-b1 and @jreback ?

jreback · 2018-04-14T15:03:57Z

usecols should refer to names just like it does in all other parsers. no problem adding usecols_a1 (or similar).

chris-b1 · 2018-04-16T19:55:29Z

OK, @jacksonjos here's what I propose. Most painful part will be backwards compat & testing. If you don't want to take it all the way, just let me know and I'll finish, would like to get this in for 0.23.

Add a new kwarg usecols_excel which replicates the current usecols + your fix here
Since we're breaking compat, inspect usecols and if it is string AND parses to an excel region, issue a warning that the arg was renamed. This could raise a false positive for a single letter (e.g. 'A') but don't see a way around it.
usecols should get passed to the TextParser object, which will use it in the manner @jreback suggests.

pandas/pandas/io/excel.py

Line 529 in 1915ffc

parser = TextParser(data, header=header, index_col=index_col,

jacksonjos · 2018-04-17T21:35:46Z

Hi @chris-b1 and @jreback. Thank you for your answers. i'm already working on this issue and I'll update the pull request in the coming days following your suggestions. If I have some trouble I ask for help. Best regards 2018-04-16 16:56 GMT-03:00 chris-b1 <notifications@github.com>:

…

OK, @jacksonjos <https://github.com/jacksonjos> here's what I propose. Most painful part will be backwards compat & testing. If you don't want to take it all the way, just let me know and I'll finish, would like to get this in for 0.23. 1. Add a new kwarg usecols_excel which replicates the current usecols + your fix here 2. Since we're breaking compat, inspect usecols and if it is string AND parses to an excel region, issue a warning that the arg was renamed. This could raise a false positive for a single letter (e.g. 'A') but don't see a way around it. 3. usecols should get passed to the TextParser object, which will use it in the manner @jreback <https://github.com/jreback> suggests. https://github.com/pandas-dev/pandas/blob/ 1915ffc/pandas/io/excel.py#L529 <https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20480 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANuQu9txp5CMJXR_WSI3THEn5GK_RKXks5tpPdngaJpZM4S6AzV> .

jacksonjos · 2018-04-23T03:15:36Z

Hi @jreback and @chris-b1, I'm about to finish the new version of this pull request and I would like to know your opinion about what to do when someone pass both arguments, usecols_excel and usecols, to read_excel function?

Should I raise an exception, print some warning or do nothing?

Thanks in advance.

chris-b1 · 2018-04-23T13:23:45Z

I would raise.

…

On Sun, Apr 22, 2018, 8:16 PM Jackson Souza ***@***.***> wrote: Hi @jreback <https://github.com/jreback> and @chris-b1 <https://github.com/chris-b1>, I'm about to finish the new version of this pull request and I would like to know your opinion about what to do when someone pass both arguments, usecols_excel and usecols, to read_excel function? Should I raise an exception, print some warning or do nothing? Thanks in advance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20480 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1b_HG6hbf_x2bJTX9HfXQf7H6ieEtnks5trUdwgaJpZM4S6AzV> .

jacksonjos · 2018-04-30T03:17:09Z

Hey, guys! I did a new push.

All requested changes are in the commit.
I'm sorry for doing this too late, but last two weeks were really had at my job and I was so tired to work on it.

I didn't understand why one of CI tests didn't pass because I do not see any relationship between the tests that did not pass and the feature I implemented/corrected.

Do you have any idea?
If there is something else I need to do to completed the implementation just tell me that this week I have time to complete the jog if there is anything else to do about this issue.

Thanks!

chris-b1

looks pretty good will try it out tonight, can you also add a note to the excel section in io.rst

chris-b1 · 2018-04-30T15:23:58Z

pandas/io/excel.py

+    example of a valid callable argument would be ``lambda x: x.upper() in
+    ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
+    parsing time and lower memory usage.
+usecols_excel : int or list, default None


string, int, or list

I'd also a bit of narrative before the bullets, something like:
Columns to parse from the spreadsheet, specified as Excel location references

chris-b1 · 2018-04-30T15:25:02Z

pandas/io/excel.py

                        row.append(_parse_cell(value, typ))
                data.append(row)

+            # Check if some string in usecols may be interpreted as a Excel
+            # positional column
+            if (usecols is not None) and (not callable(usecols)) and \


I'll try it out later, but I think this warning is over-specified. Since usecols only used to work with excel ranges as strings, not lists of strings, I would limit the warning to that case.

chris-b1 · 2018-04-30T15:28:41Z

pandas/tests/io/test_excel.py

+        df1 = self.get_csv_refdf('test1')[['B', 'D']]
+
+        with tm.assert_produces_warning(UserWarning):
+            df2 = self.get_exceldf('test1', ext, 'Sheet1', usecols=['B', 'D'])


As state above I'm not sure this should warn, since it didn't previously work. But I would like a test for the warning, something like usecols='B,D'

I can't pass a string to usecols because if I do it throws an error. I have already tested it.
So, does not make sense maintain the warning just to a string type parameter unless you do not want the error be thrown.

Considering this I wrote a test to check this behavior.
Do I should remove the warning, then?

I may enable backward compatibility for usecols if you want.
I could check if usecols contains a string which contains Excel index columns and ranges and if ti does I would I could do something like usecols_excel=usecols, usecols=None, throw the warning I've already written and place a comment that this would be removed in 0.24 pandas version.
What do you think?

The other changes you asked me are done. I'm just waiting to know what to do about this warning to commit again.

Ah, I mistakenly thought we needed to worry about the case like usecols='B'.

Yeah, I agree with what you proposed, handle the back compat case with a warning, we can eventually deprecate that path.

jreback

I am confused here. IIUC usecols_excel was only going to accep the column ranges, and usecols for everything else? these HAVE to be orthogonal parametes, otherwise this is very confusing.

jreback · 2018-05-02T10:32:51Z

doc/source/whatsnew/v0.23.0.txt

@@ -856,6 +856,7 @@ Other API Changes
 - Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
 - :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
 - A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
+- Changed the named argument `usecols` at :func:`read_excel` to `usecols_excel` that receives a list of index numbers or A1 index to select the columns that must be in the DataFrame, so the `usecols` argument can serve its purpose to select the columns that must be in the DataFrame using column labels (:issue:`18273`)


Please simplify this to make it more readable.

jreback · 2018-05-02T10:33:24Z

doc/source/whatsnew/v0.23.0.txt

@@ -1166,6 +1167,7 @@ I/O
 - Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
 - Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
 - Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
+- Bug in :func:`read_excel` where `usecols_excel` named argument as a list of strings were returning a empty DataFrame (:issue:`18273`)


how is this a bug? usecols_excel doesn't exist yet

jreback · 2018-05-02T10:34:42Z

pandas/io/excel.py

+    `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
+    order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]`` and
+    ``usecols=['foo', 'bar']`` is the same as ``['bar', 'foo']``.
+    To instantiate a DataFrame from ``data`` with element order preserved use


this is not needed here, starting with keeping ordering (or you can put in the Notes section if you really want)

jreback · 2018-05-02T10:36:01Z

pandas/io/excel.py

+    example of a valid callable argument would be ``lambda x: x.upper() in
+    ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
+    parsing time and lower memory usage.
+usecols_excel : int or list, default None
    * If None then parse all columns,
    * If int then indicates last column to be parsed


shouldusecols_excel only accept column ranges?

Yeah, I'd be in favor of letting usecols handle the integer location case and making usecols_excel only handle column ranges

capability of passing column labels for columns to be read - [x] closes pandas-dev#18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.

chris-b1 · 2018-06-23T16:06:25Z

I re-read your explanation but I'm honestly still not following why the change to usecols with numerics is necessary, sorry I'm missing something. read_csv is consistent with the old behavior and with engine='python' also uses the TextParser class, why can't we do the same here?

In [6]: pd.read_excel('pandas/pandas/tests/io/data/test1.xlsx').to_csv('tmp.csv')

In [7]: !head tmp.csv
,A,B,C,D
2000-01-03,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169
2000-01-04,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967
2000-01-05,0.498580885705,0.731167677815,-0.537677223318,1.34627041952
2000-01-06,1.12020151869,1.56762092543,0.00364077397681,0.67525259227
2000-01-07,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917
2000-01-10,0.836648671666,0.246461918642,0.588542635376,1.0627820613
2000-01-11,-0.157160753327,1.34030689438,1.19577795622,-1.09700699751

In [8]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[8]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [9]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0, engine='python')
Out[9]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

jacksonjos · 2018-06-27T04:25:30Z

#@chris-b1,

Take a look at the examples below comparing the current pandas behavior to the new one,
so maybe is easier to understand.
@jreback, do you get it what is happening here?

# CURRENT BEHAVIOR
In [5]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[5]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [6]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...   index_col=0)
Out[6]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# Index 0 is not in used_cols, so it's not possible to use it in the index
In [7]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[7]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

# index_col is related to the 0-th col in the cols selected in usecols. In this case, [1, 2].
# So, the date column is not the index_col
In [8]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2],
...     index_col=0)
Out[8]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

In [9]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...    index_col=1)
Out[9]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778

#PROPOSED BEHAVIOR

In [1]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[1]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

In [2]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3], index_col=0)
Out[2]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

# Index 0 is not in used_cols, but TextParser receives all columns. So, the default index
# is the first column because it does not have a column name (1st column header is
# empty).
In [3]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[3]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# index_col is the 0-th col in the DataFrame, not the 0-th in usecols. In this case, [1, 2].
# So, the date column is selected to be the index because is the first column in the
# DataFrame.
In [4]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=0)
Out[4]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [5]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=1)
Out[5]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778


I'm also dumping the behavior of csv you used as an example because I saw that it
diverges from current and proposed behavior of read_excel. So, maybe we can settle,
or agree about how read_excel is supposed to behave.

In [11]: pd.read_csv('tmp.csv', usecols=[0, 2, 3])
Out[11]: 
   Unnamed: 0         B         C
0  2000-01-03  3.685731 -0.364217
1  2000-01-04 -0.041232 -0.161812
2  2000-01-05  0.731168 -0.537677
3  2000-01-06  1.567621  0.003641
4  2000-01-07  0.571455 -1.611639
5  2000-01-10  0.246462  0.588543
6  2000-01-11  1.340307  1.195778

In [12]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[12]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [13]: pd.read_csv('tmp.csv', usecols=[1, 2])
Out[13]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

In [14]: pd.read_csv('tmp.csv', usecols=[1, 2], index_col=0)
Out[14]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

What do you think?

jreback · 2018-07-04T00:22:33Z

@jacksonjos can you show the proposed right next to the existing. and use separators between so its clear.

jreback · 2018-09-25T16:31:54Z

can you rebase / update

jacksonjos · 2018-10-27T14:36:07Z

Hi @jreback, Unfortunately, I don't have time to work on this project this semester. I tried to hard to contribute to it last semester, but my efforts weren't enough to finish the job. So, I advise you to ask for someone else to continue this work. Best regards Le mar. 25 sept. 2018 à 13:41, Jeff Reback <notifications@github.com> a écrit :

…

can you rebase / update — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20480 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANuQo_oAkZ8I_yXV8SzqfvIn2WcmlCiks5uelzHgaJpZM4S6AzV> .

jreback · 2018-10-28T03:30:48Z

@gfyoung if you wouldn't mind rebasing this

gfyoung · 2018-10-28T23:16:25Z

@jreback : Before I do, would like to get a better understanding of this issue (see here).

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified two major bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

jreback · 2018-11-08T13:19:35Z

closing in favor of #23544

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes gh-18273. Closes gh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

jreback requested changes Mar 25, 2018

View reviewed changes

jreback added API Design IO Excel read_excel, to_excel labels Mar 25, 2018

jreback requested a review from chris-b1 March 25, 2018 13:43

jacksonjos force-pushed the use_cols_empty_df_bug branch from 171c9b6 to fca0ae0 Compare March 25, 2018 15:54

jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from 4aa25c2 to 445d94a Compare March 25, 2018 20:26

jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from 02c4867 to beb5b2c Compare March 25, 2018 20:31

chris-b1 mentioned this pull request Mar 26, 2018

read_excel return empty dataframe when using usecols #18273

Closed

jreback requested changes Mar 30, 2018

View reviewed changes

jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from a747961 to 6143a0f Compare April 30, 2018 02:13

chris-b1 suggested changes Apr 30, 2018

View reviewed changes

jreback requested changes May 2, 2018

View reviewed changes

jacksonjos force-pushed the use_cols_empty_df_bug branch from 6c6eede to e257100 Compare June 9, 2018 22:45

gfyoung mentioned this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV #23544

Merged

jreback closed this Nov 8, 2018

gfyoung mentioned this pull request Nov 23, 2018

TST: Test unnamed columns with index_col for Excel #23874

Merged

		def test_usecols_str_list(self, ext):

		dfref = self.get_csv_refdf('test1')

BUG: read_excel return empty dataframe when using usecols #20480

BUG: read_excel return empty dataframe when using usecols #20480

Conversation

jacksonjos commented Mar 25, 2018 • edited Loading

codecov bot commented Mar 25, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacksonjos Mar 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Mar 25, 2018

jacksonjos commented Mar 25, 2018

chris-b1 commented Mar 25, 2018

pep8speaks commented Mar 25, 2018 • edited Loading

Comment last updated on June 09, 2018 at 22:45 Hours UTC

jacksonjos commented Mar 25, 2018 • edited Loading

jacksonjos commented Mar 29, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Mar 31, 2018

jacksonjos commented Apr 14, 2018 • edited Loading

jreback commented Apr 14, 2018

chris-b1 commented Apr 16, 2018

jacksonjos commented Apr 17, 2018 via email

jacksonjos commented Apr 23, 2018

chris-b1 commented Apr 23, 2018 via email

jacksonjos commented Apr 30, 2018

chris-b1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacksonjos May 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Jun 23, 2018

jacksonjos commented Jun 27, 2018 • edited Loading

jreback commented Jul 4, 2018

jreback commented Sep 25, 2018

jacksonjos commented Oct 27, 2018 via email

jreback commented Oct 28, 2018

gfyoung commented Oct 28, 2018

jreback commented Nov 8, 2018

jacksonjos commented Mar 25, 2018 •

edited

Loading

codecov bot commented Mar 25, 2018 •

edited

Loading

jacksonjos Mar 25, 2018 •

edited

Loading

pep8speaks commented Mar 25, 2018 •

edited

Loading

jacksonjos commented Mar 25, 2018 •

edited

Loading

jacksonjos commented Apr 14, 2018 •

edited

Loading

jacksonjos May 1, 2018 •

edited

Loading

jacksonjos commented Jun 27, 2018 •

edited

Loading