Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel return empty dataframe when using usecols #18273

Closed
SenWang opened this issue Nov 14, 2017 · 32 comments
Closed

read_excel return empty dataframe when using usecols #18273

SenWang opened this issue Nov 14, 2017 · 32 comments
Labels
Bug good first issue IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@SenWang
Copy link

SenWang commented Nov 14, 2017

In [3]: data = pd.read_excel("A.xlsx")

In [4]: data
Out[4]:
   A  B
0  1  2
1  3  4

In [5]: data1 = pd.read_excel("A.xlsx",usecols=['B'])

In [6]: data1
Out[6]:
Empty DataFrame
Columns: []
Index: []

In [7]: pd.__version__
Out[7]: '0.21.0'

Problem description

Having a excel file name A.xlsx(or A.xls) with column A,B
read_excel return empty dataframe if usecols used

@chris-b1
Copy link
Contributor

Guessing this is due to a conflict in the two different kind of specs we accept for usecols - columns labels, or Excel column letters (A, B, C, ...). A workaround is to select the Excel column completely ('B:B') but this should work.

df = pd.DataFrame({'A': [1, 3], 'B': [3, 4]})
df.to_excel('tmp.xlsx', index=False)

pd.read_excel('tmp.xlsx', usecols=['B'])
Out[86]: 
Empty DataFrame
Columns: []
Index: []


pd.read_excel('tmp.xlsx', usecols='B:B')
Out[88]: 
   B
0  3
1  4

@chris-b1 chris-b1 added this to the Next Major Release milestone Nov 14, 2017
@jhall6226
Copy link

jhall6226 commented Nov 20, 2017

I'm interested in contributing and thought this looked like a good place to take a first step.

Reviewing pandas/io/excel.py, it looks like the change needs to be made in the _should_parse function of the ExcelFile class. Specifically, here:

pandas/pandas/io/excel.py

Lines 355 to 360 in 1915ffc

if isinstance(usecols, int):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
else:
return i in usecols

It looks like the current implementation checks for an integer first (i.e. a max number of columns to use), a string second (i.e. assuming a comma separated list of column names in a single string), and assumes a list (technically any container that implements the "in" operator) otherwise. When a list is assumed for usecols, the check for the column index (i) assumes that it is a list of integers.

The simplest way to implement the requested functionality would be to add a new conditional to check whether the first element is a string and, if so, concatenate the list into a single string like case 2 and re-use the _range2cols function to convert to numeric values before returning the comparison:

if isinstance(usecols, list) and isinstance(usecols[0], compat.string_types):
    return i in _range2cols(', '.join(usecols))

Additionally, there would probably need to be another check (if there isn't already?) to handle the behavior for an empty list. Should this be assumed to mean the same thing as None? It doesn't make sense to read a sheet and not return any data.

If we take this a little further, we could add support for mixed lists of integers and strings (if desired) by doing something like:

def _list2cols(area_list):
    parsed_list = list()
    for e in area_list:
        if isinstance(e, int):
            parsed_list.append(e)
        elif isinstance(e, compat.string_types):
            parsed_list.extend(_range2cols(e))
        else:
            pass # Assuming other types should not be considered

    return parsed_list

if isinstance(usecols, list):
    return i in _list2cols(usecols)

Interested in feedback on which direction should be taken.

@chris-b1
Copy link
Contributor

What I think would be easiest here would be to only have _should_parse handle the case when usecols is an Excel column specification (e.g. 'A,B,D:E'), and in all other cases pass through usecols to TextParser here.

parser = TextParser(data, header=header, index_col=index_col,

It already has logic to handle column names/locations, and will raise in the mixed case.

from pandas.io.parsers import TextParser
TextParser([['a', 'b', 'c'],
            [1, 2, 3]], usecols=['b']).read()

Out[81]: 
   b
0  2

@jhall6226
Copy link

jhall6226 commented Nov 20, 2017

Ok. So in that case, you would remove the preprocessing steps here.

if usecols is not None and j not in should_parse:

Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced.

Is that the correct interpretation?

@chris-b1
Copy link
Contributor

chris-b1 commented Nov 20, 2017 via email

@JeroenDelcour
Copy link

Any progress on this? I couldn't find the mentioned pull request.

@Vonatzki
Copy link

Vonatzki commented Jan 9, 2018

Wow, after 2 years of using this, I stumbled upon this after I updated my pandas.

This is scary since most of my scripts use the usecol parameter heavily. I guess I have to downgrade for now.

@kuraga
Copy link

kuraga commented Jan 14, 2018

Yeah, the same issue with 0.21.0 and 0.22.0. No issues with 0.20.3.

All columns' names are cyrillic complex (not A, B, etc.)

@linlinzhao
Copy link

linlinzhao commented Jan 15, 2018

I have a set of excel files, all of which have a few same columns I need to read, but other columns are different. The old usecols with specified column names works nicely.

Do I have to downgrade Pandas? Currently I have 0.22.0

@ldacey
Copy link

ldacey commented Jan 24, 2018

Same issue here. usecols is returning an empty dataframe after upgrading.

@weifei0228
Copy link

Same issue here.usecols is returning an empty dataframe with 0.22.0

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 1, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 1, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Feb 1, 2018
@Bravico
Copy link

Bravico commented Mar 19, 2018

Same issue in empty dataframe returning with 0.22.0

@LISHITING
Copy link

For me
Instead of typing usecols=['B'], try usecols='B'

@jacksonjos
Copy link

I'm going to work on this issue to solve it.

If anyone is already working on it, please, tell me.

jacksonjos added a commit to jacksonjos/pandas that referenced this issue Mar 25, 2018
closes pandas-dev#18273
tests added/passed
passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
whatsnew entry

As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings.
Now lists of strings are correctly interpreted by read_excel function.
@jreback jreback removed this from the Next Major Release milestone Mar 25, 2018
jacksonjos added a commit to jacksonjos/pandas that referenced this issue Jun 4, 2018
capability of passing column labels for columns to be read

- [x] closes pandas-dev#18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

Created 'usecols_excel' that receives a string containing comma separated Excel
ranges and columns.
Changed 'usecols' named argument, now it receives a list of strings containing
column labels or a list of integers representing column indexes or a callable
for 'read_excel' function. Created and altered tests to reflect the new usage
of these named arguments. 'index_col' keyword used to indicated which columns
in the subset of selected columns by 'usecols' or 'usecols_excel' that should
be the index of the DataFrame read. Now 'index_col' indicates which columns of
the DataFrame will be the index even if that column is not in the subset of the
selected columns.
@jacksonjos
Copy link

@chris-b1, may you take a look at issue #20480, please?

@jreback requested your review there again and without you it can't be closed.
Thank you for your attention.

jacksonjos added a commit to jacksonjos/pandas that referenced this issue Jun 9, 2018
capability of passing column labels for columns to be read

- [x] closes pandas-dev#18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

Created 'usecols_excel' that receives a string containing comma separated Excel
ranges and columns.
Changed 'usecols' named argument, now it receives a list of strings containing
column labels or a list of integers representing column indexes or a callable
for 'read_excel' function. Created and altered tests to reflect the new usage
of these named arguments. 'index_col' keyword used to indicated which columns
in the subset of selected columns by 'usecols' or 'usecols_excel' that should
be the index of the DataFrame read. Now 'index_col' indicates which columns of
the DataFrame will be the index even if that column is not in the subset of the
selected columns.
@Teckloon
Copy link

Teckloon commented Jul 3, 2018

Thanks for resolving the read_excel return empty dataframe.

@gfyoung
Copy link
Member

gfyoung commented Oct 28, 2018

Throwing my two cents in:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Supporting lists of strings is not technically addressed in the documentation, so I'm a little hesitant to call this a bug as of the current version of pandas (0.23.4).

That being said, this issue does bring up a lot of questions re: how to handle usecols for read_excel, in particular, why its handling is so different from usecols for CSV:

  • Column ranges (e.g. A:C) - that's totally fine with me. That's special to Excel.
  • "If int then indicates last column to be parsed" - We don't support this for CSV. I don't see why we support this for Excel?
  • List-like support for Excel is pretty bad. We don't support values like usecols=[0, 1, 2] or usecols=['A', 'B', 'C'], which I think we should.
  • We don't support callables for usecols, though I don't see why we shouldn't.

What do you guys think?

@WillAyd
Copy link
Member

WillAyd commented Oct 28, 2018

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

It would require some deprecations from the current state but I think that logical separation would clarify any ambiguity between range "A:A" in Excel and a column named "A".

@gfyoung
Copy link
Member

gfyoung commented Oct 28, 2018

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

@WillAyd : Not sure I fully agree with adding a new parameter, but at least we have consensus that usecols should be widely consistent across the board. 👍

@jreback @chris-b1 : Thoughts?

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified two major bugs:

* index_col=None was not being respected
* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 8, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 10, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 10, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 11, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
jreback pushed a commit that referenced this issue Nov 11, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes gh-18273.
Closes gh-20480.
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
@shm007g
Copy link

shm007g commented Feb 1, 2019

Problem still exists in pandas==0.23.4.
I just want to read specific columns of excel file with column names, i don' t know what column number display in excel.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2019

this was closed in 0.24.0

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.