Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel return empty dataframe when using usecols #20480

Closed

Conversation

jacksonjos
Copy link

@jacksonjos jacksonjos commented Mar 25, 2018

As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings.
Now lists of strings are correctly interpreted by read_excel function.

@codecov
Copy link

codecov bot commented Mar 25, 2018

Codecov Report

Merging #20480 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20480   +/-   ##
=======================================
  Coverage   91.89%   91.89%           
=======================================
  Files         153      153           
  Lines       49596    49596           
=======================================
  Hits        45576    45576           
  Misses       4020     4020
Flag Coverage Δ
#multiple 90.29% <ø> (ø) ⬆️
#single 41.86% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 415012f...e257100. Read the comment docs.

@@ -479,6 +482,9 @@ def _excel2num(x):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
elif all(isinstance(x, compat.string_types) for x in usecols) is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the is True

@@ -479,6 +482,9 @@ def _excel2num(x):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
elif all(isinstance(x, compat.string_types) for x in usecols) is True:
usecols_str = ",".join(usecols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a 1-line comment on this case (and the case above to differentiate)

@@ -179,6 +179,42 @@ def test_usecols_str(self, ext):
tm.assert_frame_equal(df2, df1, check_names=False)
tm.assert_frame_equal(df3, df1, check_names=False)

def test_usecols_str_list(self, ext):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the gh issue as a comment

def test_usecols_str_list(self, ext):

dfref = self.get_csv_refdf('test1')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test

Copy link
Author

@jacksonjos jacksonjos Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your feedback! =)

I'm not sure I've got it right. Do you want me to parametrize the objects passed as argument to parse_cols argument to each inside of test_usecols_str_list test because its tests are a copy of test_usecols_str or you want me to parametrize something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are testing multiple cases that are pretty similar. the test becomes much simpler if you can parameterize over the possible cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got it. Do you want me to do the same to the test above (test_usecols_str)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback. I implemented all requested changes in the code.
I would like to know your opinion about a suggestion made by @chris-b1 at #20480 (comment).

@chris-b1
Copy link
Contributor

This looks OK, but re-reading it, #18273 is actually two somewhat separate problems and we have a bit of an API tangle here

  1. passing a list of Excel column letters doesn't work (this fixes that)
  2. passing a list column names (e.g. column title 'foo' in the spreadsheet) to usecols doesn't work anymore

Number 2 wasn't part of the read_excel documented args, but once worked, because this param used to be called parse_cols (#17774), and if you passed something to usecols it would get passed down to the TextParser logic and do the right thing

parser = TextParser(data, header=header, index_col=index_col,

Not sure what the solution is, there could be ambiguous cases (column titled'A' in spreadsheet column B), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.

Not saying you need to address item 2 in this PR, just need a new issue if not.

@jacksonjos
Copy link
Author

Hi @chris-b1, thank you for your comment.

I tried to test what you said, but I don't get it.

| Day      | Number | Animal |
|----------|--------|--------|
| 10/12/18 | 5      | Tatu   |
| 10/13/18 | 4      | Paca   |

I tried the following:

df_not_ok = pd.read_excel("mock.xlsx", ["Day"])
>> XLRDError: No sheet named <'Day'>

df_not_ok = pd.read_excel("mock.xlsx", usecols=["Day"])
>> Empty DataFrame
>> Columns: []
>> Index: []
pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

If someone create feature or issue I may can work to make possible to pass col names as arguments to read just them.

@chris-b1
Copy link
Contributor

Right, that doesn't work on 0.22, but it used to, which is what #18273 is about

print(pd.__version__)
# 0.20.3
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, 6]})
df.to_excel('tmp.xlsx', index=False)
pd.read_excel('tmp.xlsx', usecols=['bar'])
   bar
0    4
1    5
2    6

@jacksonjos jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from 4aa25c2 to 445d94a Compare March 25, 2018 20:26
@pep8speaks
Copy link

pep8speaks commented Mar 25, 2018

Hello @jacksonjos! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 09, 2018 at 22:45 Hours UTC

@jacksonjos jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from 02c4867 to beb5b2c Compare March 25, 2018 20:31
@jacksonjos
Copy link
Author

jacksonjos commented Mar 25, 2018

Ok, I got it, @chris-b1. It isn't better to create another issue to implement the possibility to choose columns you want to load by column name using another named argument? I think that might be confusing use usecols for a lot of different possibilities.

@jacksonjos
Copy link
Author

Is there anything wrong with the requested changes that I implemented to make the pull requested fulfill the requirements to be accepted? I'm asking because I don't understand why this pull requested it wasn't accepted yet.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for @chris-b1 to have a look for this. I am concerned that usecols is now ambiguous, yes?

column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
inclusive of both sides.
* If list of strings each string shall be a Excel column letter or column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a -> an

@chris-b1
Copy link
Contributor

Yeah, I've been a bit stuck on what to do here. Current (0.22) behavior where usecols refers only to excel names for regions is bad, it broke with (untested/undocumented behavior but working!) 0.20.3 and is inconsistent with, e.g. read_csv.

I suppose the only reasonable solution is two keywords. usecols_labels and usecols_?

  • usecols_excel
  • usecols_a1 (very Excel specific, but "A1 notation" is a somewhat common term)
  • excel_columns

@jacksonjos
Copy link
Author

jacksonjos commented Apr 14, 2018

Hi,

I prefer to use usecols_lables and usecols_excel because it's possible refer columns by A1 notation and by numbers. If we decide to use usecols_a1 the named argument would cause misunderstanding about the arguments can be passed to it.

What do you think @chris-b1 and @jreback ?

@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

usecols should refer to names just like it does in all other parsers. no problem adding usecols_a1 (or similar).

@chris-b1
Copy link
Contributor

OK, @jacksonjos here's what I propose. Most painful part will be backwards compat & testing. If you don't want to take it all the way, just let me know and I'll finish, would like to get this in for 0.23.

  1. Add a new kwarg usecols_excel which replicates the current usecols + your fix here
  2. Since we're breaking compat, inspect usecols and if it is string AND parses to an excel region, issue a warning that the arg was renamed. This could raise a false positive for a single letter (e.g. 'A') but don't see a way around it.
  3. usecols should get passed to the TextParser object, which will use it in the manner @jreback suggests.
    parser = TextParser(data, header=header, index_col=index_col,

@jacksonjos
Copy link
Author

jacksonjos commented Apr 17, 2018 via email

@jacksonjos
Copy link
Author

Hi @jreback and @chris-b1, I'm about to finish the new version of this pull request and I would like to know your opinion about what to do when someone pass both arguments, usecols_excel and usecols, to read_excel function?

Should I raise an exception, print some warning or do nothing?

Thanks in advance.

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 23, 2018 via email

@jacksonjos jacksonjos force-pushed the use_cols_empty_df_bug branch 2 times, most recently from a747961 to 6143a0f Compare April 30, 2018 02:13
@jacksonjos
Copy link
Author

Hey, guys! I did a new push.

All requested changes are in the commit.
I'm sorry for doing this too late, but last two weeks were really had at my job and I was so tired to work on it.

I didn't understand why one of CI tests didn't pass because I do not see any relationship between the tests that did not pass and the feature I implemented/corrected.

Do you have any idea?
If there is something else I need to do to completed the implementation just tell me that this week I have time to complete the jog if there is anything else to do about this issue.

Thanks!

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good will try it out tonight, can you also add a note to the excel section in io.rst

example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
usecols_excel : int or list, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string, int, or list

I'd also a bit of narrative before the bullets, something like:
Columns to parse from the spreadsheet, specified as Excel location references

row.append(_parse_cell(value, typ))
data.append(row)

# Check if some string in usecols may be interpreted as a Excel
# positional column
if (usecols is not None) and (not callable(usecols)) and \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it out later, but I think this warning is over-specified. Since usecols only used to work with excel ranges as strings, not lists of strings, I would limit the warning to that case.

df1 = self.get_csv_refdf('test1')[['B', 'D']]

with tm.assert_produces_warning(UserWarning):
df2 = self.get_exceldf('test1', ext, 'Sheet1', usecols=['B', 'D'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As state above I'm not sure this should warn, since it didn't previously work. But I would like a test for the warning, something like usecols='B,D'

Copy link
Author

@jacksonjos jacksonjos May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't pass a string to usecols because if I do it throws an error. I have already tested it.
So, does not make sense maintain the warning just to a string type parameter unless you do not want the error be thrown.

Considering this I wrote a test to check this behavior.
Do I should remove the warning, then?

I may enable backward compatibility for usecols if you want.
I could check if usecols contains a string which contains Excel index columns and ranges and if ti does I would I could do something like usecols_excel=usecols, usecols=None, throw the warning I've already written and place a comment that this would be removed in 0.24 pandas version.
What do you think?

The other changes you asked me are done. I'm just waiting to know what to do about this warning to commit again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I mistakenly thought we needed to worry about the case like usecols='B'.

Yeah, I agree with what you proposed, handle the back compat case with a warning, we can eventually deprecate that path.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused here. IIUC usecols_excel was only going to accep the column ranges, and usecols for everything else? these HAVE to be orthogonal parametes, otherwise this is very confusing.

@@ -856,6 +856,7 @@ Other API Changes
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
- Changed the named argument `usecols` at :func:`read_excel` to `usecols_excel` that receives a list of index numbers or A1 index to select the columns that must be in the DataFrame, so the `usecols` argument can serve its purpose to select the columns that must be in the DataFrame using column labels (:issue:`18273`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simplify this to make it more readable.

@@ -1166,6 +1167,7 @@ I/O
- Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
- Bug in :func:`read_excel` where `usecols_excel` named argument as a list of strings were returning a empty DataFrame (:issue:`18273`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this a bug? usecols_excel doesn't exist yet

`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]`` and
``usecols=['foo', 'bar']`` is the same as ``['bar', 'foo']``.
To instantiate a DataFrame from ``data`` with element order preserved use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed here, starting with keeping ordering (or you can put in the Notes section if you really want)

example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
usecols_excel : int or list, default None
* If None then parse all columns,
* If int then indicates last column to be parsed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldusecols_excel only accept column ranges?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd be in favor of letting usecols handle the integer location case and making usecols_excel only handle column ranges

capability of passing column labels for columns to be read

- [x] closes pandas-dev#18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

Created 'usecols_excel' that receives a string containing comma separated Excel
ranges and columns.
Changed 'usecols' named argument, now it receives a list of strings containing
column labels or a list of integers representing column indexes or a callable
for 'read_excel' function. Created and altered tests to reflect the new usage
of these named arguments. 'index_col' keyword used to indicated which columns
in the subset of selected columns by 'usecols' or 'usecols_excel' that should
be the index of the DataFrame read. Now 'index_col' indicates which columns of
the DataFrame will be the index even if that column is not in the subset of the
selected columns.
@chris-b1
Copy link
Contributor

I re-read your explanation but I'm honestly still not following why the change to usecols with numerics is necessary, sorry I'm missing something. read_csv is consistent with the old behavior and with engine='python' also uses the TextParser class, why can't we do the same here?

In [6]: pd.read_excel('pandas/pandas/tests/io/data/test1.xlsx').to_csv('tmp.csv')

In [7]: !head tmp.csv
,A,B,C,D
2000-01-03,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169
2000-01-04,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967
2000-01-05,0.498580885705,0.731167677815,-0.537677223318,1.34627041952
2000-01-06,1.12020151869,1.56762092543,0.00364077397681,0.67525259227
2000-01-07,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917
2000-01-10,0.836648671666,0.246461918642,0.588542635376,1.0627820613
2000-01-11,-0.157160753327,1.34030689438,1.19577795622,-1.09700699751

In [8]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[8]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [9]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0, engine='python')
Out[9]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

@jacksonjos
Copy link
Author

jacksonjos commented Jun 27, 2018

#@chris-b1,

Take a look at the examples below comparing the current pandas behavior to the new one,
so maybe is easier to understand.
@jreback, do you get it what is happening here?

# CURRENT BEHAVIOR
In [5]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[5]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [6]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...   index_col=0)
Out[6]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# Index 0 is not in used_cols, so it's not possible to use it in the index
In [7]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[7]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

# index_col is related to the 0-th col in the cols selected in usecols. In this case, [1, 2].
# So, the date column is not the index_col
In [8]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2],
...     index_col=0)
Out[8]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

In [9]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...    index_col=1)
Out[9]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778

#PROPOSED BEHAVIOR

In [1]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[1]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

In [2]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3], index_col=0)
Out[2]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

# Index 0 is not in used_cols, but TextParser receives all columns. So, the default index
# is the first column because it does not have a column name (1st column header is
# empty).
In [3]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[3]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# index_col is the 0-th col in the DataFrame, not the 0-th in usecols. In this case, [1, 2].
# So, the date column is selected to be the index because is the first column in the
# DataFrame.
In [4]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=0)
Out[4]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [5]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=1)
Out[5]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778


I'm also dumping the behavior of csv you used as an example because I saw that it
diverges from current and proposed behavior of read_excel. So, maybe we can settle,
or agree about how read_excel is supposed to behave.

In [11]: pd.read_csv('tmp.csv', usecols=[0, 2, 3])
Out[11]: 
   Unnamed: 0         B         C
0  2000-01-03  3.685731 -0.364217
1  2000-01-04 -0.041232 -0.161812
2  2000-01-05  0.731168 -0.537677
3  2000-01-06  1.567621  0.003641
4  2000-01-07  0.571455 -1.611639
5  2000-01-10  0.246462  0.588543
6  2000-01-11  1.340307  1.195778

In [12]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[12]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [13]: pd.read_csv('tmp.csv', usecols=[1, 2])
Out[13]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

In [14]: pd.read_csv('tmp.csv', usecols=[1, 2], index_col=0)
Out[14]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

What do you think?

@jreback
Copy link
Contributor

jreback commented Jul 4, 2018

@jacksonjos can you show the proposed right next to the existing. and use separators between so its clear.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

can you rebase / update

@jacksonjos
Copy link
Author

jacksonjos commented Oct 27, 2018 via email

@jreback
Copy link
Contributor

jreback commented Oct 28, 2018

@gfyoung if you wouldn't mind rebasing this

@gfyoung
Copy link
Member

gfyoung commented Oct 28, 2018

@jreback : Before I do, would like to get a better understanding of this issue (see here).

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified two major bugs:

* index_col=None was not being respected
* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
@jreback
Copy link
Contributor

jreback commented Nov 8, 2018

closing in favor of #23544

@jreback jreback closed this Nov 8, 2018
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 10, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 10, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 11, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
jreback pushed a commit that referenced this pull request Nov 11, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes gh-18273.
Closes gh-20480.
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel return empty dataframe when using usecols
6 participants