Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel return empty dataframe when using usecols #20480

Closed

Conversation

Projects
None yet
6 participants
@jacksonjos
Copy link

commented Mar 25, 2018

  • closes #18273
  • tests added / passed
  • passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
  • whatsnew entry

As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings.
Now lists of strings are correctly interpreted by read_excel function.

@codecov

This comment has been minimized.

Copy link

commented Mar 25, 2018

Codecov Report

Merging #20480 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20480   +/-   ##
=======================================
  Coverage   91.89%   91.89%           
=======================================
  Files         153      153           
  Lines       49596    49596           
=======================================
  Hits        45576    45576           
  Misses       4020     4020
Flag Coverage Δ
#multiple 90.29% <ø> (ø) ⬆️
#single 41.86% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 415012f...e257100. Read the comment docs.

@@ -479,6 +482,9 @@ def _excel2num(x):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
elif all(isinstance(x, compat.string_types) for x in usecols) is True:

This comment has been minimized.

Copy link
@jreback

jreback Mar 25, 2018

Contributor

you don't need the is True

@@ -479,6 +482,9 @@ def _excel2num(x):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
elif all(isinstance(x, compat.string_types) for x in usecols) is True:
usecols_str = ",".join(usecols)

This comment has been minimized.

Copy link
@jreback

jreback Mar 25, 2018

Contributor

can you add a 1-line comment on this case (and the case above to differentiate)

@@ -179,6 +179,42 @@ def test_usecols_str(self, ext):
tm.assert_frame_equal(df2, df1, check_names=False)
tm.assert_frame_equal(df3, df1, check_names=False)

def test_usecols_str_list(self, ext):

This comment has been minimized.

Copy link
@jreback

jreback Mar 25, 2018

Contributor

can you add the gh issue as a comment

def test_usecols_str_list(self, ext):

dfref = self.get_csv_refdf('test1')

This comment has been minimized.

Copy link
@jreback

jreback Mar 25, 2018

Contributor

can you parameterize this test

This comment has been minimized.

Copy link
@jacksonjos

jacksonjos Mar 25, 2018

Author

Hi, thank you for your feedback! =)

I'm not sure I've got it right. Do you want me to parametrize the objects passed as argument to parse_cols argument to each inside of test_usecols_str_list test because its tests are a copy of test_usecols_str or you want me to parametrize something else?

This comment has been minimized.

Copy link
@jreback

jreback Mar 25, 2018

Contributor

you are testing multiple cases that are pretty similar. the test becomes much simpler if you can parameterize over the possible cases.

This comment has been minimized.

Copy link
@jacksonjos

jacksonjos Mar 25, 2018

Author

Ok, I got it. Do you want me to do the same to the test above (test_usecols_str)?

This comment has been minimized.

Copy link
@jacksonjos

jacksonjos Mar 26, 2018

Author

Hi @jreback. I implemented all requested changes in the code.
I would like to know your opinion about a suggestion made by @chris-b1 at #20480 (comment).

@jreback jreback requested a review from chris-b1 Mar 25, 2018

@jacksonjos jacksonjos force-pushed the jacksonjos:use_cols_empty_df_bug branch from 171c9b6 to fca0ae0 Mar 25, 2018

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2018

This looks OK, but re-reading it, #18273 is actually two somewhat separate problems and we have a bit of an API tangle here

  1. passing a list of Excel column letters doesn't work (this fixes that)
  2. passing a list column names (e.g. column title 'foo' in the spreadsheet) to usecols doesn't work anymore

Number 2 wasn't part of the read_excel documented args, but once worked, because this param used to be called parse_cols (#17774), and if you passed something to usecols it would get passed down to the TextParser logic and do the right thing

parser = TextParser(data, header=header, index_col=index_col,

Not sure what the solution is, there could be ambiguous cases (column titled'A' in spreadsheet column B), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.

Not saying you need to address item 2 in this PR, just need a new issue if not.

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Mar 25, 2018

Hi @chris-b1, thank you for your comment.

I tried to test what you said, but I don't get it.

| Day      | Number | Animal |
|----------|--------|--------|
| 10/12/18 | 5      | Tatu   |
| 10/13/18 | 4      | Paca   |

I tried the following:

df_not_ok = pd.read_excel("mock.xlsx", ["Day"])
>> XLRDError: No sheet named <'Day'>

df_not_ok = pd.read_excel("mock.xlsx", usecols=["Day"])
>> Empty DataFrame
>> Columns: []
>> Index: []
pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

If someone create feature or issue I may can work to make possible to pass col names as arguments to read just them.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2018

Right, that doesn't work on 0.22, but it used to, which is what #18273 is about

print(pd.__version__)
# 0.20.3
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, 6]})
df.to_excel('tmp.xlsx', index=False)
pd.read_excel('tmp.xlsx', usecols=['bar'])
   bar
0    4
1    5
2    6

@jacksonjos jacksonjos force-pushed the jacksonjos:use_cols_empty_df_bug branch 2 times, most recently from 4aa25c2 to 445d94a Mar 25, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Mar 25, 2018

Hello @jacksonjos! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 09, 2018 at 22:45 Hours UTC

@jacksonjos jacksonjos force-pushed the jacksonjos:use_cols_empty_df_bug branch 2 times, most recently from 02c4867 to beb5b2c Mar 25, 2018

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Mar 25, 2018

Ok, I got it, @chris-b1. It isn't better to create another issue to implement the possibility to choose columns you want to load by column name using another named argument? I think that might be confusing use usecols for a lot of different possibilities.

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Mar 29, 2018

Is there anything wrong with the requested changes that I implemented to make the pull requested fulfill the requirements to be accepted? I'm asking because I don't understand why this pull requested it wasn't accepted yet.

@jreback
Copy link
Contributor

left a comment

waiting for @chris-b1 to have a look for this. I am concerned that usecols is now ambiguous, yes?

column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
inclusive of both sides.
* If list of strings each string shall be a Excel column letter or column

This comment has been minimized.

Copy link
@jreback

jreback Mar 30, 2018

Contributor

a -> an

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2018

Yeah, I've been a bit stuck on what to do here. Current (0.22) behavior where usecols refers only to excel names for regions is bad, it broke with (untested/undocumented behavior but working!) 0.20.3 and is inconsistent with, e.g. read_csv.

I suppose the only reasonable solution is two keywords. usecols_labels and usecols_?

  • usecols_excel
  • usecols_a1 (very Excel specific, but "A1 notation" is a somewhat common term)
  • excel_columns
@jacksonjos

This comment has been minimized.

Copy link
Author

commented Apr 14, 2018

Hi,

I prefer to use usecols_lables and usecols_excel because it's possible refer columns by A1 notation and by numbers. If we decide to use usecols_a1 the named argument would cause misunderstanding about the arguments can be passed to it.

What do you think @chris-b1 and @jreback ?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 14, 2018

usecols should refer to names just like it does in all other parsers. no problem adding usecols_a1 (or similar).

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2018

OK, @jacksonjos here's what I propose. Most painful part will be backwards compat & testing. If you don't want to take it all the way, just let me know and I'll finish, would like to get this in for 0.23.

  1. Add a new kwarg usecols_excel which replicates the current usecols + your fix here
  2. Since we're breaking compat, inspect usecols and if it is string AND parses to an excel region, issue a warning that the arg was renamed. This could raise a false positive for a single letter (e.g. 'A') but don't see a way around it.
  3. usecols should get passed to the TextParser object, which will use it in the manner @jreback suggests.
    parser = TextParser(data, header=header, index_col=index_col,
@jacksonjos

This comment has been minimized.

Copy link
Author

commented Apr 17, 2018

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Apr 23, 2018

Hi @jreback and @chris-b1, I'm about to finish the new version of this pull request and I would like to know your opinion about what to do when someone pass both arguments, usecols_excel and usecols, to read_excel function?

Should I raise an exception, print some warning or do nothing?

Thanks in advance.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@jacksonjos jacksonjos force-pushed the jacksonjos:use_cols_empty_df_bug branch 2 times, most recently from a747961 to 6143a0f Apr 30, 2018

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Apr 30, 2018

Hey, guys! I did a new push.

All requested changes are in the commit.
I'm sorry for doing this too late, but last two weeks were really had at my job and I was so tired to work on it.

I didn't understand why one of CI tests didn't pass because I do not see any relationship between the tests that did not pass and the feature I implemented/corrected.

Do you have any idea?
If there is something else I need to do to completed the implementation just tell me that this week I have time to complete the jog if there is anything else to do about this issue.

Thanks!

@chris-b1
Copy link
Contributor

left a comment

looks pretty good will try it out tonight, can you also add a note to the excel section in io.rst

example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
usecols_excel : int or list, default None

This comment has been minimized.

Copy link
@chris-b1

chris-b1 Apr 30, 2018

Contributor

string, int, or list

I'd also a bit of narrative before the bullets, something like:
Columns to parse from the spreadsheet, specified as Excel location references

row.append(_parse_cell(value, typ))
data.append(row)

# Check if some string in usecols may be interpreted as a Excel
# positional column
if (usecols is not None) and (not callable(usecols)) and \

This comment has been minimized.

Copy link
@chris-b1

chris-b1 Apr 30, 2018

Contributor

I'll try it out later, but I think this warning is over-specified. Since usecols only used to work with excel ranges as strings, not lists of strings, I would limit the warning to that case.

df1 = self.get_csv_refdf('test1')[['B', 'D']]

with tm.assert_produces_warning(UserWarning):
df2 = self.get_exceldf('test1', ext, 'Sheet1', usecols=['B', 'D'])

This comment has been minimized.

Copy link
@chris-b1

chris-b1 Apr 30, 2018

Contributor

As state above I'm not sure this should warn, since it didn't previously work. But I would like a test for the warning, something like usecols='B,D'

This comment has been minimized.

Copy link
@jacksonjos

jacksonjos May 1, 2018

Author

I can't pass a string to usecols because if I do it throws an error. I have already tested it.
So, does not make sense maintain the warning just to a string type parameter unless you do not want the error be thrown.

Considering this I wrote a test to check this behavior.
Do I should remove the warning, then?

I may enable backward compatibility for usecols if you want.
I could check if usecols contains a string which contains Excel index columns and ranges and if ti does I would I could do something like usecols_excel=usecols, usecols=None, throw the warning I've already written and place a comment that this would be removed in 0.24 pandas version.
What do you think?

The other changes you asked me are done. I'm just waiting to know what to do about this warning to commit again.

This comment has been minimized.

Copy link
@chris-b1

chris-b1 May 3, 2018

Contributor

Ah, I mistakenly thought we needed to worry about the case like usecols='B'.

Yeah, I agree with what you proposed, handle the back compat case with a warning, we can eventually deprecate that path.

@jreback
Copy link
Contributor

left a comment

I am confused here. IIUC usecols_excel was only going to accep the column ranges, and usecols for everything else? these HAVE to be orthogonal parametes, otherwise this is very confusing.

@@ -856,6 +856,7 @@ Other API Changes
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
- Changed the named argument `usecols` at :func:`read_excel` to `usecols_excel` that receives a list of index numbers or A1 index to select the columns that must be in the DataFrame, so the `usecols` argument can serve its purpose to select the columns that must be in the DataFrame using column labels (:issue:`18273`)

This comment has been minimized.

Copy link
@jreback

jreback May 2, 2018

Contributor

Please simplify this to make it more readable.

@@ -1166,6 +1167,7 @@ I/O
- Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
- Bug in :func:`read_excel` where `usecols_excel` named argument as a list of strings were returning a empty DataFrame (:issue:`18273`)

This comment has been minimized.

Copy link
@jreback

jreback May 2, 2018

Contributor

how is this a bug? usecols_excel doesn't exist yet

`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]`` and
``usecols=['foo', 'bar']`` is the same as ``['bar', 'foo']``.
To instantiate a DataFrame from ``data`` with element order preserved use

This comment has been minimized.

Copy link
@jreback

jreback May 2, 2018

Contributor

this is not needed here, starting with keeping ordering (or you can put in the Notes section if you really want)

example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
usecols_excel : int or list, default None
* If None then parse all columns,
* If int then indicates last column to be parsed

This comment has been minimized.

Copy link
@jreback

jreback May 2, 2018

Contributor

shouldusecols_excel only accept column ranges?

This comment has been minimized.

Copy link
@chris-b1

chris-b1 May 3, 2018

Contributor

Yeah, I'd be in favor of letting usecols handle the integer location case and making usecols_excel only handle column ranges

BUG: read_excel return empty dataframe when using usecols and restored
capability of passing column labels for columns to be read

- [x] closes #18273
- [x] tests added / passed
- [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8
- [x] whatsnew entry

Created 'usecols_excel' that receives a string containing comma separated Excel
ranges and columns.
Changed 'usecols' named argument, now it receives a list of strings containing
column labels or a list of integers representing column indexes or a callable
for 'read_excel' function. Created and altered tests to reflect the new usage
of these named arguments. 'index_col' keyword used to indicated which columns
in the subset of selected columns by 'usecols' or 'usecols_excel' that should
be the index of the DataFrame read. Now 'index_col' indicates which columns of
the DataFrame will be the index even if that column is not in the subset of the
selected columns.

@jacksonjos jacksonjos force-pushed the jacksonjos:use_cols_empty_df_bug branch from 6c6eede to e257100 Jun 9, 2018

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

I re-read your explanation but I'm honestly still not following why the change to usecols with numerics is necessary, sorry I'm missing something. read_csv is consistent with the old behavior and with engine='python' also uses the TextParser class, why can't we do the same here?

In [6]: pd.read_excel('pandas/pandas/tests/io/data/test1.xlsx').to_csv('tmp.csv')

In [7]: !head tmp.csv
,A,B,C,D
2000-01-03,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169
2000-01-04,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967
2000-01-05,0.498580885705,0.731167677815,-0.537677223318,1.34627041952
2000-01-06,1.12020151869,1.56762092543,0.00364077397681,0.67525259227
2000-01-07,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917
2000-01-10,0.836648671666,0.246461918642,0.588542635376,1.0627820613
2000-01-11,-0.157160753327,1.34030689438,1.19577795622,-1.09700699751

In [8]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[8]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [9]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0, engine='python')
Out[9]:
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778
@jacksonjos

This comment has been minimized.

Copy link
Author

commented Jun 27, 2018

#@chris-b1,

Take a look at the examples below comparing the current pandas behavior to the new one,
so maybe is easier to understand.
@jreback, do you get it what is happening here?

# CURRENT BEHAVIOR
In [5]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[5]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [6]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...   index_col=0)
Out[6]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# Index 0 is not in used_cols, so it's not possible to use it in the index
In [7]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[7]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

# index_col is related to the 0-th col in the cols selected in usecols. In this case, [1, 2].
# So, the date column is not the index_col
In [8]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2],
...     index_col=0)
Out[8]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

In [9]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...    index_col=1)
Out[9]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778

#PROPOSED BEHAVIOR

In [1]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[1]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

In [2]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3], index_col=0)
Out[2]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

# Index 0 is not in used_cols, but TextParser receives all columns. So, the default index
# is the first column because it does not have a column name (1st column header is
# empty).
In [3]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[3]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# index_col is the 0-th col in the DataFrame, not the 0-th in usecols. In this case, [1, 2].
# So, the date column is selected to be the index because is the first column in the
# DataFrame.
In [4]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=0)
Out[4]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [5]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=1)
Out[5]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778


I'm also dumping the behavior of csv you used as an example because I saw that it
diverges from current and proposed behavior of read_excel. So, maybe we can settle,
or agree about how read_excel is supposed to behave.

In [11]: pd.read_csv('tmp.csv', usecols=[0, 2, 3])
Out[11]: 
   Unnamed: 0         B         C
0  2000-01-03  3.685731 -0.364217
1  2000-01-04 -0.041232 -0.161812
2  2000-01-05  0.731168 -0.537677
3  2000-01-06  1.567621  0.003641
4  2000-01-07  0.571455 -1.611639
5  2000-01-10  0.246462  0.588543
6  2000-01-11  1.340307  1.195778

In [12]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[12]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [13]: pd.read_csv('tmp.csv', usecols=[1, 2])
Out[13]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

In [14]: pd.read_csv('tmp.csv', usecols=[1, 2], index_col=0)
Out[14]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

What do you think?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2018

@jacksonjos can you show the proposed right next to the existing. and use separators between so its clear.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

can you rebase / update

@jacksonjos

This comment has been minimized.

Copy link
Author

commented Oct 27, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2018

@gfyoung if you wouldn't mind rebasing this

@gfyoung

This comment has been minimized.

Copy link
Member

commented Oct 28, 2018

@jreback : Before I do, would like to get a better understanding of this issue (see here).

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified two major bugs:

* index_col=None was not being respected
* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

closing in favor of #23544

@jreback jreback closed this Nov 8, 2018

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 8, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 10, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 10, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 11, 2018

BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

jreback added a commit that referenced this pull request Nov 11, 2018

BUG: Delegate more of Excel parsing to CSV (#23544)
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes gh-18273.
Closes gh-20480.

JustinZhengBC added a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.