read_excel gives different data for one and more than one elements in parse_cols setting #15316

fortooon · 2017-02-06T11:14:44Z

related to #12292

Try to get values from only first column with empty cells

import pandas as pd
df = pd.DataFrame([["", 1, 100], [3, 2, 200], ["", 3, 300], ["", "", 400]])
df.to_excel("test_excel.xls", index=False, header=False)
fst_col = pd.read_excel("test_excel.xls", parse_cols=[0], header=None).values
fst_cols = pd.read_excel("test_excel.xls", parse_cols=[0,1], header=None).values
print(fst_col)
print("V.S.")
print(fst_cols)

...[Out]
 [[3]]
 V.S.
 [[ nan   1.]
  [  3.   2.]
  [ nan   3.]
  [ nan   nan]]

Different view of read the same column:

Whether output for first column must be the same for both cases, for generalization of reading data?
How can I get full data (included empty values) from first column using parse_cols=[0]?

Expected Output

[[ nan ]
[  3. ]
[ nan ]
[ nan ]]

pandas: 0.18.1
xlrd: 0.9.4
python: 2.7.7.final.0

jreback · 2017-02-06T14:01:12Z

easier to read example

In [7]: pd.read_excel('test_excel.xls',header=None)
Out[7]: 
     0    1    2
0  NaN  1.0  100
1  3.0  2.0  200
2  NaN  3.0  300
3  NaN  NaN  400

In [8]: pd.read_excel?

In [9]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0])
Out[9]: 
   0
0  3

In [10]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0,1])
Out[10]: 
     0    1
0  NaN  1.0
1  3.0  2.0
2  NaN  3.0
3  NaN  NaN

[9] does look buggy.

jreback · 2017-02-06T14:01:45Z

@chris-b1 as you seem to be the latent excel expert!

also, IIRC we had an issue to parse_cols -> usecols? (deprecation)

chris-b1 · 2017-02-06T14:41:49Z

xref #4988 for usecols. This does look buggy.

Personally, I never use parse_cols - there isn't a lot performance benefit, so I just read the whole frame and then select.

stanleyguan · 2017-05-22T18:12:10Z

I'm taking a stab at this (from PyCon Sprint).

jorisvandenbossche · 2017-05-22T19:15:23Z

From discussion on gitter: this is caused by the fact that the TextParser by default skips empty lines. This is also the default behaviour of read_csv, but there this can be controlled by skip_blank_lines:

In [10]: s = """a
    ...: 1
    ...: 
    ...: 2
    ...: """

In [11]: pd.read_csv(StringIO(s))
Out[11]: 
   a
0  1
1  2

In [12]: pd.read_csv(StringIO(s), skip_blank_lines=False)
Out[12]: 
     a
0  1.0
1  NaN
2  2.0

So possible options to go forward here:

regard the current behaviour of read_excel (the one reported here, with the difference between parsing a single or multipe columns if you have missing values) as the desired behaviour (mimicking read_csv), but then add skip_blank_lines as a new keyword arg to have the option to change this
decide we want to fix this and in read_excel never skip blank lines. The this option could just be added to TextParser to enable read_excel to specify the behaviour.

@jreback @chris-b1 any preferences?

stanleyguan · 2017-05-22T19:24:42Z

@jorisvandenbossche Thanks for moving the conversation over.

Some more illustration with TextParser:

In [4]: data = [['foo', 3], ['bar', ''], ['', 2], ['foobar', '']]

In [5]: TextParser(data, header=None).read()
Out[5]: 
        0    1
0     foo  3.0
1     bar  NaN
2     NaN  2.0
3  foobar  NaN

In [6]: TextParser([[row[0]] for row in data], header=None).read()
Out[6]: 
        0
0     foo
1     bar
2  foobar

chris-b1 · 2017-05-22T19:33:27Z

It seems like read_csv might do this in a different order - skipping lines only if the entire line is blank, then selecting columns - is that reasonable here?

pd.read_csv(StringIO("""a,b
,1
2,"""), usecols=['a'])
Out[49]: 
     a
0  NaN
1  2.0

chris-b1 · 2017-05-22T19:52:28Z

Actually, since TextParser now supports usecols, you probably could rip most of the column-specific logic out of read_excel and defer it to there, which would replicate read_csv.

from pandas.io.parsers import TextParser

TextParser([['a', 'b'], [1, ''], ['', 2]], usecols='a').read()
Out[54]: 
     a
0  1.0
1  NaN

jorisvandenbossche · 2017-05-22T19:55:11Z

Yes, that looks like a good way to go.
As the behaviour of read_csv (to only skip if the full line is missing, also when you select only some columns) certainly makes sense

stanleyguan · 2017-05-24T05:34:33Z

Deferring the logic to TextParser does mean that we will do unnecessary parsing for the columns that will be discarded. Is that acceptable?

chris-b1 · 2017-05-24T09:47:47Z

Yes, that's fine - because of the way excel data is structured (row oriented XML) the whole file has to be parsed anyways - it might add a little overhead, but small relative to overall reading time.

df = pd.DataFrame(np.random.randn(1000, 10), columns=list('qwertyuasd'))
df.to_excel('tmp.xlsx', index=False)

%timeit pd.read_excel('tmp.xlsx')
10 loops, best of 3: 111 ms per loop

%timeit pd.read_excel('tmp.xlsx', parse_cols='a')
10 loops, best of 3: 104 ms per loop

gfyoung · 2018-11-13T20:01:54Z

Not sure when this was patched, but we extensively test parse_cols (now called usecols) now, and this is no longer a bug. 🎉

Closing.

jreback added Bug IO Excel read_excel, to_excel labels Feb 6, 2017

jreback added this to the 0.20.0 milestone Feb 6, 2017

jreback added Difficulty Novice labels Feb 6, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

chris-b1 mentioned this issue May 24, 2017

ENH: inconsistent naming convention for read_excel column selection (#4988) #16488

Closed

5 tasks

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

gfyoung removed the good first issue label Nov 13, 2018

gfyoung modified the milestones: Contributions Welcome, No action Nov 13, 2018

gfyoung closed this as completed Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel gives different data for one and more than one elements in parse_cols setting #15316

read_excel gives different data for one and more than one elements in parse_cols setting #15316

fortooon commented Feb 6, 2017 •

edited by gfyoung

Loading

jreback commented Feb 6, 2017

jreback commented Feb 6, 2017

chris-b1 commented Feb 6, 2017

stanleyguan commented May 22, 2017

jorisvandenbossche commented May 22, 2017

stanleyguan commented May 22, 2017

chris-b1 commented May 22, 2017

chris-b1 commented May 22, 2017

jorisvandenbossche commented May 22, 2017

stanleyguan commented May 24, 2017

chris-b1 commented May 24, 2017

gfyoung commented Nov 13, 2018

read_excel gives different data for one and more than one elements in parse_cols setting #15316

read_excel gives different data for one and more than one elements in parse_cols setting #15316

Comments

fortooon commented Feb 6, 2017 • edited by gfyoung Loading

Try to get values from only first column with empty cells

Different view of read the same column:

Expected Output

jreback commented Feb 6, 2017

jreback commented Feb 6, 2017

chris-b1 commented Feb 6, 2017

stanleyguan commented May 22, 2017

jorisvandenbossche commented May 22, 2017

stanleyguan commented May 22, 2017

chris-b1 commented May 22, 2017

chris-b1 commented May 22, 2017

jorisvandenbossche commented May 22, 2017

stanleyguan commented May 24, 2017

chris-b1 commented May 24, 2017

gfyoung commented Nov 13, 2018

fortooon commented Feb 6, 2017 •

edited by gfyoung

Loading