Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel gives different data for one and more than one elements in parse_cols setting #15316

Closed
fortooon opened this issue Feb 6, 2017 · 12 comments
Labels
Bug IO Excel read_excel, to_excel

Comments

@fortooon
Copy link

fortooon commented Feb 6, 2017

related to #12292

Try to get values from only first column with empty cells

import pandas as pd
df = pd.DataFrame([["", 1, 100], [3, 2, 200], ["", 3, 300], ["", "", 400]])
df.to_excel("test_excel.xls", index=False, header=False)
fst_col = pd.read_excel("test_excel.xls", parse_cols=[0], header=None).values
fst_cols = pd.read_excel("test_excel.xls", parse_cols=[0,1], header=None).values
print(fst_col)
print("V.S.")
print(fst_cols)

...[Out]
 [[3]]
 V.S.
 [[ nan   1.]
  [  3.   2.]
  [ nan   3.]
  [ nan   nan]]

Different view of read the same column:

Whether output for first column must be the same for both cases, for generalization of reading data?
How can I get full data (included empty values) from first column using parse_cols=[0]?

Expected Output

[[ nan ]
[  3. ]
[ nan ]
[ nan ]]

pandas: 0.18.1
xlrd: 0.9.4
python: 2.7.7.final.0

@jreback
Copy link
Contributor

jreback commented Feb 6, 2017

easier to read example

In [7]: pd.read_excel('test_excel.xls',header=None)
Out[7]: 
     0    1    2
0  NaN  1.0  100
1  3.0  2.0  200
2  NaN  3.0  300
3  NaN  NaN  400

In [8]: pd.read_excel?

In [9]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0])
Out[9]: 
   0
0  3

In [10]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0,1])
Out[10]: 
     0    1
0  NaN  1.0
1  3.0  2.0
2  NaN  3.0
3  NaN  NaN

[9] does look buggy.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2017

@chris-b1 as you seem to be the latent excel expert!

also, IIRC we had an issue to parse_cols -> usecols? (deprecation)

@jreback jreback added Bug IO Excel read_excel, to_excel labels Feb 6, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 6, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Feb 6, 2017

xref #4988 for usecols. This does look buggy.

Personally, I never use parse_cols - there isn't a lot performance benefit, so I just read the whole frame and then select.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@stanleyguan
Copy link

I'm taking a stab at this (from PyCon Sprint).

@jorisvandenbossche
Copy link
Member

From discussion on gitter: this is caused by the fact that the TextParser by default skips empty lines. This is also the default behaviour of read_csv, but there this can be controlled by skip_blank_lines:

In [10]: s = """a
    ...: 1
    ...: 
    ...: 2
    ...: """

In [11]: pd.read_csv(StringIO(s))
Out[11]: 
   a
0  1
1  2

In [12]: pd.read_csv(StringIO(s), skip_blank_lines=False)
Out[12]: 
     a
0  1.0
1  NaN
2  2.0

So possible options to go forward here:

  • regard the current behaviour of read_excel (the one reported here, with the difference between parsing a single or multipe columns if you have missing values) as the desired behaviour (mimicking read_csv), but then add skip_blank_lines as a new keyword arg to have the option to change this
  • decide we want to fix this and in read_excel never skip blank lines. The this option could just be added to TextParser to enable read_excel to specify the behaviour.

@jreback @chris-b1 any preferences?

@stanleyguan
Copy link

@jorisvandenbossche Thanks for moving the conversation over.

Some more illustration with TextParser:

In [4]: data = [['foo', 3], ['bar', ''], ['', 2], ['foobar', '']]

In [5]: TextParser(data, header=None).read()
Out[5]: 
        0    1
0     foo  3.0
1     bar  NaN
2     NaN  2.0
3  foobar  NaN

In [6]: TextParser([[row[0]] for row in data], header=None).read()
Out[6]: 
        0
0     foo
1     bar
2  foobar

@chris-b1
Copy link
Contributor

It seems like read_csv might do this in a different order - skipping lines only if the entire line is blank, then selecting columns - is that reasonable here?

pd.read_csv(StringIO("""a,b
,1
2,"""), usecols=['a'])
Out[49]: 
     a
0  NaN
1  2.0

@chris-b1
Copy link
Contributor

Actually, since TextParser now supports usecols, you probably could rip most of the column-specific logic out of read_excel and defer it to there, which would replicate read_csv.

from pandas.io.parsers import TextParser

TextParser([['a', 'b'], [1, ''], ['', 2]], usecols='a').read()
Out[54]: 
     a
0  1.0
1  NaN

@jorisvandenbossche
Copy link
Member

Yes, that looks like a good way to go.
As the behaviour of read_csv (to only skip if the full line is missing, also when you select only some columns) certainly makes sense

@stanleyguan
Copy link

Deferring the logic to TextParser does mean that we will do unnecessary parsing for the columns that will be discarded. Is that acceptable?

@chris-b1
Copy link
Contributor

Yes, that's fine - because of the way excel data is structured (row oriented XML) the whole file has to be parsed anyways - it might add a little overhead, but small relative to overall reading time.

df = pd.DataFrame(np.random.randn(1000, 10), columns=list('qwertyuasd'))
df.to_excel('tmp.xlsx', index=False)

%timeit pd.read_excel('tmp.xlsx')
10 loops, best of 3: 111 ms per loop

%timeit pd.read_excel('tmp.xlsx', parse_cols='a')
10 loops, best of 3: 104 ms per loop

@gfyoung
Copy link
Member

gfyoung commented Nov 13, 2018

Not sure when this was patched, but we extensively test parse_cols (now called usecols) now, and this is no longer a bug. 🎉

Closing.

@gfyoung gfyoung closed this as completed Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

7 participants