Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel with dtype=str converts empty cells to the string 'nan' #20377

Closed
arnau126 opened this issue Mar 16, 2018 · 15 comments

Comments

Projects
None yet
6 participants
@arnau126
Copy link

commented Mar 16, 2018

Code Sample, a copy-pastable example if possible

In [10]: df = pd.DataFrame({'a': ['x', 'y', '', 'z']})

In [11]: df
Out[11]: 
   a
0  x
1  y
2   
3  z

In [12]: df.to_excel('temp.xlsx')

In [13]: df = pd.read_excel('temp.xlsx', dtype=str)

In [14]: df
Out[14]: 
     a
0    x
1    y
2  nan
3    z

In [15]: df.loc[2, 'a']
Out[15]: 'nan'

In [16]: type(df.loc[2, 'a'])
Out[16]: str

Problem description

The empty string of the original dataframe becomes the string 'nan', instead of numpy.nan.

Expected Output

In [14]: df
Out[14]: 
     a
0    x
1    y
2  NaN
3    z

In [15]: df.loc[2, 'a']
Out[15]: nan

In [16]: type(df.loc[2, 'a'])
Out[16]: float

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ca_ES.UTF-8
LOCALE: ca_ES.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2018

Thanks for the report. Mind taking a look to see what's going on?

@arnau126

This comment has been minimized.

Copy link
Author

commented Mar 16, 2018

I've found the function that converts the np.nan to 'nan': pandas.core.dtypes.astype_nansafe.

read_excel uses PythonParser. This parser has a function called self._cast_types, which uses astype_nansafe.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1570

An example of astype_nansafe:

In [4]: from pandas.core.dtypes.cast import astype_nansafe

In [5]: arr = np.array([np.nan, 'a', 'b'], object)

In [6]: arr
Out[6]: array([nan, 'a', 'b'], dtype=object)

In [7]: astype_nansafe(arr, str)
Out[7]: array(['nan', 'a', 'b'], dtype=object)

I don't know if it's the expected behavior or not.
Should we modify astype_nansafe? or _cast_types?.

I'm not sure how to fix the issue.

@cbertinato

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2018

Is that what one should expect when using dtype=str? Not specifying dtype produces the expected result. The default value is None and dtype conversion is done by row.

In [16]: df = pd.read_excel('temp.xlsx')

In [17]: df
Out[17]:
     a
0    x
1    y
2  NaN
3    z

In [18]: df.loc[2, 'a']
Out[18]: nan

In [19]: type(df.loc[2, 'a'])
Out[19]: float

In [20]: type(df.loc[1, 'a'])
Out[20]: str

I think that the output for dtype=str is expected behavior.

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2018

read_csv returns what you would expect

from io import StringIO
import pandas as pd

arr = pd.read_csv(StringIO('col1, col2, col3\n,a,b'), dtype=str)

it will give you

  col1  col2  col3
0  NaN     a     b

so, I guess, what you expect from read_excel seems to be consistent with this.

I guess, if this needs to be changed, the best would be to change astype_unicode in pandas/_libs/lib.pyx.
What do you think?

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2018

@cbertinato just saw your comment! Good point.

@arnau126

This comment has been minimized.

Copy link
Author

commented Mar 19, 2018

@cbertinato, if I have a column with numbers, but I want to read them as strings I need to specify dtype=str, otherwise they will be read as float.

In [1]: df = pd.DataFrame({'a': ['1', '2', '', '3']})

In [2]: df.to_excel('temp.xlsx')

In [3]: pd.read_excel('temp.xlsx', dtype=str)
Out[3]: 
     a
0   1
1   2
2  nan
3   3

In [5]: pd.read_excel('temp.xlsx')
Out[5]: 
     a
0  1.0
1  2.0
2  NaN
3  3.0

Moreover, as @nikoskaragiannakis says, it should be consistent with read_csv.

@cbertinato

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

I missed the use case for numbers, and @nikoskaragiannakis's comment. It should probably be consistent with read_csv.

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

@cbertinato if the case is as @arnau126 says, then I can make a fix for this.

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2018

Question: When we use dtype=str do we want np.nan to turn to a string "nan" or to stay as it is?
I made a fix for this issue, but the above functionality is affected.

@arnau126

This comment has been minimized.

Copy link
Author

commented Mar 20, 2018

The purpose of this issue is just change this functionality: currently np.nan turns to a string "nan", and I think that it should stay as it.

@cbertinato

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2018

As @arnau126 points out, the result from pd.read_excel with dtype=str is inconsistent with that from pd.read_csv. The value of placing a np.nan instead of the string representation is that you can use pd.isna, which does not work for 'nan'. So the thought is to make read_excel consistent with read_csv.

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2018

@cbertinato @arnau126 maybe my question wasn't very clear.
I suggested changing astype_unicode and astype_str (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L460-L483), which would mean that 'nan' would always become np.nan.

If we only want to change the read_excel function, then we could just substitute all the 'nan' to np.nan before we return the result to the user. Not very elegant, but it would only affect read_excel.

Which one do we want here?

Sorry if I misunderstood what you guys are saying.

@cbertinato

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2018

I think that the latter would be safer and less likely to break something else. This is clearly only an issue for read_excel and not read_csv.

@nikoskaragiannakis

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2018

It is clear now, thanks. I'll do the changes.

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

@jreback jreback added this to the 0.23.0 milestone Apr 5, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Oct 15, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 17, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 18, 2018

jreback added a commit that referenced this issue Oct 18, 2018

brute4s99 added a commit to brute4s99/pandas that referenced this issue Nov 19, 2018

@yoonghm

This comment has been minimized.

Copy link

commented Dec 11, 2018

Treating an empty value in Excel as nan has another side effect: integer will be converted to float. Subsequent operation on that columns will have other effects again.

Also read_excel() does not honor empty value handling offered by function in converters:

I have an Excel file temp.xlsx with the following data:

Key1,Key2,Key3,Key4
0,11.0,  Apple  ,1.12
1,12.0,,1.02
2,13.0,  Orange,
3,  Banana  ,0.01

This is the code:

import numpy as np
import pandas as pd

def handle_string(value):
    return value.replace(' ', '')

def handle_integer(value):
    if value == '':
        return 0
    else:
        int(value)

def handle_float(value):
    if value == '':
        return 0.0
    else:
        float(value)

df = pd.read_excel(
        'temp.xlsx',
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

print('')

df = pd.read_excel(
        'temp.xlsx',
        converters={\
            'Key1' : handle_integer,
            'Key2' : handle_integer,
            'Key3' : handle_string,
            'Key4' : handle_float,
        }
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

The output:

Key1 Key2 Key3 Key4
0 0 11.0 Apple 1.12
1 1 12.0 NaN 1.02
2 2 13.0 Orange NaN
3 3 NaN Banana 0.01
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

Key1 Key2 Key3 Key4
0 None NaN Apple NaN
1 None NaN NaN NaN
2 None NaN Orange 0.0
3 None 0.0 Banana NaN
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.