Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dtype ignored in pd.read_excel #39898

Closed
mvacher opened this issue Feb 18, 2021 · 6 comments
Closed

BUG: dtype ignored in pd.read_excel #39898

mvacher opened this issue Feb 18, 2021 · 6 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel Usage Question

Comments

@mvacher
Copy link

mvacher commented Feb 18, 2021

I'm trying to read an excel file with pd.read_excel().
The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file
test1.xlsx

When using the dtype or the converters arguments the dtype of the columns remain the same

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object
df.head()
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

I have also tried to use other engine but the result is always the same.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 7d32926db8f7541c356066dcadabf854487738de
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.2.0
Version          : Darwin Kernel Version 20.2.0: Wed Dec  2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_AU.UTF-8
LOCALE           : en_AU.UTF-8

pandas           : 1.2.2
numpy            : 1.20.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 50.3.1.post20201107
Cython           : 0.29.21
pytest           : 6.1.1
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.6
pandas_gbq       : None
pyarrow          : None
pyxlsb           : 1.0.8
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.52.0

This seems too obvious to be an unreported bug and I apologise if the problem comes from my side.

@mvacher mvacher added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 18, 2021
@attack68
Copy link
Contributor

You can read a little about this here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types

If you feel this addresses the issue please close or revert back.

@phofl
Copy link
Member

phofl commented Feb 19, 2021

As @attack68 mentioned, we do not support str, we support object dtype or the StringDtype in DataFrames

@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@jreback
Copy link
Contributor

jreback commented Feb 20, 2021

this is actually a bit trickier

converters are applied first to the raw fields

then dates are parsed by default ; since these are strings already

so this is as designed and documented -

@mvacher
Copy link
Author

mvacher commented Feb 20, 2021

Thanks a lot for your replies, I was not aware of this update and the use of StringDtype.

The correct dtype seems to be applied now:

df = pd.read_excel(test1.xlsx',
                   dtype={'Date':"string",
                          'Time':"string"})
print(df.dtypes)
df.head()
Date    string
Time    string
dtype: object
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

However it's strange not to have consistent results between read_excel and read_csv.
Here is the results (same file but in csv) using the read_csv

df = pd.read_csv(test1.csv',
                   dtype={'Date':"string",
                          'Time':"string"})
print(df.dtypes)
df.head()
Date    string
Time    string
dtype: object
Date	Time
0	8/3/20	10:00
1	9/3/20	11:00
2	10/3/20	12:00
3	11/3/20	13:00
4	12/3/20	14:00

How can I get the same output using read_excel ? (date and time showing as they are in the original excel file)

@nmartensen
Copy link
Contributor

How can I get the same output using read_excel ?

I don't think you can, due to the way date and time are stored in XLSX. The excel reader engines in read_excel give you the correct time and date values, but they do not need to parse date strings for that (in the most common case). You can unzip your test file and look into the sheet XML to see how the values are stored.

@mroeschke
Copy link
Member

Since this behavior is as designed, appears that there's nothing to address in pandas so closing as a usage question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel Usage Question
Projects
None yet
Development

No branches or pull requests

6 participants