BUG: dtype ignored in pd.read_excel #39898

mvacher · 2021-02-18T23:52:45Z

I'm trying to read an excel file with pd.read_excel().
The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.

Example of the excel file
test1.xlsx

When using the dtype or the converters arguments the dtype of the columns remain the same

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

Same thing when using converters

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

I have also tried to use other engine but the result is always the same.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 7d32926db8f7541c356066dcadabf854487738de
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.2.0
Version          : Darwin Kernel Version 20.2.0: Wed Dec  2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_AU.UTF-8
LOCALE           : en_AU.UTF-8

pandas           : 1.2.2
numpy            : 1.20.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 50.3.1.post20201107
Cython           : 0.29.21
pytest           : 6.1.1
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.6
pandas_gbq       : None
pyarrow          : None
pyxlsb           : 1.0.8
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.52.0

This seems too obvious to be an unreported bug and I apologise if the problem comes from my side.

The text was updated successfully, but these errors were encountered:

attack68 · 2021-02-19T06:28:32Z

You can read a little about this here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types

If you feel this addresses the issue please close or revert back.

phofl · 2021-02-19T20:31:20Z

As @attack68 mentioned, we do not support str, we support object dtype or the StringDtype in DataFrames

jreback · 2021-02-20T03:28:03Z

this is actually a bit trickier

converters are applied first to the raw fields

then dates are parsed by default ; since these are strings already

so this is as designed and documented -

mvacher · 2021-02-20T07:43:04Z

Thanks a lot for your replies, I was not aware of this update and the use of StringDtype.

The correct dtype seems to be applied now:

df = pd.read_excel(test1.xlsx',
                   dtype={'Date':"string",
                          'Time':"string"})
print(df.dtypes)
df.head()

Date    string
Time    string
dtype: object
Date	Time
0	2020-03-08 00:00:00	10:00:00
1	2020-03-09 00:00:00	11:00:00
2	2020-03-10 00:00:00	12:00:00
3	2020-03-11 00:00:00	13:00:00
4	2020-03-12 00:00:00	14:00:00

However it's strange not to have consistent results between read_excel and read_csv.
Here is the results (same file but in csv) using the read_csv

df = pd.read_csv(test1.csv',
                   dtype={'Date':"string",
                          'Time':"string"})
print(df.dtypes)
df.head()

Date    string
Time    string
dtype: object
Date	Time
0	8/3/20	10:00
1	9/3/20	11:00
2	10/3/20	12:00
3	11/3/20	13:00
4	12/3/20	14:00

How can I get the same output using read_excel ? (date and time showing as they are in the original excel file)

nmartensen · 2021-02-22T18:42:39Z

How can I get the same output using read_excel ?

I don't think you can, due to the way date and time are stored in XLSX. The excel reader engines in read_excel give you the correct time and date values, but they do not need to parse date strings for that (in the most common case). You can unzip your test file and look into the sheet XML to see how the values are stored.

mroeschke · 2021-08-15T18:31:13Z

Since this behavior is as designed, appears that there's nothing to address in pandas so closing as a usage question

mvacher added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 18, 2021

phofl added Dtype Conversions Unexpected or buggy dtype conversions Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021

mroeschke closed this as completed Aug 15, 2021

mroeschke added the IO Excel read_excel, to_excel label Aug 15, 2021

smcclure17 mentioned this issue Nov 15, 2021

Scraper fixes and changes to requirements.txt act-now-coalition/can-scrapers#372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dtype ignored in pd.read_excel #39898

BUG: dtype ignored in pd.read_excel #39898

mvacher commented Feb 18, 2021

attack68 commented Feb 19, 2021

phofl commented Feb 19, 2021

jreback commented Feb 20, 2021 •

edited

mvacher commented Feb 20, 2021

nmartensen commented Feb 22, 2021

mroeschke commented Aug 15, 2021

BUG: dtype ignored in pd.read_excel #39898

BUG: dtype ignored in pd.read_excel #39898

Comments

mvacher commented Feb 18, 2021

pd.show_versions()

attack68 commented Feb 19, 2021

phofl commented Feb 19, 2021

jreback commented Feb 20, 2021 • edited

mvacher commented Feb 20, 2021

nmartensen commented Feb 22, 2021

mroeschke commented Aug 15, 2021

jreback commented Feb 20, 2021 •

edited