Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv() and read_csv() do not preserve dtype for a column with integer values but whose original dtype is object. #27749

Closed
svadali1 opened this issue Aug 5, 2019 · 2 comments
Labels
IO Data IO issues that don't fit into a more specific label Usage Question

Comments

@svadali1
Copy link

svadali1 commented Aug 5, 2019

Code Sample, a copy-pastable example if possible

import csv

import pandas as pd

data_dict = {'visitor_id': [123, 456],
             'name': ['John Doe', 'Jane Doe']}
data_df = pd.DataFrame(data_dict)
data_df['visitor_id'] = data_df['visitor_id'].astype(str)
# Original dtype for visitor_id is object
print(data_df.dtypes)

# dtype for visitor_id is int64 when data file is read back using to_csv()
data_df.to_csv('./data_file.csv', index=False)
read_data_df = pd.read_csv('./data_file.csv')
print(read_data_df.dtypes)

# dtype for visitor_id is float64 when data file is read back using to_csv() 
# with csv.QUOTE_NONNUMERIC
data_df.to_csv('./data_file.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)
read_data_df = pd.read_csv('./data_file.csv', quoting=csv.QUOTE_NONNUMERIC)
print(read_data_df.dtypes)

Problem description

When a Pandas dataframe has a column with integer values but whose dtype is actually object, using to_csv() and then reading back the csv file using read_csv() does not preserve the column's dtype.

I have not seen a similar issue posted before (may have missed in my search). I am on Pandas 0.23 and upgrading to pandas 0.25 does not solve the issue.

Expected Output

The dtype for a column should be preserved for columns which have integer values but with dtype object.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: 4.3.1
pip: 19.0.1
setuptools: 40.7.3
Cython: 0.28.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.7.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml.etree: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.18
pymysql: 0.9.3
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@rileyschack
Copy link

You can not preserve dtypes with a csv. This isn’t a bug with pandas, but a limitation of using csvs. Try using parquet or hdf5 if you want dtypes preserved.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Aug 8, 2019
@jorisvandenbossche jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Aug 8, 2019
@jorisvandenbossche
Copy link
Member

@rileyschack is correct. CSV is in general not a good format if you want exact type preservation in roundtrips.

There seems to be something strange with the handling of csv.QUOTE_NONNUMERIC giving float vs int. If you want, you can open a specific issue about that. But going to close this one as this type preservation is not something we can guarantee in general with csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants