Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

Open
2 of 3 tasks
galipremsagar opened this issue Oct 22, 2020 · 8 comments
Open
2 of 3 tasks
Assignees
Labels
Docs IO Parquet parquet, feather

Comments

@galipremsagar
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[42]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint64")})
In[43]: df.to_parquet('a')
In[44]: pd.read_parquet('a').dtypes
Out[44]: 
a    uint64
dtype: object
In[45]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
In[46]: df.to_parquet('a')
In[47]: pd.read_parquet('a').dtypes
Out[47]: 
a    int64
dtype: object

Problem description

It appears to be that uint32 is not being preserved like uint64 is being preserved while round-tripping through a parquet file.

Expected Output

Preserve the uint32 dtype.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.3
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

Crosslinking to cudf fuzz-testing for tracking purpose: rapidsai/cudf#6001

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2020
@allenmac347
Copy link

allenmac347 commented Nov 10, 2020

take
Hello I am a first time contributor and I'm willing to examine this further!

@allenmac347
Copy link

@galipremsagar Hi, after further testing it seems like uint32 is preserved when using 'fastparquet' as the engine for to_parquet and read_parquet. However, the closed issue #31896 seems to acknowledge this behavior and a fix was introduced and merged into main branch to make pandas interpret the written uint32 data as int64 data. I was wondering if you think this could be expected behavior or would this still be considered a ongoing issue?

@galipremsagar
Copy link
Author

@allenmac347 I think it'd still be considered an issue, we might probably need a fix similar to : #31918

@allenmac347
Copy link

@jorisvandenbossche hey so I noticed that in the issue you fixed in issue #31896, you said that parquet does not seem to be able to store uint32. Would you happen to know more about this issue and if this is an issue with pyarrow or pandas? thanks!

@allenmac347
Copy link

@phofl Hi phofl. I'm currently trying to debug this issue, but it seems like this might be an external problem with pyarrow. Here's some interesting output I get:

//The datatype uint32 is preserved here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='fastparquet')
dataframe = pd.read_parquet('a', engine='fastparquet')

//The datatype uint32 is preserved here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='fastparquet')
dataframe = pd.read_parquet('a', engine='pyarrow')

//The datatype uint32 is read in as a int64 here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='pyarrow')
dataframe = pd.read_parquet('a', engine='fastparquet')

I feel like this means there's something wrong with how pyarrow writes uint32 to a file. I was wondering if you've had any suggestions? I've tried using the pandas metadata of the parquet file to just convert the dataframe back to uint32 after reading it in, but that made a lot of test cases fail.

@phofl
Copy link
Member

phofl commented Dec 19, 2020

Unfortunately I am not that familiar here. Do you know or could find out who implemented the pyarrow engine?

@jorisvandenbossche jorisvandenbossche added Upstream issue Issue related to pandas dependency IO Parquet parquet, feather and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021
@jorisvandenbossche
Copy link
Member

Sorry for the slow reply here. This is not directly related to #31896 (that was a bug on the conversion on our side, specifically for nullable dtypes), but it is actually a limiation of pyarrow.

You can specify version="2.0", and then pyarrow will use additional type annotations in the parquet file, in which case it can actually preserve uint32. But by default it indeed does not.

So there is nothing to do on the pandas side about it (apart from maybe better documenting this). A similar issue about this on the pyarrow side is https://issues.apache.org/jira/browse/ARROW-9215

@mroeschke mroeschke added Docs and removed Upstream issue Issue related to pandas dependency labels Aug 13, 2021
@miccoli
Copy link
Contributor

miccoli commented Mar 18, 2022

While googling for solving this very same problem, I found this very useful thread.

Let me just add that version="2.0" is deprecated now, use instead "2.4" or "2.6".

I understand that it may sound obvious, but maybe adding in the pandas docs an explicit link on where to look for the additional **kwargs to be passed to the undelying engine could be useful. In fact it took me a while to figure out that the relevant docs for pyarrow are in pyarrow.parquet.ParquetWriter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

6 participants