Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

Closed
2 of 3 tasks
galipremsagar opened this issue Nov 16, 2020 · 5 comments
Closed
2 of 3 tasks
Labels
Bug IO Parquet parquet, feather

Comments

@galipremsagar
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[38]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))
In[39]: df.to_parquet('a')
In[40]: pd.read_parquet('a')
Out[40]: 
Empty DataFrame
Columns: []
Index: []
In[41]: df.to_parquet('a', index=True)
In[42]: pd.read_parquet('a')
Out[42]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In[43]: pd.read_parquet('a').index
Out[43]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')```

Problem description

When index is None/True parquet reader must retrieve only RangeIndex and not empty RangeIndex or Int64Index.

Expected Output

We should be able to get rangeIndex when we read from parquet file a.

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 67a3d42
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-53-generic
Version : #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.2
hypothesis : 5.41.1
sphinx : 3.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

This could be related to #37896, but it appears the parquet reader is also not able to retrieve any Index at all when index is None. Hence filing this as a separate issue.

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2020
@jreback
Copy link
Contributor

jreback commented Nov 16, 2020

likely a pyarrow issue
cc @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2020
@jorisvandenbossche
Copy link
Member

(thanks for the ping, will take a look later this week)

@jorisvandenbossche
Copy link
Member

It's indeed a pyarrow issue, but not directly related to Parquet itself, but the pandas <-> pyarrow roundtrip is already failing for this corner case:

In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))

In [34]: df
Out[34]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [35]: df.shape
Out[35]: (10, 0)

In [36]: table = pa.table(df)

In [37]: table.to_pandas()
Out[37]: 
Empty DataFrame
Columns: []
Index: []

In [38]: table.to_pandas().shape
Out[38]: (0, 0)

I opened https://issues.apache.org/jira/browse/ARROW-10643 for this, contributions to fix this are always welcome!

Since it is an issue on the pyarrow side, closing this.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 18, 2020
@wence-
Copy link
Contributor

wence- commented Nov 25, 2022

While the underlying pyarrow issue was fixed. The bug in the original report persists. to_parquet/read_parquet do not roundtrip RangeIndex correctly. I suspect this is because to_parquet doesn't preserve range indices:

import pyarrow as pa
import pandas as pd
from io import BytesIO

buf = BytesIO()

df = pd.DataFrame({"a": [1, 2, 3]}, index=pd.RangeIndex(0, 3))
df.to_parquet(buf, index=True)
table = pa.parquet.read_table(buf)

pa_table = pa.table(df)

assert table == pa_table # False

@wence-
Copy link
Contributor

wence- commented Mar 22, 2024

parquet case is apache/arrow#40743

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

4 participants