New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-default indexes in to_parquet #18581

Closed
dhirschfeld opened this Issue Nov 30, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@dhirschfeld
Contributor

dhirschfeld commented Nov 30, 2017

Calling to_parquet on a DataFrame with a non-default index results in the error below:

ValueError: parquet does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)

While, you can work-around this by calling reset_index() as the message says, this loses the information about what columns made up the index so means you can't round-trip a DataFrame with a non-default index.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.4.4
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2017

Contributor

If I remove the checking code (this is with pyarrow 0.7.1)

In [2]: df = pd.DataFrame({'A': list('abc')}, index=[2, 3, 4])

In [3]: df
Out[3]: 
   A
2  a
3  b
4  c

In [4]: df.to_parquet('foo.parquet')

In [5]: pd.read_parquet('foo.parquet')
Out[5]: 
   A
2  a
3  b
4  c
diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
index 4a13d2c9d..b7cce3ae7 100644
--- a/pandas/io/parquet.py
+++ b/pandas/io/parquet.py
@@ -147,28 +147,6 @@ def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
 
     valid_types = {'string', 'unicode'}
 
-    # validate index
-    # --------------
-
-    # validate that we have only a default index
-    # raise on anything else as we don't serialize the index
-
-    if not isinstance(df.index, Int64Index):
-        raise ValueError("parquet does not support serializing {} "
-                         "for the index; you can .reset_index()"
-                         "to make the index into column(s)".format(
-                             type(df.index)))
-
-    if not df.index.equals(RangeIndex.from_range(range(len(df)))):
-        raise ValueError("parquet does not support serializing a "
-                         "non-default index for the index; you "
-                         "can .reset_index() to make the index "
-                         "into column(s)")
-
-    if df.index.name is not None:
-        raise ValueError("parquet does not serialize index meta-data on a "
-                         "default index")
-
     # validate columns
     # ----------------
 

We support pyarrow >= 0.4.1, I don't remember exactly when index support was added (and had a bug or 2), but we could check conditionally (as we have other compat code for pyarrow < 0.5.0, and < 0.6.0 for other items). alternatively could bump minimum to 0.6.0 is ok too.

@dhirschfeld would love a PR.

cc @cpcloud @wesm

Contributor

jreback commented Dec 1, 2017

If I remove the checking code (this is with pyarrow 0.7.1)

In [2]: df = pd.DataFrame({'A': list('abc')}, index=[2, 3, 4])

In [3]: df
Out[3]: 
   A
2  a
3  b
4  c

In [4]: df.to_parquet('foo.parquet')

In [5]: pd.read_parquet('foo.parquet')
Out[5]: 
   A
2  a
3  b
4  c
diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
index 4a13d2c9d..b7cce3ae7 100644
--- a/pandas/io/parquet.py
+++ b/pandas/io/parquet.py
@@ -147,28 +147,6 @@ def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
 
     valid_types = {'string', 'unicode'}
 
-    # validate index
-    # --------------
-
-    # validate that we have only a default index
-    # raise on anything else as we don't serialize the index
-
-    if not isinstance(df.index, Int64Index):
-        raise ValueError("parquet does not support serializing {} "
-                         "for the index; you can .reset_index()"
-                         "to make the index into column(s)".format(
-                             type(df.index)))
-
-    if not df.index.equals(RangeIndex.from_range(range(len(df)))):
-        raise ValueError("parquet does not support serializing a "
-                         "non-default index for the index; you "
-                         "can .reset_index() to make the index "
-                         "into column(s)")
-
-    if df.index.name is not None:
-        raise ValueError("parquet does not serialize index meta-data on a "
-                         "default index")
-
     # validate columns
     # ----------------
 

We support pyarrow >= 0.4.1, I don't remember exactly when index support was added (and had a bug or 2), but we could check conditionally (as we have other compat code for pyarrow < 0.5.0, and < 0.6.0 for other items). alternatively could bump minimum to 0.6.0 is ok too.

@dhirschfeld would love a PR.

cc @cpcloud @wesm

@jreback jreback added this to the Next Major Release milestone Dec 1, 2017

@dhirschfeld

This comment has been minimized.

Show comment
Hide comment
@dhirschfeld

dhirschfeld Dec 1, 2017

Contributor

Seems a simple fix! Will see about putting in a PR shortly...

Contributor

dhirschfeld commented Dec 1, 2017

Seems a simple fix! Will see about putting in a PR shortly...

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 5, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 6, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.
...when supported by the underlying engine.
Fixes pandas-dev#18581

@jreback jreback modified the milestones: 0.22.0, 0.21.1 Dec 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment