Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: update feather IO for pyarrow 0.17 / Feather V2 #33422

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 9, 2020

Upcoming pyarrow 0.17 release will include an upgraded feather format.

This PRs updates pandas for that, more specifically ensures the new keywords can be passed through (the basics should keep working out of the box, since the public API did not change), and small update to the tests

@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Apr 9, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 9, 2020
@@ -102,8 +104,8 @@ def test_read_columns(self):

def test_unsupported_other(self):

# period
df = pd.DataFrame({"a": pd.period_range("2013", freq="M", periods=3)})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since feather now exactly maps to the Arrow memory, periods are now supported (since Period is supported in the pandas->pyarrow.Table conversion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that answers my question above, never mind

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a whatsnew note, and a small comment on the docs in io.rst. LGTM otherwise.

* The format will NOT write an ``Index``, or ``MultiIndex`` for the
``DataFrame`` and will raise an error if a non-default one is provided. You
can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to
ignore it.
* Duplicate column names and non-string columns names are not supported
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
* Non supported types actual Python object types. These will raise a helpful error message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a word here. Maybe

Suggested change
* Non supported types actual Python object types. These will raise a helpful error message
* object-dtype columns are not supported. This will raise with a helpful error message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my edification, does this mean that PeriodIndex or Series[Period] is supported? If so, is that a change from the older version?

@@ -2058,18 +2058,24 @@ def to_stata(
writer.write_file()

@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
def to_feather(self, path) -> None:
def to_feather(self, path, **kwargs) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make these explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make these explicit?

It might change with the pyarrow version, needing us to each time update if other keywords get added. Passing through kwargs makes this more "future-robust".

But I could make the ones that there are now explicit. However, that also means that we need to check the pyarrow version to give a nice error message to say which keyword is not yet supported with the older pyarrow versions (which is of course not that difficult)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. maybe a whatsnew note, otherwise merge when ready.

@jorisvandenbossche jorisvandenbossche merged commit a00202d into pandas-dev:master Apr 10, 2020
@jorisvandenbossche jorisvandenbossche deleted the feather-update branch April 10, 2020 12:41
@noklam
Copy link

noklam commented Jun 21, 2020

Hi, is feather v2 now supported by pandas? It seems tag 1.0.4 but I cannot find it in the release note, thanks!

the current to_feather() in pandas 1.0.5 seems does not support the compression that was introduced in feather V2 as well.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html

@jreback
Copy link
Contributor

jreback commented Jun 21, 2020

this is in 1.1

@jorisvandenbossche
Copy link
Member Author

@noklam Feather V2 is already supported by pandas 1.0.4, as long as you have pyarrow>=0.17 installed.
But only the default behaviour, so no additional keywords (that is what this PR is enabling: correctly passing through keywords such as compression)

@noklam
Copy link

noklam commented Jun 21, 2020

Thanks! Got it. Looking forward to release 1.1 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants