Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.to_parquet() returns bytes if path_or_buf not provided #37129

Merged
merged 17 commits into from
Oct 21, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Oct 15, 2020

Not sure if this is an API breaking change. Prior to this patch path was a required positional argument but it becomes optional here. It is now consistent with, for example, the csv writer.

@arw2019 arw2019 marked this pull request as draft October 15, 2020 06:04
@arw2019 arw2019 marked this pull request as ready for review October 15, 2020 14:17
doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

if path is None:
path = io.BytesIO()

return impl.write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to return path.getvalue() no? (of path was None in the first place).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! fixed that

def test_to_bytes_without_path_or_buf_provided(self):
# GH 37105
df = pd.DataFrame()
df.to_parquet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check one of the test dataframes, and test that the round-trip works as well, alt you can write it to a file and check that its the ssame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pandas/io/parquet.py Show resolved Hide resolved
doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

if path is None:
path = io.BytesIO()

return impl.write(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! fixed that

def test_to_bytes_without_path_or_buf_provided(self):
# GH 37105
df = pd.DataFrame()
df.to_parquet()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -194,6 +194,7 @@ Other enhancements
- Added :meth:`Rolling.sem()` and :meth:`Expanding.sem()` to compute the standard error of mean (:issue:`26476`).
- :meth:`Rolling.var()` and :meth:`Rolling.std()` use Kahan summation and Welfords Method to avoid numerical issues (:issue:`37051`)
- :meth:`DataFrame.plot` now recognizes ``xlabel`` and ``ylabel`` arguments for plots of type ``scatter`` and ``hexbin`` (:issue:`37001`)
- :meth:`DataFrame.to_parquet` now writes to ``io.Bytes`` when no ``path`` argument is passed (:issue:`37105`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this returns a bytes object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. Fixed

If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function) or io.BytesIO. The engine
fastparquet does not accept file-like objects.
fastparquet does not accept file-like objects. If path is None,
frame is written to an io.BytesIO object and a bytes object with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise, the io.Bytes is a detail that is not important; its the return of bytes that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

fastparquet does not accept file-like objects.
fastparquet does not accept file-like objects. If path is None,
frame is written to an io.BytesIO object and a bytes object with
the contents of the buffer is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionchanged 1.2 (return bytes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done both

pandas/io/parquet.py Show resolved Hide resolved
# GH 37105

buf = df_full.to_parquet()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that buf is bytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@arw2019 arw2019 force-pushed the GH37105 branch 4 times, most recently from 1518ad8 to 0bd0f47 Compare October 20, 2020 06:18
@jreback jreback added this to the 1.2 milestone Oct 20, 2020
@jreback
Copy link
Contributor

jreback commented Oct 20, 2020

lgtm. can you merge master and ping on green just to make sure.

@jorisvandenbossche if any comments.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Small comment on the test


with tm.ensure_clean() as path:
with open(path, "wb") as f:
f.write(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe rather test that you can directly read the bytes again (by wrapping it in a BytesIO?) instead of writing the bytes to a file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jorisvandenbossche jorisvandenbossche merged commit ac7ca23 into pandas-dev:master Oct 21, 2020
@jorisvandenbossche
Copy link
Member

Thanks @arw2019 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: df.to_parquet() should return bytes
3 participants