New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass kwargs from read_parquet() to the underlying engines. #18216

Merged
merged 8 commits into from Nov 14, 2017

Conversation

Projects
None yet
7 participants
@Corni
Contributor

Corni commented Nov 10, 2017

This allows e.g. to specify filters for predicate pushdown to fastparquet.
This is a followup to #18155/#18154

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
Cornelius Riemenschneider
Pass kwargs from read_parquet() to the underlying engines.
This allows e.g. to specify filters for predicate pushdown to fastparquet.
@pep8speaks

This comment has been minimized.

Show comment
Hide comment
@pep8speaks

pep8speaks Nov 10, 2017

Hello @Corni! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 14, 2017 at 13:47 Hours UTC

pep8speaks commented Nov 10, 2017

Hello @Corni! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 14, 2017 at 13:47 Hours UTC

Cornelius Riemenschneider added some commits Nov 10, 2017

Cornelius Riemenschneider
Cornelius Riemenschneider
@Corni

This comment has been minimized.

Show comment
Hide comment
@Corni

Corni Nov 10, 2017

Contributor

@hoffmann Is this what you intended?

Contributor

Corni commented Nov 10, 2017

@hoffmann Is this what you intended?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 10, 2017

Contributor

need tests

Contributor

jreback commented Nov 10, 2017

need tests

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 10, 2017

Contributor

@wesm @cpcloud @martindurant

is there a reasonable API that we could expose here to users to facilitate some sort of row group filttering (w/o resorting to a full query language or multiple functions)?

Contributor

jreback commented Nov 10, 2017

@wesm @cpcloud @martindurant

is there a reasonable API that we could expose here to users to facilitate some sort of row group filttering (w/o resorting to a full query language or multiple functions)?

@jreback jreback added the IO Parquet label Nov 10, 2017

@xhochy

This comment has been minimized.

Show comment
Hide comment
@xhochy

xhochy Nov 10, 2017

Contributor

@jreback https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 sounds reasonable. We should be able to implement that in pyarrow quite easily.

Contributor

xhochy commented Nov 10, 2017

@jreback https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 sounds reasonable. We should be able to implement that in pyarrow quite easily.

@xhochy

This comment has been minimized.

Show comment
Hide comment
Contributor

xhochy commented Nov 10, 2017

Cornelius Riemenschneider
Fix wrong tests, which called read_parquet with a compression paramet…
…er (which only makes sense for writes).
@Corni

This comment has been minimized.

Show comment
Hide comment
@Corni

Corni Nov 13, 2017

Contributor

@jreback How do you imagine tests for this feature, which would not rely on testing backend-specific parameters and their behauviour?

Contributor

Corni commented Nov 13, 2017

@jreback How do you imagine tests for this feature, which would not rely on testing backend-specific parameters and their behauviour?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2017

Contributor

@Corni actually I would have a couple of tests that pass thru kwargs specifically for the backend (IOW separate tests). just to make sure things are passed thru. e.g. write a file with row_groups and exercise reading with row_groups. (the kwargs for this type of predicate pushdown hopefully will be synchronized at some point, but for now you can have tests for each engine separately).

Contributor

jreback commented Nov 13, 2017

@Corni actually I would have a couple of tests that pass thru kwargs specifically for the backend (IOW separate tests). just to make sure things are passed thru. e.g. write a file with row_groups and exercise reading with row_groups. (the kwargs for this type of predicate pushdown hopefully will be synchronized at some point, but for now you can have tests for each engine separately).

@@ -105,7 +105,7 @@ def test_options_py(df_compat, pa):
with pd.option_context('io.parquet.engine', 'pyarrow'):
df.to_parquet(path)
result = read_parquet(path, compression=None)

This comment has been minimized.

@jreback

jreback Nov 13, 2017

Contributor

is there a reason you are removing the kw?

@jreback

jreback Nov 13, 2017

Contributor

is there a reason you are removing the kw?

This comment has been minimized.

@Corni

Corni Nov 13, 2017

Contributor

Yes, compression only exists for writes, as you specify the compression when writing the data, on read you have to un-compress with whatever algorithm was used when writing the file.
Before my patch, the kw was silently dropped, now it caused exceptions, because neither backend uses it.

@Corni

Corni Nov 13, 2017

Contributor

Yes, compression only exists for writes, as you specify the compression when writing the data, on read you have to un-compress with whatever algorithm was used when writing the file.
Before my patch, the kw was silently dropped, now it caused exceptions, because neither backend uses it.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49884     -184     
==========================================
- Hits        45777    45575     -202     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <80%> (-0.06%) ⬇️
#single 39.42% <40%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <80%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
pandas/core/api.py 100% <0%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...1c045a5. Read the comment docs.

codecov bot commented Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49884     -184     
==========================================
- Hits        45777    45575     -202     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <80%> (-0.06%) ⬇️
#single 39.42% <40%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <80%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
pandas/core/api.py 100% <0%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...1c045a5. Read the comment docs.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49880     -188     
==========================================
- Hits        45777    45571     -206     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <83.33%> (-0.06%) ⬇️
#single 39.42% <33.33%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <83.33%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/indexes/multi.py 96.38% <0%> (-0.02%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...edbd937. Read the comment docs.

codecov bot commented Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49880     -188     
==========================================
- Hits        45777    45571     -206     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <83.33%> (-0.06%) ⬇️
#single 39.42% <33.33%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <83.33%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/indexes/multi.py 96.38% <0%> (-0.02%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...edbd937. Read the comment docs.

@Corni

This comment has been minimized.

Show comment
Hide comment
@Corni

Corni Nov 13, 2017

Contributor

I included now I included now a test for filtering rowgroups with fastparquet, though there is no kwargs in pyarrow yet which I could include in a test (see https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html), so I'm not testing the pyarrow yet.

Contributor

Corni commented Nov 13, 2017

I included now I included now a test for filtering rowgroups with fastparquet, though there is no kwargs in pyarrow yet which I could include in a test (see https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html), so I'm not testing the pyarrow yet.

@@ -188,18 +188,21 @@ def check_error_on_write(self, df, engine, exc):
with tm.ensure_clean() as path:
to_parquet(df, path, engine, compression=None)
def check_round_trip(self, df, engine, expected=None, **kwargs):
def check_round_trip(self, df, engine, expected=None,

This comment has been minimized.

@hoffmann

hoffmann Nov 14, 2017

Contributor

I would refactor this helper function to have the following signature:

def check_round_trip(self, df, engine, expected=None, write_kwargs=None, read_kwargs=None)

@hoffmann

hoffmann Nov 14, 2017

Contributor

I would refactor this helper function to have the following signature:

def check_round_trip(self, df, engine, expected=None, write_kwargs=None, read_kwargs=None)

This comment has been minimized.

@Corni

Corni Nov 14, 2017

Contributor

Yeah, this is definitly the way to go.
Tests until now only worked because pyarrow.parquet.write_table ignores extra kwargs, and the fastparquet implementation did not pass kwargs through to the write method - else the tests would have failed on the (read-only) parameter columns already.

@Corni

Corni Nov 14, 2017

Contributor

Yeah, this is definitly the way to go.
Tests until now only worked because pyarrow.parquet.write_table ignores extra kwargs, and the fastparquet implementation did not pass kwargs through to the write method - else the tests would have failed on the (read-only) parameter columns already.

@hoffmann

This comment has been minimized.

Show comment
Hide comment
@hoffmann

hoffmann Nov 14, 2017

Contributor

@Corni thanks, just the minor test issue otherwise like I intended it.

Contributor

hoffmann commented Nov 14, 2017

@Corni thanks, just the minor test issue otherwise like I intended it.

@jreback jreback added this to the 0.21.1 milestone Nov 14, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 14, 2017

Contributor

lgtm. ping on green.

@wesm @xhochy @martindurant

Contributor

jreback commented Nov 14, 2017

lgtm. ping on green.

@wesm @xhochy @martindurant

@xhochy

xhochy approved these changes Nov 14, 2017

Cornelius Riemenschneider
@martindurant

This comment has been minimized.

Show comment
Hide comment
@martindurant

martindurant commented Nov 14, 2017

+1

@Corni

This comment has been minimized.

Show comment
Hide comment
@Corni

Corni Nov 14, 2017

Contributor

ping :)

Contributor

Corni commented Nov 14, 2017

ping :)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 14, 2017

Member

Can you update the docstring for this new functionality?

Member

jorisvandenbossche commented Nov 14, 2017

Can you update the docstring for this new functionality?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 14, 2017

Member

Can you update the docstring for this new functionality?

No matter, it already seems to (incorrectly) have been there ..

Member

jorisvandenbossche commented Nov 14, 2017

Can you update the docstring for this new functionality?

No matter, it already seems to (incorrectly) have been there ..

@jorisvandenbossche jorisvandenbossche merged commit ef4e30b into pandas-dev:master Nov 14, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche
Member

jorisvandenbossche commented Nov 14, 2017

@Corni Thanks!

@Corni Corni deleted the Corni:read-parquet-improvements branch Nov 14, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017

ENH: Pass kwargs from read_parquet() to the underlying engines. (pand…
…as-dev#18216)

This allows e.g. to specify filters for predicate pushdown to fastparquet.

(cherry picked from commit ef4e30b)

TomAugspurger added a commit that referenced this pull request Dec 11, 2017

ENH: Pass kwargs from read_parquet() to the underlying engines. (#18216)
This allows e.g. to specify filters for predicate pushdown to fastparquet.

(cherry picked from commit ef4e30b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment