Skip to content

Commit

Permalink
Merge pull request #14 from zillow/tz/gather_statistics_kwargs
Browse files Browse the repository at this point in the history
Resolves #13 Surface the gather_statistics argument
  • Loading branch information
martindurant committed Aug 22, 2019
2 parents 2e419af + c20180d commit e38d720
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 6 deletions.
18 changes: 18 additions & 0 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,24 @@ Arguments to ``open_parquet``:
be loaded, but partitions containing *at least one* value which passes the filter will be
loaded.

- ``engine`` : 'fastparquet' or 'pyarrow'. Which backend to read with.

- ``gather_statistics`` : bool or None (default). Gather the statistics for
each dataset partition. By default, this will only be done if the _metadata
file is available. Otherwise, statistics will only be gathered if True,
because the footer of every file will be parsed (which is very slow on some
systems).

- ``engine`` : 'fastparquet' or 'pyarrow'. Which backend to read with.

- ``gather_statistics`` : bool or None (default). Gather the statistics for
each dataset partition. By default, this will only be done if the _metadata
file is available. Otherwise, statistics will only be gathered if True,
because the footer of every file will be parsed (which is very slow on some
systems).

- see ``dd.read_parquet()`` for the other named parameters that can be passed through.

.. _documentation : http://dask.pydata.org/en/latest/remote-data-services.html

A source so defined will provide the usual methods such as ``discover`` and ``read_partition``.
Expand Down
16 changes: 10 additions & 6 deletions intake_parquet/source.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,15 @@ class ParquetSource(base.DataSource):
- engine: 'fastparquet' or 'pyarrow'
Which backend to read with.
- gather_statistics : bool or None (default).
Gather the statistics for each dataset partition. By default,
this will only be done if the _metadata file is available. Otherwise,
statistics will only be gathered if True, because the footer of
every file will be parsed (which is very slow on some systems).
- see dd.read_parquet() for the other named parameters that can be passed through.
"""
container = 'dataframe'
name = 'parquet'
Expand Down Expand Up @@ -96,13 +105,8 @@ def _to_dask(self):
"""
import dask.dataframe as dd
urlpath = self._get_cache(self._urlpath)[0]
kw = dict(columns=self._kwargs.get('columns', None),
index=self._kwargs.get('index', None),
engine=self._kwargs.get('engine', 'auto'))
if 'filters' in self._kwargs:
kw['filters'] = self._kwargs['filters']
self._df = dd.read_parquet(urlpath,
storage_options=self._storage_options, **kw)
storage_options=self._storage_options, **self._kwargs)
self._load_metadata()
return self._df

Expand Down

0 comments on commit e38d720

Please sign in to comment.