GitHub - mrocklin/dask-deltatable: A Delta Lake reader for Dask

Dask-DeltaTable

Reading and writing to Delta Lake using Dask engine.

Installation

dask-deltatable is available on PyPI:

pip install dask-deltatable

And conda-forge:

conda install -c conda-forge dask-deltatable

Features:

Read the parquet files from Delta Lake and parallelize with Dask
Write Dask dataframes to Delta Lake (limited support)
Supports multiple filesystems (s3, azurefs, gcsfs)
Subset of Delta Lake features:
- Time Travel
- Schema evolution
- Parquet filters
  - row filter
  - partition filter

Not supported

Writing to Delta Lake is still in development.
optimize API to run a bin-packing operation on a Delta Table.

Reading from Delta Lake

import dask_deltatable as ddt

# read delta table
df = ddt.read_deltalake("delta_path")

# with specific version
df = ddt.read_deltalake("delta_path", version=3)

# with specific datetime
df = ddt.read_deltalake("delta_path", datetime="2018-12-19T16:39:57-08:00")

df is a Dask DataFrame that you can work with in the same way you normally would. See the Dask DataFrame documentation for available operations.

Accessing remote file systems

To be able to read from S3, azure, gcsfs, and other remote filesystems, you ensure the credentials are properly configured in environment variables or config files. For AWS, you may need ~/.aws/credential; for gcsfs, GOOGLE_APPLICATION_CREDENTIALS. Refer to your cloud provider documentation to configure these.

ddt.read_deltalake("s3://bucket_name/delta_path", version=3)

Accessing AWS Glue catalog

dask-deltatable can connect to AWS Glue catalog to read the delta table. The method will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and if those are not available, fall back to ~/.aws/credentials.

Example:

ddt.read_deltalake(catalog="glue", database_name="science", table_name="physics")

Writing to Delta Lake

To write a Dask dataframe to Delta Lake, use to_deltalake method.

import dask.dataframe as dd
import dask_deltatable as ddt

df = dd.read_csv("s3://bucket_name/data.csv")
# do some processing on the dataframe...
ddt.to_deltalake(df, "s3://bucket_name/delta_path")

Writing to Delta Lake is still in development, so be aware that some features may not work.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
continous_integeration		continous_integeration
dask_deltatable		dask_deltatable
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
conftest.py		conftest.py
dev_requirements.txt		dev_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dask-DeltaTable

Installation

Features:

Not supported

Reading from Delta Lake

Accessing remote file systems

Accessing AWS Glue catalog

Writing to Delta Lake

About

Releases

Packages

Languages

License

mrocklin/dask-deltatable

Folders and files

Latest commit

History

Repository files navigation

Dask-DeltaTable

Installation

Features:

Not supported

Reading from Delta Lake

Accessing remote file systems

Accessing AWS Glue catalog

Writing to Delta Lake

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages