Pachyderm: A Containerized, Version-Controlled Data Lake
- Git for Data Science: Pachyderm offers complete version control for even the largest data sets.
- Containerized: Pachyderm is built on Docker and Kubernetes. Since everything in Pachyderm is a container, data scientists can use any languages or libraries they want (e.g. R, Python, OpenCV, etc).
- Ideal for building machine learning pipelines and ETL workflows: Pachyderm versions and tracks every output directly to the raw input datasets that created it (aka: Provenance).
For more details, see what's new about Pachyderm.
You can also refer to our complete developer docs to see tutorials, check out example projects, and learn about advanced features of Pachyderm.
If you'd like to see some examples and learn about core use cases for Pachyderm:
- Use Cases
- Case Studies: Learn how General Fusion uses Pachyderm to power commercial fusion research.
What's new about Pachyderm? (How is it different from Hadoop?)
There are two bold new ideas in Pachyderm:
- Containers as the core processing primitive
- Version Control for data
These ideas lead directly to a system that's much more powerful, flexible and easy to use.
To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).
Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!
Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
Keep up to date and get Pachyderm support via:
To get started, sign the Contributor License Agreement.
Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "noob-friendly" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.
Pachyderm automatically reports anonymized usage metrics. These metrics help us
understand how people are using Pachyderm and make it better. They can be
disabled by setting the env variable
false in the pachd