Reproducible Data Science at Scale!
Clone or download
gabrielgrant Merge pull request #3384 from pachyderm/update-compat-pfs-inputs
update dash compat files for 1.8.1's "PFS" input
Latest commit c867c3f Jan 21, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Change template a little bit. Oct 26, 2018
doc Merge pull request #3351 from pachyderm/docs_issue_3178 Jan 18, 2019
etc update dash compat files for 1.8.1's "PFS" input Jan 20, 2019
examples fixing wordcount example Jan 17, 2019
src Merge pull request #3377 from pachyderm/propagate-ports-to-sidecar Jan 18, 2019
.dockerignore Keep the compile image go version dry May 17, 2018
.gitignore Add deps not that were ignored by case-insensitive configurations Dec 18, 2018
.gitmodules Add make launch-logging task Mar 10, 2017
.goxc.json.template Update how we consume additional version string Jun 28, 2016
.rgignore Ignore citibike data Dec 18, 2018
.spelling Fix spelling errors and typos. Jun 25, 2018
.travis.yml Upgrade travis image from Ubuntu 14.04 to 16.04 (need systemd for min… Dec 17, 2018 Update changelog. Dec 19, 2018 Update Sep 21, 2016
Dockerfile Fix linting. Oct 16, 2018
Dockerfile.pachd MAINTAINER is deprecated, using LABEL now Nov 1, 2017
Dockerfile.test MAINTAINER is deprecated, using LABEL now Nov 1, 2017
Dockerfile.worker Remove May 9, 2018
LICENSE remove appendix instructions from license Aug 12, 2015
Makefile Merge pull request #3348 from pachyderm/fix-ignored-error Jan 16, 2019 Change GH issues label in readme Jan 3, 2019
etcd Install etcdctl (necesary after PR#2952) Jun 5, 2018
mascot.txt Remove refactor progress scripts Sep 2, 2016
pachyderm.go Make it so go getting our repo doesn't error. Apr 22, 2016

GitHub release GitHub license GoDoc Go Report Card Slack Status

Pachyderm: Data Versioning, Data Pipelines, and Data Lineage

Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you.


  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it's processed. You can always ask the system how data has changed, see a diff, and, if something doesn't look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.

Getting Started

Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete developer docs to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:


Official Documentation


Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.


To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.