To get started using Pipelines, refer to our Beginner Tutorial or Pipeline docs.
Pachyderm Pipeline System is a parallel, containerized analysis platform
It is designed to:
- Write your analysis in any language of your choosing (enabling Accessibility).
- Allow you to compose your analyses
- Allow you to reproduce your input data, your processing step, and your output data (enabling Reproducibility)
- Allow you to understand the Provenance of your data
PPS has two components, and understand each gives you a full picture of PPS.
Jobs are transformations that are only run once.
Broadly, they take the following inputs:
- a transformation image, refer to the pipeline spec for instructions on creating your own image
- an entry point to run the transformation
- some other configuration options about how to run the job (parallelism, partitioning method, etc)
- at least one PFS input
Repo
containing some data- a
Commit
ID per input repo
- a
When creating a job, PPS:
- creates an output
Repo
with the same name as the job - uses kubernetes to spin up containers w the image you specify, in the configuration you specify
- mounts the input
Repo
at theCommit
specified at/pfs/your_repo_name
for use by your code on that container - mounts
/pfs/out
for writing output, which is connected to the newly created outputRepo
- runs the containers with the entry point you provided
- the output is stored in a new commit on the new output
Repo
Pipelines are configured once, but run every time new data is present in the form of a new Commit
on any of their input Repo
s. You can think of them as automatically up-to-date long-running jobs.
For detailed instructions on pipelines, refer to the pipeline spec
You'll be using and composing pipelines frequently with PPS. Quickly, you're going to want to understand how your outputs are related to the inputs.
Check out the flush-commit docs for specifics on how to track provenance.
Beyond provenance, your primary triaging tool is pachctl's logs. This allows you to see the log output per Job
/ Pipeline
and debug any errors.