Pachyderm Pipeline System (PPS)

Get Started

To get started using Pipelines, refer to our Beginner Tutorial or Pipeline docs.

Overview

Pachyderm Pipeline System is a parallel, containerized analysis platform

It is designed to:

Write your analysis in any language of your choosing (enabling Accessibility).
Allow you to compose your analyses
Allow you to reproduce your input data, your processing step, and your output data (enabling Reproducibility)
Allow you to understand the Provenance of your data

Components

PPS has two components, and understand each gives you a full picture of PPS.

Jobs

Jobs are transformations that are only run once.

Broadly, they take the following inputs:

a transformation image, refer to the pipeline spec for instructions on creating your own image
- an entry point to run the transformation
- some other configuration options about how to run the job (parallelism, partitioning method, etc)
at least one PFS input Repo containing some data
- a Commit ID per input repo

When creating a job, PPS:

creates an output Repo with the same name as the job
uses kubernetes to spin up containers w the image you specify, in the configuration you specify
mounts the input Repo at the Commit specified at /pfs/your_repo_name for use by your code on that container
mounts /pfs/out for writing output, which is connected to the newly created output Repo
runs the containers with the entry point you provided
the output is stored in a new commit on the new output Repo

Pipeline

Pipelines are configured once, but run every time new data is present in the form of a new Commit on any of their input Repos. You can think of them as automatically up-to-date long-running jobs.

For detailed instructions on pipelines, refer to the pipeline spec

Provenance

You'll be using and composing pipelines frequently with PPS. Quickly, you're going to want to understand how your outputs are related to the inputs.

Check out the flush-commit docs for specifics on how to track provenance.

Debugging tools

Beyond provenance, your primary triaging tool is pachctl's logs. This allows you to see the log output per Job / Pipeline and debug any errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pachyderm_pipeline_system.md

pachyderm_pipeline_system.md

Pachyderm Pipeline System (PPS)

Get Started

Overview

Components

Jobs

Pipeline

Provenance

Debugging tools

Files

pachyderm_pipeline_system.md

Latest commit

History

pachyderm_pipeline_system.md

File metadata and controls

Pachyderm Pipeline System (PPS)

Get Started

Overview

Components

Jobs

Pipeline

Provenance

Debugging tools