Skip to content

Commit

Permalink
Merge pull request #3282 from pachyderm/atom-pfs-migration-notes
Browse files Browse the repository at this point in the history
Go back to atom inputs in documentation & examples
  • Loading branch information
jdoliner committed Dec 14, 2018
2 parents 822a2e9 + fd64da3 commit d577f50
Show file tree
Hide file tree
Showing 45 changed files with 215 additions and 193 deletions.
2 changes: 1 addition & 1 deletion doc/cookbook/gpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ An example pipeline definition for a GPU enabled Pachyderm Pipeline is as follow
"gpu": 1
},
"inputs": {
"pfs": {
"atom": {
"repo": "data",
"glob": "/*"
}
Expand Down
2 changes: 1 addition & 1 deletion doc/enterprise/stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ As mentioned above, enabling stats collection for a pipeline is as simple as add
"name": "edges"
},
"input": {
"pfs": {
"atom": {
"glob": "/*",
"repo": "images"
}
Expand Down
2 changes: 1 addition & 1 deletion doc/fundamentals/creating_analysis_pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ Here's an example pipeline spec:
"cmd": ["/binary", "/pfs/data", "/pfs/out"]
},
"input": {
"pfs": {
"atom": {
"repo": "data",
"glob": "/*"
}
Expand Down
2 changes: 1 addition & 1 deletion doc/fundamentals/creating_services.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ is an example of Jupyter service:
```json
{
"input": {
"pfs": {
"atom": {
"glob": "/",
"repo": "input"
}
Expand Down
22 changes: 11 additions & 11 deletions doc/fundamentals/distributed_computing.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,30 +38,30 @@ Defining how your data is spread out among workers is arguably the most importan

Instead of confining users to just data-distribution patterns such as Map (split everything as much as possible) and Reduce (_all_ the data must be grouped together), Pachyderm uses [Glob Patterns](https://en.wikipedia.org/wiki/Glob_(programming)) to offer incredible flexibility in defining your data distribution.

Glob patterns are defined by the user for each `pfs` within the `input` of a pipeline, and they tell Pachyderm how to divide the input data into individual "datums" that can be processed independently.
Glob patterns are defined by the user for each `atom` within the `input` of a pipeline, and they tell Pachyderm how to divide the input data into individual "datums" that can be processed independently.

```
"input": {
"pfs": {
"atom": {
"repo": "string",
"glob": "string",
}
}
```

That means you could easily define multiple PFS inputs, one with the data highly distributed and another where it's grouped together. You can then join the datums in these PFS inputs via a cross product or union (as shown above) for combined, distributed processing.
That means you could easily define multiple "atoms", one with the data highly distributed and another where it's grouped together. You can then join the datums in these atoms via a cross product or union (as shown above) for combined, distributed processing.

```
"input": {
"cross" or "union": [
{
"pfs": {
"atom": {
"repo": "string",
"glob": "string",
}
},
{
"pfs": {
"atom": {
"repo": "string",
"glob": "string",
}
Expand All @@ -71,21 +71,21 @@ That means you could easily define multiple PFS inputs, one with the data highly
}
```

More information about PFS inputs, unions, and crosses can be found in our [Pipeline Specification](http://docs.pachyderm.io/en/latest/reference/pipeline_spec.html).
More information about "atoms," unions, and crosses can be found in our [Pipeline Specification](http://docs.pachyderm.io/en/latest/reference/pipeline_spec.html).

### Datums

Pachyderm uses the glob pattern to determine how many “datums” a PFS input consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.
Pachyderm uses the glob pattern to determine how many “datums” an input atom consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.

If you have two workers and define 2 datums, Pachyderm will send one datum to each worker. In a scenario where there are more datums than workers, Pachyderm will queue up extra datums and send them to workers as they finish processing previous datums.

### Defining Datums via Glob Patterns

Intuitively, you should think of the PFS input repo as a file system where the glob pattern is being applied to the root of the file system. The files and directories that match the glob pattern are considered datums.
Intuitively, you should think of the input atom repo as a file system where the glob pattern is being applied to the root of the file system. The files and directories that match the glob pattern are considered datums.

For example, a glob pattern of just `/` would denote the entire input repo as a single datum. All of the input data would be given to a single worker similar to a typical reduce-style operation.

Another commonly used glob pattern is `/*`. `/*` would define each top level object (file or directory) in the PFS input repo as its own datum. If you have a repo with just 10 files in it and no directory structure, every file would be a datum and could be processed independently. This is similar to a typical map-style operation.
Another commonly used glob pattern is `/*`. `/*` would define each top level object (file or directory) in the input atom repo as its own datum. If you have a repo with just 10 files in it and no directory structure, every file would be a datum and could be processed independently. This is similar to a typical map-style operation.

But Pachyderm can do anything in between too. If you have a directory structure with each state as a directory and a file for each city such as:

Expand All @@ -105,7 +105,7 @@ and you need to process all the data for a given state together, `/*` would also

If we instead used the glob pattern `/*/*` for the states example above, each `<city>.json` would be it's own datum.

Glob patterns also let you take only a particular directory (or subset of directories) as a PFS input instead of the whole input repo. If we create a pipeline that is specifically only for California, we can use a glob pattern of `/California/*` to only use the data in that directory as input to our pipeline.
Glob patterns also let you take only a particular directory (or subset of directories) as an input atom instead of the whole input repo. If we create a pipeline that is specifically only for California, we can use a glob pattern of `/California/*` to only use the data in that directory as input to our pipeline.

### Only Processing New Data

Expand All @@ -127,7 +127,7 @@ Let's look at our states example with a few different glob patterns to demonstra
...
```

If our glob pattern is `/`, then the entire PFS input is a single datum, which means anytime any file or directory is changed in our input, all the the data will be processed from scratch. There are plenty of usecases where this is exactly what we need (e.g. some machine learning training algorithms).
If our glob pattern is `/`, then the entire input atom is a single datum, which means anytime any file or directory is changed in our input, all the the data will be processed from scratch. There are plenty of usecases where this is exactly what we need (e.g. some machine learning training algorithms).

If our glob pattern is `/*`, then each state directory is it's own datum and we'll only process the ones that have changed. So if we add a new city file, `Sacramento.json` to the `/California` directory, _only_ the California datum, will be reprocessed.

Expand Down
2 changes: 1 addition & 1 deletion doc/fundamentals/getting_data_out_of_pachyderm.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ edges 026536b547a44a8daa2db9d25bf88b79 754542b89c1c47a5b657e6038
edges 754542b89c1c47a5b657e60381c06c71 <none> 2 minutes ago Less than a second 22.22 KiB
```

In this case, we have one output commit per input commit on `images`. However, this might get more complicated for pipelines with multiple branches, multiple PFS inputs, etc. To confirm which commits correspond to which outputs, we can use `flush-commit`. In particular, we can call `flush-commit` on any one of our commits into `images` to see which output came from this particular commit:
In this case, we have one output commit per input commit on `images`. However, this might get more complicated for pipelines with multiple branches, multiple input atoms, etc. To confirm which commits correspond to which outputs, we can use `flush-commit`. In particular, we can call `flush-commit` on any one of our commits into `images` to see which output came from this particular commit:

```sh
$ pachctl flush-commit images/a9678d2a439648c59636688945f3c6b5
Expand Down
2 changes: 1 addition & 1 deletion doc/fundamentals/incrementality.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ an example. Suppose we have a pipeline with a single input that looks like this:

```json
{
"pfs": {
"atom": {
"repo": "R",
"glob": "/*",
}
Expand Down
8 changes: 4 additions & 4 deletions doc/getting_started/beginner_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,15 +101,15 @@ Below is the pipeline spec and python code we're using. Let's walk through the d
"image": "pachyderm/opencv"
},
"input": {
"pfs": {
"atom": {
"repo": "images",
"glob": "/*"
}
}
}
Our pipeline spec contains a few simple sections. First is the pipeline ``name``, edges. Then we have the ``transform`` which specifies the docker image we want to use, ``pachyderm/opencv`` (defaults to DockerHub as the registry), and the entry point ``edges.py``. Lastly, we specify the input. Here we only have one PFS input, our images repo with a particular glob pattern.
Our pipeline spec contains a few simple sections. First is the pipeline ``name``, edges. Then we have the ``transform`` which specifies the docker image we want to use, ``pachyderm/opencv`` (defaults to DockerHub as the registry), and the entry point ``edges.py``. Lastly, we specify the input. Here we only have one "atom" input, our images repo with a particular glob pattern.

The glob pattern defines how the input data can be broken up if we wanted to distribute our computation. ``/*`` means that each file can be processed individually, which makes sense for images. Glob patterns are one of the most powerful features of Pachyderm so when you start creating your own pipelines, check out the :doc:`../reference/pipeline_spec`.

Expand Down Expand Up @@ -245,13 +245,13 @@ Below is the pipeline spec for this new pipeline:
},
"input": {
"cross": [ {
"pfs": {
"atom": {
"glob": "/",
"repo": "images"
}
},
{
"pfs": {
"atom": {
"glob": "/",
"repo": "edges"
}
Expand Down
2 changes: 1 addition & 1 deletion doc/managing_pachyderm/data_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ ln -s /pfs/input/log.txt /pfs/out/logs/log.txt

Under the hood, Pachyderm is smart enough to recognize that the output file simply symlinks to a file that already exists in Pachyderm, and therefore skips the upload altogether.

Note that if your shuffling pipeline only needs the names of the input files but not their content, you can use [`lazy input`](http://pachyderm.readthedocs.io/en/latest/reference/pipeline_spec.html#pfs-input). That way, your shuffling pipeline can skip both the download and the upload. An example for this type of shuffle pipeline is [here](https://github.com/pachyderm/pachyderm/tree/master/examples/shuffle)
Note that if your shuffling pipeline only needs the names of the input files but not their content, you can use [`lazy input`](http://pachyderm.readthedocs.io/en/latest/reference/pipeline_spec.html#atom-input). That way, your shuffling pipeline can skip both the download and the upload. An example for this type of shuffle pipeline is [here](https://github.com/pachyderm/pachyderm/tree/master/examples/shuffle)

## Garbage collection

Expand Down
62 changes: 42 additions & 20 deletions doc/reference/pipeline_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
"datum_tries": int,
"job_timeout": string,
"input": {
<"pfs", "cross", "union", "cron", or "git" see below>
<"atom", "pfs", "cross", "union", "cron", or "git" see below>
},
"output_branch": string,
"egress": {
Expand All @@ -79,6 +79,19 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
"pod_spec": string
}

------------------------------------
"atom" input
------------------------------------

"atom": {
"name": string,
"repo": string,
"branch": string,
"glob": string,
"lazy" bool,
"empty_files": bool
}

------------------------------------
"pfs" input
------------------------------------
Expand All @@ -98,7 +111,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.

"cross" or "union": [
{
"pfs": {
"atom": {
"name": string,
"repo": string,
"branch": string,
Expand All @@ -108,7 +121,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
}
},
{
"pfs": {
"atom": {
"name": string,
"repo": string,
"branch": string,
Expand Down Expand Up @@ -155,7 +168,7 @@ In practice, you rarely need to specify all the fields. Most fields either come
"cmd": ["/binary", "/pfs/data", "/pfs/out"]
},
"input": {
"pfs": {
"atom": {
"repo": "data",
"glob": "/*"
}
Expand Down Expand Up @@ -342,15 +355,18 @@ these fields be set for any instantiation of the object.

```
{
"pfs": pfs_input,
"atom": atom_input,
"union": [input],
"cross": [input],
"cron": cron_input
}
```

#### PFS Input
PFS inputs are the simplest inputs, they take input from a single branch on a
#### Atom Input

**Note:** Atom inputs are deprecated in Pachyderm 1.8.1+. They have been renamed to PFS inputs. The configuration is the same, but all instances of `atom` should be changed to `pfs`.

Atom inputs are the simplest inputs, they take input from a single branch on a
single repo.

```
Expand All @@ -364,20 +380,20 @@ single repo.
}
```

`input.pfs.name` is the name of the input. An input with name `XXX` will be
`input.atom.name` is the name of the input. An input with name `XXX` will be
visible under the path `/pfs/XXX` when a job runs. Input names must be unique
if the inputs are crossed, but they may be duplicated between `PfsInput`s that are unioned. This is because when `PfsInput`s are unioned, you'll only ever see a datum from one input at a time. Overlapping the names of unioned inputs allows
if the inputs are crossed, but they may be duplicated between `AtomInput`s that are unioned. This is because when `AtomInput`s are unioned, you'll only ever see a datum from one input at a time. Overlapping the names of unioned inputs allows
you to write simpler code since you no longer need to consider which input directory a particular datum come from. If an input's name is not specified, it defaults to the name of the repo. Therefore, if you have two crossed inputs from the same repo, you'll be required to give at least one of them a unique name.

`input.pfs.repo` is the `repo` to be used for the input.
`input.atom.repo` is the `repo` to be used for the input.

`input.pfs.branch` is the `branch` to watch for commits on, it may be left blank in
`input.atom.branch` is the `branch` to watch for commits on, it may be left blank in
which case `"master"` will be used.

`input.pfs.glob` is a glob pattern that's used to determine how the input data
`input.atom.glob` is a glob pattern that's used to determine how the input data
is partitioned. It's explained in detail in the next section.

`input.pfs.lazy` controls how the data is exposed to jobs. The default is `false`
`input.atom.lazy` controls how the data is exposed to jobs. The default is `false`
which means the job will eagerly download the data it needs to process and it
will be exposed as normal files on disk. If lazy is set to `true`, data will be
exposed as named pipes instead and no data will be downloaded until the job
Expand All @@ -389,10 +405,16 @@ be especially notable if the job only reads a subset of the files that are
available to it. Note that `lazy` currently doesn't support datums that
contain more than 10000 files.

`input.pfs.empty_files` controls how files are exposed to jobs. If true, it will
cause files from this PFS input to be presented as empty files. This is useful in shuffle
`input.atom.empty_files` controls how files are exposed to jobs. If true, it will
cause files from this atom to be presented as empty files. This is useful in shuffle
pipelines where you want to read the names of files and reorganize them using symlinks.

#### PFS Input

**Note:** PFS inputs are only available in versions of Pachyderm 1.8.1+.

PFS inputs are the new name for atom inputs, documented above. The configuration is the same, but all instances of `atom` should be changed to `pfs`.

#### Union Input

Union inputs take the union of other inputs. For example:
Expand All @@ -417,7 +439,7 @@ directory. This, of course, only works if your code doesn't need to be
aware of which of the underlying inputs the data comes from.

`input.union` is an array of inputs to union, note that these need not be
`pfs` inputs, they can also be `union` and `cross` inputs. Although there's no
`atom` inputs, they can also be `union` and `cross` inputs. Although there's no
reason to take a union of unions since union is associative.

#### Cross Input
Expand All @@ -438,7 +460,7 @@ Notice that cross inputs, do not take a name and maintain the names of the sub-i
In the above example you would see files under `/pfs/inputA/...` and `/pfs/inputB/...`.

`input.cross` is an array of inputs to cross, note that these need not be
`pfs` inputs, they can also be `union` and `cross` inputs. Although there's no
`atom` inputs, they can also be `union` and `cross` inputs. Although there's no
reason to take a cross of crosses since cross products are associative.

#### Cron Input
Expand All @@ -461,7 +483,7 @@ satisfied the spec. The time is formatted according to [RFC
```

`input.cron.name` is the name for the input, its semantics are similar to
those of `input.pfs.name`. Except that it's not optional.
those of `input.atom.name`. Except that it's not optional.

`input.cron.spec` is a cron expression which specifies the schedule on
which to trigger the pipeline. To learn more about how to write schedules
Expand All @@ -488,7 +510,7 @@ Git inputs allow you to pull code from a public git URL and execute that code as
`input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`

`input.git.name` is the name for the input, its semantics are similar to
those of `input.pfs.name`. It is optional.
those of `input.atom.name`. It is optional.

`input.git.branch` is the name of the git branch to use as input

Expand Down Expand Up @@ -610,7 +632,7 @@ containers.

## The Input Glob Pattern

Each PFS input needs to specify a [glob pattern](../fundamentals/distributed_computing.html).
Each atom input needs to specify a [glob pattern](../fundamentals/distributed_computing.html).

Pachyderm uses the glob pattern to determine how many "datums" an input
consists of. Datums are the unit of parallelism in Pachyderm. That is,
Expand Down
4 changes: 2 additions & 2 deletions examples/fruit_stand/pipeline.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"constant": 1
},
"input": {
"pfs": {
"atom": {
"repo": "data",
"glob": "/*"
}
Expand All @@ -37,7 +37,7 @@
"constant": 1
},
"input": {
"pfs": {
"atom": {
"repo": "filter",
"glob": "/*"
}
Expand Down
4 changes: 2 additions & 2 deletions examples/gatk/joint-call.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@
"input": {
"cross": [
{
"pfs": {
"atom": {
"repo": "reference",
"glob": "/"
}
},
{
"pfs": {
"atom": {
"repo": "likelihoods",
"glob": "/"
}
Expand Down

0 comments on commit d577f50

Please sign in to comment.