Merge pull request #3282 from pachyderm/atom-pfs-migration-notes

Go back to atom inputs in documentation & examples
pachyderm · Dec 14, 2018 · d577f50 · d577f50
2 parents 822a2e9 + fd64da3
commit d577f50
Show file tree

Hide file tree

Showing 45 changed files with 215 additions and 193 deletions.
diff --git a/doc/cookbook/gpus.md b/doc/cookbook/gpus.md
@@ -62,7 +62,7 @@ An example pipeline definition for a GPU enabled Pachyderm Pipeline is as follow
       "gpu": 1
   },
   "inputs": {
-    "pfs": {
+    "atom": {
       "repo": "data",
       "glob": "/*"
     }

diff --git a/doc/enterprise/stats.md b/doc/enterprise/stats.md
@@ -28,7 +28,7 @@ As mentioned above, enabling stats collection for a pipeline is as simple as add
     "name": "edges"
   },
   "input": {
-    "pfs": {
+    "atom": {
       "glob": "/*",
       "repo": "images"
     }

diff --git a/doc/fundamentals/creating_analysis_pipelines.md b/doc/fundamentals/creating_analysis_pipelines.md
@@ -94,7 +94,7 @@ Here's an example pipeline spec:
     "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   },
   "input": {
-      "pfs": {
+      "atom": {
         "repo": "data",
         "glob": "/*"
       }

diff --git a/doc/fundamentals/creating_services.md b/doc/fundamentals/creating_services.md
@@ -10,7 +10,7 @@ is an example of Jupyter service:
 ```json
 {
     "input": {
-        "pfs": {
+        "atom": {
             "glob": "/",
             "repo": "input"
         }

diff --git a/doc/fundamentals/distributed_computing.md b/doc/fundamentals/distributed_computing.md
@@ -38,30 +38,30 @@ Defining how your data is spread out among workers is arguably the most importan
 
 Instead of confining users to just data-distribution patterns such as Map (split everything as much as possible) and Reduce (_all_ the data must be grouped together), Pachyderm uses [Glob Patterns](https://en.wikipedia.org/wiki/Glob_(programming)) to offer incredible flexibility in defining your data distribution.
 
- Glob patterns are defined by the user for each `pfs` within the `input` of a pipeline, and they tell Pachyderm how to divide the input data into individual "datums" that can be processed independently.
+ Glob patterns are defined by the user for each `atom` within the `input` of a pipeline, and they tell Pachyderm how to divide the input data into individual "datums" that can be processed independently.
 
 ```
 "input": {
-    "pfs": {
+    "atom": {
         "repo": "string",
         "glob": "string",
     }
 }
 ```
 
-That means you could easily define multiple PFS inputs, one with the data highly distributed and another where it's grouped together.  You can then join the datums in these PFS inputs via a cross product or union (as shown above) for combined, distributed processing.
+That means you could easily define multiple "atoms", one with the data highly distributed and another where it's grouped together.  You can then join the datums in these atoms via a cross product or union (as shown above) for combined, distributed processing.
 
 ```
 "input": {
     "cross" or "union": [
         {
-            "pfs": {
+            "atom": {
                 "repo": "string",
                 "glob": "string",
             }
         },
         {
-            "pfs": {
+            "atom": {
                 "repo": "string",
                 "glob": "string",
             }
@@ -71,21 +71,21 @@ That means you could easily define multiple PFS inputs, one with the data highly
 }
 ```
 
-More information about PFS inputs, unions, and crosses can be found in our [Pipeline Specification](http://docs.pachyderm.io/en/latest/reference/pipeline_spec.html).
+More information about "atoms," unions, and crosses can be found in our [Pipeline Specification](http://docs.pachyderm.io/en/latest/reference/pipeline_spec.html).
 
 ### Datums
 
-Pachyderm uses the glob pattern to determine how many “datums” a PFS input consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.
+Pachyderm uses the glob pattern to determine how many “datums” an input atom consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.
 
 If you have two workers and define 2 datums, Pachyderm will send one datum to each worker. In a scenario where there are more datums than workers, Pachyderm will queue up extra datums and send them to workers as they finish processing previous datums.
 
 ### Defining Datums via Glob Patterns
 
-Intuitively, you should think of the PFS input repo as a file system where the glob pattern is being applied to the root of the file system. The files and directories that match the glob pattern are considered datums.
+Intuitively, you should think of the input atom repo as a file system where the glob pattern is being applied to the root of the file system. The files and directories that match the glob pattern are considered datums.
 
 For example, a glob pattern of just `/` would denote the entire input repo as a single datum. All of the input data would be given to a single worker similar to a typical reduce-style operation.
 
-Another commonly used glob pattern is `/*`. `/*` would define each top level object (file or directory) in the PFS input repo as its own datum. If you have a repo with just 10 files in it and no directory structure, every file would be a datum and could be processed independently. This is similar to a  typical map-style operation.
+Another commonly used glob pattern is `/*`. `/*` would define each top level object (file or directory) in the input atom repo as its own datum. If you have a repo with just 10 files in it and no directory structure, every file would be a datum and could be processed independently. This is similar to a  typical map-style operation.
 
 But Pachyderm can do anything in between too. If you have a directory structure with each state as a directory and a file for each city such as:
 
@@ -105,7 +105,7 @@ and you need to process all the data for a given state together, `/*` would also
 
 If we instead used the glob pattern `/*/*` for the states example above, each `<city>.json` would be it's own datum.
 
-Glob patterns also let you take only a particular directory (or subset of directories) as a PFS input instead of the whole input repo. If we create a pipeline that is specifically only for California, we can use a glob pattern of `/California/*` to only use the data in that directory as input to our pipeline.
+Glob patterns also let you take only a particular directory (or subset of directories) as an input atom instead of the whole input repo. If we create a pipeline that is specifically only for California, we can use a glob pattern of `/California/*` to only use the data in that directory as input to our pipeline.
 
 ### Only Processing New Data
 
@@ -127,7 +127,7 @@ Let's look at our states example with a few different glob patterns to demonstra
 ...
 ```
 
-If our glob pattern is `/`, then the entire PFS input is a single datum, which means anytime any file or directory is changed in our input, all the the data will be processed from scratch. There are plenty of usecases where this is exactly what we need (e.g. some machine learning training algorithms).
+If our glob pattern is `/`, then the entire input atom is a single datum, which means anytime any file or directory is changed in our input, all the the data will be processed from scratch. There are plenty of usecases where this is exactly what we need (e.g. some machine learning training algorithms).
 
 If our glob pattern is `/*`, then each state directory is it's own datum and we'll only process the ones that have changed. So if we add a  new city file, `Sacramento.json` to the `/California` directory, _only_ the California datum, will be reprocessed.
 

diff --git a/doc/fundamentals/getting_data_out_of_pachyderm.md b/doc/fundamentals/getting_data_out_of_pachyderm.md
@@ -43,7 +43,7 @@ edges               026536b547a44a8daa2db9d25bf88b79   754542b89c1c47a5b657e6038
 edges               754542b89c1c47a5b657e60381c06c71   <none>                             2 minutes ago        Less than a second   22.22 KiB
 ```
 
-In this case, we have one output commit per input commit on `images`.  However, this might get more complicated for pipelines with multiple branches, multiple PFS inputs, etc.  To confirm which commits correspond to which outputs, we can use `flush-commit`.  In particular, we can call `flush-commit` on any one of our commits into `images` to see which output came from this particular commit:
+In this case, we have one output commit per input commit on `images`.  However, this might get more complicated for pipelines with multiple branches, multiple input atoms, etc.  To confirm which commits correspond to which outputs, we can use `flush-commit`.  In particular, we can call `flush-commit` on any one of our commits into `images` to see which output came from this particular commit:
 
 ```sh
 $ pachctl flush-commit images/a9678d2a439648c59636688945f3c6b5

diff --git a/doc/fundamentals/incrementality.md b/doc/fundamentals/incrementality.md
@@ -23,7 +23,7 @@ an example. Suppose we have a pipeline with a single input that looks like this:
 
 ```json
 {
-  "pfs": {
+  "atom": {
     "repo": "R",
     "glob": "/*",
   }

diff --git a/doc/getting_started/beginner_tutorial.rst b/doc/getting_started/beginner_tutorial.rst
@@ -101,15 +101,15 @@ Below is the pipeline spec and python code we're using. Let's walk through the d
       "image": "pachyderm/opencv"
     },
     "input": {
-      "pfs": {
+      "atom": {
         "repo": "images",
         "glob": "/*"
       }
     }
   }
 
 
-Our pipeline spec contains a few simple sections. First is the pipeline ``name``, edges. Then we have the ``transform`` which specifies the docker image we want to use, ``pachyderm/opencv`` (defaults to DockerHub as the registry), and the entry point ``edges.py``. Lastly, we specify the input.  Here we only have one PFS input, our images repo with a particular glob pattern. 
+Our pipeline spec contains a few simple sections. First is the pipeline ``name``, edges. Then we have the ``transform`` which specifies the docker image we want to use, ``pachyderm/opencv`` (defaults to DockerHub as the registry), and the entry point ``edges.py``. Lastly, we specify the input.  Here we only have one "atom" input, our images repo with a particular glob pattern. 
 
 The glob pattern defines how the input data can be broken up if we wanted to distribute our computation. ``/*`` means that each file can be processed individually, which makes sense for images. Glob patterns are one of the most powerful features of Pachyderm so when you start creating your own pipelines, check out the :doc:`../reference/pipeline_spec`.
 
@@ -245,13 +245,13 @@ Below is the pipeline spec for this new pipeline:
     },
     "input": {
       "cross": [ {
-        "pfs": {
+        "atom": {
           "glob": "/",
           "repo": "images"
         }
       },
       {
-        "pfs": {
+        "atom": {
           "glob": "/",
           "repo": "edges"
         }

diff --git a/doc/managing_pachyderm/data_management.md b/doc/managing_pachyderm/data_management.md
@@ -26,7 +26,7 @@ ln -s /pfs/input/log.txt /pfs/out/logs/log.txt
 
 Under the hood, Pachyderm is smart enough to recognize that the output file simply symlinks to a file that already exists in Pachyderm, and therefore skips the upload altogether.
 
-Note that if your shuffling pipeline only needs the names of the input files but not their content, you can use [`lazy input`](http://pachyderm.readthedocs.io/en/latest/reference/pipeline_spec.html#pfs-input).  That way, your shuffling pipeline can skip both the download and the upload. An example for this type of shuffle pipeline is [here](https://github.com/pachyderm/pachyderm/tree/master/examples/shuffle)
+Note that if your shuffling pipeline only needs the names of the input files but not their content, you can use [`lazy input`](http://pachyderm.readthedocs.io/en/latest/reference/pipeline_spec.html#atom-input).  That way, your shuffling pipeline can skip both the download and the upload. An example for this type of shuffle pipeline is [here](https://github.com/pachyderm/pachyderm/tree/master/examples/shuffle)
 
 ## Garbage collection
 

diff --git a/doc/reference/pipeline_spec.md b/doc/reference/pipeline_spec.md
@@ -54,7 +54,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
   "datum_tries": int,
   "job_timeout": string,
   "input": {
-    <"pfs", "cross", "union", "cron", or "git" see below>
+    <"atom", "pfs", "cross", "union", "cron", or "git" see below>
   },
   "output_branch": string,
   "egress": {
@@ -79,6 +79,19 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
   "pod_spec": string
 }
 
+------------------------------------
+"atom" input
+------------------------------------
+
+"atom": {
+  "name": string,
+  "repo": string,
+  "branch": string,
+  "glob": string,
+  "lazy" bool,
+  "empty_files": bool
+}
+
 ------------------------------------
 "pfs" input
 ------------------------------------
@@ -98,7 +111,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
 
 "cross" or "union": [
   {
-    "pfs": {
+    "atom": {
       "name": string,
       "repo": string,
       "branch": string,
@@ -108,7 +121,7 @@ create-pipeline](../pachctl/pachctl_create-pipeline.html) doc.
     }
   },
   {
-    "pfs": {
+    "atom": {
       "name": string,
       "repo": string,
       "branch": string,
@@ -155,7 +168,7 @@ In practice, you rarely need to specify all the fields.  Most fields either come
     "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   },
   "input": {
-        "pfs": {
+        "atom": {
             "repo": "data",
             "glob": "/*"
         }
@@ -342,15 +355,18 @@ these fields be set for any instantiation of the object.
 
 ```
 {
-    "pfs": pfs_input,
+    "atom": atom_input,
     "union": [input],
     "cross": [input],
     "cron": cron_input
 }
 ```
 
-#### PFS Input
-PFS inputs are the simplest inputs, they take input from a single branch on a
+#### Atom Input
+
+**Note:** Atom inputs are deprecated in Pachyderm 1.8.1+. They have been renamed to PFS inputs. The configuration is the same, but all instances of `atom` should be changed to `pfs`.
+
+Atom inputs are the simplest inputs, they take input from a single branch on a
 single repo.
 
 ```
@@ -364,20 +380,20 @@ single repo.
 }
 ```
 
-`input.pfs.name` is the name of the input.  An input with name `XXX` will be
+`input.atom.name` is the name of the input.  An input with name `XXX` will be
 visible under the path `/pfs/XXX` when a job runs.  Input names must be unique
-if the inputs are crossed, but they may be duplicated between `PfsInput`s that are unioned.  This is because when `PfsInput`s are unioned, you'll only ever see a datum from one input at a time. Overlapping the names of unioned inputs allows
+if the inputs are crossed, but they may be duplicated between `AtomInput`s that are unioned.  This is because when `AtomInput`s are unioned, you'll only ever see a datum from one input at a time. Overlapping the names of unioned inputs allows
 you to write simpler code since you no longer need to consider which input directory a particular datum come from.  If an input's name is not specified, it defaults to the name of the repo.  Therefore, if you have two crossed inputs from the same repo, you'll be required to give at least one of them a unique name.
 
-`input.pfs.repo` is the `repo` to be used for the input.
+`input.atom.repo` is the `repo` to be used for the input.
 
-`input.pfs.branch` is the `branch` to watch for commits on, it may be left blank in
+`input.atom.branch` is the `branch` to watch for commits on, it may be left blank in
 which case `"master"` will be used.
 
-`input.pfs.glob` is a glob pattern that's used to determine how the input data
+`input.atom.glob` is a glob pattern that's used to determine how the input data
 is partitioned.  It's explained in detail in the next section.
 
-`input.pfs.lazy` controls how the data is exposed to jobs. The default is `false`
+`input.atom.lazy` controls how the data is exposed to jobs. The default is `false`
 which means the job will eagerly download the data it needs to process and it
 will be exposed as normal files on disk. If lazy is set to `true`, data will be
 exposed as named pipes instead and no data will be downloaded until the job
@@ -389,10 +405,16 @@ be especially notable if the job only reads a subset of the files that are
 available to it.  Note that `lazy` currently doesn't support datums that
 contain more than 10000 files.
 
-`input.pfs.empty_files` controls how files are exposed to jobs. If true, it will 
-cause files from this PFS input to be presented as empty files. This is useful in shuffle 
+`input.atom.empty_files` controls how files are exposed to jobs. If true, it will 
+cause files from this atom to be presented as empty files. This is useful in shuffle 
 pipelines where you want to read the names of files and reorganize them using symlinks.
 
+#### PFS Input
+
+**Note:** PFS inputs are only available in versions of Pachyderm 1.8.1+.
+
+PFS inputs are the new name for atom inputs, documented above. The configuration is the same, but all instances of `atom` should be changed to `pfs`.
+
 #### Union Input
 
 Union inputs take the union of other inputs. For example:
@@ -417,7 +439,7 @@ directory. This, of course, only works if your code doesn't need to be
 aware of which of the underlying inputs the data comes from.
 
 `input.union` is an array of inputs to union, note that these need not be
-`pfs` inputs, they can also be `union` and `cross` inputs. Although there's no
+`atom` inputs, they can also be `union` and `cross` inputs. Although there's no
 reason to take a union of unions since union is associative.
 
 #### Cross Input
@@ -438,7 +460,7 @@ Notice that cross inputs, do not take a name and maintain the names of the sub-i
 In the above example you would see files under `/pfs/inputA/...` and `/pfs/inputB/...`.
 
 `input.cross` is an array of inputs to cross, note that these need not be
-`pfs` inputs, they can also be `union` and `cross` inputs. Although there's no
+`atom` inputs, they can also be `union` and `cross` inputs. Although there's no
 reason to take a cross of crosses since cross products are associative.
 
 #### Cron Input
@@ -461,7 +483,7 @@ satisfied the spec. The time is formatted according to [RFC
 ```
 
 `input.cron.name` is the name for the input, its semantics are similar to
-those of `input.pfs.name`. Except that it's not optional.
+those of `input.atom.name`. Except that it's not optional.
 
 `input.cron.spec` is a cron expression which specifies the schedule on
 which to trigger the pipeline. To learn more about how to write schedules
@@ -488,7 +510,7 @@ Git inputs allow you to pull code from a public git URL and execute that code as
 `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`
 
 `input.git.name` is the name for the input, its semantics are similar to
-those of `input.pfs.name`. It is optional.
+those of `input.atom.name`. It is optional.
 
 `input.git.branch` is the name of the git branch to use as input
 
@@ -610,7 +632,7 @@ containers.
 
 ## The Input Glob Pattern
 
-Each PFS input needs to specify a [glob pattern](../fundamentals/distributed_computing.html).
+Each atom input needs to specify a [glob pattern](../fundamentals/distributed_computing.html).
 
 Pachyderm uses the glob pattern to determine how many "datums" an input
 consists of.  Datums are the unit of parallelism in Pachyderm.  That is,

diff --git a/examples/fruit_stand/pipeline.json b/examples/fruit_stand/pipeline.json
@@ -14,7 +14,7 @@
     "constant": 1
   },
   "input": {
-    "pfs": {
+    "atom": {
       "repo": "data",
       "glob": "/*"
     }
@@ -37,7 +37,7 @@
     "constant": 1
   },
   "input": {
-    "pfs": {
+    "atom": {
       "repo": "filter",
       "glob": "/*"
     }

diff --git a/examples/gatk/joint-call.json b/examples/gatk/joint-call.json
@@ -26,13 +26,13 @@
   "input": {
     "cross": [
       {
-        "pfs": {
+        "atom": {
           "repo": "reference",
           "glob": "/"
         }
       },
       {
-        "pfs": {
+        "atom": {
           "repo": "likelihoods",
           "glob": "/"
         }