Merge pull request #143 from epigen/master

merge master back into dev.
pepkit · Jul 8, 2017 · ab7c20a · ab7c20a
2 parents 902ed24 + d6e4602
commit ab7c20a
Show file tree

Hide file tree

Showing 46 changed files with 3,771 additions and 3,305 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -13,5 +13,6 @@ install:
 script: pytest
 branches:
   only:
+    - 0.6-rc2
     - dev
     - master
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -29,16 +29,15 @@ Note: to handle different *classes* of input files, like read1 and read2, these
 Connecting to multiple pipelines
 ****************************************
 
-If you have a project that contains samples of different types, then you may need to specify multiple pipeline repositories to your project. Starting in version 0.5, looper can handle a priority list of pipeline directories in the metadata.pipelines_dir 
-attribute.
+If you have a project that contains samples of different types, then you may need to specify multiple pipeline repositories to your project. Starting in version 0.5, looper can handle a priority list of pipelines. Starting with version 0.6, these pointers should point directly at a pipeline interface files (instead of at directories as previously). in the metadata.pipeline_interfaces attribute.
 
 For example:
 
 .. code-block:: yaml
 
 	metadata:
-	  pipelines_dir: [pipeline1, pipeline2]
+	  pipeline_interfaces: [pipeline_iface1.yaml, pipeline_iface2.yaml]
 
 
-In this case, for a given sample, looper will first look in the pipeline1 directory to see if appropriate pipeline exists for this sample type. If it finds one, it will use this pipeline (or set of pipelines, as specified in the protocol_mappings.yaml file). Having submitted a suitable pipeline it will ignore the pipeline2 directory. However if there is no suitable pipeline in the first directory, looper will check the second directory and, if it finds a match, will submit that. If no suitable pipelines are found in any of the directories, the sample will be skipped as usual.
+In this case, for a given sample, looper will first look in pipeline_iface1.yaml to see if appropriate pipeline exists for this sample type. If it finds one, it will use this pipeline (or set of pipelines, as specified in the protocol_mappings.yaml file). Having submitted a suitable pipeline it will ignore the pipeline_iface2.yaml interface. However if there is no suitable pipeline in the first interface, looper will check the second and, if it finds a match, will submit that. If no suitable pipelines are found in any of the interfaces, the sample will be skipped as usual.
 
diff --git a/doc/source/changelog.rst b/doc/source/changelog.rst
@@ -9,9 +9,7 @@ Changelog
 
     - Add support for Python 3
 
-    - Merges pipeline interface and protocol mappings. This means we now allow direct pointers
-    to ``pipeline_interface.yaml`` files, increasing flexibility, so this relaxes the specified 
-    folder structure that was previously used for ``pipelines_dir`` (with ``config`` subfolder).
+    - Merges pipeline interface and protocol mappings. This means we now allow direct pointers to ``pipeline_interface.yaml`` files, increasing flexibility, so this relaxes the specified folder structure that was previously used for ``pipelines_dir`` (with ``config`` subfolder).
 
     - Allow URLs as paths to sample sheets.
 
@@ -21,14 +19,15 @@ Changelog
 
     - Changed LOOPERENV environment variable to PEPENV, generalizing it to generic models
 
+    - Changed name of ``pipelines_dir`` to ``pipeline_interfaces`` (but maintained backwards compatibility for now).
+
     - Changed name of ``run`` column to ``toggle``, since ``run`` can also refer to a sequencing run.
 
-    - Relaxes many constraints (like resources sections, pipelines_dir columns), making project
-    configuration files useful outside looper. This moves us closer to dividing models from looper,
-    and improves flexibility.
+    - Relaxes many constraints (like resources sections, pipelines_dir columns), making project configuration files useful outside looper. This moves us closer to dividing models from looper, and improves flexibility.
 
     - Various small bug fixes and dev improvements.
 
+    - Require `setuptools` for installation, and `pandas 0.20.2`. If `numexpr` is installed, version `2.6.2` is required.
 
 - **v0.5** (*2017-03-01*):
 
@@ -46,7 +45,7 @@ Changelog
 
     - Complete rehaul of logging and test infrastructure, using logging and pytest packages
 
-  - Fixes
+  - Changed
 
     - Removes pipelines_dir requirement for models, making it useful outside looper
 
@@ -71,7 +70,7 @@ Changelog
 
     - Support for portable and pipeline-independent allocation of computing resources with Looperenv.
 
-  - Fixes
+  - Changed
 
     - Removed requirement to have ``pipelines`` repository installed in order to extend base Sample objects
 

diff --git a/doc/source/cluster-computing.rst b/doc/source/cluster-computing.rst
@@ -3,26 +3,29 @@
 Cluster computing
 =============================================
 
+
 By default, looper will build a shell script for each sample and then run each sample serially on the local computer. But where looper really excels is in large projects that require submitting these jobs to a cluster resource manager (like SLURM, SGE, LFS, etc.). Looper handles the interface to the resource manager so that projects and pipelines can be moved to different environments with ease. 
 
-To configure looper to use cluster computing, all you have to do is tell looper a few things about your cluster setup: you create a configuration file (`compute_config.yaml`) and point an environment variable (``PEPENV``) to this file, and that's it! Complete, step-by-step instructions and examples are available in the pepenv repository at https://github.com/pepkit/pepenv.
+To configure looper to use cluster computing, all you have to do is tell looper a few things about your cluster setup: you create a configuration file (`compute_config.yaml`) and point an environment variable (``PEPENV``) to this file, and that's it!
+
+Following is a brief overview to familiarize you with how this will work. When you're ready to hook looper up to your compute cluster, you should follow the complete, step-by-step instructions and examples in the pepenv repository at https://github.com/pepkit/pepenv. 
 
-Compute config overview 
+PEPENV overview 
 ****************************************
 
-If you're not quite ready to set it up and just want an overview, here is an example ``compute_config.yaml`` file that works with a SLURM environment:
+Here is an example ``compute_config.yaml`` file that works with a SLURM environment:
 
 .. code-block:: yaml
 
    compute:
      default:
-       submission_template: pipelines/templates/local_template.sub
+       submission_template: templates/local_template.sub
        submission_command: sh
      local:
-       submission_template: pipelines/templates/local_template.sub
+       submission_template: templates/local_template.sub
        submission_command: sh    
      slurm:
-       submission_template: pipelines/templates/slurm_template.sub
+       submission_template: templates/slurm_template.sub
        submission_command: sbatch
        partition: queue_name
 
@@ -31,12 +34,13 @@ The sub-sections below ``compute`` each define a "compute package" that can be a
 
 There are two or three sub-parameters for a compute package:
 
-   - **submission_template** is a (relative or absolute) path to the template submission script. Templates are described in more detail in the `pepenv readme <https://github.com/pepkit/pepenv>`_. 
+   - **submission_template** is a (relative or absolute) path to the template submission script. Templates files contain variables that are populated with values for each job you submit. More details are in the `pepenv readme <https://github.com/pepkit/pepenv>`_. 
    - **submission_command** is the command-line command that `looper` will prepend to the path of the produced submission script to actually run it (`sbatch` for SLURM, `qsub` for SGE, `sh` for localhost, etc).
-   - **partition** is necessary only if you need to specify a queue name
+   - **partition** specifies a queue name (optional).
 
 
-Submission templates
+Resources
 ****************************************
-A template uses variables (encoded like `{VARIABLE}`), which will be populated independently for each sample as defined in `pipeline_interface.yaml`. The one variable ``{CODE}`` is a reserved variable that refers to the actual python command that will run the pipeline. Otherwise, you can use any variables you define in your `pipeline_interface.yaml`. In `Templates <https://github.com/pepkit/pepenv/tree/master/templates>`__ are examples for submission templates for `SLURM <https://github.com/pepkit/pepenv/blob/master/templates/slurm_template.sub>`__, `SGE <https://github.com/pepkit/pepenv/blob/master/templates/sge_template.sub>`__, and `local runs <https://github.com/pepkit/pepenv/blob/master/templates/localhost_template.sub>`__. You can also create your own templates, giving looper ultimate flexibility to work with any compute infrastructure in any environment.
+You may notice that the compute config file does not specify resources to request (like memory, CPUs, or time). Yet, these are required as well in order to submit a job to a cluster. In the looper system, **resources are not handled by the pepenv file** because they not relative to a particular computing environment; instead they are are variable and specific to a pipeline and a sample. As such, these items are defined in the ``pipeline_interface.yaml`` file (``pipelines`` section) that connects looper to a pipeline. The reason for this is that the pipeline developer is the most likely to know what sort of resources her pipeline requires, so she is in the best position to define the resources requested.
 
+For more information on how to adjust resources, see the :ref:`pipeline interface <pipeline-interface-pipelines>` documentation. If all the different configuration files seem confusing, now would be a good time to review :doc:`who's who in configuration files <config-files>`.
diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -140,6 +140,7 @@
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']
+html_static_path = []  # it's empty; suppress warning
 
 # Add any extra paths that contain custom files (such as robots.txt or
 # .htaccess) here, relative to this directory. These files are copied

diff --git a/doc/source/config-files.rst b/doc/source/config-files.rst
@@ -2,28 +2,29 @@
 Configuration files
 =========================
 
-Looper uses `YAML <http://www.yaml.org/>`_ configuration files to describe a project. Looper is a very modular system, so there are few different YAML files. Here's an explanation of each. Which ones you need to know about will depend on whether you're a pipeline user (running pipelines on your project) or a pipeline developer (building your own pipeline).
+Looper uses `YAML <http://www.yaml.org/>`_ configuration files for several purposes. Looper is designed to be organized, modular, and very configurable, so there are several configuration files. We've organized the configuration files so they each handle a different level of infrastructure: environment, project, sample, or pipeline. This makes the system very adaptable and portable, but for a newcomer, it is easy to confuse what the different configuration files are used for. So, here's an explanation of each for you to use as a reference until you are familiar with the whole ecosystem. Which ones you need to know about will depend on whether you're a pipeline user (running pipelines on your project) or a pipeline developer (building your own pipeline).
 
 
 Pipeline users
 *****************
 
-Users (non-developers) of pipelines only need to be aware of one YAML file:
+Users (non-developers) of pipelines only need to be aware of one or two YAML files:
 
--   :ref:`project config file <project-config-file>`: This file is specific to each project and contains information about the project's metadata, where the processed files should be saved, and other variables that allow to configure the pipelines specifically for this project.
+-   :ref:`project config file <project-config-file>`: This file is specific to each project and contains information about the project's metadata, where the processed files should be saved, and other variables that allow to configure the pipelines specifically for this project. This file follows the standard looper format (now referred to as ``PEP`` format).
 
 If you are planning to submit jobs to a cluster, then you need to know about a second YAML file:
 
--	looper environment configuration: (in progress). This file tells looper how to use compute resource managers.
+-	:ref:`PEPENV environment config <cluster-resource-managers>`:  This file tells looper how to use compute resource managers, like SLURM. This file doesn't require much editing or maintenance beyond initial setup.
 
-Pipeline developers
-*****************
+That should be all you need to worry about as a pipeline user. If you need to adjust compute resources or want to develop a pipeline or have more advanced project-level control over pipelines, then you'll need to know about a few others:
 
-If you want to add a new pipeline to looper, then you need to know about a configuration file that coordinates linking your pipeline in to your looper project.
+Pipeline developers
+**********************
 
--	:doc:`pipeline interface file <pipeline-interface>`: Has two sections: ``protocol_mapping`` tells looper which pipelines exist, and how to map each protocol (sample data type) to its pipelines. ``pipelines`` links looper to the pipelines; describes variables, options and paths that the pipeline needs to know to run; and outlines resource requirements for cluster managers.
+If you want to add a new pipeline to looper, tweak the way looper interacts with a pipeline for a given project, or change the default cluster resources requested by a pipeline, then you need to know about a configuration file that coordinates linking your pipeline in to your looper project.
 
+-	:doc:`pipeline interface file <pipeline-interface>`: Has two sections: 1) ``protocol_mapping`` tells looper which pipelines exist, and how to map each protocol (sample data type) to its pipelines; 2) ``pipelines`` links looper to the pipelines by describing options, arguments, and compute resources that the pipeline needs to run.
 
 Finally, if you're using Pypiper to develop pipelines, it uses a pipeline-specific configuration file (detailed in the Pypiper documentation):
 
--   `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may have a configuration file describing where software is, and parameters to use for tasks within the pipeline
+-   `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may (optionally) provide a configuration file describing where software is, and parameters to use for tasks within the pipeline. This configuration file is by default named identical to the pypiper script name, with a `.yaml` extension instead of `.py` (So `rna_seq.py` looks for an accompanying `rna_seq.yaml` file by default). These files can be changed on a per-project level using the :ref:`pipeline_config section in your project config file <pipeline-config-section>`.
diff --git a/doc/source/connecting-pipelines.rst b/doc/source/connecting-pipelines.rst
diff --git a/doc/source/define-your-project.rst b/doc/source/define-your-project.rst
@@ -15,13 +15,13 @@ The format is simple and modular, so you only need to define the components you
 
 .. code-block:: yaml
 
-	metadata:
-	  sample_annotation: /path/to/sample_annotation.csv
-	  output_dir: /path/to/output/folder
-	  pipelines_dir: /path/to/pipelines/repository
+   metadata:
+     sample_annotation: /path/to/sample_annotation.csv
+     output_dir: /path/to/output/folder
+     pipeline_interfaces: path/to/pipeline_interface.yaml
 
 
-The **output_dir** describes where you want to save pipeline results, and **pipelines_dir** describes where your pipeline code is stored. You will also need a second file to describe samples, which is a comma-separated value (``csv``) file containing at least a unique identifier column named ``sample_name``, a column named ``library`` describing the sample type, and some way of specifying an input file. Here's a minimal example of **sample_annotation.csv**:
+The **output_dir** key specifies where to save results. The **pipeline_interfaces** key points to your looper-compatible pipelines (described in :doc:`linking the pipeline interface <pipeline-interface>`). The **sample_annotation** key points to another file, which is a comma-separated value (``csv``) file describing samples in the project. Here's a small example of **sample_annotation.csv**:
 
 
 .. csv-table:: Minimal Sample Annotation Sheet
@@ -40,8 +40,8 @@ For example, by default, your jobs will run serially on your local computer, whe
 
 Let's go through the more advanced details of both annotation sheets and project config files:
 
-.. include:: sample-annotation-sheet.rst
+.. include:: sample-annotation-sheet.rst.inc
 
-.. include:: project-config.rst
+.. include:: project-config.rst.inc
 
 
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -27,10 +27,10 @@ Contents
 	advanced.rst
 
 .. toctree::
-	:caption: Developing Pipelines
+	:caption: Developing and Linking Pipelines
 	:maxdepth: 2
 
-	connecting-pipelines.rst
+	pipeline-interface.rst
 	config-files.rst
 
 .. toctree::

diff --git a/doc/source/inputs.rst b/doc/source/inputs.rst