Merge pull request #103 from epigen/iface

Merging pipeline_interface and protocol_mappings
pepkit · May 27, 2017 · bccd053 · bccd053
2 parents 15e15cc + d39488e
commit bccd053
Show file tree

Hide file tree

Showing 11 changed files with 569 additions and 152 deletions.
diff --git a/doc/source/changelog.rst b/doc/source/changelog.rst
@@ -5,9 +5,18 @@ Changelog
 
   - New
 
-    - Adds support for implied_column section of the project config file
+    - Add support for implied_column section of the project config file
+
+    - Add support for Python 3
+
+    - Merges pipeline interface and protocol mappings. This means we now allow direct pointers
+    to ``pipeline_interface.yaml`` files, increasing flexibility, so this relaxes the specified 
+    folder structure that was previously used for ``pipelines_dir`` (with ``config`` subfolder).
+
+  - Changed
+
+    - Changed name of ``run`` column to ``toggle``, since ``run`` can also refer to a sequencing run.
 
-    - Adds support for Python 3
 
 - **v0.5** (*2017-03-01*):
 

diff --git a/doc/source/config-files.rst b/doc/source/config-files.rst
@@ -19,11 +19,9 @@ If you are planning to submit jobs to a cluster, then you need to know about a s
 Pipeline developers
 *****************
 
-If you want to add a new pipeline to looper, then there are two YAML files that coordinate linking your pipeline in to your looper project.
+If you want to add a new pipeline to looper, then you need to know about a configuration file that coordinates linking your pipeline in to your looper project.
 
--   :doc:`protocol mapping file <protocol-mapping>`: Tell looper which pipelines exist, and how to map each library (sample data type) to its pipelines.
-
--	:doc:`pipeline interface file <pipeline-interface>`: Links looper to the pipelines, describes variables, options and paths that the pipeline needs to know to run, and resource requirements for cluster managers.
+-	:doc:`pipeline interface file <pipeline-interface>`: Has two sections: ``protocol_mapping`` tells looper which pipelines exist, and how to map each protocol (sample data type) to its pipelines. ``pipelines`` links looper to the pipelines; describes variables, options and paths that the pipeline needs to know to run; and outlines resource requirements for cluster managers.
 
 
 Finally, if you're using Pypiper to develop pipelines, it uses a pipeline-specific configuration file (detailed in the Pypiper documentation):

diff --git a/doc/source/connecting-pipelines.rst b/doc/source/connecting-pipelines.rst
@@ -3,18 +3,35 @@ Connecting pipelines
 
 .. HINT:: 
 
-	Pipeline users don't need to worry about this section. This is for those who develop pipelines.
+	Pipeline users don't need to worry about this section. This is for those who develop pipelines, or those who want to use a currently defined looper project to submit to an existing pipeline that isn't already configured for looper.
 
-Looper can connect any pipeline, as long as it runs on the command line and uses text command-line arguments. These pipelines can be simple shell scripts, python scripts, perl scripts, or even pipelines built using a framework. Typically, we use python pipelines built using the `pypiper <https://databio.org/pypiper>`_ package, which provides some additional power to looper, but this is optional.
+Looper can connect samples to any pipeline, as long as it runs on the command line and uses text command-line arguments. These pipelines could be simple shell scripts, python scripts, perl scripts, or even pipelines built using a framework. Typically, we use python pipelines built using the `pypiper <https://databio.org/pypiper>`_ package, which provides some additional power to looper, but this is optional.
 
-Regardless of what pipelines you use, you will need to tell looper how to interface with your pipeline. You do that by specifying a *pipeline interface*, which currently consists of two files:
+Regardless of what pipelines you use, you will need to tell looper how to interface with your pipeline. You do that by specifying a **pipeline interface file**. The **pipeline interface** is a ``yaml`` file with two subsections:
 
-1. **Protocol mappings** - a ``yaml`` file that maps sample **library** to one or more **pipeline scripts**.
-2. **Pipeline interface** -  a ``yaml`` file telling ``Looper`` the arguments and resources required by each pipeline script.
+1. ``protocol_mappings`` - maps sample ``protocol`` (aka ``library``) to one or more pipeline scripts.
+2. ``pipelines`` -  describes the arguments and resources required by each pipeline script.
 
-Let's go through each one in detail:
+Let's start with a very simple example. A basic ``pipeline_interface.yaml`` file may look like this:
 
-.. include:: protocol-mappings.rst
+
+.. code-block:: yaml
+    
+    protocol_mapping:
+      RRBS: rrbs_pipeline
+
+    pipelines:
+      rrbs_pipeline:
+        name: RRBS
+        path: path/to/rrbs.py
+        arguments:
+          "--sample-name": sample_name
+          "--input": data_path
+
+
+The first section specifies that samples of protocol ``RRBS`` will be mapped to the pipeline specified by key ``rrbs_pipeline``. The second section describes where the pipeline named ``rrbs_pipeline`` is located and what command-line arguments it requires. Pretty simple. Let's go through each of these sections in more detail:
+
+.. include:: protocol-mapping.rst
 
 .. include:: pipeline-interface.rst
 
diff --git a/doc/source/pipeline-interface.rst b/doc/source/pipeline-interface.rst
@@ -1,15 +1,24 @@
 
-Pipeline interface YAML
+Pipeline interface section: pipelines 
 **************************************************
 
-The pipeline interface file describes how looper knows what arguments to pass to the pipeline and (possibly) what resources to request. For each pipeline (defined by the filename of the script itself), you specify some optional and required variables:
+The ``pipelines`` section specifies to looper which command-line arguments to pass to the pipeline. In addition, if you're using a cluster resource manager, it also specifies which compute resources to request. For each pipeline, you specify variables (some optional and some required). The possible attributes to specify for each pipeline include:
+
+- ``name`` (recommended): Name of the pipeline. This is used to assess pipeline flags (if your pipeline employs them, like pypiper pipelines).
+- ``arguments`` (required): List of key-value pairs of arguments, and attribute sources to pass to the pipeline. The key corresponds verbatim to the string that will be passed on the command line to the pipeline. The value corresponds to an attribute of the sample, which will be derived from the sample_annotation csv file (in other words, it's a column name of your sample annotation sheet).
+- ``path`` (required): Absolute or relative path to the script for this pipeline. Relative paths are considered relative to your **pipeline_interface file**.
+- ``required_input_files`` (optional): A list of sample attributes (annotation sheets column names) that will point to input files that must exist.
+- ``all_input_files`` (optional): A list of sample attributes (annotation sheet column names) that will point to input files that are not required, but if they exist, should be counted in the total size calculation for requesting resources.
+- ``ngs_input_files`` (optional): For pipelines using sequencing data, provide a list of sample attributes (annotation sheet column names) that will point to input files to be used for automatic detection of ``read_length`` and ``read_type`` sample attributes.
+
+- ``looper_args`` (optional): Provide ``True`` or ``False`` to specify if this pipeline understands looper args, which are then automatically added for
+
+	- ``-C``: config_file (the pipeline config file specified in the project config file; or the default config file, if it exists)
+	- ``-P``: cores (the number of cores specified by the resource package chosen)
+	- ``-M``: mem (the memory limit)
+
+- ``resources`` (recommended): A section outlining how much memory, CPU, and clock time to request, modulated by input file size (details below). If the ``resources`` section is missing, looper will only be able to run the pipeline locally (not submit it to a cluster resource manager). If you provide a ``resources`` section, you must define at least 1 option named 'default' with ``file_size: 0``. Add as many additional resource sets as you want, with any names. Looper will determine which resource package to use based on the ``file_size`` of the input file. It will select the lowest resource package whose ``file_size`` attribute does not exceed the size of the input file.
 
-- **name (recommended)**: Name of the pipeline
-- **arguments (required)**: List of key-value pairs of arguments, and attribute sources to pass to the pipeline (details below).
-- **resources (required)**: A section outlining how much memory, CPU, and clock time to request, modulated by input file size (details below)
-- **required_input_files (optional)**: A list of sample attributes (annotation sheets column names) that will point to input files that must exist.
-- **all_input_files (optional)**: A list of sample attributes (annotation sheets column names) that will point to input files that are not required, but if they exist, should be counted in the total size calculation for requesting resources.
-- **ngs_input_files (optional)**: A list of sample attributes (annotation sheets column names) that will point to input files to be used for automatic detection of read_length and read_type.
 
 Example:
 
@@ -36,18 +45,10 @@ Example:
 	      time: "2-00:00:00"
 	      partition: "longq"
 
-``arguments`` - the key corresponds verbatim to the string that will be passed on the command line to the pipeline. The value corresponds to an attribute of the sample, which will be derived from the sample_annotation csv file (in other words, it's a column name of your sample annotation sheet).
-
-In addition to arguments you specify here, you may include ``looper_args: True`` and then looper will automatically add arguments for:
 
-- **-C**: config_file (the pipeline config file specified in the project config file; or the default config file, if it exists)
-- **-P**: cores (the number of cores specified by the resource package chosen)
-- **-M**: mem (the memory limit)
 
-``resources`` - You must define at least 1 option named 'default' with ``file_size`` = 0. Add as many additional resource sets as you want, with any names.
-The looper will determine which resource package to use based on the ``file_size`` of the input file. It will select the lowest resource package whose ``file_size`` attribute does not exceed the size of the input file.
 
-Example:
+Full example:
 
 .. code-block:: yaml
 

diff --git a/doc/source/protocol-mappings.rst → doc/source/protocol-mapping.rst b/doc/source/protocol-mappings.rst → doc/source/protocol-mapping.rst
@@ -1,14 +1,14 @@
 
-Protocol mapping YAML
-******************************************
+Pipeline interface section: protocol_mapping 
+********************************************
 
-The protocol mappings explains how looper should map from a sample protocol (like RNA-seq) to a particular pipeline (like rnaseq.py), or group of pipelines. Here's how to build a pipeline_interface YAML file:
+The ``protocol_mapping`` section explains how looper should map from a sample protocol (like ``RNA-seq``, which is a column in your annotation sheet) to a particular pipeline (like ``rnaseq.py``), or group of pipelines. Here's how to build ``protocol_mapping``:
 
 - **Case 1:** one protocol maps to one pipeline. Example: ``RNA-seq: rnaseq.py``
 
-Any samples that list "RNA-seq" under ``library`` will be run using the ``rnaseq.py`` pipeline. You can list as many library types as you like in the protocol mappings, mapping to as many pipelines as you provide in your pipelines repository.
+Any samples that list "RNA-seq" under ``library`` will be run using the ``rnaseq.py`` pipeline. You can list as many library types as you like in the protocol mappings, mapping to as many pipelines as you configure in your ``pipelines`` section.
 
-Example ``protocol_mappings.yaml``:
+Example ``protocol_mappings`` section:
 
 .. code-block:: yaml
 
@@ -28,15 +28,10 @@ Example ``protocol_mappings.yaml``:
 
 You can map multiple pipelines to a single protocol if you want samples of a type to kick off more than one pipeline run.
 
-Example ``protocol_mappings.yaml``:
+Example ``protocol_mappings`` section:
 
 .. code-block:: yaml
 
-	RRBS: rrbs.py
-	WGBS: wgbs.py
-	SMART:  >
-	  rnaBitSeq.py -f,
-	  rnaTopHat.py -f
 	SMART-seq:  >
 	  rnaBitSeq.py -f,
 	  rnaTopHat.py -f
@@ -45,16 +40,14 @@ Example ``protocol_mappings.yaml``:
 - **Case 3:** a protocol runs one pipeline which depends on another. Example: ``WGBSNM: first;second;third;(fourth, fifth)``
 
 .. warning::
-	This feature, pipeline dependency is not implemented yet. This documentation describes a protocol that may be implemented in the future, if it is necessary to have dependency among pipeline submissions.
+	This feature (pipeline dependency) is not implemented yet. This documentation describes a protocol that may be implemented in the future, if it is necessary to have dependency among pipeline submissions.
 
 The basic format for pipelines run simultaneously is: ``PROTOCOL: pipeline1 [, pipeline2, ...]``. Use semi-colons to indicate dependency.
 
-Example ``protocol_mappings.yaml``:
+Example ``protocol_mappings`` section:
 
 .. code-block:: yaml
 
-	RRBS: rrbs.py
-	WGBS: wgbs.py
 	WGBSQC: >
 	  wgbs.py;
-	  (nnm.py, pdr.py)
+	  (nnm.py, pdr.py)
diff --git a/doc/source/usage-and-commands.rst b/doc/source/usage-and-commands.rst
@@ -2,29 +2,31 @@ Usage and commands
 ******************************
 
 
-Looper doesn't just run pipelines, it can also check and summarize the progress of your jobs, as well as remove all files created by them.
+Looper doesn't just run pipelines; it can also check and summarize the progress of your jobs, as well as remove all files created by them.
 
-Each task is controlled by one of the four main commands ``run``, ``summarize``, ``destroy``, ``check``: 
+Each task is controlled by one of the five main commands ``run``, ``summarize``, ``destroy``, ``check``, ``clean``: 
 
 
 .. code-block:: bash
 
-  usage: looper [-h] [--version] {run,summarize,destroy,check} ...
+  usage: looper [-h] [-V] {run,summarize,destroy,check,clean} ...
 
-  looper - Loops through samples and submits pipelines for them.
+  looper - Loop through samples and submit pipelines for them.
 
   positional arguments:
-    {run,summarize,destroy,check}
+    {run,summarize,destroy,check,clean}
       run                 Main Looper function: Submit jobs for samples.
       summarize           Summarize statistics of project samples.
       destroy             Remove all files of the project.
-      check               Remove all files of the project.
+      check               Checks flag status of current runs.
+      clean               Runs clean scripts to remove intermediate files of
+                          already processed jobs.
 
   optional arguments:
     -h, --help            show this help message and exit
-    --version             show program's version number and exit
+    -V, --version         show program's version number and exit
 
-  For command line options of each command, type: looper COMMAND -h
+  For subcommand-specific options, type: 'looper <subcommand> -h'
   https://github.com/epigen/looper
 
 

diff --git a/looper/loodels.py b/looper/loodels.py
@@ -14,18 +14,13 @@
         "submit_templates", DEFAULT_PROJECT_COMPUTE_NAME)
 
 
+
 class Project(models.Project):
     """ Looper-specific NGS Project. """
 
-    # Default project/compute environment with looper specificity.
-    DEFAULT_ENVIRONMENT = resource_filename(
-            "looper", DEFAULT_PROJECT_COMPUTE_CONFIG)
-
-
     def __init__(
             self, config_file,
-            default_compute= resource_filename(
-                    "looper", DEFAULT_PROJECT_COMPUTE_CONFIG),
+            default_compute=None,
             *args, **kwargs):
         """
         Create a new Project.
@@ -37,6 +32,9 @@ def __init__(
         :param tuple args: additional positional arguments
         :param dict kwargs: additional keyword arguments
         """
+        if not default_compute:
+            default_compute = resource_filename(
+                    "looper", DEFAULT_PROJECT_COMPUTE_CONFIG)
         super(Project, self).__init__(
                 config_file, default_compute, *args, **kwargs)