Merge pull request #20 from epigen/dev

Merge dev to master
pepkit · Jan 23, 2017 · 125fb01 · 125fb01
2 parents 6d8a9d8 + 2beb2d9
commit 125fb01
Show file tree

Hide file tree

Showing 32 changed files with 1,537 additions and 656 deletions.
diff --git a/.gitignore b/.gitignore
@@ -31,3 +31,6 @@ pipelines/pypiper
 
 #make sure to track .gitignore
 !.gitignore
+
+# ignore test results
+tests/test/*
diff --git a/README.md b/README.md
@@ -2,37 +2,39 @@
 
 [![Documentation Status](http://readthedocs.org/projects/looper/badge/?version=latest)](http://looper.readthedocs.io/en/latest/?badge=latest)
 
-__`Looper`__ is a pipeline submission engine that parses sample inputs and submits pipelines for each sample. It also has several accompanying scripts that use the same infrastructure to do other processing for projects. Looper was conceived to use [`pypiper`](https://github.com/epigen/pypiper/) pipelines, but does not require this.
+__`Looper`__ is a pipeline submission engine that parses sample inputs and submits pipelines for each sample. Looper was conceived to use [pypiper](https://github.com/epigen/pypiper/) pipelines, but does not require this.
+
 
 # Links
 
  * Public-facing permalink: http://databio.org/looper
- * Documentation: [Read the Docs](http://looper.readthedocs.org/) (still under heavy work)
+ * Documentation: [Read the Docs](http://looper.readthedocs.org/)
  * Source code: http://github.com/epigen/looper
 
 
 # Installing
+Looper supports Python 2.7 only and has been tested only in Linux.
 
 ```
 pip install https://github.com/epigen/looper/zipball/master
 ```
 
-You will have a `looper` executable and all accessory scripts (from [`scripts/`](scripts/), see below) in your `$PATH`.
+To have the `looper` executable in your `$PATH`, add the following line to your .bashrc file:
+
+```
+export PATH=$PATH:~/.local/bin
+```
+
 
 # Running pipelines
 
 `Looper` just requires a yaml format config file passed as an argument, which contains all the settings required. This can, for example, submit each job to SLURM (or SGE, or run them locally).
 
 ```bash
-looper -c metadata/config.yaml
+looper run project_config.yaml
 ```
 
-# Post-pipeline processing (accessory scripts)
-
-Once a pipeline has been run (or is running), you can do some post-processing on the results. In [`scripts/`](scripts/) are __accessory scripts__ that help with monitoring running pipelines, summarizing pipeline outputs, etc. These scripts are not required by the pipelines, but useful for post-processing. Here are some post-processing scripts:
 
-* [scripts/flagCheck.sh](scripts/flagCheck.sh) - Summarize status flag to check on the status (running, completed, or failed) of pipelines.
-* [scripts/make_trackhubs.py](scripts/make_trackhubs.py) - Builds a track hub. Just pass it your config file.
-* [scripts/summarizePipelineStats.R](scripts/summarizePipelineStats.R) - Run this in the output folder and it will aggregate and summarize all key-value pairs reported in the `PIPE_stats` files, into tables for each pipeline, and a combined table across all pipelines run in this folder.
+# Looper commands
 
-You can find other examples in [scripts/](scripts/).
+Looper can do more than just run your samples through pipelines. Once a pipeline has been run (or is running), you can do some post-processing on the results. These commands help with monitoring running pipelines, summarizing pipeline outputs, etc. This includes `looper clean`, `looper destroy`, `looper summarize`, and more. You can find details about these in the **Commands** section of the documentation.
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.3
+0.4
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -3,7 +3,7 @@ Advanced features
 
 .. _advanced-derived-columns:
 
-Pointing to flexible data with derived columns
+Pointing to data paths flexibly with "derived columns"
 ****************************************
 On your sample sheet, you will need to point to the input file or files for each sample. Of course, you could just add a column with the file path, like ``/path/to/input/file.fastq.gz``. For example:
 
@@ -21,7 +21,7 @@ On your sample sheet, you will need to point to the input file or files for each
 This is common, and it works in a pinch with Looper, but what if the data get moved, or your filesystem changes, or you switch servers or move institutes? Will this data still be there in 2 years? Do you want long file paths cluttering your annotation sheet? What if you have 2 or 3 input files? Do you want to manually manage these unwieldy absolute paths?
 
 
-``Looper`` makes it really easy to do better: using a ``derived column``, you can use variables to make this flexible. So instead of ``/long/path/to/sample.fastq.gz`` in your table, you just write ``source1`` (or whatever):
+``Looper`` makes it really easy to do better: using a columns from the sample metadata, you can derive a data path that is flexible - we call these newly constructed fields a ``derived column``. So instead of ``/long/path/to/sample.fastq.gz`` in your table, you just write ``source1`` (or whatever):
 
 .. csv-table:: Sample Annotation Sheet (good example)
 	:header: "sample_name", "library", "organism", "time", "file_path"
@@ -32,7 +32,7 @@ This is common, and it works in a pinch with Looper, but what if the data get mo
 	"frog_0h", "RRBS", "frog", "0", "source1"
 	"frog_1h", "RRBS", "frog", "1", "source1"
 
-Then, in your config file you specify which columns are derived (in this case, ``file_path``), as well as a string that will construct your path based on other sample attributes encoded using brackets as in ``{sample_attribute}``, like this:
+Then, in your config file you specify which sample attributes (similar to the metadata columns) are derived (in this case, ``file_path``), as well as a string that will construct your path based on other sample attributes encoded using brackets as in ``{sample_attribute}``, like this:
 
 
 .. code-block:: yaml
@@ -42,7 +42,7 @@ Then, in your config file you specify which columns are derived (in this case, `
     source1: /data/lab/project/{sample_name}.fastq
     source2: /path/from/collaborator/weirdNamingScheme_{external_id}.fastq
 
-That's it! The variables will be automatically populated as in the original example. To take this a step further, you'd get the same result with this config file, which substitutes ``{sample_name}`` for other sample attributes, ``{organism}`` and ``{time}``:
+That's it! The attributes will be automatically populated as in the original example. To take this a step further, you'd get the same result with this config file, which substitutes ``{sample_name}`` for other sample attributes, ``{organism}`` and ``{time}``:
 
 .. code-block:: yaml
 
@@ -58,7 +58,41 @@ By default, the "data_source" column is considered a derived column. But you can
 
 Think of each sample as belonging to a certain type (for simple experiments, the type will be the same); then define the location of these samples in the project configuration file. As a side bonus, you can easily include samples from different locations, and you can also share the same sample annotation sheet on different environments (i.e. servers or users) by having multiple project config files (or, better yet, by defining a subproject for each environment). The only thing you have to change is the project-level expression describing the location, not any sample attributes (plus, you get to eliminate those annoying long/path/arguments/in/your/sample/annotation/sheet).
 
-Check out the complete working example in the `microtest repository <https://github.com/epigen/microtest/tree/master/config>`_.
+Check out the complete working example in the `microtest repository <https://github.com/epigen/microtest/tree/master/config>`__.
+
+.. _cluster-resource-managers:
+
+Using cluster resource managers
+****************************************
+
+.. warning:: This is still in progress
+
+Looper uses a template-based system for building scripts. By default, looper will just build a shell script and run them serially. Compute settings can be changed using an environment script, which you point to with a shell environment variable called ``LOOPERENV``.
+
+Complete instructions for configuring your compute environment are availble in the looperenv repository at https://github.com/epigen/looperenv.
+
+For each iteration, ``looper`` will create one or more submission scripts for that sample. The ``compute`` settings specify how these scripts will be both produced and run.  This makes it very portable and easy to change cluster management systems, or to just use a local compute power like a laptop or standalone server, by just changing the two variables in the ``compute`` section.
+
+Example:
+
+.. code-block:: yaml
+
+	compute:
+	  default:
+	    submission_template: pipelines/templates/local_template.sub
+	    submission_command: sh
+	  slurm:
+	    submission_template: pipelines/templates/slurm_template.sub
+	    submission_command: sbatch
+	    partition: queue_name
+
+
+There are two sub-parameters in the compute section. First, ``submission_template`` is a (relative or absolute) path to the template submission script. This is a template with variables (encoded like ``{VARIABLE}``), which will be populated independently for each sample as defined in ``pipeline_inteface.yaml``. The one variable ``{CODE}`` is a reserved variable that refers to the actual python command that will run the pipeline. Otherwise, you can use any variables you define in your `pipeline_interface.yaml`.
+
+Second, the ``submission_command`` is the command-line command that ``looper`` will prepend to the path of the produced submission script to actually run it (``sbatch`` for SLURM, `qsub` for SGE, ``sh`` for localhost, etc).
+
+In `Templates <https://github.com/epigen/looper/tree/master/templates>`__ are examples for submission templates for `SLURM <https://github.com/epigen/looper/blob/master/templates/slurm_template.sub>`__, `SGE <https://github.com/epigen/looper/blob/master/templates/sge_template.sub>`__, and `local runs <https://github.com/epigen/looper/blob/master/templates/localhost_template.sub>`__. For a local run, just pass the script to the shell with ``submission_command: sh``. This will cause each sample to run sequentially, as the shell will block until the run is finished and control is returned to ``looper`` for the next iteration.
+
 
 .. _cluster-resource-managers:
 
@@ -107,7 +141,6 @@ metadata:
 
 Make sure the ``sample_name`` column of this table matches, and then include any columns you need to point to the data. ``Looper`` will automatically include all of these files as input passed to the pipelines.
 
-
 Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet.
 
 
@@ -117,5 +150,72 @@ In a case of data produced at CeMM by the BSF, three additional columns will all
 
 -  ``flowcell`` - the name of the BSF flowcell (should be something like BSFXXX)
 -  ``lane`` - the lane number in the instrument
--  ``BSF_name`` - the name used to describe the sample in the BSF annotation [1]_.
+-  ``BSF_name`` - the name used to describe the sample in the BSF annotation.
+
+
+
+.. _extending-sample-objects:
+
+Extending Sample objects
+****************************************
+
+Looper uses object oriented programming (OOP) under the hood. This means that concepts like a sample to be processed or a project are modeled as objects in Python. 
+
+By default we use `generic models <https://github.com/epigen/looper/tree/master/looper/models.py>`__ (see the `API <api.html>`__ for more) to handle samples in Looper, but these can also be reused in other contexts by importing ``looper.models`` or by means of object serialization through YAML files.
+
+Since these models provide useful methods to interact, update and store attributes in the objects (most nobly *samples* - ``Sample`` object), a useful use case is during the run of a pipeline: pipeline scripts can extend ``Sample`` objects with further attributes or methods.
+
+Example:
+
+You want a convenient yet systematic way of specifying many file paths for several samples depending on the type of NGS sample you're handling: a ChIP-seq sample might have at some point during a run a peak file with a certain location, while a RNA-seq sample will have a file with transcript quantifications. Both paths to the files exist only for the respective samples, will likely be used during a run of a pipeline, but also during some analysis later on.
+By working with ``Sample`` objects that are specific to each file type, you can specify the location of such files only once during the whole process and later access them "on the fly".
+
+
+**To have** ``Looper`` **create a Sample object specific to your data type, simply import the base** ``Sample`` **object from** ``looper.models``, **and create a** ``class`` **that inherits from it that has an** ``__library__`` **attribute:**
+
+
+.. code-block:: python
+
+	# atacseq.py
+
+	from looper.models import Sample
+
+	class ATACseqSample(Sample):
+		"""
+		Class to model ATAC-seq samples based on the generic Sample class.
+
+		:param series: Pandas `Series` object.
+		:type series: pandas.Series
+		"""
+		__library__ = "ATAC-seq"
+
+		def __init__(self, series):
+			if not isinstance(series, pd.Series):
+				raise TypeError("Provided object is not a pandas Series.")
+			super(ATACseqSample, self).__init__(series)
+			self.make_sample_dirs()
+
+		def set_file_paths(self):
+			"""Sets the paths of all files for this sample."""
+			# Inherit paths from Sample by running Sample's set_file_paths()
+			super(ATACseqSample, self).set_file_paths()
+
+			self.fastqc = os.path.join(self.paths.sample_root, self.name + ".fastqc.zip")
+			self.trimlog = os.path.join(self.paths.sample_root, self.name + ".trimlog.txt")
+			self.fastq = os.path.join(self.paths.sample_root, self.name + ".fastq")
+			self.trimmed = os.path.join(self.paths.sample_root, self.name + ".trimmed.fastq")
+			self.mapped = os.path.join(self.paths.sample_root, self.name + ".bowtie2.bam")
+			self.peaks = os.path.join(self.paths.sample_root, self.name + "_peaks.bed")
+
+
+When ``Looper`` parses your config file and creates ``Sample`` objects, it will:
+
+	- check if any pipeline has a class extending ``Sample`` with the ``__library__`` attribute:
+
+		- first by trying to import a ``pipelines`` module and checking the module pipelines;
+
+		- if the previous fails, it will try appending the provided pipeline_dir to ``$PATH`` and checking the module files for pipelines;
+
+	- if any of the above is successful, if will match the sample ``library`` with the ``__library__`` attribute of the classes to create extended sample objects.
 
+	- if a sample cannot be matched to an extended class, it will be a generic ``Sample`` object.
diff --git a/doc/source/changelog.rst b/doc/source/changelog.rst
@@ -0,0 +1,24 @@
+Changelog
+******************************
+
+- **v0.4** (*2017-01-12*):
+
+  - New
+
+    - New command-line interface (CLI) based on sub-commands
+
+    - New subcommand (``looper summarize``) replacing the ``summarizePipelineStats.R`` script
+
+    - New subcommand (``looper check``) replacing the ``flagCheck.sh`` script
+
+    - New command (``looper destroy``) to remove all output of a project
+
+    - Support for portable and pipeline-independent allocation of computing resources with Looperenv.
+
+  - Fixes
+
+    - Removed requirement to have ``pipelines`` repository installed in order to extend base Sample objects
+
+    - Maintenance of sample attributes as provided by user by means of reading them in as strings (to be improved further)
+
+    - Improved serialization of Sample objects
diff --git a/doc/source/define-your-project.rst b/doc/source/define-your-project.rst
@@ -10,7 +10,7 @@ The format is simple and modular, so you only need to define the components you
 1. **Project config file** - a ``yaml`` file describing input and output file paths and other (optional) project settings
 2. **Sample annotation sheet** - a ``csv`` file with 1 row per sample
 
-In the simplest case, ``project_config.yaml`` is just a few lines of ``yaml``. Here's a minimal example **project_config.yaml**:
+The first file (**project config**) is just a few lines of ``yaml`` in the simplest case. Here's a minimal example **project_config.yaml**:
 
 
 .. code-block:: yaml
@@ -21,7 +21,9 @@ In the simplest case, ``project_config.yaml`` is just a few lines of ``yaml``. H
 	  pipelines_dir: /path/to/pipelines/repository
 
 
-The **output_dir** describes where you want to save pipeline results, and **pipelines_dir** describes where your pipeline code is stored. You will also need a second file to describe samples, which is a comma-separated value (``csv``) file containing at least a unique identifier column named ``sample_name``, a column named ``library`` describing the sample type, and some way of specifying an input file. Here's a minimal example of **sample_annotation.csv**:
+The **output_dir** describes where you want to save pipeline results, and **pipelines_dir** describes where your pipeline code is stored.
+
+The second file (**sample annotation sheet**) is where you list your samples, which is a comma-separated value (``csv``) file containing at least a few defined columns: a unique identifier column named ``sample_name``; a column named ``library`` describing the sample type (e.g. RNA-seq); and some way of specifying an input file. Here's a minimal example of **sample_annotation.csv**:
 
 
 .. csv-table:: Minimal Sample Annotation Sheet
@@ -34,11 +36,11 @@ The **output_dir** describes where you want to save pipeline results, and **pipe
    "frog_4", "RNA-seq", "frog4.fq.gz"
 
 
-With those two simple files, you could run looper, and that's fine for just running a quick test on a few files. In practice, you'll probably want to use some of the more advanced features of looper by adding additional information to your configuration ``yaml`` file and your sample annotation ``csv`` file.
+With those two simple files, you could run looper, and that's fine for just running a quick test on a few files. You just type: ``looper run path/to/project_config.yaml`` and it will run all your samples through the appropriate pipeline. In practice, you'll probably want to use some of the more advanced features of looper by adding additional information to your configuration ``yaml`` file and your sample annotation ``csv`` file.
 
-For example, by default, your jobs will run serially on your local computer, where you're running ``looper``. If you want to submit to a cluster resource manager (like SLURM or SGE), you just need to specify a ``compute`` section.
+For example, by default, your jobs will run serially on your local computer, where you're running ``looper``. If you want to submit to a cluster resource manager (like SLURM or SGE), you just need to add a ``compute`` section to your **project config file**.
 
-Let's go through the more advanced details of both annotation sheets and project config files:
+Now, let's go through the more advanced details of both annotation sheets and project config files:
 
 .. include:: sample-annotation-sheet.rst
 

diff --git a/doc/source/faq.rst b/doc/source/faq.rst
@@ -1,8 +1,8 @@
 FAQ
 =========================
 
--   **Why isn't looper in my path?** Please read this issue report: https://github.com/epigen/looper/issues/8
+- **Why isn't looper in my path?** By default, Python packages are installed to ~/.local/bin. You can add this location to your path by appending it (export PATH=$PATH:~/.local/bin). See discussion about this issue here: https://github.com/epigen/looper/issues/8
 
 - How can I run my jobs on a cluster? See :ref:`cluster resource managers <cluster-resource-managers>`
 
-- Which configuration file has which settings? Here's a list: :doc:`config files <config-files>`
+- Which configuration file has which settings? Here's a list: :doc:`config files <config-files>`