Merge branch 'release/v2.1'

milaboratory · Feb 6, 2017 · 7c4dcd4 · 7c4dcd4
2 parents 4cb38f2 + e6c7644
commit 7c4dcd4
Show file tree

Hide file tree

Showing 47 changed files with 3,227 additions and 522 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ doc/_build
 .floo
 .flooignore
 out
+test_target
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,4 +1,23 @@
 
+MiXCR 2.1 ( 6 Feb 2017)
+========================
+
+-- Major review of all analysis steps for non-enriched libraries (RNA-Seq, etc...). Efficiency of 
+   TCR/IG extraction substantially improved (according to our benchmarks, efficiency is highest among 
+   all tools available for RNA-Seq repertoire extraction known to date; successfully work even for 
+   48+48 RNA-Seq data). Zero false-positive alignments and false-overlaps detected.
+-- Additional round of alignment for V gene in paired-end reads aligner (improve efficiency and 
+   accuracy for some boundary cases; negligible impact on analysis speed).
+-- New action `extendAlignments` to extend TCR alignments with defined V and J genes but not fully 
+   covering CDR3 sequence.
+-- Scripting-friendly export format now used by default. Use `-v` to return to column names with 
+   spaces.
+-- Information on the number of deleted nucleotides / size of P-segment for `V`, `D`, and `J` genes 
+   now is explicitly exported in `-defaultAnchorPoints` field (see docs for more info).
+-- Many small fixes and enhancements.
+-- Correct marks for P-segment of J gene in `exportAlignmentsPretty` and `exportClonesPretty`
+
+
 MiXCR 2.0.4 ( 4 Feb 2017)
 ========================
 

diff --git a/README.md b/README.md
@@ -4,6 +4,40 @@
 
 MiXCR is a universal software for fast and accurate analysis of raw T- or B- cell receptor repertoire sequencing data.
 
+ - Easy to use. Default pipeline can be executed without any additional parameters (see *Usage* section)
+
+ - TCR and IG repertoires
+
+ - Following species are supported *out-of-the-box* using built-in library:
+   - human
+   - mouse
+   - rat (only TRB and TRA)
+   - *... several new species will be available soon*
+
+- Efficiently extract repertoires from most of (if not *all*) types of TCR/IG-containing raw sequencing data:
+  - data from all specialized RepSeq sample preparation protocols
+  - RNA-Seq
+  - WGS
+  - single-cell data
+  - *etc..*
+
+- Has optional CDR3 reconstruction step, that allows to *recover full hypervariable region from several disjoint reads*. Uses sophisticated algorithms protecting from false-positive assemblies at the same time having best in class efficiency.
+
+- Assemble clonotypes, applying several *error-correction* algorithms to eliminate artificial diversity arising from PCR and sequencing errors
+
+- Clonotypes can be assembled based on CDR3 sequence (default) as well as any other region, including *full-length* variable sequence (from beginning of FR1 to the end of FR4)
+
+- Provides exhaustive output information for clonotypes and per-read alignments:
+  - nucleotide and amino acid sequences of all immunologically relevant regions (FR1, CDR1, ..., CDR3, etc..)
+  - identified V, D, J, C genes
+  - nucleotide and amino acid mutations in germline regions
+  - variable region topology (number of end V / D / J nucleotide deletions, length of P-segments, number of non-template N nucleotides)
+  - sequencing quality scores for any extracted sequence
+  - several other useful pieces of information
+
+- Completely transparent pipeline, possible to track individual read fate from raw fastq entry to clonotype. Several useful tools available to evaluate pipeline performance: human readable alignments visualization, diff tool for alignment and clonotype files, etc...
+
+
 ## Installation / Download
 
 #### Using Homebrew on Mac OS X or Linux (linuxbrew)
@@ -17,7 +51,7 @@ to upgrade already installed MiXCR to the newest version:
 
 #### Manual install (any OS)
 
-* download latest MiXCR version from [release page](https://github.com/milaboratory/mixcr/releases/latest)
+* download latest stable MiXCR build from [release page](https://github.com/milaboratory/mixcr/releases/latest)
 * unzip the archive
 * add resulting folder to your ``PATH`` variable
   * or add symbolic link for ``mixcr`` script to your ``bin`` folder
@@ -30,20 +64,35 @@ to upgrade already installed MiXCR to the newest version:
 
 ## Usage
 
-Here is a very simple example of analysis of raw human RepSeq data:
+#### Enriched RepSeq Data
+
+Here is a very simple usage example that will extract repertoire data (in the form of clonotypes list) from raw sequencing data of enriched RepSeq library:
 
     mixcr align -r log.txt input_R1.fastq.gz input_R2.fastq.gz alignments.vdjca
     mixcr assemble -r log.txt alignments.vdjca clones.clns
     mixcr exportClones clones.clns clones.txt
 
-this sequence of commands will produce a tab-delimited list of clones (`clones.txt`) assembled by their CDR3 sequences with extensive information on their abundancies, V, D and J genes etc.
+this will produce a tab-delimited list of clones (`clones.txt`) assembled by their CDR3 sequences with extensive information on their abundances, V, D and J genes, mutations in germline regions, topology of VDJ junction etc.
+
+#### Repertoire extraction from RNA-Seq
 
-For more details see documentation.
+MiXCR is equally effective in extraction of repertoire information from non-enriched data, like RNA-Seq or WGS. This example illustrates usage for RNA-Seq:
+
+    mixcr align -p rna-seq -r log.txt input_R1.fastq.gz input_R2.fastq.gz alignments.vdjca
+    mixcr assemblePartial alignments.vdjca alignment_contigs.vdjca
+    mixcr assemble -r log.txt alignment_contigs.vdjca clones.clns
+    mixcr exportClones clones.clns clones.txt
+
+#### Further reading
+
+MiXCR pipeline is very flexible, and can be applied to raw data from broad spectrum of experimental setups. For detailed description of MiXCR features and options please see documentation.
 
 ## Documentation
 
 Detailed documentation can be found at https://mixcr.readthedocs.io/
 
+If you haven't found the answer to your question in the docs, or have any suggestions concerning new features, feel free to create an issue here, on GitHub, or write an email to support@milaboratory.com .
+
 ## Build
 
 Dependancy:
@@ -63,7 +112,6 @@ To build MiXCR from source:
   ```
   ./build.sh
   ```
-
 
 ## License
 

diff --git a/doc/export.rst b/doc/export.rst
@@ -177,17 +177,29 @@ The following table shows the correspondance between anchor point and positions
 +--------------------------+---------------------+--------------------+
 | VEnd / *PSegmentBegin*   | 10                  | 11                 |
 +--------------------------+---------------------+--------------------+
-| VEndTrimmed              | 11                  | 12                 |
+| Number of 3' V deletions | 11                  | 12                 |
+| (negative value), or     |                     |                    |
+| length of 3' V P-segment |                     |                    |
+| (positive value)         |                     |                    |
 +--------------------------+---------------------+--------------------+
-| DBeginTrimmed            | 12                  | 13                 |
+| Number of 5' D deletions | 12                  | 13                 |
+| (negative value), or     |                     |                    |
+| length of 5' D P-segment |                     |                    |
+| (positive value)         |                     |                    |
 +--------------------------+---------------------+--------------------+
 | DBegin / *PSegmentEnd*   | 13                  | 14                 |
 +--------------------------+---------------------+--------------------+
 | DEnd / *PSegmentBegin*   | 14                  | 15                 |
 +--------------------------+---------------------+--------------------+
-| DEndTrimmed              | 15                  | 16                 |
+| Number of 3' D deletions | 15                  | 16                 |
+| (negative value), or     |                     |                    |
+| length of 3' D P-segment |                     |                    |
+| (positive value)         |                     |                    |
 +--------------------------+---------------------+--------------------+
-| JBeginTrimmed            | 16                  | 17                 |
+| Number of 3' J deletions | 16                  | 17                 |
+| (negative value), or     |                     |                    |
+| length of 3' J P-segment |                     |                    |
+| (positive value)         |                     |                    |
 +--------------------------+---------------------+--------------------+
 | JBegin / *PSegmentEnd*   | 17                  | 18                 |
 +--------------------------+---------------------+--------------------+

diff --git a/doc/rnaseq.rst b/doc/rnaseq.rst
@@ -7,7 +7,7 @@
 Processing RNA-seq data
 =======================
 
-The typical MiXCR workflow can be applied for the analysis of RNA-seq samples. Though MiXCR can be used with the default parameters for aligning RNA-seq data, it is recommended to use ``rna-seq`` preset which is specifically tuned to perform well on such type of input:
+The typical MiXCR workflow can be applied for the analysis of RNA-seq samples. It is recommended to use ``rna-seq`` preset which is specifically tuned to perform well on such type of input:
 
 ::
 
@@ -35,9 +35,9 @@ Note option ``-OallowPartialAlignments=true`` of the ``align`` command: it will
 +------------------------------+---------------+--------------------------------------------------------------+
 | ``kOffset``                  | ``0``         | Offset taken from ``VEndTrimmed``/``JBeginTrimmed``.         |
 +------------------------------+---------------+--------------------------------------------------------------+
-| ``minimalVJJunctionOverlap`` | ``18``        | Minimal length of the overlapped VJ region: two squences can |
+| ``minimalAssembleOverlap``   | ``18``        | Minimal length of the overlapped VJ region: two squences can |
 |                              |               | be potentially merged only if they has at least              |
-|                              |               | ``minimalVJJunctionOverlap`` consequent same nucleotides     |
+|                              |               | ``minimalAssembleOverlap`` consequent same nucleotides     |
 |                              |               | in the VJJunction region.                                    |
 +------------------------------+---------------+--------------------------------------------------------------+
 
@@ -60,7 +60,7 @@ The algorithm which restores merged sequence from two overlapped alignments has
 +-----------------------------+---------------------+--------------------------------------------------------------+
 | ``partsLayout``             | ``CollinearDirect`` | Relative orientation of paired reads.                        |
 +-----------------------------+---------------------+--------------------------------------------------------------+
-| ``minimalOverlap``          | ``20``              | Minimal length of the overlapped region.                     |
+| ``minimalAssembleOverlap``  | ``20``              | Minimal length of the overlapped region.                     |
 +-----------------------------+---------------------+--------------------------------------------------------------+
 | ``maxQuality``              | ``45``              | Maximal sequence quality that can may be assigned in the     | 
 |                             |                     | region of overlap.                                           |
@@ -73,5 +73,5 @@ The above parameters can be specified in e.g. the following way:
 
 ::
 
-    mixcr assemblePartial -OmergerParameters.minimalOverlap=15 alignments.vdjca alignmentsRescued.vdjca
+    mixcr assemblePartial -OmergerParameters.minimalAssembleOverlap=15 alignments.vdjca alignmentsRescued.vdjca
 
diff --git a/itests.sh b/itests.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+
+# "Integration" tests for MiXCR
+# Test standard analysis pipeline results
+
+# Linux readlink -f alternative for Mac OS X
+function readlinkUniversal() {
+    targetFile=$1
+
+    cd `dirname $targetFile`
+    targetFile=`basename $targetFile`
+
+    # iterate down a (possible) chain of symlinks
+    while [ -L "$targetFile" ]
+    do
+        targetFile=`readlink $targetFile`
+        cd `dirname $targetFile`
+        targetFile=`basename $targetFile`
+    done
+
+    # compute the canonicalized name by finding the physical path 
+    # for the directory we're in and appending the target file.
+    phys_dir=`pwd -P`
+    result=$phys_dir/$targetFile
+    echo $result
+}
+
+os=`uname`
+delta=100
+
+dir=""
+
+case $os in
+    Darwin)
+        dir=$(dirname "$(readlinkUniversal "$0")")
+    ;;
+    Linux)
+        dir="$(dirname "$(readlink -f "$0")")"
+    ;;
+    FreeBSD)
+        dir=$(dirname "$(readlinkUniversal "$0")")    
+    ;;
+    *)
+       echo "Unknown OS."
+       exit 1
+    ;;
+esac
+
+rm -rf ${dir}/test_target
+mkdir ${dir}/test_target
+
+cp ${dir}/src/test/resources/sequences/*.fastq ${dir}/test_target/
+
+cd ${dir}/test_target/
+
+PATH=${dir}:${PATH}
+
+which mixcr
+
+mixcr -v
+
+function go_assemble {
+  mixcr assemble -r $1.clns.report $1.vdjca $1.clns || exit 1
+  for c in TCR IG TRB TRA TRG TRD IGH IGL IGK ALL
+  do
+    mixcr exportClones -c ${c} -s $1.clns $1.clns.${c}.txt || exit 1
+  done
+}
+
+for s in sample_IGH test;
+do
+  mixcr align -r ${s}_paired.vdjca.report ${s}_R1.fastq ${s}_R2.fastq ${s}_paired.vdjca || exit 1
+  go_assemble ${s}_paired
+  mixcr align -r ${s}_single.vdjca.report ${s}_R1.fastq ${s}_single.vdjca || exit 1
+  go_assemble ${s}_single
+done
diff --git a/pom.xml b/pom.xml
@@ -32,7 +32,7 @@
 
     <groupId>com.milaboratory</groupId>
     <artifactId>mixcr</artifactId>
-    <version>2.0.4</version>
+    <version>2.1</version>
     <packaging>jar</packaging>
     <name>MiXCR</name>
 

diff --git a/repseqio b/repseqio
diff --git a/src/main/java/com/milaboratory/mixcr/basictypes/VDJCAlignmentsFormatter.java b/src/main/java/com/milaboratory/mixcr/basictypes/VDJCAlignmentsFormatter.java
@@ -113,6 +113,12 @@ public boolean accept(SequencePartitioning object) {
             return object.isAvailable(ReferencePoint.VEnd) && object.getPosition(ReferencePoint.VEnd) != object.getPosition(ReferencePoint.VEndTrimmed);
         }
     };
+    public static final Filter<SequencePartitioning> IsJP = new Filter<SequencePartitioning>() {
+        @Override
+        public boolean accept(SequencePartitioning object) {
+            return object.isAvailable(ReferencePoint.JBegin) && object.getPosition(ReferencePoint.JBegin) != object.getPosition(ReferencePoint.JBeginTrimmed);
+        }
+    };
     public static final Filter<SequencePartitioning> IsDPLeft = new Filter<SequencePartitioning>() {
         @Override
         public boolean accept(SequencePartitioning object) {
@@ -128,6 +134,7 @@ public boolean accept(SequencePartitioning object) {
     public static final Filter<SequencePartitioning> NotDPLeft = FilterUtil.not(IsDPLeft);
     public static final Filter<SequencePartitioning> NotDPRight = FilterUtil.not(IsDPRight);
     public static final Filter<SequencePartitioning> NotVP = FilterUtil.not(IsVP);
+    public static final Filter<SequencePartitioning> NotJP = FilterUtil.not(IsJP);
 
 
     public static final PointToDraw[] POINTS_FOR_REARRANGED = new PointToDraw[]{
@@ -153,8 +160,10 @@ public boolean accept(SequencePartitioning object) {
             pd(ReferencePoint.DEnd, "D><DP", IsDPRight),
             pd(ReferencePoint.DEndTrimmed, "DP>", IsDPRight),
 
-            pd(ReferencePoint.JBeginTrimmed, "<J"),
-            pd(ReferencePoint.CDR3End, "CDR3><FR4"),
+            pd(ReferencePoint.JBeginTrimmed, "<J", NotJP),
+            pd(ReferencePoint.JBegin, "JP><J", IsJP),
+            pd(ReferencePoint.JBeginTrimmed, "<JP", IsJP),
+            pd(ReferencePoint.CDR3End.move(-1), "CDR3><FR4").moveMarkerPoint(1),
             pd(ReferencePoint.FR4End, "FR4>", -1),
             pd(ReferencePoint.CBegin, "<C")
     };
@@ -174,7 +183,7 @@ public boolean accept(SequencePartitioning object) {
             pd(ReferencePoint.DBegin, "<D"),
             pd(ReferencePoint.DEnd, "D>", -1),
             pd(ReferencePoint.JBegin, "<J"),
-            pd(ReferencePoint.CDR3End, "CDR3><FR4"),
+            pd(ReferencePoint.CDR3End.move(-1), "CDR3><FR4").moveMarkerPoint(1),
             pd(ReferencePoint.FR4End, "FR4>", -1)
     };
 
@@ -213,6 +222,10 @@ public PointToDraw(ReferencePoint rp, String marker, int markerOffset, Filter<Se
             this.activator = activator;
         }
 
+        public PointToDraw moveMarkerPoint(int offset) {
+            return new PointToDraw(rp, marker, markerOffset + offset, activator);
+        }
+
         public boolean draw(SequencePartitioning partitioning, MultiAlignmentHelper helper, char[] line, boolean overwrite) {
             if (activator != null && !activator.accept(partitioning))
                 return true;

diff --git a/src/main/java/com/milaboratory/mixcr/basictypes/VDJCHit.java b/src/main/java/com/milaboratory/mixcr/basictypes/VDJCHit.java
@@ -73,6 +73,12 @@ public VDJCHit(VDJCGene gene, Alignment<NucleotideSequence>[] alignments, GeneFe
         this.score = score;
     }
 
+    public VDJCHit setAlignment(int target, Alignment<NucleotideSequence> alignment) {
+        Alignment<NucleotideSequence>[] newAlignments = alignments.clone();
+        newAlignments[target] = alignment;
+        return new VDJCHit(gene, newAlignments, alignedFeature);
+    }
+
     public int getPosition(int target, ReferencePoint referencePoint) {
         if (alignments[target] == null)
             return -1;
@@ -103,6 +109,10 @@ public Alignment<NucleotideSequence> getAlignment(int target) {
         return alignments[target];
     }
 
+    public Alignment<NucleotideSequence>[] getAlignments() {
+        return alignments.clone();
+    }
+
     public int numberOfTargets() {
         return alignments.length;
     }

diff --git a/src/main/java/com/milaboratory/mixcr/basictypes/VDJCObject.java b/src/main/java/com/milaboratory/mixcr/basictypes/VDJCObject.java
@@ -81,6 +81,13 @@ public final VDJCHit[] getHits(GeneType type) {
         return hits == null ? new VDJCHit[0] : hits;
     }
 
+    public Chains getTopChain(GeneType gt) {
+        final VDJCHit top = getBestHit(gt);
+        if (top == null)
+            return Chains.EMPTY;
+        return top.getGene().getChains();
+    }
+
     public Chains getAllChains(GeneType geneType) {
         if (allChains == null)
             synchronized ( this ){
@@ -142,6 +149,10 @@ public final NSequenceWithQuality getTarget(int target) {
         return targets[target];
     }
 
+    public final NSequenceWithQuality[] getTargets(){
+        return targets.clone();
+    }
+
     public final VDJCPartitionedSequence getPartitionedTarget(int target) {
         if (partitionedTargets == null) {
             partitionedTargets = new VDJCPartitionedSequence[targets.length];