updated readme file.

khalidm · Apr 26, 2018 · 03455b0 · 03455b0
1 parent a669565
commit 03455b0
Showing 1 changed file with 84 additions and 51 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # hiplexpipe
-
+***
 ## A bioinformatics pipeline for variant calling for [Hi-Plex](http://hiplex.org/) sequencing.
 
 Author: Khalid Mahmood (kmahmood@unimelb.edu.au)
@@ -15,7 +15,7 @@ hiplexpipe is based on the [Ruffus](http://www.ruffus.org.uk/) library for writi
 
 See LICENSE.txt in source repository.
 
-## Installation
+## Installation dependencies
 
 #### External tools dependencies
 
@@ -46,20 +46,12 @@ See LICENSE.txt in source repository.
 
 We recommend using a python virtual environment. Following is an examples of how to setup a `hiplexpipe` virtual environment ready for analysis:
 
-## Getting started
+## Setup: New environment
 
 The following instruction are based on the Melbourne Bioinformatics computing infrastructure.
 
-### Step 1. Setup Python
-    `module load Python/2.7.12-GCC-4.9.3`
-### Step 2. Load software requirements
-    `export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so`
-    Other Software dependancies    
-    `module load BEDTools/2.26.0-vlsci_intel-2015.08.25`
-    `bgzip, samtools, picard`
-
-### Step 3. Installing hiplexpipe.
-    ```
+    module load Python/2.7.12-GCC-4.9.3
+    export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so
     virtualenv --system-site-packages hiplexpipe
     source hiplexpipe/bin/activate
     pip install jupyter
@@ -70,57 +62,98 @@ The following instruction are based on the Melbourne Bioinformatics computing in
     pip install -U https://github.com/khalidm/offtarget/archive/master.zip
     mkdir references
     mkdir coverage
-    ```
+##### Test pipeline works with:
+
+    hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 10 --verbose 3 --just_print
+
+## Setup: New gene panel
+
+#### Hi-Plex primer files
+
+You should have two target interval files for every Hi-Plex experiment.
+
+* rover.txt - this contains the amplicon regions and primer sequences.
+* idt.txt - this file contains the primer sequences and their names matching the names in the above rover.txt file.
+
+_Make sure heel sequences are removed from rover.txt file_
+
+#### Additional interval files
+
+Follow instructions below to prepare the intervals files for the pipeline. (We are working on a tool to automate this task).
 
-    Test:
+##### Main rover bed file. (rover.bed)
+Each interval in this bed file is the midpoint of each amplicon. This file is used to calculate alignment and coverage statistics.
 
-    ```
-    hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 10 --verbose 3
-     --just_print
-     ```
+    cut -f1,2,3,4,5 rover.txt > rover.bed
+or
 
-### Step 4. Preparing pipeline config files
+    awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,int($2+($3-$2)/2),int($3-($3-$2)/2),$4,$5} ' rover.txt > rover.bed
 
+##### Primer coordinates file. (primer.bedpe)
+This file is used to clip primer sequences from the alignments.
+
+    awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,$7,$8,$1,$12,$11} ' rover.txt > primer.bedpe
+
+##### Create intervals for GATK variant calling (gatk.bed)
+This creates a bed file of intervals for GATK variant calling. Note this is different from rover.bed as this merges overlapping targets and mainly functions to provide a target for variant calling.
+
+    cut -f1,2,3 rover.txt | bedtools slop -i - -b 10 -g hg19.genome | bedtools merge -i - > rover.gatk.bed
+
+## New analysis  
+#### Step 1. Load software requirements
+    module load Python/2.7.12-GCC-4.9.3
+    export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so
+    module load BEDTools/2.26.0-vlsci_intel-2015.08.25
+    module load SAMtools
+    module load VCFtools
+
+#### Step 2. Preparing pipeline config files
 I have created a template config file (pipeline_template.config) for all these analysis.
 
-    For a new analysis - create a new directory ( I have created 4gp_analysis for the 4gene panel).
-    Make a copy of pipeline_template.config in the new analysis directory. Make relevant changes to the new config file.
-        - change pipeline_id and add fastq file paths
-        - under the comment "hiplex files" - amend paths to files relevant to the design
-        - see 4gp_analysis/pipeline.config as an example
+1. Create a new directory for the analysis
+1. Make a copy of pipeline_template.config in the new analysis directory.
+2. Make relevant changes to the new config file.
+  - change pipeline_id
+  - add fastq file paths
+  - under the comment "hiplex files" - amend paths to files relevant to the design
+
+#### Step 3: Create new screen and load modules
+Log into snowy (HPC)
+
+Run following commands:
+
+    module load Python/2.7.12-GCC-4.9.3
+    screen -S new_analysis
+    module purge
+    module load vlsci
+    module load Python/2.7.12-GCC-4.9.3
+    module load SAMtools
+    module load VCFtools
+    source hiplexpipe/bin/activate
 
-### Step 5. Run the pipeline - this is an example for the '4gp' analysis
+#### Step 4: Run hiplexpipe
 
-a. log in to snowy
-b. module load Python/2.7.12-GCC-4.9.3
-c. screen -S new_analysis
-d. module purge
-e. module load vlsci
-f. module load Python/2.7.12-GCC-4.9.3
-g. module load SAMtools
-h. module load VCFtools
-i. source hiplexpipe/bin/activate (assuming above Step 3 is performed)
-j. command: `hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 50 --verbose 2`
+    hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 50 --verbose 2
 
-### Step 6. Generate alignment statistics
+## Generate statistics
 
-a. alignment statistics: from within the virtualenv run the follwing command:
-    `python alignment_stats.py > stats.txt`
+### Alignment statistics
 
-### Step 7. Generate heatmaps for alignment coverage
-    `jupyter nbconvert --ExecutePreprocessor.timeout=6000 --to html --execute coverage_analysis_main.ipynb`
-    This will output 'coverage_analysis_main.html` file.
+From within the virtualenv, run the following command:
 
-## Preparing target region files
+    python alignment_stats.py > stats.txt
 
-You should have two target interval files for every Hi-Plex experiment.
+This will generate a table containing various alignment statistics for each sample.
 
-rover.txt - this contains the amplicon regions and primer sequences.
-idt.txt - this file contains the primer sequences and their names matching the names in the above rover.txt file.
+### Heatmaps for alignment coverage
 
-Follow instructions below to prepare the intervals files for the pipeline. (We are working on a tool to automate this task).
+    jupyter nbconvert --ExecutePreprocessor.timeout=6000 --to html --execute coverage_analysis_main.ipynb
+
+This will output `coverage_analysis_main.html` file.
+
+### Offtarget
+Generates statistics on which amplicons are mapping to incorrect regions of the genome, or not mapping at all.
+
+Run for a few samples picked at random.
 
-* Main rover bed file. (rover.bed) This file is used to calculate alignment and coverage statistics. cut -f1,2,3,4,5 rover.txt > rover.bed or awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,int($2+($3-$2)/2),int($3-($3-$2)/2),$4,$5} ' rover.txt > rover.bed
-* Primer coordinates file. (primer.bedpe) This file is used to clip primer sequences from the alignments. awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,$7,$8,$1,$12,$11} ' rover.txt > primer.bedpe
-* Create intervals for GATK variant calling
-`cut -f1,2,3 rover.txt | bedtools slop -i - -b 10 -g hg19.genome | bedtools merge -i - > rover.gatk.bed`
+    offtarget --primers <rover.txt> --fastq1 <fastq_read_1> --fastq2 <fastq_read_2> --bam <sorted bam file> --log offtarget.log > output.txt