Skip to content

Commit

Permalink
updated readme file.
Browse files Browse the repository at this point in the history
  • Loading branch information
khalidm committed Apr 26, 2018
1 parent a669565 commit 03455b0
Showing 1 changed file with 84 additions and 51 deletions.
135 changes: 84 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# hiplexpipe

***
## A bioinformatics pipeline for variant calling for [Hi-Plex](http://hiplex.org/) sequencing.

Author: Khalid Mahmood (kmahmood@unimelb.edu.au)
Expand All @@ -15,7 +15,7 @@ hiplexpipe is based on the [Ruffus](http://www.ruffus.org.uk/) library for writi

See LICENSE.txt in source repository.

## Installation
## Installation dependencies

#### External tools dependencies

Expand Down Expand Up @@ -46,20 +46,12 @@ See LICENSE.txt in source repository.

We recommend using a python virtual environment. Following is an examples of how to setup a `hiplexpipe` virtual environment ready for analysis:

## Getting started
## Setup: New environment

The following instruction are based on the Melbourne Bioinformatics computing infrastructure.

### Step 1. Setup Python
`module load Python/2.7.12-GCC-4.9.3`
### Step 2. Load software requirements
`export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so`
Other Software dependancies
`module load BEDTools/2.26.0-vlsci_intel-2015.08.25`
`bgzip, samtools, picard`

### Step 3. Installing hiplexpipe.
```
module load Python/2.7.12-GCC-4.9.3
export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so
virtualenv --system-site-packages hiplexpipe
source hiplexpipe/bin/activate
pip install jupyter
Expand All @@ -70,57 +62,98 @@ The following instruction are based on the Melbourne Bioinformatics computing in
pip install -U https://github.com/khalidm/offtarget/archive/master.zip
mkdir references
mkdir coverage
```
##### Test pipeline works with:

hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 10 --verbose 3 --just_print

## Setup: New gene panel

#### Hi-Plex primer files

You should have two target interval files for every Hi-Plex experiment.

* rover.txt - this contains the amplicon regions and primer sequences.
* idt.txt - this file contains the primer sequences and their names matching the names in the above rover.txt file.

_Make sure heel sequences are removed from rover.txt file_

#### Additional interval files

Follow instructions below to prepare the intervals files for the pipeline. (We are working on a tool to automate this task).

Test:
##### Main rover bed file. (rover.bed)
Each interval in this bed file is the midpoint of each amplicon. This file is used to calculate alignment and coverage statistics.

```
hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 10 --verbose 3
--just_print
```
cut -f1,2,3,4,5 rover.txt > rover.bed
or

### Step 4. Preparing pipeline config files
awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,int($2+($3-$2)/2),int($3-($3-$2)/2),$4,$5} ' rover.txt > rover.bed

##### Primer coordinates file. (primer.bedpe)
This file is used to clip primer sequences from the alignments.

awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,$7,$8,$1,$12,$11} ' rover.txt > primer.bedpe

##### Create intervals for GATK variant calling (gatk.bed)
This creates a bed file of intervals for GATK variant calling. Note this is different from rover.bed as this merges overlapping targets and mainly functions to provide a target for variant calling.

cut -f1,2,3 rover.txt | bedtools slop -i - -b 10 -g hg19.genome | bedtools merge -i - > rover.gatk.bed

## New analysis
#### Step 1. Load software requirements
module load Python/2.7.12-GCC-4.9.3
export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-GCC/lib/libdrmaa.so
module load BEDTools/2.26.0-vlsci_intel-2015.08.25
module load SAMtools
module load VCFtools

#### Step 2. Preparing pipeline config files
I have created a template config file (pipeline_template.config) for all these analysis.

For a new analysis - create a new directory ( I have created 4gp_analysis for the 4gene panel).
Make a copy of pipeline_template.config in the new analysis directory. Make relevant changes to the new config file.
- change pipeline_id and add fastq file paths
- under the comment "hiplex files" - amend paths to files relevant to the design
- see 4gp_analysis/pipeline.config as an example
1. Create a new directory for the analysis
1. Make a copy of pipeline_template.config in the new analysis directory.
2. Make relevant changes to the new config file.
- change pipeline_id
- add fastq file paths
- under the comment "hiplex files" - amend paths to files relevant to the design

#### Step 3: Create new screen and load modules
Log into snowy (HPC)

Run following commands:

module load Python/2.7.12-GCC-4.9.3
screen -S new_analysis
module purge
module load vlsci
module load Python/2.7.12-GCC-4.9.3
module load SAMtools
module load VCFtools
source hiplexpipe/bin/activate

### Step 5. Run the pipeline - this is an example for the '4gp' analysis
#### Step 4: Run hiplexpipe

a. log in to snowy
b. module load Python/2.7.12-GCC-4.9.3
c. screen -S new_analysis
d. module purge
e. module load vlsci
f. module load Python/2.7.12-GCC-4.9.3
g. module load SAMtools
h. module load VCFtools
i. source hiplexpipe/bin/activate (assuming above Step 3 is performed)
j. command: `hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 50 --verbose 2`
hiplexpipe --config pipeline.config --use_threads --log_file pipeline.log --jobs 50 --verbose 2

### Step 6. Generate alignment statistics
## Generate statistics

a. alignment statistics: from within the virtualenv run the follwing command:
`python alignment_stats.py > stats.txt`
### Alignment statistics

### Step 7. Generate heatmaps for alignment coverage
`jupyter nbconvert --ExecutePreprocessor.timeout=6000 --to html --execute coverage_analysis_main.ipynb`
This will output 'coverage_analysis_main.html` file.
From within the virtualenv, run the following command:

## Preparing target region files
python alignment_stats.py > stats.txt

You should have two target interval files for every Hi-Plex experiment.
This will generate a table containing various alignment statistics for each sample.

rover.txt - this contains the amplicon regions and primer sequences.
idt.txt - this file contains the primer sequences and their names matching the names in the above rover.txt file.
### Heatmaps for alignment coverage

Follow instructions below to prepare the intervals files for the pipeline. (We are working on a tool to automate this task).
jupyter nbconvert --ExecutePreprocessor.timeout=6000 --to html --execute coverage_analysis_main.ipynb

This will output `coverage_analysis_main.html` file.

### Offtarget
Generates statistics on which amplicons are mapping to incorrect regions of the genome, or not mapping at all.

Run for a few samples picked at random.

* Main rover bed file. (rover.bed) This file is used to calculate alignment and coverage statistics. cut -f1,2,3,4,5 rover.txt > rover.bed or awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,int($2+($3-$2)/2),int($3-($3-$2)/2),$4,$5} ' rover.txt > rover.bed
* Primer coordinates file. (primer.bedpe) This file is used to clip primer sequences from the alignments. awk ' BEGIN{FS="\t";OFS="\t"}; { print $1,$7,$8,$1,$12,$11} ' rover.txt > primer.bedpe
* Create intervals for GATK variant calling
`cut -f1,2,3 rover.txt | bedtools slop -i - -b 10 -g hg19.genome | bedtools merge -i - > rover.gatk.bed`
offtarget --primers <rover.txt> --fastq1 <fastq_read_1> --fastq2 <fastq_read_2> --bam <sorted bam file> --log offtarget.log > output.txt

0 comments on commit 03455b0

Please sign in to comment.