GitHub - lhon/iterative-mapping-phasing: A bioinformatics workflow for haplotype phasing using PacBio/Cergentis data

Iterative Mapping and Phasing

This repository describes a workflow that takes a Cergentis TLA prepared sample sequenced on a PacBio® RS II and extracts phasing information based on heterozygous SNPs.

Installation

To use the scripts, first clone this repository:

git clone https://github.com/lhon/iterative-mapping-phasing.git

The scripts require BLASR, samtools, and bcftools to be in your path. One way to get these tools is to install them using LinuxBrew:

brew install blasr samtools bcftools

The workflow was tested with the following software versions on Linux:

python=2.7
blasr=2.1
samtools=1.2
bcftools=1.2

HAPCUT was recompiled with a change to allow paired end analysis to work properly; the resulting binaries are included in this repository. The change is documented at https://github.com/lhon/hapcut

Running

First generate Reads of Insert from SMRT® cell data. The easiest way is to use the RS_ReadsOfInsert protocol in SMRT Portal, using default parameters.

The data from reads_of_insert.fasta can then be aligned and phased using a command similar to this:

/path/to/phase.sh reads_of_insert.fasta hg19.fasta chr17:41,194,312-41,279,500 output_dir/

This performs the following steps:

Iteratively aligns reads_of_insert.fasta to hg19.fasta using map.py and blasr
Determines SNPs in BRCA1 (chr17:41,194,312-41,279,500) using samtools and bcftools
Generates a paired end formatted sam file for use with HAPCUT using the pair.py script
Calculates phasing using the high quality SNPs via HAPCUT

The intermediate files and results are placed in output_dir/. The key output is output_haplotype_file which reports the phasing information. The file format is described at https://github.com/vibansal/hapcut#format-of-input-and-output-files

Notes on the Analysis

Each read typically consists of several segments ligated together, generally from nearby genomic locations on the same allele. To get mapping information for each segment, we first map the entire read, and then for unmapped portions of the read, iteratively repeat the process. This is done via map.py.

The phasing is performed by HAPCUT, which supports mate pair style reads. Because the Cergentis/PacBio data can have more than two segment per read, the data needs to be represented in a mate pair style manner. This is done via pair.py.

Analysis of BRCA1 using this workflow was presented at AGBT 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
HAPCUT		HAPCUT
README.md		README.md
extractHAIRS		extractHAIRS
map.py		map.py
pair.py		pair.py
phase.sh		phase.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iterative Mapping and Phasing

Installation

Running

Notes on the Analysis

About

Releases

Packages

Languages

lhon/iterative-mapping-phasing

Folders and files

Latest commit

History

Repository files navigation

Iterative Mapping and Phasing

Installation

Running

Notes on the Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages