PDX Mouse Subtraction Pipeline

Authors: Oliver Hampton
Chase Miller
Liu Xi
Maria Cardenas and Komal S. Rathi
Contact: Komal S. Rathi (
Organization: DBHi, CHOP
Status: Completed
Date: 2019-07-18


The goal of this repo is to make the Mouse subtraction pipeline from BCM (Wheeler Lab) reproducible.


  1. Create python3 environment:
conda create --name pdx-subtract-env
conda activate pdx-subtract-env
conda install -c bioconda samtools
conda install -c bioconda htslib
conda install -c bioconda sambamba
conda install -c bioconda picard
conda install -c bioconda cufflinks
conda install -c anaconda java-1.7.0-openjdk-cos6-x86_64 # required by rna-seqc
conda install -c bioconda rna-seqc
conda install -c bioconda htseq
conda install -c bioconda star=2.5.3a
conda install -c bioconda trinity=2.5.1 # required by star-fusion
conda install -c bioconda star-fusion=1.1.0
conda install -c bioconda bwa
conda install -c bioconda alignstats
conda config --add channels
conda install defuse
conda install -c bioconda bamutil
  1. Create python2 environment (STAR-Fusion v1.0.1 is python 2.7 compatible):
conda create --name star-fusion-env python=2.7
source activate star-fusion-env
conda install -c bioconda star-fusion
conda install -c bioconda trinity
conda install -c conda-forge -c bioconda samtools bzip2
conda install -c conda-forge configparser

# install some non-standard perl modules:
perl -MCPAN -e shell
install DB_File
install URI::Escape
install Set::IntervalTree
install Carp::Assert
install JSON::XS


# SOAPfuse has to be installed separately as it is not available on conda
tar -xzf SOAPfuse-v1.26.tar.gz
cd SOAPfuse-v1.26

# get SOAPfuse database
cd /mnt/isilon/cbmi/variome/reference/soapfuse_db

# update SOAPfuse config file according to
# add cytoBand file from ucsc and update SOAPfuse config
wget hg19-GRCh37.59/
gunzip cytoBand.txt.gz

# change PA_all_fq_postfix in config file to .fq


# for deFUSE, python 2 is required so use the python2 environment created for STAR-Fusion

# Install via source:

# in the tools directory, download boost
cd tools && wget
tar -zxvf boost_1_68_0.tar.gz
export CPLUS_INCLUDE_PATH=/mnt/isilon/maris_lab/target_nbl_ngs/PPTC-PDX-genomics/mouse_subtraction_pipeline/scripts/dranew-defuse-0f198c242b82/tools/boost_1_68_0
cd tools && make

# download deFUSE reference database
# change perl in to /usr/bin/env perl -d /mnt/isilon/cbmi/variome/reference/defuse_db/hg19/


# get reference files and prepare corresponding index files
gunzip 1000G_phase1.indels.b37.vcf.gz
bgzip 1000G_phase1.indels.b37.vcf
tabix -p vcf 1000G_phase1.indels.b37.vcf.gz

gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz
bgzip Mills_and_1000G_gold_standard.indels.b37.vcf
tabix -p vcf 1000G_phase1.indels.b37.vcf.gz

gunzip dbsnp_138.b37.vcf.gz
bgzip dbsnp_138.b37.vcf
tabix -p vcf dbsnp_138.b37.vcf.gz

Download reference files:

# run this code to create output directories and download reference data

Prepare reference fasta and gtf:

# Code to prepare reference fasta and gtf (this might be inaccurate because I got the reference files from BCM): bash scripts/
# make sure all reference fasta files are indexed:
samtools faidx <file.fasta|file.fa>

# make sure the fasta reference used by bwa is indexed:
bwa index protein_coding_canonical.T_chr.fa

BCM-specific scripts and software:

1. pindel_0.2.5b5_tdonly
2. ERCCPlot.jar

Steps to run the RNA-pipeline:

The RNA pipeline is divided into four steps:

  1. Snakefile_Phase1: Align PDX RNA-seq data to hybrid genome, split into human and mouse bams and create human specific fastq files.
  2. Snakefile_Phase2: Realign to human reference, do QC, run htseq and pindel.
  3. Snakefile_fusions_py2: Run python2 dependent fusion callers like STAR-Fusion and deFUSE
  4. Snakefile_soapfuse: Run python3 dependent fusion caller like SOAPfuse

Each snakefile has a corresponding bash script to run the pipeline:

# Run phase 1
cd rna-hybrid && bash

# Run phase 2
cd rna-hybrid && bash

# Run python2 based fusion callers
cd rna-hybrid && bash

# Run python3 based fusion callers
cd rna-hybrid && bash

Steps to run the DNA-pipeline:

cd dna-pipeline && bash
