khanlab_pipeline

This is the implementation of KhanLab NGS Pipeline using Snakemake.

Khanlab pipeline supports the following NGS data types:

HiC
RNAseq
ChIPseq
Pacbio
DNAseq

Installation

The easiest way to get this pipeline is to clone the repository.

git clone git@github.com:hsienchao/khanlab_pipeline.git

This pipeline is available on NIH biowulf cluster, contact me if you would like to do a test run. The data from this pipeline could directly be ported in OncoGenomics-DB, an application created to visualize NGS data available to NIH users.

Requirements

snakemake 5.13.0
mutt
gnu parallel
SLURM resource management

HiC:

[HiCPro] (https://github.com/nservant/HiC-Pro)
[JuiceBox] (https://github.com/aidenlab/Juicebox)
[Bowtie2] (https://github.com/BenLangmead/bowtie2)

ChIPseq:

BWA
macs
rose
homer
coltron
samtools
bedtools

RNAseq:

STAR
RSEM
samtools
xengsort
HLAminer
seq2HLA

Pacbio:

Isoseq3
cDNA_cupcake
sqanti3

Conventions

Sample names cannot have "/" or "." in them
Fastq files end in ".fastq.gz"

Running pipeline

Input sample sheet

Sample sheet in YAML format

Required columns

Genome (accepted values: hg19,hg38,mm10)
SampleFiles ( usually Sample_ + Library_ID + _ + FCID )
Other pipeline type specific columns

Example

See examples in HiC, RNAseq or ChIPseq

Sample sheet in YAML format
Sample sheet can be generated using script sampleToYaml.py. Usage:

python scripts/sampleToYaml.py -s [SAMPLE_ID] -o [OUTPUT_FILE]

Example:

python scripts/sampleToYaml.py -s RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7 -o RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7.hic.yaml

Launching the pipeline

1. General launch script:

launch [options]

required options:

        -type|t         <string>        Pipeline type (available options: hic,chipseq,ranseq,dnaseq)
        -workdir|w      <string>        Working directory where all the results will be stored.
        -sheet|s        <string>        Sample sheet in YAML format
        -genome|g       <string>        Genome version (default: hg38)

optional options: 
        -datadir|d <string> FASTQ file location (default: /data/khanlab/projects/DATA) 
        -dryrun Dryrun only 
        -dag Generate DAG SVG

Example (hg38)

      launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml

Example (hg19)

      launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml -g hg19

2. Launch by sample ID (Khanlab automation and regular Khanlab users):

    scripts/automate.sh [sampleID]

This script will parse the Khanlab master files to determine the sequencing type automatically. Then it will retrieve required columns from HiC/ChIPseq sample sheets and check if FASTQ files are ready. Then it will lauch the pipeline automatically.

For RNAseq samples, the genome is defined in "SampleRef" column in master file. For ChIPSeq/HiC samples, genome version is defined in "Genome" column. The values can be multiple seperated by comma (e.g. hg19,hg38)

The Khanlab data location:

-- FASTQ files: /data/khanlab/projects/DATA
-- Processed data: /data/khanlab/projects/pipeline_production/processed_DATA
-- Processed data: /data/khanlab/projects/pipeline_production/sample_sheets
-- ChIPSeq sample sheet: /data/khanlab/projects/ChIP_seq/manage_samples/ChIP_seq_samples.xlsx
-- HiC sample sheet: /data/khanlab/projects/HiC/manage_samples/HiC_sample_sheet.xlsx

Automation workflow
Example 1: process HiC sample

   scripts/automate.sh RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7

Example 2: process ChIPseq samples

   scripts/automate.sh RH4_D6_BRD4_024_C_HLFMLBGX3

Example 3: process RNAseq samples

   scripts/automate.sh NCI0215tumor_T_C2V4TACXX

HiC

HiC sample_sheet

columns

Genome (hg19,hg38,mm10)
SampleFiles
SpikeIn (optional)
SpikeInGenome (optional)

HiC Example:

samples:
  RH4_D6_H3K27ac_HiChIP_HKJ22BGX7:
    Genome: hg19
    SampleFiles: Sample_RH4_D6_H3K27ac_HiChIP_HKJ22BGX7
    SpikeIn: 'yes'
    SpikeInGenome: mm10

Output

Juicebox hic file: [output dir]/[sample_id].allValidPairs.hic
HiCpro pairs:

Pairs for the reference genome:
- [output dir]/HiCproOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
Pairs for the spikeIn genome:
- [output dir]/HiCproAQuAOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
Mergestat summary (reference and spike-In):
- mergeStats.txt
Successful flag:
- successful.txt

Dependency graph

DAG example with spike-In

ChIPseq

ChIPseq sample_sheet

Required columns

Genome (hg19,hg38,mm10)
SampleFiles
SpikeIn (yes,no)
SpikeInGenome (optional)
LibrarySize (optional)
EnhancePipe (yes, no)
PeakCalling (narrow, broad)
PairedRNA_SAMPLE_ID (optional)

ChIPseq Example:

ChIPSeq with enhancer pipeline (e.g. H3K27ac)

samples:
  RH4_D6_H3K27ac_018_C_HWC77BGXY:
    PairedInput: RH4_Input_001_C_H5TLGBGXX
    Genome: hg38
    SampleFiles: Sample_RH4_D6_H3K27ac_018_C_HWC77BGXY
    SpikeIn: 'yes'
    SpikeInGenome: dm6
    LibrarySize: 250
    EnhancePipe: 'yes'
    PeakCalling: narrow
    PairedRNA_SAMPLE_ID: Rh4_dmso_6h_rz_T_H3YCHBGXX

ChIPSeq without enhancer pipeline (e.g. H3K27ac)

samples:
  RH4_D6_BRD4_024_C_HLFMLBGX3:
    PairedInput: RH4_Input_001_C_H5TLGBGXX
    Genome: hg38
    SampleFiles: Sample_RH4_D6_BRD4_024_C_HLFMLBGX3
    SpikeIn: 'yes'
    SpikeInGenome: dm6
    LibrarySize: 250
    EnhancePipe: 'no'
    PeakCalling: narrow
    PairedRNA_SAMPLE_ID: RH4_D6_T_HVNVFBGX2

Output

BWA Output

BAM: [sample_id].bam
No duplicate BAM: [sample_id].dd.bam
Bigwig/TDF: [sample_id].25.RPM.bw/tdf
SpikeIn normalized bigwig/TDF: [sample_id].25.scaled.bw/tdf

MACS2 Output

Folder name:

MACS_OUT_q_[cutoff]: MACS_Out_q_0.01 or MACS_Out_q_0.05

Peaks (no regions in blacklist): [sample_id]_peaks.narrowPeak.nobl.bed
Peak annotation: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.txt
Peak annotation summary: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.summary
Peak summit file (narrow peaks only): [sample_id].narrow_summits.bed
Peaks GREAT format: [sample_id]_peaks.narrowPeak.nobl.GREAT.bed

ROSE Output

Folder name:

MACS_OUT_q_[cutoff]/ROSE_out_[stitch distance]: MACS_Out_q_[cutoff]/ROSE_out_12500

All enhancers: [sample_id]_peaks_AllEnhancers.table.txt
Super enhancer BED file: [sample_id]_peaks_AllEnhancers.table.super.bed
Regular enhancer BED file: [sample_id]_peaks_AllEnhancers.table.regular.bed
Super enhancer summit file: [sample_id].super_summits.bed
Regular enhancer summit file: [sample_id].regular_summits.bed

Motif Output

We use homer to predict motifs for:

Peaks summit file (folder: motif_narrow)
Super enhancer summit file (folder: motif_super)
Regular enhancer summit file (folder: motif_regular)

EDEN output

We use EDEN to look for regulated genes in the same TAD for:

Peaks summit file (folder: motif_narrow)
Super enhancer summit file (folder: motif_super)
Regular enhancer summit file (folder: motif_regular)

In the same folder, EDEN generates the following files:

*_TPM[cutoff]_muti-genes.txt: Nearest genes for upstream, downstream and overlapped region of interest (TPM at least cutoff)
*_TPM[cutoff]_max-genes.txt: Max expressed gene around region of interest (TPM at least cutoff)
*_TPM[cutoff]_nearest-genes.txt: Max expressed gene around region of interest (TPM at least cuotff)
*_TPM[cutoff]_300k.superloci.max.bed: Max expressed gene around stitched regions (TPM at least cutoff)
*_TPM[cutoff]_300k.superloci.nearest.bed: Nearest gene around stitched regions (TPM at least cutoff)

Coltron output

Coltron output can be found in ROSE_out_12500/coltron

Dependency graph

DAG example with enhancer pipeline

DAG example without enhancer pipeline

RNAseq

RNAseq sample_sheet

Required columns

Genome (hg19,hg38,mm10)
SampleFiles
SampleCaptures (polya, polya_stranded, ribozero, access)
Xenograft (optional)
XenograftGenome (optional)

RNAseq Example:

Regular RNAseq

samples:
  NCI0215tumor_T_C2V4TACXX:
    Genome: hg19
    SampleFiles: Sample_NCI0215tumor_T_C2V4TACXX
    SampleCaptures: ribozero

Xenograft RNAseq

samples:
  RH4_total_RNA_PA58_T_H37TWBGXC:
    Genome: hg19
    SampleFiles: Sample_RH4_total_RNA_PA58_T_H37TWBGXC
    Xenograft: 'yes'
    XenograftGenome: mm10
    SampleCaptures: ribozero

Output

STAR Output

Gencode STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
Gencode STAR BAM bigwig: STAR_hg19_gencode/[sample_id].star.genome.bw
UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bw

RSEM Output

Gencode RSME genes: RSEM_hg19_gencode/[sample_id].hg19.genocde.genes.results
Gencode RSME isoforms: RSEM_hg19_gencode/[sample_id].hg19.genocde.isoforms.results
Gencode RSME genes: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results
Gencode RSME isoforms: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results

HLA Output

Seq2HLA and HLAminer combined file: HLA/[sample_id].Calls.txt

Xenograft filtered FASTQ (for Xenograft samples)

DATA/classification.tsv: Filtering summary
DATA/[sample_id].filtered_R1.fastq.gz
DATA/[sample_id].filtered_R2.fastq.gz

Dependency graph

DAG example

DAG example for Xenograft samples

Pacbio

Pacbio sample_sheet

columns

Genome (hg19,hg38,mm10)
SampleFiles
Target. Format: chromosome:start-end. Empty if the library is whole transcriptome
Primers. Primer fasta file if your primers are different from standard one.

Pacbio Example:

samples:
  RH30:
    Amplified_Sample_Library_Name: RH30
    SampleFiles: Sample_RH30/RH30.ccs.bam
    Target: chr5:177086905-177098144
	Genome: hg19
    Primers: /data/khanlab3/hsienchao/Adam/FGFR4/CS28601_R2/A01/primers.fasta

Output

Sqanti classification file: [output dir]/[sample_id].sqanti_classification.filtered_lite_classification.txt
Final GTF: [output dir]/[sample_id].sqanti_classification.filtered_lite.gtf

Dependency graph

DAG example with spike-In

DNAseq

Not implemented yet.

Methylseq

Not implemented yet.

For questions or comments, please contact: Hsien-chao Chou (chouh@nih.gov)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
apps		apps
config		config
ref		ref
rules		rules
sample_sheets		sample_sheets
scripts		scripts
README.md		README.md
automation.png		automation.png
dag.chipseq.broad.svg		dag.chipseq.broad.svg
dag.chipseq.no_enhancer.svg		dag.chipseq.no_enhancer.svg
dag.chipseq.svg		dag.chipseq.svg
dag.hic.pdf		dag.hic.pdf
dag.hic.svg		dag.hic.svg
dag.nospikeIn.hic.pdf		dag.nospikeIn.hic.pdf
dag.pacbio.svg		dag.pacbio.svg
dag.rnaseq.pdf		dag.rnaseq.pdf
dag.rnaseq.svg		dag.rnaseq.svg
dag.rnaseq.xeno.svg		dag.rnaseq.xeno.svg
dag.targeted.pacbio.svg		dag.targeted.pacbio.svg
khanlab_pipeline.smk		khanlab_pipeline.smk
launch		launch
pipeline.chipseq.rules		pipeline.chipseq.rules
pipeline.hic.rules		pipeline.hic.rules
pipeline.rnaseq.rules		pipeline.rnaseq.rules

hsienchao/khanlab_pipeline

Folders and files

Latest commit

History

Repository files navigation

khanlab_pipeline

Installation

Requirements

HiC:

ChIPseq:

RNAseq:

Pacbio:

Conventions

Running pipeline

Input sample sheet

Required columns

Example

Launching the pipeline

1. General launch script:

2. Launch by sample ID (Khanlab automation and regular Khanlab users):

HiC

HiC sample_sheet

columns

HiC Example:

Output

Dependency graph

DAG example with spike-In

ChIPseq

ChIPseq sample_sheet

Required columns

ChIPseq Example:

ChIPSeq with enhancer pipeline (e.g. H3K27ac)

ChIPSeq without enhancer pipeline (e.g. H3K27ac)

Output

BWA Output

MACS2 Output

ROSE Output

Motif Output

EDEN output

Coltron output

Dependency graph

DAG example with enhancer pipeline

DAG example without enhancer pipeline

RNAseq

RNAseq sample_sheet

Required columns

RNAseq Example:

Regular RNAseq

Xenograft RNAseq

Output

STAR Output

RSEM Output

HLA Output

Xenograft filtered FASTQ (for Xenograft samples)

Dependency graph

DAG example

DAG example for Xenograft samples

Pacbio

Pacbio sample_sheet

columns

Pacbio Example:

Output

Dependency graph

DAG example with spike-In

DNAseq

Methylseq

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages