This is the implementation of KhanLab NGS Pipeline using Snakemake.
Khanlab pipeline supports the following NGS data types:
The easiest way to get this pipeline is to clone the repository.
git clone git@github.com:hsienchao/khanlab_pipeline.git
This pipeline is available on NIH biowulf cluster, contact me if you would like to do a test run. The data from this pipeline could directly be ported in OncoGenomics-DB, an application created to visualize NGS data available to NIH users.
snakemake 5.13.0
mutt
gnu parallel
SLURM resource management
- [HiCPro] (https://github.com/nservant/HiC-Pro)
- [JuiceBox] (https://github.com/aidenlab/Juicebox)
- [Bowtie2] (https://github.com/BenLangmead/bowtie2)
- BWA
- macs
- rose
- homer
- coltron
- samtools
- bedtools
- STAR
- RSEM
- samtools
- xengsort
- HLAminer
- seq2HLA
- Isoseq3
- cDNA_cupcake
- sqanti3
- Sample names cannot have "/" or "." in them
- Fastq files end in ".fastq.gz"
- Sample sheet in YAML format
- Genome (accepted values: hg19,hg38,mm10)
- SampleFiles ( usually Sample_ + Library_ID + _ + FCID )
- Other pipeline type specific columns
See examples in HiC, RNAseq or ChIPseq
- Sample sheet in YAML format
- Sample sheet can be generated using script sampleToYaml.py. Usage:
python scripts/sampleToYaml.py -s [SAMPLE_ID] -o [OUTPUT_FILE]
Example:
python scripts/sampleToYaml.py -s RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7 -o RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7.hic.yaml
launch [options]
required options:
-type|t <string> Pipeline type (available options: hic,chipseq,ranseq,dnaseq)
-workdir|w <string> Working directory where all the results will be stored.
-sheet|s <string> Sample sheet in YAML format
-genome|g <string> Genome version (default: hg38)
optional options:
-datadir|d <string> FASTQ file location (default: /data/khanlab/projects/DATA)
-dryrun Dryrun only
-dag Generate DAG SVG
- Example (hg38)
launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml
- Example (hg19)
launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml -g hg19
scripts/automate.sh [sampleID]
This script will parse the Khanlab master files to determine the sequencing type automatically. Then it will retrieve required columns from HiC/ChIPseq sample sheets and check if FASTQ files are ready. Then it will lauch the pipeline automatically.
For RNAseq samples, the genome is defined in "SampleRef" column in master file. For ChIPSeq/HiC samples, genome version is defined in "Genome" column. The values can be multiple seperated by comma (e.g. hg19,hg38)
The Khanlab data location:
-- FASTQ files: /data/khanlab/projects/DATA
-- Processed data: /data/khanlab/projects/pipeline_production/processed_DATA
-- Processed data: /data/khanlab/projects/pipeline_production/sample_sheets
-- ChIPSeq sample sheet: /data/khanlab/projects/ChIP_seq/manage_samples/ChIP_seq_samples.xlsx
-- HiC sample sheet: /data/khanlab/projects/HiC/manage_samples/HiC_sample_sheet.xlsx
scripts/automate.sh RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7
- Example 2: process ChIPseq samples
scripts/automate.sh RH4_D6_BRD4_024_C_HLFMLBGX3
- Example 3: process RNAseq samples
scripts/automate.sh NCI0215tumor_T_C2V4TACXX
- Genome (hg19,hg38,mm10)
- SampleFiles
- SpikeIn (optional)
- SpikeInGenome (optional)
samples:
RH4_D6_H3K27ac_HiChIP_HKJ22BGX7:
Genome: hg19
SampleFiles: Sample_RH4_D6_H3K27ac_HiChIP_HKJ22BGX7
SpikeIn: 'yes'
SpikeInGenome: mm10
- Juicebox hic file: [output dir]/[sample_id].allValidPairs.hic
- HiCpro pairs:
- Pairs for the reference genome:
- [output dir]/HiCproOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
- Pairs for the spikeIn genome:
- [output dir]/HiCproAQuAOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
- Mergestat summary (reference and spike-In):
- mergeStats.txt
- Successful flag:
- successful.txt
- Genome (hg19,hg38,mm10)
- SampleFiles
- SpikeIn (yes,no)
- SpikeInGenome (optional)
- LibrarySize (optional)
- EnhancePipe (yes, no)
- PeakCalling (narrow, broad)
- PairedRNA_SAMPLE_ID (optional)
samples:
RH4_D6_H3K27ac_018_C_HWC77BGXY:
PairedInput: RH4_Input_001_C_H5TLGBGXX
Genome: hg38
SampleFiles: Sample_RH4_D6_H3K27ac_018_C_HWC77BGXY
SpikeIn: 'yes'
SpikeInGenome: dm6
LibrarySize: 250
EnhancePipe: 'yes'
PeakCalling: narrow
PairedRNA_SAMPLE_ID: Rh4_dmso_6h_rz_T_H3YCHBGXX
samples:
RH4_D6_BRD4_024_C_HLFMLBGX3:
PairedInput: RH4_Input_001_C_H5TLGBGXX
Genome: hg38
SampleFiles: Sample_RH4_D6_BRD4_024_C_HLFMLBGX3
SpikeIn: 'yes'
SpikeInGenome: dm6
LibrarySize: 250
EnhancePipe: 'no'
PeakCalling: narrow
PairedRNA_SAMPLE_ID: RH4_D6_T_HVNVFBGX2
- BAM: [sample_id].bam
- No duplicate BAM: [sample_id].dd.bam
- Bigwig/TDF: [sample_id].25.RPM.bw/tdf
- SpikeIn normalized bigwig/TDF: [sample_id].25.scaled.bw/tdf
Folder name:
MACS_OUT_q_[cutoff]: MACS_Out_q_0.01 or MACS_Out_q_0.05
- Peaks (no regions in blacklist): [sample_id]_peaks.narrowPeak.nobl.bed
- Peak annotation: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.txt
- Peak annotation summary: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.summary
- Peak summit file (narrow peaks only): [sample_id].narrow_summits.bed
- Peaks GREAT format: [sample_id]_peaks.narrowPeak.nobl.GREAT.bed
Folder name:
MACS_OUT_q_[cutoff]/ROSE_out_[stitch distance]: MACS_Out_q_[cutoff]/ROSE_out_12500
- All enhancers: [sample_id]_peaks_AllEnhancers.table.txt
- Super enhancer BED file: [sample_id]_peaks_AllEnhancers.table.super.bed
- Regular enhancer BED file: [sample_id]_peaks_AllEnhancers.table.regular.bed
- Super enhancer summit file: [sample_id].super_summits.bed
- Regular enhancer summit file: [sample_id].regular_summits.bed
We use homer to predict motifs for:
- Peaks summit file (folder: motif_narrow)
- Super enhancer summit file (folder: motif_super)
- Regular enhancer summit file (folder: motif_regular)
We use EDEN to look for regulated genes in the same TAD for:
- Peaks summit file (folder: motif_narrow)
- Super enhancer summit file (folder: motif_super)
- Regular enhancer summit file (folder: motif_regular)
In the same folder, EDEN generates the following files:
- *_TPM[cutoff]_muti-genes.txt: Nearest genes for upstream, downstream and overlapped region of interest (TPM at least cutoff)
- *_TPM[cutoff]_max-genes.txt: Max expressed gene around region of interest (TPM at least cutoff)
- *_TPM[cutoff]_nearest-genes.txt: Max expressed gene around region of interest (TPM at least cuotff)
- *_TPM[cutoff]_300k.superloci.max.bed: Max expressed gene around stitched regions (TPM at least cutoff)
- *_TPM[cutoff]_300k.superloci.nearest.bed: Nearest gene around stitched regions (TPM at least cutoff)
Coltron output can be found in ROSE_out_12500/coltron
- Genome (hg19,hg38,mm10)
- SampleFiles
- SampleCaptures (polya, polya_stranded, ribozero, access)
- Xenograft (optional)
- XenograftGenome (optional)
samples:
NCI0215tumor_T_C2V4TACXX:
Genome: hg19
SampleFiles: Sample_NCI0215tumor_T_C2V4TACXX
SampleCaptures: ribozero
samples:
RH4_total_RNA_PA58_T_H37TWBGXC:
Genome: hg19
SampleFiles: Sample_RH4_total_RNA_PA58_T_H37TWBGXC
Xenograft: 'yes'
XenograftGenome: mm10
SampleCaptures: ribozero
- Gencode STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
- Gencode STAR BAM bigwig: STAR_hg19_gencode/[sample_id].star.genome.bw
- UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
- UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bw
- Gencode RSME genes: RSEM_hg19_gencode/[sample_id].hg19.genocde.genes.results
- Gencode RSME isoforms: RSEM_hg19_gencode/[sample_id].hg19.genocde.isoforms.results
- Gencode RSME genes: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results
- Gencode RSME isoforms: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results
- Seq2HLA and HLAminer combined file: HLA/[sample_id].Calls.txt
- DATA/classification.tsv: Filtering summary
- DATA/[sample_id].filtered_R1.fastq.gz
- DATA/[sample_id].filtered_R2.fastq.gz
- Genome (hg19,hg38,mm10)
- SampleFiles
- Target. Format: chromosome:start-end. Empty if the library is whole transcriptome
- Primers. Primer fasta file if your primers are different from standard one.
samples:
RH30:
Amplified_Sample_Library_Name: RH30
SampleFiles: Sample_RH30/RH30.ccs.bam
Target: chr5:177086905-177098144
Genome: hg19
Primers: /data/khanlab3/hsienchao/Adam/FGFR4/CS28601_R2/A01/primers.fasta
- Sqanti classification file: [output dir]/[sample_id].sqanti_classification.filtered_lite_classification.txt
- Final GTF: [output dir]/[sample_id].sqanti_classification.filtered_lite.gtf
Not implemented yet.
Not implemented yet.
For questions or comments, please contact: Hsien-chao Chou (chouh@nih.gov)