Skip to content

Illumina Dragen cancer genome and transcriptome analysis automation using Snakemake

License

Notifications You must be signed in to change notification settings

nodrogluap/tamor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tamor

Rapid automated Personal Cancer Genome Report (PCGR) generation using Illumina Dragen + Snakemake.

tl;dr

Data for large scale tumor analysis projects can be spread over multiple DNA sequencing instrument runs, tamor simplifies the process of analyzing them.

A tab-delimited file is provided by the user to associate tumor and germline sequencing sample IDs with a study subject ID, along with a tissue-of-origin for the tumor. PCGR somatic variant reports (including germline susceptibility sequence variants) are generated using 1) this tab-delimited file, 2) the Illumina sequencer output (BCL or FASTQ), and 3) the Illumina Experiment Manager samplesheets CSV for the sequencing runs.

Tumor RNA analysis is in development.

Installation

  1. Nota bene: These instructions assume that you already have a Dragen server with software version 3.10 or higher and a working hg38 genome index.

  2. Download this code base:

git clone https://github.com/nodrogluap/tamor
  1. Install all the dependencies via conda or mamba (my preference because it's much, much faster):
mamba env create -f conda_tamor.yml
  1. Due to quirks in the conda dependencies spec, you will need to install the latest version of the Perl zlib library module manually, and the R hg38 genome sequence module:
mamba activate pcgrr
cpanm Compress::Raw::Zlib
R -e 'BiocManager::install("rtracklayer", force=TRUE);BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")'
  1. Download (~22GB) the cancer databases that CPSR and PCGR rely on for annotation of your discovered sequence variants:
BUNDLE=pcgr.databundle.hg38.20220203.tgz
wget http://insilico.hpc.uio.no/pcgr/$BUNDLE
tar zxvf $BUNDLE

Configuration

If you do not have the tamor directory leading in your shell's PATH variable, you will need to prepend it so tamor's ersatz bcftools command is used (place this in your .bashrc if you don't want to do this manually each time):

export PATH=/where/you/have/put/tamor:$PATH

Copy the config.yml.sample file to config.yml:

cp config.yml.sample config.yml

This is the file that you can customize for your site-specific settings. By default the config is set up to write result files under the current directory in output, and is expecting the input list of paired tumor-normal samples in a file called tumor_dna_paired_germline_dna_samples.tsv which has 5 columns to be specified:

subjectID<tab>tumorSampleName<tab>germlineSampleName<tab>TrueOrFalse_germline_contains_some_tumor<tab>PCGRTissueSiteNumber

The subjectID, tumorSampleName and germlineSampleName must:

  • CONTAIN NO UNDERSCORES
  • The subjectID must be between 6 and 35 characters (due to a PCGR naming limitation)
  • tumorSampleName and germlineSampleName must be the exact Sample_Name values you used in your Illumina sequencing sample spreadsheets

These sample sheets are the only metadata to which tamor has access. Place all the Illumina experiment sample sheets for your project into data/spreadsheets by default (see the samplesheets_dir setting in config.yml). They must be called runID.csv where runID is typically the Illumina folder name in the format YYMMDD_machineID_SideFlowCellID.

Tamor can start with either BCL files or FASTQ. If you are starting with BCLs, the full Illumina experiment output folders (which contain the requisite Data/Intensities/Basecalls subfolder) are expected by in data/bcls/runID (see bcl_dir setting inconfig.yaml). Tamor will perform bcl to fastq conversion, with the FASTQ output into data/analysis/primary/sequencer/runID (see analysis_dir setting in config.yaml, and the default sequencer is novaseq6000).

If instead you are providing the FASTQs directly as input to tamor, they must also be in the data/analysis/primary/sequencerName/runID directory, with a corresponding Illumina Experiment Manager samplesheet data/spreadsheets/runID.csv. Why? This is required because tamor reads the sample sheet to find the correspondence between Sample_Name and Sample ID for each sequencing library, also analysis for DNA samples differs from that for RNA samples, so the sample sheet must also contain a Sample_Project column. Sample projects with names that contain "RNA" in them will be processed as such, all others are assumed to be DNA. The samplesheet is also used to determine if Unique Molecular Indices were used to generate the sequencing libraries, which requires different handling in Dragen during genotyping downstream.

If you provide FASTQ files directly, they must be timestamped later than the corresponding Illumina Experiment Manager spreadsheet, otherwise Snakemake will assume you've consequentially changed the spreadsheet and try to automatically regenerated all FASTQs for that run -- from potentially non-existent BCLs.

The fourth column of the paired input sample TSV file is usually False, unless your germline sample is from a leukemia or perhaps a poor quality histology section from a tumor, in which case use True. This instructs Dragen to consider low frequency variants in the germline sample to still show up as somatic variants in the tumor analysis output (see default of 0.05 under tumor_in_normal_tolerance_proportion in config.yaml)

For the fifth column, the list of tissue site numbers for the version of PCGR included here is:

                        0 = Any
                        1 = Adrenal Gland
                        2 = Ampulla of Vater
                        3 = Biliary Tract
                        4 = Bladder/Urinary Tract
                        5 = Bone
                        6 = Breast
                        7 = Cervix
                        8 = CNS/Brain
                        9 = Colon/Rectum
                        10 = Esophagus/Stomach
                        11 = Eye
                        12 = Head and Neck
                        13 = Kidney
                        14 = Liver
                        15 = Lung
                        16 = Lymphoid
                        17 = Myeloid
                        18 = Ovary/Fallopian Tube
                        19 = Pancreas
                        20 = Peripheral Nervous System
                        21 = Peritoneum
                        22 = Pleura
                        23 = Prostate
                        24 = Skin
                        25 = Soft Tissue
                        26 = Testis
                        27 = Thymus
                        28 = Thyroid
                        29 = Uterus
                        30 = Vulva/Vagina

Running a paired tumor-normal analysis

Any time you want to use tamor, you must be sure to have the conda/mamba environment loaded:

mamba activate pcgrr

Once the sample pairing file mentioned earlier is ready, you can simply run Snakemake to generate the FASTQs (optiuonally), BAMs, VCFs, and CPSR/PCGR reports:

snakemake --cores=1

The default outputs are in a directory called data/output/pcgr/subjectID_tumorSampleName_germlineSampleName. The most relevant document may be the self-contained Web page subjectID.pcgr_acmg.grch38.flexdb.html.

In a multi-user system, it is imperative to use a queuing system such as slurm to submit only one job at a time to Dragen v4.x. Once slurm is installed and configured on your Dragen system, Snakemake support for slurm is enabled by invoking like so:

snakemake --cluster sbatch --cores=2

Screenshot of a sample Personal Cancer Genome Report, FlexDB version

Acknowledgements

This project is being developed in support of the Terry Fox Research Institute's Marathon of Hope Cancer Care Network activities within the Prairie Cancer Research Consortium.

About

Illumina Dragen cancer genome and transcriptome analysis automation using Snakemake

Resources

License

Stars

Watchers

Forks

Releases

No releases published