Rapid automated Personal Cancer Genome Report (PCGR) generation using Illumina Dragen + Snakemake.
Data for large scale tumor analysis projects can be spread over multiple DNA sequencing instrument runs, tamor
simplifies the process of analyzing them.
A tab-delimited file is provided by the user to associate tumor and germline sequencing sample IDs with a study subject ID, along with a tissue-of-origin for the tumor. PCGR somatic variant reports (including germline susceptibility sequence variants) are generated using 1) this tab-delimited file, 2) the Illumina sequencer output (BCL or FASTQ), and 3) the Illumina Experiment Manager samplesheets CSV for the sequencing runs.
Tumor RNA analysis is in development.
-
Nota bene: These instructions assume that you already have a Dragen server with software version 3.10 or higher and a working
hg38
genome index. -
Download this code base:
git clone https://github.com/nodrogluap/tamor
- Install all the dependencies via conda or mamba (my preference because it's much, much faster):
mamba env create -f conda_tamor.yml
- Due to quirks in the conda dependencies spec, you will need to install the latest version of the Perl
zlib
library module manually, and the R hg38 genome sequence module:
mamba activate pcgrr
cpanm Compress::Raw::Zlib
R -e 'BiocManager::install("rtracklayer", force=TRUE);BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")'
- Download (~22GB) the cancer databases that CPSR and PCGR rely on for annotation of your discovered sequence variants:
BUNDLE=pcgr.databundle.hg38.20220203.tgz
wget http://insilico.hpc.uio.no/pcgr/$BUNDLE
tar zxvf $BUNDLE
If you do not have the tamor directory leading in your shell's PATH
variable, you will need to prepend it so tamor's ersatz bcftools
command is used (place this in your .bashrc if you don't want to do this manually each time):
export PATH=/where/you/have/put/tamor:$PATH
Copy the config.yml.sample
file to config.yml:
cp config.yml.sample config.yml
This is the file that you can customize for your site-specific settings. By default the config is set up to write result files under the current directory in output
, and is expecting the input list of paired tumor-normal samples in a file called tumor_dna_paired_germline_dna_samples.tsv
which has 5 columns to be specified:
subjectID<tab>tumorSampleName<tab>germlineSampleName<tab>TrueOrFalse_germline_contains_some_tumor<tab>PCGRTissueSiteNumber
The subjectID
, tumorSampleName
and germlineSampleName
must:
- CONTAIN NO UNDERSCORES
- The
subjectID
must be between 6 and 35 characters (due to a PCGR naming limitation) tumorSampleName
andgermlineSampleName
must be the exactSample_Name
values you used in your Illumina sequencing sample spreadsheets
These sample sheets are the only metadata to which tamor has access. Place all the Illumina experiment sample sheets for your project into data/spreadsheets
by default (see the samplesheets_dir
setting in config.yml
). They must be called runID.csv
where runID is typically the Illumina folder name in the format YYMMDD_machineID_SideFlowCellID
.
Tamor can start with either BCL files or FASTQ. If you are starting with BCLs, the full Illumina experiment output folders (which contain the requisite Data/Intensities/Basecalls
subfolder) are expected by in data/bcls/runID
(see bcl_dir
setting inconfig.yaml
). Tamor will perform bcl to fastq conversion, with the FASTQ output into data/analysis/primary/sequencer/runID
(see analysis_dir
setting in config.yaml
, and the default sequencer
is novaseq6000
).
If instead you are providing the FASTQs directly as input to tamor, they must also be in the data/analysis/primary/sequencerName/runID
directory, with a corresponding Illumina Experiment Manager samplesheet data/spreadsheets/runID.csv
. Why? This is required because tamor reads the sample sheet to find the correspondence between Sample_Name and Sample ID for each sequencing library, also analysis for DNA samples differs from that for RNA samples, so the sample sheet must also contain a Sample_Project
column. Sample projects with names that contain "RNA" in them will be processed as such, all others are assumed to be DNA. The samplesheet is also used to determine if Unique Molecular Indices were used to generate the sequencing libraries, which requires different handling in Dragen during genotyping downstream.
If you provide FASTQ files directly, they must be timestamped later than the corresponding Illumina Experiment Manager spreadsheet, otherwise Snakemake will assume you've consequentially changed the spreadsheet and try to automatically regenerated all FASTQs for that run -- from potentially non-existent BCLs.
The fourth column of the paired input sample TSV file is usually False
, unless your germline sample is from a leukemia or perhaps a poor quality histology section from a tumor, in which case use True
. This instructs Dragen to consider low frequency variants in the germline sample to still show up as somatic variants in the tumor analysis output (see default of 0.05 under tumor_in_normal_tolerance_proportion
in config.yaml
)
For the fifth column, the list of tissue site numbers for the version of PCGR included here is:
0 = Any
1 = Adrenal Gland
2 = Ampulla of Vater
3 = Biliary Tract
4 = Bladder/Urinary Tract
5 = Bone
6 = Breast
7 = Cervix
8 = CNS/Brain
9 = Colon/Rectum
10 = Esophagus/Stomach
11 = Eye
12 = Head and Neck
13 = Kidney
14 = Liver
15 = Lung
16 = Lymphoid
17 = Myeloid
18 = Ovary/Fallopian Tube
19 = Pancreas
20 = Peripheral Nervous System
21 = Peritoneum
22 = Pleura
23 = Prostate
24 = Skin
25 = Soft Tissue
26 = Testis
27 = Thymus
28 = Thyroid
29 = Uterus
30 = Vulva/Vagina
Any time you want to use tamor, you must be sure to have the conda/mamba environment loaded:
mamba activate pcgrr
Once the sample pairing file mentioned earlier is ready, you can simply run Snakemake to generate the FASTQs (optiuonally), BAMs, VCFs, and CPSR/PCGR reports:
snakemake --cores=1
The default outputs are in a directory called data/output/pcgr/subjectID_tumorSampleName_germlineSampleName
. The most relevant document may be the self-contained Web page subjectID.pcgr_acmg.grch38.flexdb.html
.
In a multi-user system, it is imperative to use a queuing system such as slurm to submit only one job at a time to Dragen v4.x. Once slurm is installed and configured on your Dragen system, Snakemake support for slurm is enabled by invoking like so:
snakemake --cluster sbatch --cores=2
This project is being developed in support of the Terry Fox Research Institute's Marathon of Hope Cancer Care Network activities within the Prairie Cancer Research Consortium.