Parallelize DESeq2 #79

drpatelh · 2020-02-07T14:35:11Z

It should be possible to add another parameter to the differential accessibility script specifying the number of cores in order to parallelize DESeq2:
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#using-parallelization

Probably also worth adding an option just to skip this step e.g. --skip_differential_analysis

The text was updated successfully, but these errors were encountered:

drpatelh · 2020-02-15T00:51:28Z

As suggested by @mikelove on Twitter it's worth looking into using limma for the CQN normalised data for speedup. See F1000 paper.

The implementation should by quite trivial. It's just a case of figuring out how! Contributions/thoughts welcome.

mikelove · 2020-02-15T01:07:18Z

Here's another pointer:

https://github.com/kauralasoo/macrophage-gxe-study/blob/5f8c7ce999da89fce5017af3e2cdd39106e68126/ATAC/munge/processPeakCounts.R

Also CC @kauralasoo whose paper that was (the data generation and processing).

kauralasoo · 2020-02-15T10:58:57Z

Thanks @mikelove for cc'ing me. Yes, we ran into the same issue that DESeq2 was a bit too slow when testing for differential accessibility of 300,000 features across 64 samples. In the paper, I decided to use limma voom for differential accessibility analysis, but did not benchmark it agains cqn normalisation + lmFit.

Here is the limma voom code: https://github.com/kauralasoo/macrophage-gxe-study/blob/master/ATAC/DA/clusterPeaks.R

I used cqn normalisation for chromatin accessibility QTL analysis, where we tested up to 5000 genetic variants around each feature as this can be efficiently done with efficient linear model implementations such as MatrixEQTL or QTLtools.

On our dataset cqn seemed to work better than log(TPM), but I vaguely remember other people having some issues with it on other dataset, so I would not dare to recommend it as the default without testing it on a few datasets.

Cqn requires feature QC content as a covariate. I calculated this based on the reference genome and peak coordinates using bedtools nuc:
bedtools nuc -fi ../../../annotations/GRCh38/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed ATAC_consensus_peaks.gff3 > ATAC_consensus_peaks.nuc_content.txt

drpatelh · 2020-02-18T15:23:02Z

Thanks @kauralasoo ! cc'ing @macroscian here who is our in-house stats guru 😎

The pipeline currently generates raw read counts (featureCounts) and then processes them with DESeq2 based on a user-specified design. Ideally, the user would have to run the pipeline at an experiment-level because the DESeq2 model is fitted once across all samples in the design. Also, it doesnt make sense to create a consensus set of intervals across samples you dont want to compare together.

Unless Ive missed something, Im hoping that the simplest solution would be to implement an independent script that goes from raw read counts and uses limma voom to generate the differential intervals instead. I can then add that into the pipeline with an optional flag (e.g. --limma_diff_analysis) where this could be used if required. Furthermore, I can set the directory structure up in a way where you can get both outputs by using the above parameter along with -resume.

@mikelove I think we need to paralellise DESeq2 anyway. I am setting up some tests to add this into the pipeline but I was wondering whether you had an idea as to what sort of speed-up can be attained? In the dev version of the pipeline that process is currently labelled to use 6 cpus:

atacseq/main.nf

Line 1316 in 34ed69b

label 'process_medium'

as defined here:

atacseq/conf/base.config

Lines 28 to 32 in 34ed69b

    
           withLabel:process_medium { 
        
             cpus = { check_max( 6 * task.attempt, 'cpus' ) } 
        
             memory = { check_max( 42.GB * task.attempt, 'memory' ) } 
        
             time = { check_max( 8.h * task.attempt, 'time' ) } 
        
           }

I realise this may be dependent on the input data but is there an upper-limit where there is minimal gain by using additional cores?

mikelove · 2020-02-18T16:08:32Z

There are a number of threads on the Bioc site about the speedup attainable with BPPARAM. The gist is: it works fine on my end (where I usually use 4-8 workers and attain some fractional gain relative to nworkers due to overhead, maybe like 50% of the optimal speedup), and usually what happens "in the wild" is that people end up requesting dozens of cores across different nodes, which gets bogged down in memory transfer, and end up with performance worse than if they had just used parallel=FALSE.

drpatelh · 2020-02-24T14:04:09Z

Im seeing a significant speed-up in the differential accessibility analysis if I parallelise and allocate 6 cores to DESeq2. A previous run of the nf-core/chipseq pipeline on an in-house dataset with 907450 consensus intervals across 60 samples failed to complete because it bypassed our max wall-time limit of 72 hour where DESeq2 model building took ~15 hours and extracting the results for each possible pairwise comparison took ~20 minutes each. With the updated implementation in #84 the model building now takes ~ 3 hours and extracting pairwise results takes ~ 2 minutes. This really is quite a big difference and solves the initial subject of this issue to parallelise DESeq2.

drpatelh · 2020-02-24T14:08:49Z

Going to close this issue and create a new one for the addition of limma.

drpatelh added enhancement New feature or request good first issue Good for newcomers labels Feb 7, 2020

drpatelh removed the good first issue Good for newcomers label Feb 18, 2020

This was referenced Feb 24, 2020

Parallelise DESeq2 and minor updates nf-core/chipseq#142

Merged

Parallelise DESeq2 and minor updates #84

Merged

drpatelh changed the title ~~Parallelize DESeq2~~ Implement limma for differential analysis on large datasets Feb 24, 2020

drpatelh changed the title ~~Implement limma for differential analysis on large datasets~~ Parallelize DESeq2 Feb 24, 2020

drpatelh closed this as completed Feb 24, 2020

drpatelh mentioned this issue Feb 24, 2020

Implement limma for differential analysis on large datasets #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize DESeq2 #79

Parallelize DESeq2 #79

drpatelh commented Feb 7, 2020 •

edited

Loading

drpatelh commented Feb 15, 2020 •

edited

Loading

mikelove commented Feb 15, 2020

kauralasoo commented Feb 15, 2020

drpatelh commented Feb 18, 2020 •

edited

Loading

mikelove commented Feb 18, 2020

drpatelh commented Feb 24, 2020

drpatelh commented Feb 24, 2020

Parallelize DESeq2 #79

Parallelize DESeq2 #79

Comments

drpatelh commented Feb 7, 2020 • edited Loading

drpatelh commented Feb 15, 2020 • edited Loading

mikelove commented Feb 15, 2020

kauralasoo commented Feb 15, 2020

drpatelh commented Feb 18, 2020 • edited Loading

mikelove commented Feb 18, 2020

drpatelh commented Feb 24, 2020

drpatelh commented Feb 24, 2020

drpatelh commented Feb 7, 2020 •

edited

Loading

drpatelh commented Feb 15, 2020 •

edited

Loading

drpatelh commented Feb 18, 2020 •

edited

Loading