Merge pull request #164 from CKComputomics/umitools

Merging this to the umihandling branch to be able to fix remaining bits there myself more easily. Please wait for the next PR to be opened, will post a link here.
nf-core · Jan 11, 2024 · 069beb1 · 069beb1
2 parents f2541d2 + fcc3ef0
commit 069beb1
Show file tree

Hide file tree

Showing 18 changed files with 795 additions and 47 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [dev](https://github.com/nf-core/smrnaseq/branch/dev)
 
-- _nothing yet done_
+### Parameters
+
+| Old parameter | New parameter               |
+| ------------- | --------------------------- |
+|               | `--with_umi`                |
+|               | `--umitools_extract_method` |
+|               | `--umitools_bc_pattern`     |
+|               | `--umi_discard_read`        |
+|               | `--save_umi_intermeds`      |
+|               | `--umi_merge_unmapped`      |
+
 
 ## [v2.2.4](https://github.com/nf-core/smrnaseq/releases/tag/2.2.4) - 2023-11-03
 
@@ -64,22 +74,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [[#188](https://github.com/nf-core/smrnaseq/pull/188)] - Dropped TrimGalore in favor of fastp QC and adapter trimming, improved handling of adapters and trimming parameters
 - [[#194](https://github.com/nf-core/smrnaseq/issues/194)] - Added default adapters file for FastP improved miRNA adapter trimming
 
-### Parameters
-
-| Old parameter | New parameter            |
-| ------------- | ------------------------ |
-|               | `--mirgenedb`            |
-|               | `--mirgenedb_species`    |
-|               | `--mirgenedb_gff`        |
-|               | `--mirgenedb_mature`     |
-|               | `--mirgenedb_hairpin`    |
-|               | `--contamination_filter` |
-|               | `--rrna`                 |
-|               | `--trna`                 |
-|               | `--cdna`                 |
-|               | `--ncrna`                |
-|               | `--pirna`                |
-|               | `--other_contamination`  |
 
 ## [v2.0.0](https://github.com/nf-core/smrnaseq/releases/tag/2.0.0) - 2022-05-31 Aqua Zinc Chihuahua
 
@@ -98,20 +92,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Other enhancements & fixes
 
 - [#134](https://github.com/nf-core/smrnaseq/issues/134) - Fixed colSum of zero issues for edgeR_miRBase.R script
+- [#49](https://github.com/nf-core/smrnaseq/issues/49) - Integrated the existing umitools modules into the pipeline and extend the deduplication step.
 - [#55](https://github.com/lpantano/seqcluster/pull/55) - update seqcluster to fix UMI-detecting bug
 
-### Parameters
-
-| Old parameter        | New parameter    |
-| -------------------- | ---------------- |
-| `--conda`            | `--enable_conda` |
-| `--clusterOptions`   |                  |
-| `--publish_dir_mode` |                  |
-
-> **NB:** Parameter has been **updated** if both old and new parameter information is present.
-> **NB:** Parameter has been **added** if just the new parameter information is present.
-> **NB:** Parameter has been **removed** if parameter information isn't present.
-
 ### Software dependencies
 
 Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
@@ -133,6 +116,7 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
 | `seqkit`             | 0.16.0      | 2.0.0       |
 | `trim-galore`        | 0.6.6       | 0.6.7       |
 | `bioconvert`         | -           | 0.4.3       |
+| `umi_tools`          | -           | 1.1.2       |
 | `htseq`              | -           | -           |
 | `markdown`           | -           | -           |
 | `pymdown-extensions` | -           | -           |

diff --git a/README.md b/README.md
@@ -28,28 +28,30 @@ You can find numerous talks on the nf-core events page from various topics inclu
 ## Pipeline summary
 
 1. Raw read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
+2. UMI barcode extraction ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
+3. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
    1. Insert Size calculation
    2. Collapse reads ([`seqcluster`](https://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads))
-3. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
-4. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
-5. Alignment against miRBase hairpin
+4. UMI barcode deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
+5. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
+6. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
+7. Alignment against miRBase hairpin
    1. Unaligned reads from step 3 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
    2. Collapsed reads from step 2.2 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
-6. Post-alignment processing of miRBase hairpin
+8. Post-alignment processing of miRBase hairpin
    1. Basic statistics from step 3 and step 4.1 ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
    2. Analysis on miRBase, or MirGeneDB hairpin counts ([`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html))
       - TMM normalization and a table of top expression hairpin
       - MDS plot clustering samples
       - Heatmap of sample similarities
    3. miRNA and isomiR annotation from step 4.1 ([`mirtop`](https://github.com/miRTop/mirtop))
-7. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
+9. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
    1. Post-alignment processing of alignment against host reference genome ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
-8. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
-   1. Mapping against reference genome with the mapper module
-   2. Known and novel miRNA discovery with the mirdeep2 module
-9. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
-10. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
+10. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
+11. Mapping against reference genome with the mapper module
+12. Known and novel miRNA discovery with the mirdeep2 module
+13. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
+14. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
 

diff --git a/conf/modules.config b/conf/modules.config
@@ -77,6 +77,7 @@ process {
 //
 // Read QC and trimming options
 //
+
 process {
     withName: 'MIRTRACE_RUN' {
         publishDir = [
@@ -89,15 +90,15 @@ process {
 
 if (!(params.skip_fastqc)) {
     process {
-        withName: '.*:FASTQC_FASTP:FASTQC_.*' {
+        withName: '.*:FASTQC_UMITOOLS_FASTP:FASTQC_.*' {
             ext.args = '--quiet'
         }
     }
 }
 
 if (!params.skip_fastp) {
     process {
-        withName: 'FASTP' {
+        withName: '.*:FASTQC_UMITOOLS_FASTP:FASTP' {
                     ext.args = [ "",
             params.trim_fastq              ? "" : "--disable_adapter_trimming",
             params.clip_r1 > 0             ? "--trim_front1 ${params.clip_r1}" : "", // Remove bp from the 5' end of read 1.
@@ -142,6 +143,90 @@ if (!params.skip_fastp) {
     }
 }
 
+if (params.with_umi && !params.skip_umi_extract) {
+    process {
+        withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:UMITOOLS_EXTRACT' {
+            ext.args   = [
+                    params.umitools_extract_method ? "--extract-method=${params.umitools_extract_method}" : '',
+                    params.umitools_bc_pattern     ? "--bc-pattern='${params.umitools_bc_pattern}'" : '',
+                ].join(' ').trim()
+            publishDir = [
+                [
+                    path: { "${params.outdir}/umitools" },
+                    mode: params.publish_dir_mode,
+                    pattern: "*.log"
+                ],
+                [
+                    path: { "${params.outdir}/umitools" },
+                    mode: params.publish_dir_mode,
+                    pattern: "*.fastq.gz",
+                    enabled: params.save_umi_intermeds
+                ]
+            ]
+        }
+    }
+}
+
+//
+// UMI tools deduplication
+//
+
+if (params.with_umi) {
+    process {
+        withName: '.*:DEDUPLICATE_UMIS:UMITOOLS_DEDUP' {
+            ext.args = { meta.single_end ? '' : '--unpaired-reads=discard --chimeric-pairs=discard' }
+            ext.prefix = { "${meta.id}.umi_dedup.sorted" }
+            publishDir = [
+                [
+                    path: { "${params.outdir}/umi_dedup/umitools" },
+                    mode: params.publish_dir_mode,
+                    pattern: '*.tsv'
+                ],
+                [
+                    path: { "${params.outdir}/umi_dedup" },
+                    mode: params.publish_dir_mode,
+                    pattern: '*.bam',
+                    enabled: (
+                        params.save_umi_intermeds
+                    )
+                ]
+            ]
+        }
+
+        withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT' {
+            ext.prefix = { "${meta.id}.sorted" }
+            publishDir = [
+                path: { "${params.outdir}/umi_dedup" },
+                mode: params.publish_dir_mode,
+                pattern: '*.{bam}',
+                enabled: (
+                    params.save_umi_intermeds
+                )
+            ]
+        }
+
+        withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_INDEX' {
+            ext.prefix = { "${meta.id}.sorted" }
+            publishDir = [
+                path: { "${params.outdir}/umi_dedup" },
+                mode: params.publish_dir_mode,
+                pattern: '*.{bai,csi}',
+                enabled: (
+                    params.save_umi_intermeds
+                )
+            ]
+        }
+
+        withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:.*' {
+            publishDir = [
+                path: { "${params.outdir}/umi_dedup/samtools_stats" },
+                mode: params.publish_dir_mode,
+                pattern: '*.{stats,flagstat,idxstats}'
+            ]
+        }
+    }
+}
+
 //
 // Quantification
 //

diff --git a/docs/output.md b/docs/output.md
@@ -13,6 +13,8 @@ The directories listed below will be created in the results directory after the
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
 - [FastQC](#fastqc) - read quality control
+- [UMI-tools extract](#umi-tools-extract) - UMI barcode extraction
+- [UMI-tools deduplicate](#umi-tools-deduplicate) - read deduplication
 - [FastP](#fastp) - adapter trimming
 - [Bowtie2](#bowtie2) - contamination filtering
 - [Bowtie](#bowtie) - alignment against mature miRNAs and miRNA precursors (hairpins)
@@ -40,6 +42,21 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 ![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
 
+## UMI-tools extract
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `umitools/`
+  - `*.fastq.gz`: If `--save_umi_intermeds` is specified, FastQ files **after** UMI extraction will be placed in this directory.
+  - `*.log`: Log file generated by the UMI-tools `extract` command.
+
+</details>
+
+[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name. Secondly, reads are deduplicated based on UMI identifier after mapping as highlighted in the [UMI-tools deduplicate](#umi-tools-deduplicate) section.
+
+To facilitate processing of input data which has the UMI barcode already embedded in the read name from the start, `--skip_umi_extract` can be specified in conjunction with `--with_umi`.
+
 ## FastP
 
 [FastP](https://github.com/OpenGene/fastp) is used for removal of adapter contamination and trimming of low quality regions.
@@ -55,6 +72,19 @@ Contains FastQ files with quality and adapter trimmed reads for each sample, alo
 
 FastP can automatically detect adapter sequences when not specified directly by the user - the pipeline also comes with a feature and a supplied miRNA adapters file to ensure adapters auto-detected are more accurate. If there are needs to add more known miRNA adapters to this list, please open a pull request.
 
+## UMI-tools deduplicate
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `umi_dedup/`
+  - `*.tsv`: Results statistics files detailing the UMI deduplication results.
+  - `*.bam`: If `--save_umi_intermeds` is specified, the deduplicated bam files **after** UMI deduplication will be placed in this directory. In addition the sorted and indexed files will be placed there as well.
+  - `samtools_stats/` - `*.{stats,flagstat,idxstats}:` Statistics on the mappings underlying the UMI deduplication.
+  </details>
+
+[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name as highlighted in the [UMI-tools extract](#umi-tools-extract) section. The reads are deduplicated based on an alignment against the full genome of the species. The deduplicated reads are then converted into fastq format and merged with the reads that remained unmapped in order to reduce potential reference bias. This behavior can be stopped by setting `--umi_merge_unmapped false`. The resulting fastq files are used in the remaining steps of the pipeline.
+
 ## Bowtie2
 
 [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.

diff --git a/modules/local/mirdeep2_run.nf b/modules/local/mirdeep2_run.nf
@@ -40,4 +40,3 @@ process MIRDEEP2_RUN {
     END_VERSIONS
     """
 }
-
diff --git a/modules/nf-core/modules/cat/cat/main.nf b/modules/nf-core/modules/cat/cat/main.nf
diff --git a/modules/nf-core/modules/cat/cat/meta.yml b/modules/nf-core/modules/cat/cat/meta.yml