Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make metatranscriptomics work! #280

Closed
d4straub opened this issue Sep 19, 2019 · 6 comments
Closed

Make metatranscriptomics work! #280

d4straub opened this issue Sep 19, 2019 · 6 comments

Comments

@d4straub
Copy link
Contributor

I am interested in finding a solution for analyzing environmental metatranscriptomic data.

Problem

Analyse metatranscriptomics instead of single organism transcriptomics. Common aims are differential transcript abundance & gene expression, pathway abundance & expression. However, this pipeline isn't able to pre-process metatranscriptomics appropriately (yet).

Here are two common metagenomics experiments where rnaseq could be of assistance:
Case 1: A single reference metagenome is assembled (e.g. by nf-core/mag) and annotated, input into nf-core/rnaseq via --fasta and --gff / --gtf. Metatranscriptomic data (Illumina) is available for several samples. This pipeline could run until a count table is produced and further analysis could be done elsewhere.
Case 2: No metagenome is available. This pipeline could do QC and pre-processed reads are further analyzed elsewhere.

This might be somewhat related to #277, #271 and #227.

Solution

Minor adjustments in QC including adding a step for rRNA removal (sufficient for environmental samples) and host sequence removal for e.g. human gut microbiome analysis.

Possible Implementation

Optional rRNA removal with the popular SortMeRNA. Unfortunately the newest release 3.0.3 is not in bioconda yet. This step could be done after Trim Galore!.

Host sequence removal by adding an optional process that maps sequences first to a genome and only forwards unmapped ones. This step could be added directly after rRNA removal.

Question

Would be a pull request containing rRNA removal with SortMeRNA welcome?

@d4straub
Copy link
Contributor Author

d4straub commented Sep 20, 2019

Was too impatient to wait for an answer and implemented it, see #284

edit: wording

apeltzer pushed a commit that referenced this issue Oct 2, 2019
* Add SortMeRNA as optional step (not default)

* Update Changelog

* Add dedicated test for rRNA removel in .travis.yml

* Update CHANGELOG

* Make rRNA databases configurable

* Add verification for rRNA database manifest file

* Add remaining paramter

* Resolve SortMeRNA memory issue by keeping it at default

* Optimise file output

* Update docs

* Improve wording, remove typo
@d4straub
Copy link
Contributor Author

d4straub commented Oct 9, 2019

I am running the pipeline with rRNA removal with real metatranscriptomic paired-end data right now and the process sortmerna doesn't finish because the cpu/time requirements are not met. I will optimize these asap. This should go in before release 1.4, otherwise the rRNA removal might not be possible.
Pull request incoming when I solved the issue in my branch.

@apeltzer
Copy link
Member

apeltzer commented Oct 9, 2019

Ok, planning a release today or tomorrow - please speed up 😆

@d4straub
Copy link
Contributor Author

d4straub commented Oct 9, 2019

Well, with new settings its running now for four hours, unfortunately it takes a while. Hope it works out 😟

@d4straub
Copy link
Contributor Author

Run 8 hours and was killed by walltime threshold. Now starting last trial with 48h max time.
Each sample has ~100 million read pairs and ~20G bases, but I am surprised it takes that long.

@d4straub d4straub mentioned this issue Oct 11, 2019
Merged
8 tasks
@d4straub
Copy link
Contributor Author

Sortmerna successfully finished with the smallest file, took 8h with 16 cores... not very efficient. The pull request#306 allocates 16 cores and 24h walltime.

apeltzer added a commit that referenced this issue Oct 15, 2019
* First pass update of relevant files

* Fix lint errors and warnings

* Reorder parameters

* Rename Salmon processes

* Use correct strandedness

* Reorder validate inputs

* Bug fixes

* Update installation.md

Minor edit: the docker container name does not have a hyphen

* Update adding_your_own.md

Minor: Update other singularity-related docs to reference the correct docker image.

* Add empty salmon_multiqc_logs channel

* add new dependcies for salmon

* upgrade to latest dependencies

* deleted -M from the featureCounts command

* Fix template merge

* Major overhaul of Salmon requirements

* Update CHANGELOG

* Add tximport function

* add docs for tximport

* update readme and changelog

* fix variable and correct version to parse gtf

* Close outstanding issues and amend salmon merge

* Update CHANGELOG

* Remove subsamp_filesize_thresh parameter

* Don't extract transcripts if transcript_fasta is provided

* Read transcript_fasta from config genome

* fix typo in 'pseudo_aligner'

* fix typo in object name

* Fix logic for both star index provided + salmon for fasta

* --transcriptome --> --transcript_fasta

* Add in ReadGroups for QualiMap compatibility

* Fix typo

* Fix seqCenter

* HISAT2 seq_center

* Revert "Read transcript_fasta from config genome"

This reverts commit 3aa02d4.

* Change logic to deal with both salmon + alignment and fasta references

* Add --gencode flag to salmon index'

* --transcriptome --> --transcript_fasta

* Only transfer quant.sf files for salon_merge"

* Add separate step to clean featurecounts output to minimize memory needed for merging

* Make salmon_merge also mid_memory

* missing 'into'

* Get clean_featureCounts to work with test data

* Use all quant files into salmon_merge -- Reverts 44b7686

* remove git cruft

* fix mismatch between tx2gene and quant.sf

* Use paste to merge everything

* Use params.gencode to decide on --gencode flag

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* use evaluated $gencode parameter

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* add default value for gencode

* Set params.fc_group_features_type = 'gene_type' if gencode

* Add note about --gencode for usage"

* Add note about --gencode for changelog

* no "markdups" in filename

* Update docs/usage.md

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* Use @drpatelh's description

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* Apply suggestions from code review

Use @drpatelh's suggestions for documentation language

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* Remove reference to PR for changelog

* evaluate params.fc_group_features_type within featureCounts process

* Use unix-fu to merge featurecounts

* Wrap biotype variable in braces

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* use 'bash' for syntax of fasta/gtf to fix markdownlint

* Remove first line of featurecounts files

* Evaluate gene biotype earlier and print in summary

* Remove csvtk from requirements

* add a note about redirection

* Use @drpatelh's wording

Co-Authored-By: Harshil Patel <drpatelh@users.noreply.github.com>

* remove random fenced code

* Use `file` instead of `new File`

As I learned [here](nextflow-io/nextflow#1185), `file` != `new File` and `new File` doesn't know how to handle S3 paths. This leads to weird behavior like creating an `s3:` folder, with all the bucket "subfolders" when a pipeline is run:
```
 Thu 27 Jun - 09:03  ~/code/nf-core/rnaseq   origin ☊ olgabot/salmon-gencode ✔ 28☀ 
  ll --tree s3:
Permissions Size User    Date Modified Git Name
drwxr-xr-x     - olgabot 11 Jun 10:26   -- s3:
drwxr-xr-x     - olgabot 11 Jun 10:26   -- └── olgabot-maca
drwxr-xr-x     - olgabot 11 Jun 10:26   --    └── mini-maca
drwxr-xr-x     - olgabot 11 Jun 10:26   --       └── results
drwxr-xr-x     - olgabot 11 Jun 10:26   --          └── pipeline_info
.rw-r--r--   12k olgabot 11 Jun 16:40   --             ├── pipeline_report.html
.rw-r--r--  2.7k olgabot 11 Jun 16:40   --             └── pipeline_report.txt
```

* Use 7th column for gene namec

* Shorten name of biotype field in summary for brevity

* use 0'th item not 1th

* Add sample name to output

* use tximport for each sample and then merge individually

* properly merge gene counts and tpm files

* actually use the gene counts to merge ..

* Add transcript_id and gene_id to salmon output csv

* Add --gencode flag to usage and summary output

* Don't need salmon RDS files

* UPdate changelog"

* Add gtf_qualimap

* Use format strings to create new files

* Update changelog

* Remove trailing slash

* Moved process "Convert GFF3 to GTF"

Moved process "Convert GFF3 to GTF" before "Making STAR index"

* updated changelog and removed -M flag from the featurecounts command

* updated CHANGELOG.md

* trying to solve env troubles

* trying to solve env troubles

* Add all alignment-based steps into optional section

* Get optional alignment to work"

* Add default value for skipTrimmed

* make sample names homogenous on multiqc and featurecounts

* update changelog

* Add test to skip alignment

* Add note in changelog

* Add note about --skipAlignment to docs

* Fix merge conflicts of dev branch

* Add script to filter gtf on seqnames in genome fasta

* Add filter for genes in genome

* remove pycharm nonsesnse

* Retain all gff features upon conversion to gtf

* Add changelog

* Add another line in changelog

* Initial commit for compressed/gzipped reference

* Save gtf file converted from gff

* Keep exon attributes in conversion to gtf

* Whitespace fixes

* whitespace fixes

* Add gff to test config

* Get fasta.gz and gtf.gz to work

* Add default value for compressedReference

* Add configuration to test gziped files

* Fix logic for hisat2

* Add a bunch more tests to check for edge cases of gzipped references

* Fix typo

Co-Authored-By: Alexander Peltzer <apeltzer@users.noreply.github.com>

* Prefer gtf over gff

* Matrix out the test and test_gz stuff into separate travis sections

* Add summary section about compressed reference

* remove global env variables and separate linting into its own jobs

* Fix nextflow run command on travis

* Add script for testing gzipped references-specific things on Travis CI

* Specify java and python versions for matrix jobs

* Add profile for testing gzipped reference

* Add minimal nextflow version for testing

* Protect  with curly braces

* Hopefully set openjdk8 for everyone??

* Move pipeline and markdown linting into separate matrix

* Don't unzip transcript fasta unless necessary

* Fix logic for salmon_index presence

* Temporarily use my branch on test-datasets for now

* Add explanation of --compressedReference and examples

* Add note about --additional_fasta

* Add line in changelog

* Try explicitly stating openjdk

* Move notice about ignoring GFF to when one is actually ignoring it

* Compress to stdout and use long flag for verbose

* Use genome fasta name

* Change permissions to executable

* Add test for extracting fasta transcripts

* Add proper flags for filter_gtf_for_genes_in_genome.py

* Add output flag

* Split fasta id name by whitespace to get name

* Print out which genome seqnames were found

* Add more log message

* Fix extracting overlapping chromosome names for gtf and fasta

* Add all sub-tests into separate travis matrix builds

* Skip QC in gzipped reference tests with homebrewed references

* remove ruby cruftc

* Set language to java for minimal nextflow version

* Fix path of testing gzipped indices

* Add -k flag'

* Use bash to run script

* Add bash type to fenced code blocks

* Add separate tests for no HiSat2 or STAR indices

* Remove extra line

* Only unzip star and hisat2 references if not skipping alignment

* Add more logic for if we need to decompress genome.fa.gz file

* Fix gunzip command name

* GTF is by default preferred over GTF so make tests check GFF usage

* Move notice about ignoring GFF to when one is actually ignoring it

* Fix typos/misspellings

Co-Authored-By: Alexander Peltzer <apeltzer@users.noreply.github.com>

* Add gzipped gff for testing

* Protect flags in quotes

* Don't unzip gff if gtf present

* use nf-core/test-datasets repo with rnaseq branch

* use subset chrom I branch with gzipped data

* Upping default walltime to 4.h

* Add  SortMeRNA to environment.yml

* Update CHANGELOG.md

* NuclearRNA compatibility (#287)

* compatible with nuclear RNAseq

* change to better name for option

* clean names on multiqc

* update changelog

* add test

* better info in changelog

* better comment in travis

* Remove unrequired file

* MarkdownLint fixes

* Fixing changelog

* [Feature] Optionally output unmapped reads (#288)

* SaveUnaligned in STAR

* Add Usage and docs

* Add switch to salmon

*  Add unmapped for HISAT2

* Add tests for unaligned output

* Add more docs

* Adjusting unmapped output handling

* Fix optional outputs

* Fix optional salmon output too

* :palmface:

* Remove unnecessary string interpolation braces when defining memreqs (#295)

* Fix for skipping biotypeqc (#289)

* Adding functionality fixing #268

* Mixed things up

* Shift biotype_qc bits to variable upon execution

* Add mini testcase

* Using bash syntax

* Fixing statement

* Undo... misread

* Add SortMeRNA as optional step (not default) #280 (#284)

* Add SortMeRNA as optional step (not default)

* Update Changelog

* Add dedicated test for rRNA removel in .travis.yml

* Update CHANGELOG

* Make rRNA databases configurable

* Add verification for rRNA database manifest file

* Add remaining paramter

* Resolve SortMeRNA memory issue by keeping it at default

* Optimise file output

* Update docs

* Improve wording, remove typo

* Adjust mapping percentage (#296)

* Adjust mapping percentage

* Add changelog

* Support skipTrimming, Restore SummarizedExperiment object creation (#297)

* add skip trimming option

* add SE object to the output

* Update travis and changelog, fix channels

* update usage

* restore usage TOC

* restore spacing in TOC

* restore usage.md from dev branch

* Add note about PRs to dev vs master (#298)

* Add note about PRs to dev vs master

* Add "PR to dev" to PR template checklist

* CI Testing Updates (#299)

* Revert to old tests

* Bump Tests to the way they were before

* Basic GitHub Actions testing

* Add CI tests for Github Actions

* Strings

* Fix CI

* Bump Nextflow Version

* Add in comment on GFF

* Of course markdownlint

* Dependency updates [skip ci]

* Sort the list [skip ci]

* Remove ToDo string from README

* Update .github/CONTRIBUTING.md

Co-Authored-By: Maxime Garcia <max.u.garcia@gmail.com>

* Update .github/CONTRIBUTING.md

Co-Authored-By: Maxime Garcia <max.u.garcia@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Maxime Garcia <max.u.garcia@gmail.com>

* Update docs/usage.md

Co-Authored-By: Maxime Garcia <max.u.garcia@gmail.com>

* Mini release fixes (#304)

* Remove ToDo string from README

* Mini fix for branch protection

* Apply suggestions from code review

Co-Authored-By: Phil Ewels <phil.ewels@scilifelab.se>

* Update CHANGELOG.md

Co-Authored-By: Phil Ewels <phil.ewels@scilifelab.se>

* Update CHANGELOG.md

Co-Authored-By: Phil Ewels <phil.ewels@scilifelab.se>

* Apply suggestions from code review

Co-Authored-By: Phil Ewels <phil.ewels@scilifelab.se>

* Apply suggestions from code review

Co-Authored-By: Olga Botvinnik <olga.botvinnik@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Maxime Garcia <max.u.garcia@gmail.com>

* Update .github/CONTRIBUTING.md

* Clean up for v1.4 release (#305)

* Remove ToDo string from README

* Mini fix for branch protection

* Correct statemtn

* Make @ewels happy :-)

* MultiQC finds sortmeRNA now

* Auto-Detect compressed input

* Externalize rrna DBs

* Document the rrna-db file

* Cleaned up Changelog for V1.4

* Break up Readme

* Fix missing channel if RRNA removal not running

* Update main.nf

Co-Authored-By: Olga Botvinnik <olga.botvinnik@gmail.com>

* Add extra test for skipping alignment

* remove hidden warning in salmon_tx2gene

I discovered a warning because how I was getting the input. This fixed it. I went ahead and commit directly here.

* Minor typos fixed [skip ci]

Some minor typos addressed and a bit more structure

* Dev (#306)

* sortmerna requires more time with large read files

* Adjust paired-end sortmerna parameters

* Increase sortmerna cpu & time

* More cores for sortmerna

* Adjust sortmerna resources once more

* Template update for nf-core/tools version 1.8.dev0

* Fix multiqc stuff

* Remove png, include fix by @drpatelh

* TEMPLATE v1.7 PR merged in (#310)

* Remove ToDo string from README

* Template update for nf-core/tools version 1.8.dev0

* Fix multiqc stuff

* Remove png, include fix by @drpatelh

* Apply suggestions from code review

Co-Authored-By: Olga Botvinnik <olga.botvinnik@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Olga Botvinnik <olga.botvinnik@gmail.com>

* Add Memory adjustments

* Mini Memory Adjustments (#313)

* Remove ToDo string from README

* Template update for nf-core/tools version 1.8.dev0

* Fix multiqc stuff

* Remove png, include fix by @drpatelh

* Add Memory adjustments

* Final Dependency updates for PR 1.4

* Let's get RSEM in once we add it :-)

* Package updates for PR 1.4 (#314)

* Remove ToDo string from README

* Template update for nf-core/tools version 1.8.dev0

* Fix multiqc stuff

* Remove png, include fix by @drpatelh

* Add Memory adjustments

* Final Dependency updates for PR 1.4

* Let's get RSEM in once we add it :-)

* Requested changes from review by @drpatelh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants