Use multiple genomes and annotation for transcriptome mapping #46

drpatelh · 2019-12-09T20:06:45Z

Many thanks for contributing to nf-core/nanoseq!

Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs).

PR checklist

This comment contains a description of changes (with reason)
If you've fixed a bug or added code that should be tested, add tests!
If necessary, also make a PR on the nf-core/nanoseq branch on the nf-core/test-datasets repo
Ensure the test suite passes (nextflow run . -profile test,docker).
Make sure your code lints (nf-core lint .).
Documentation in docs is updated
CHANGELOG.md is updated
README.md is updated

Learn more about contributing: https://github.com/nf-core/nanoseq/tree/master/.github/CONTRIBUTING.md

Added transcriptome entry to samplesheet.csv input that can either be a gtf, fasta or will be resolved if iGenomes reference.
Replaced if statements with when where possible to make it easier to read the pipeline
Add gtf2bed conversion process and container
Fixes FastQC fails when fastq file is not in gzip format #43
Fixes Transcriptome alignment #32
Fixes Transcriptome aware genome alignment #31

Im not going to have much time to work on this for the rest of the week because Ive spent the past few days pretty much glued to getting this solved. Some things left to do:

There still may be some bugs in the logic so it will need extensive testing with different entries for genome and transcriptome, and by using the different --skip flags to see if the channels are all defined properly.
Tests will also fail because the samplesheet.csv on test-datasets doesnt have a transcriptome entry
Add documentation.

If someone fancies addressing the remaining points then its probably wise to update the samplesheet.csv on test-datasets, re-trigger the Travis tests here, merge after passing (hopefully will work first time, if not I can push a commit or two if required) and then some heavy testing 👍 😅

lwratten

Looking really good! 👍
Very good timing too! I just started implementing this yesterday - you really saved me a whole lot of headache with this one so thank you very much! 😁
Just a few things here and there from me that might make it a bit more robust/speed up certain processes.

Also was thinking it might be good to alter the output sub directories to reflect whether a sample was aligned to genome or txome - as this will not show up in the file name. What do you reckon?

Also, I'll go through and change the samplesheet now so we can get tests passing. Happy to implement some of the suggestions below too after a merge!

bin/check_samplesheet.py

lwratten · 2019-12-10T06:42:30Z

main.nf

+// def fix_channel(ArrayList ch) {
+//
+//     if (ch.size() == 7) {
+//         return [ file(ch[0]), file(ch[1]), ch[2], ch[5], ch[6] ]  // [ fasta, gtf, bed, sizes, is_transcripts ]
+//     } else if (ch.size == 5) {
+//         return [ file(ch[0]), false, false, ch[3], ch[4] ]
+//     }
+// }


Do we still need this?

It should have been removed 🤔

main.nf

drpatelh · 2019-12-10T17:03:49Z

Ok. Ive tested all of the commands in .travis.yml and its working now.

Thanks for adding the transcriptome entry to the test-datasets @lwratten ! I was sitting here thinking the tests should be failing until I looked 😄 We still do need to resolve #37 too.

I dont think we should complicate things by having genome/transcriptome in the file/directory name. It seems tempting but everything should be stored in the samplesheet.csv. Ive tried to factor in as many possibilities as I can but its most likely that people will run the pipeline on a single reference genome/transcriptome. I think properly documenting the exact behaviour of the pipeline is probably more important. Also, please do test the pipeline by using separate genome/transcriptome entries in different formats to see if we are getting the expected behaviour.

drpatelh · 2019-12-10T17:09:24Z

Would be good if you can double-check the logic I have used for minimap2, graphmap2 for DNA, RNA etc.

lwratten

Logic for minimap2 and graphmap2 call look good!
Will have a solid test with lots of different references formats and combos once we're merged! 👍

drpatelh added 30 commits December 6, 2019 10:39

Add transcriptome

09993cf

Add igenomes logic to main script

03ffe94

Fix channels

a15d60c

Raise log warnings

640a218

Update log to error

2bfa17c

Update fastq checks

a3c6075

Exit if fastq extension wrong

aeb85d8

Update fastq logs

8c44e14

Tidy up code

2b3d0d2

Fix remnants

2ceddfd

Add in basecalling

dd7f8b2

Fix config bug

88c05cb

Fix channels

7dd5370

Add in FastQC

7575d34

Run through FastQC

ca2174b

Refactor genome - transcriptome

af2b2c9

Use transcriptome too

f3af78a

Update logic

4a6388c

Update logic

fcbc146

Add in Guppy againnnnn

ed759f6

Add in FastQC

2cca379

Update gtf logic

a604bf1

Add function to check

23526ba

Output everything

5a843c2

Update logic for gtf and fasta

c391b2b

Add header

0f6390b

Update outputs

1ad00d1

Add in FastQC

89031c1

Add in function again

0ac48a8

Fix channels

4785b5a

drpatelh added 14 commits December 9, 2019 11:20

Change process name

32f4707

Fix channels

b1fdf7e

Update minimap2 index

12a30c0

Add graphmap2 index

dab652a

Fix file staging

f6544bf

Complete refactor again

29b3367

Get channels working

9d54070

Add in graphmap

53480ad

Add in bedgraph

221f4d0

Add in BigWig

b6dad66

Change file extensions

a1ff4c7

Change channel content

4f8d1b0

Add in MultiQC

debaa38

Tidy up

c2b7acc

drpatelh requested review from lwratten and csawye01 December 9, 2019 20:14

lwratten requested changes Dec 10, 2019

View reviewed changes

lwratten reviewed Dec 10, 2019

View reviewed changes

main.nf Show resolved Hide resolved

lwratten reviewed Dec 10, 2019

View reviewed changes

main.nf Show resolved Hide resolved

drpatelh added 4 commits December 10, 2019 08:39

Add @lwrattens suggestions

eadc3de

Fix indents

50b2e77

Fix and and bug

b8fc84f

Fix channel not being defined

b333ca6

This was referenced Dec 10, 2019

FastQC fails when fastq file is not in gzip format #43

Closed

Transcriptome alignment #32

Closed

Transcriptome aware genome alignment #31

Closed

lwratten approved these changes Dec 11, 2019

View reviewed changes

drpatelh merged commit bb81383 into nf-core:dev Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multiple genomes and annotation for transcriptome mapping #46

Use multiple genomes and annotation for transcriptome mapping #46

drpatelh commented Dec 9, 2019 •

edited

Loading

lwratten left a comment •

edited

Loading

lwratten Dec 10, 2019

drpatelh Dec 10, 2019

drpatelh commented Dec 10, 2019 •

edited

Loading

drpatelh commented Dec 10, 2019

lwratten left a comment

Use multiple genomes and annotation for transcriptome mapping #46

Use multiple genomes and annotation for transcriptome mapping #46

Conversation

drpatelh commented Dec 9, 2019 • edited Loading

PR checklist

lwratten left a comment • edited Loading

Choose a reason for hiding this comment

lwratten Dec 10, 2019

Choose a reason for hiding this comment

drpatelh Dec 10, 2019

Choose a reason for hiding this comment

drpatelh commented Dec 10, 2019 • edited Loading

drpatelh commented Dec 10, 2019

lwratten left a comment

Choose a reason for hiding this comment

drpatelh commented Dec 9, 2019 •

edited

Loading

lwratten left a comment •

edited

Loading

drpatelh commented Dec 10, 2019 •

edited

Loading