Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lane Merging option? #91

Closed
apeltzer opened this issue Sep 13, 2018 · 19 comments
Closed

Lane Merging option? #91

apeltzer opened this issue Sep 13, 2018 · 19 comments

Comments

@apeltzer
Copy link
Member

We have quite often some cases where there are more than just one FastQ file per condition, e.g. if samples have been sequenced on more than one lane:

blabla_L001_R1.fastq.gz
blabla_L001_R1.fastq.gz
blabla_L002_R1.fastq.gz
blabla_L002_R1.fastq.gz

in this case single ended.
I thought about a possibility to treat these as a single sample based on the extension and having an option in RNAseq that can be used to specify a lane pattern for example? Would this be of general interest?

Cheers,
Alex

@ewels
Copy link
Member

ewels commented Sep 13, 2018

We've talked about the same thing a few times before (SciLifeLab#29, SciLifeLab#98). Although the idea is nice, I'm concerned that it could be easy for such automatic functionality to go wrong silently. We're currently doing this step in our parent master pipeline tool which launches and manages the nextflow runs instead. This is safer for us as we already have the details of which samples are split in our LIMS so can use that directly instead of guessing from filenames.

Having said this - with the right sanity checks in place and no default (or off by default), it should be fine and would certainly be a useful feature for others.

This could potentially even be added to the template repo, as I'm sure many pipelines could benefit from the same thing.

@apeltzer
Copy link
Member Author

I guess I could have a go and simply make this turned off by default so it doesn't mess with other options in general, try to generalize it and then submit that to our template.

@apeltzer
Copy link
Member Author

I agree that it can be an issue and that one could as well have a separate "mini pipeline" that simply merged FastQ files together for example. Just a thought but I guess I'll give it a try and check how much effort this takes.

@maxulysse
Copy link
Member

In Sarek, we are merging such fastq files, but every samples path is defined in a tsv file, so I'm guessing that won't apply here

@apeltzer
Copy link
Member Author

Yes I saw that but specifying something like the normal --reads option together with a --laneregex or similar would be the intent here :-)

@apeltzer
Copy link
Member Author

apeltzer commented Sep 14, 2018

Merge and then map fastq files

However, I'm not sure whether we can merge stats files for example in a correct way (e.g. STAR/HISAT2 log files) in such cases. @ewels Does MultiQC Handle such things or can I concatenate stats files for that use case? If that doesn't work I'll just implement the "non-ideal" solution now...

@maxulysse
Copy link
Member

Not completely unrelated, but I have heard people talking about uBAM
Could it be a solution to consider at some point?

@apeltzer
Copy link
Member Author

I think so - certainly for things like Sarek and ExoSeq!

@ewels
Copy link
Member

ewels commented Sep 16, 2018

Non-ideal for now I think. I don't think that mapping split RNA seq FastQ files will make much difference to the speed in practice - projects are typically quite a lot of samples (more than WGS I'd argue) so already parallelised well. And yes, it'll make reporting and everything quite a lot trickier. MultiQC can't really handle these cases well currently.

@lconde-ucl
Copy link
Member

I agree that it can be an issue and that one could as well have a separate "mini pipeline" that simply merged FastQ files together for example. Just a thought but I guess I'll give it a try and check how much effort this takes.

Hi, do you know if there is such a "mini pipeline" currently available in nextflow? Thanks!

@ewels
Copy link
Member

ewels commented Dec 17, 2018

NB: This could be added in with the tsv sample input described in #123

@ewels
Copy link
Member

ewels commented Dec 17, 2018

Ideal solution:

  • map independent lanes and merge the BAM files

@apeltzer - I'm pretty sure that we can't do this. Usually RNA-seq aligners start by doing non-spliced alignments and building a gene / exon model from this, to be used for a second round of spliced alignments. As such, you want to use as much data as possible in that first step, so the lanes should be merged prior to alignment.

@apeltzer
Copy link
Member Author

apeltzer commented Dec 17, 2018

Yes you're right. For DNA it makes sense to speed up computation (mapping etc) but for RNA-seq alignment it doesn't make sense.

I edited my comment on top...

@apeltzer
Copy link
Member Author

So, I will close this now. For some future use-cases, there is something like this and with upcoming nextflow modules, we can even allow users to perform merging by adding optional subworkflows for such specific use-cases, e.g. this one here: https://github.com/czbiohub/fastqcat

Doesn't make any sense to implement it here then ;-)

@drpatelh
Copy link
Member

drpatelh commented Feb 20, 2020

Reopening this issue. I swear it was here but couldnt find it 😅 This has come up again at The Crick so we should probably wait until we have an alternative solution to close 👍 cc @lDesiree

@drpatelh drpatelh reopened this Feb 20, 2020
@ewels ewels added this to Existing pipelines in hackathon-crick-2020 Mar 3, 2020
@apeltzer
Copy link
Member Author

apeltzer commented Mar 6, 2020

Maybe we should shift this over to the demultiplexing pipeline? What are thoughts on that ...?

@drpatelh
Copy link
Member

drpatelh commented Mar 6, 2020

I think we should still have this functionality in the pipeline at some point because it will also allow users to supply pre-demultiplexed data in this format.

@drpatelh
Copy link
Member

Functionality to add design file input has now been added in #459 so it should now be relatively straightforward to cat FastQ files within the pipeline using an approach similar to this.

@drpatelh drpatelh added this to the 1.5 milestone Aug 24, 2020
@drpatelh
Copy link
Member

Functionality for this has now been added here -> 5b2e4ca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
hackathon-crick-2020
Existing pipelines
Development

No branches or pull requests

5 participants