-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated grouping for paired-end input files #236
Comments
What would be the input parameter by using this approach? |
I was thinking just a regular |
The main problem is that as long as you use wildcard character BASH expand it to a list of files, thus |
Ah, I had a nasty feeling you were going to say that (in fact I even started trying to test it myself but got distracted with other tasks and thought it better to just reply). Yes, CF cheats its way around this by taking any remaining command line options after flags as positional, so handles any number of expanded filenames. Ok, in that case I guess this idea is on ice until if / when this is added to core NF.. Thanks! |
...I assume that Nextflow isn't able to parse positional command line arguments? Or has any other way of supplying this information on the command line? |
No, it has that's the problem. I mean if you enter:
You will get the |
ok, well that would be fine as well.. I'd be happy with this as a way to run the pipeline
Then just take all of the input files from |
I would prefer a more general solution ie. that works for any parameter. Anyhow if you need urgently you could implement this in your own script just writing a function handling the logic you are suggesting. |
Yup, agreed - I guess such a generalised solution will have to wait until Nextflow argument can accept multiple arguments. In the mean time I'll see if I can write something to do it using |
👍 |
I think it's fine to close this now - the solution we now have working (assume PE unless |
Hi, how does |
Hi @mictadlo, You can see It doesn't really help the awkwardness of requiring quotes when running NextFlow, but it does mean that an error is usually thrown immediately when it happens. Before, we were using Code snippet here: params.reads = "data/*{1,2}.fastq.gz"
params.singleEnd = false
Channel
.fromFilePairs( params.reads, size: params.singleEnd ? 1 : 2 )
.ifEmpty { exit 1, "Cannot find any reads matching: ${params.reads}\nNB: Path needs to be enclosed in quotes!\nIf this is single-end data, please specify --singleEnd on the command line." }
.into { read_files } I hope this helps! Phil |
In the meant time I found another solution. Which one do you think will work better? Thank you in advance. Michal |
Aha! Yes that's my colleague Rickard who was in that thread, so it's the same pipeline that I pointed to for the example above. So I we've tried both and prefer the above solution 😛 The code in the thread you linked to works fine, but the problem is that it almost works too well. I'll illustrate what I mean with an example. Consider the following paired-end data files:
Run with the correct command, it works fine: nextflow run main.nf --reads "data/*_R{1,2}.fq.gz"
# PE: sample_1
# PE: sample_2 However, it's very easy to make the following mistake: nextflow run main.nf --reads data/*_R{1,2}.fq.gz
# SE: sample_1_R1
# SE: sample_1_R2
# SE: sample_2_R1
# SE: sample_2_R2 The missing quotes means that bash expands the filename pattern before it gets to Nextflow, and NF instead gets multiple filenames. The pipeline runs perfectly well, handling each paired-end file as a single-end input. We did this quite a lot of times by accident. Using the above Phil |
Motivation
Our Nextflow pipelines is the method of specifying input files - we use the
fromFilePairs
Channel factory, eg:--reads '*_R{1,2}.fastq.gz'
Two points of this usage are annoying:
Essentially the problem is that this isn't using more standard bash-like input pattern norms, which is what users expect.
Alternative Approach
My Cluster Flow workflow tool handles this problem in a slightly more automated manner. Essentially it lines up the input filenames alphabetically and strips out
/_R?[1-4]/
from the filename. If there are resulting pairs with identical names it returns them in a group, otherwise things come back as single end. If there are more than two filenames with identical names it raises an error, and if there are a mixture of single end and paired-end filenames it raises an error (in practice we found that this was usually due to a bash pattern picking up more files than expected).This default behaviour can then be overwritten by using
--single
or--paired
which returns everything as either a single input file or pairs of alphabetically sorted files. Again in practice we find that these are basically never required or used.You can see the code that powers this here: https://github.com/ewels/clusterflow/blob/master/source/CF/Helpers.pm#L343-L392
Ups and Downs
The risk of using stronger assumptions about filename patterns is of course that it can be wrong sometimes. I try to mitigate this by obviously logging how many input files were found, how many groups and whether it's treating it as a paired-end or single-end run. This helps the user spot any errors and cancel the pipeline run before it really gets started.
The logic seems to work smoothly 99% of the time in our hands and means that you can be a lot lazier when launching the pipeline - just using regular bash to supply a raw list of input filenames.
So - would something like this be possible in Nextflow? It could of course be a new function / Channel factory in addition to the current
fromFilePairs
so Nextflow users could continue to use that if they want.The text was updated successfully, but these errors were encountered: