Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated grouping for paired-end input files #236

Closed
ewels opened this issue Nov 2, 2016 · 15 comments
Closed

Automated grouping for paired-end input files #236

ewels opened this issue Nov 2, 2016 · 15 comments

Comments

@ewels
Copy link
Member

ewels commented Nov 2, 2016

Motivation

Our Nextflow pipelines is the method of specifying input files - we use the fromFilePairs Channel factory, eg: --reads '*_R{1,2}.fastq.gz'

Two points of this usage are annoying:

  • The need to use quotes around the pattern to avoid bash glob expansion
  • The requirement to use a star wildcard character

Essentially the problem is that this isn't using more standard bash-like input pattern norms, which is what users expect.

Alternative Approach

My Cluster Flow workflow tool handles this problem in a slightly more automated manner. Essentially it lines up the input filenames alphabetically and strips out /_R?[1-4]/ from the filename. If there are resulting pairs with identical names it returns them in a group, otherwise things come back as single end. If there are more than two filenames with identical names it raises an error, and if there are a mixture of single end and paired-end filenames it raises an error (in practice we found that this was usually due to a bash pattern picking up more files than expected).

This default behaviour can then be overwritten by using --single or --paired which returns everything as either a single input file or pairs of alphabetically sorted files. Again in practice we find that these are basically never required or used.

You can see the code that powers this here: https://github.com/ewels/clusterflow/blob/master/source/CF/Helpers.pm#L343-L392

Ups and Downs

The risk of using stronger assumptions about filename patterns is of course that it can be wrong sometimes. I try to mitigate this by obviously logging how many input files were found, how many groups and whether it's treating it as a paired-end or single-end run. This helps the user spot any errors and cancel the pipeline run before it really gets started.

The logic seems to work smoothly 99% of the time in our hands and means that you can be a lot lazier when launching the pipeline - just using regular bash to supply a raw list of input filenames.

So - would something like this be possible in Nextflow? It could of course be a new function / Channel factory in addition to the current fromFilePairs so Nextflow users could continue to use that if they want.

@pditommaso
Copy link
Member

pditommaso commented Nov 3, 2016

What would be the input parameter by using this approach?

@ewels
Copy link
Member Author

ewels commented Nov 3, 2016

I was thinking just a regular params approach, eg. --reads *.fastq going to params.reads. Will that not work?

@pditommaso
Copy link
Member

The main problem is that as long as you use wildcard character BASH expand it to a list of files, thus --reads *.fastq will become --reads X.fastq Y.fastq Z.fastq, but currenlty NF it's not able to handle multiple values for the same command line option. That's an improvement that I would like to add, but it requires some tricky changes.

@ewels
Copy link
Member Author

ewels commented Nov 3, 2016

Ah, I had a nasty feeling you were going to say that (in fact I even started trying to test it myself but got distracted with other tasks and thought it better to just reply).

Yes, CF cheats its way around this by taking any remaining command line options after flags as positional, so handles any number of expanded filenames.

Ok, in that case I guess this idea is on ice until if / when this is added to core NF.. Thanks!

@ewels
Copy link
Member Author

ewels commented Nov 3, 2016

...I assume that Nextflow isn't able to parse positional command line arguments? Or has any other way of supplying this information on the command line?

@pditommaso
Copy link
Member

No, it has that's the problem. I mean if you enter:

nextflow run foo --reads  X.fastq Y.fastq Z.fastq

You will get the params.reads with the first value and the args implicit array holding the positional arguments ie. Y.fastq Z.fastq ..

@ewels
Copy link
Member Author

ewels commented Nov 3, 2016

ok, well that would be fine as well.. I'd be happy with this as a way to run the pipeline

nextflow run pipeline --foo bar *.fastq.gz

Then just take all of the input files from args..?

@pditommaso
Copy link
Member

I would prefer a more general solution ie. that works for any parameter. Anyhow if you need urgently you could implement this in your own script just writing a function handling the logic you are suggesting.

@ewels
Copy link
Member Author

ewels commented Nov 4, 2016

Yup, agreed - I guess such a generalised solution will have to wait until Nextflow argument can accept multiple arguments. In the mean time I'll see if I can write something to do it using args in a test script. Thanks for getting me started anyway - I would have probably struggled for a while not understanding why my params.inputfiles wasn't working!

@pditommaso
Copy link
Member

👍

@ewels
Copy link
Member Author

ewels commented Jun 28, 2017

I think it's fine to close this now - the solution we now have working (assume PE unless --singleEnd is specified) seems to work well and is more explicit 👍

@ewels ewels closed this as completed Jun 28, 2017
@mictadlo
Copy link

mictadlo commented Jul 10, 2017

Hi, how does --singleEnd works together with fromFilePairs?

@ewels
Copy link
Member Author

ewels commented Jul 10, 2017

Hi @mictadlo,

You can see --singleEnd at work in our pipeline here. It's an approach suggested by @pditommaso - tell fromFilePairs to expect pairs of files unless --singleEnd is specified, in which case expect single files.

It doesn't really help the awkwardness of requiring quotes when running NextFlow, but it does mean that an error is usually thrown immediately when it happens. Before, we were using fromFilePairs with a size of -1 (anything) and were frequently running paired-end data in single-end mode by accident.

Code snippet here:

params.reads = "data/*{1,2}.fastq.gz"
params.singleEnd = false
Channel
    .fromFilePairs( params.reads, size: params.singleEnd ? 1 : 2 )
    .ifEmpty { exit 1, "Cannot find any reads matching: ${params.reads}\nNB: Path needs to be enclosed in quotes!\nIf this is single-end data, please specify --singleEnd on the command line." }
    .into { read_files }

I hope this helps!

Phil

@mictadlo
Copy link

In the meant time I found another solution. Which one do you think will work better?

Thank you in advance.

Michal

@ewels
Copy link
Member Author

ewels commented Jul 11, 2017

Aha! Yes that's my colleague Rickard who was in that thread, so it's the same pipeline that I pointed to for the example above. So I we've tried both and prefer the above solution 😛

The code in the thread you linked to works fine, but the problem is that it almost works too well. I'll illustrate what I mean with an example. Consider the following paired-end data files:

sample_1_R1.fq.gz
sample_1_R2.fq.gz
sample_2_R1.fq.gz
sample_2_R2.fq.gz

Run with the correct command, it works fine:

nextflow run main.nf --reads "data/*_R{1,2}.fq.gz"
# PE: sample_1
# PE: sample_2

However, it's very easy to make the following mistake:

nextflow run main.nf --reads data/*_R{1,2}.fq.gz
# SE: sample_1_R1
# SE: sample_1_R2
# SE: sample_2_R1
# SE: sample_2_R2

The missing quotes means that bash expands the filename pattern before it gets to Nextflow, and NF instead gets multiple filenames. The pipeline runs perfectly well, handling each paired-end file as a single-end input. We did this quite a lot of times by accident. Using the above --singleEnd approach, you need the additional flag, but it means that the pipeline immediately fails if it doesn't get the input that it expects. More verbose and in our experience, better.

Phil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants