Automated grouping for paired-end input files #236

ewels · 2016-11-02T22:18:04Z

Motivation

Our Nextflow pipelines is the method of specifying input files - we use the fromFilePairs Channel factory, eg: --reads '*_R{1,2}.fastq.gz'

Two points of this usage are annoying:

The need to use quotes around the pattern to avoid bash glob expansion
The requirement to use a star wildcard character

Essentially the problem is that this isn't using more standard bash-like input pattern norms, which is what users expect.

Alternative Approach

My Cluster Flow workflow tool handles this problem in a slightly more automated manner. Essentially it lines up the input filenames alphabetically and strips out /_R?[1-4]/ from the filename. If there are resulting pairs with identical names it returns them in a group, otherwise things come back as single end. If there are more than two filenames with identical names it raises an error, and if there are a mixture of single end and paired-end filenames it raises an error (in practice we found that this was usually due to a bash pattern picking up more files than expected).

This default behaviour can then be overwritten by using --single or --paired which returns everything as either a single input file or pairs of alphabetically sorted files. Again in practice we find that these are basically never required or used.

You can see the code that powers this here: https://github.com/ewels/clusterflow/blob/master/source/CF/Helpers.pm#L343-L392

Ups and Downs

The risk of using stronger assumptions about filename patterns is of course that it can be wrong sometimes. I try to mitigate this by obviously logging how many input files were found, how many groups and whether it's treating it as a paired-end or single-end run. This helps the user spot any errors and cancel the pipeline run before it really gets started.

The logic seems to work smoothly 99% of the time in our hands and means that you can be a lot lazier when launching the pipeline - just using regular bash to supply a raw list of input filenames.

So - would something like this be possible in Nextflow? It could of course be a new function / Channel factory in addition to the current fromFilePairs so Nextflow users could continue to use that if they want.

The text was updated successfully, but these errors were encountered:

pditommaso · 2016-11-03T07:59:26Z

What would be the input parameter by using this approach?

ewels · 2016-11-03T13:55:52Z

I was thinking just a regular params approach, eg. --reads *.fastq going to params.reads. Will that not work?

pditommaso · 2016-11-03T14:07:48Z

The main problem is that as long as you use wildcard character BASH expand it to a list of files, thus --reads *.fastq will become --reads X.fastq Y.fastq Z.fastq, but currenlty NF it's not able to handle multiple values for the same command line option. That's an improvement that I would like to add, but it requires some tricky changes.

ewels · 2016-11-03T14:14:47Z

Ah, I had a nasty feeling you were going to say that (in fact I even started trying to test it myself but got distracted with other tasks and thought it better to just reply).

Yes, CF cheats its way around this by taking any remaining command line options after flags as positional, so handles any number of expanded filenames.

Ok, in that case I guess this idea is on ice until if / when this is added to core NF.. Thanks!

ewels · 2016-11-03T14:15:12Z

...I assume that Nextflow isn't able to parse positional command line arguments? Or has any other way of supplying this information on the command line?

pditommaso · 2016-11-03T14:20:05Z

No, it has that's the problem. I mean if you enter:

nextflow run foo --reads  X.fastq Y.fastq Z.fastq

You will get the params.reads with the first value and the args implicit array holding the positional arguments ie. Y.fastq Z.fastq ..

ewels · 2016-11-03T14:46:23Z

ok, well that would be fine as well.. I'd be happy with this as a way to run the pipeline

nextflow run pipeline --foo bar *.fastq.gz

Then just take all of the input files from args..?

pditommaso · 2016-11-04T16:57:02Z

I would prefer a more general solution ie. that works for any parameter. Anyhow if you need urgently you could implement this in your own script just writing a function handling the logic you are suggesting.

ewels · 2016-11-04T17:02:06Z

Yup, agreed - I guess such a generalised solution will have to wait until Nextflow argument can accept multiple arguments. In the mean time I'll see if I can write something to do it using args in a test script. Thanks for getting me started anyway - I would have probably struggled for a while not understanding why my params.inputfiles wasn't working!

pditommaso · 2016-11-04T17:12:10Z

👍

ewels · 2017-06-28T12:21:00Z

I think it's fine to close this now - the solution we now have working (assume PE unless --singleEnd is specified) seems to work well and is more explicit 👍

mictadlo · 2017-07-10T03:57:31Z

Hi, how does --singleEnd works together with fromFilePairs?

ewels · 2017-07-10T06:39:34Z

Hi @mictadlo,

You can see --singleEnd at work in our pipeline here. It's an approach suggested by @pditommaso - tell fromFilePairs to expect pairs of files unless --singleEnd is specified, in which case expect single files.

It doesn't really help the awkwardness of requiring quotes when running NextFlow, but it does mean that an error is usually thrown immediately when it happens. Before, we were using fromFilePairs with a size of -1 (anything) and were frequently running paired-end data in single-end mode by accident.

Code snippet here:

params.reads = "data/*{1,2}.fastq.gz"
params.singleEnd = false
Channel
    .fromFilePairs( params.reads, size: params.singleEnd ? 1 : 2 )
    .ifEmpty { exit 1, "Cannot find any reads matching: ${params.reads}\nNB: Path needs to be enclosed in quotes!\nIf this is single-end data, please specify --singleEnd on the command line." }
    .into { read_files }

I hope this helps!

Phil

mictadlo · 2017-07-10T22:13:46Z

In the meant time I found another solution. Which one do you think will work better?

Thank you in advance.

Michal

ewels · 2017-07-11T13:55:32Z

Aha! Yes that's my colleague Rickard who was in that thread, so it's the same pipeline that I pointed to for the example above. So I we've tried both and prefer the above solution 😛

The code in the thread you linked to works fine, but the problem is that it almost works too well. I'll illustrate what I mean with an example. Consider the following paired-end data files:

sample_1_R1.fq.gz
sample_1_R2.fq.gz
sample_2_R1.fq.gz
sample_2_R2.fq.gz

Run with the correct command, it works fine:

nextflow run main.nf --reads "data/*_R{1,2}.fq.gz"
# PE: sample_1
# PE: sample_2

However, it's very easy to make the following mistake:

nextflow run main.nf --reads data/*_R{1,2}.fq.gz
# SE: sample_1_R1
# SE: sample_1_R2
# SE: sample_2_R1
# SE: sample_2_R2

The missing quotes means that bash expands the filename pattern before it gets to Nextflow, and NF instead gets multiple filenames. The pipeline runs perfectly well, handling each paired-end file as a single-end input. We did this quite a lot of times by accident. Using the above --singleEnd approach, you need the additional flag, but it means that the pipeline immediately fails if it doesn't get the input that it expects. More verbose and in our experience, better.

Phil

pditommaso added the enhancement label Nov 4, 2016

ewels closed this as completed Jun 28, 2017

mhoban mentioned this issue Jul 19, 2023

Improve input file handling mhoban/rainbow_bridge#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated grouping for paired-end input files #236

Automated grouping for paired-end input files #236

ewels commented Nov 2, 2016

pditommaso commented Nov 3, 2016 •

edited

Loading

ewels commented Nov 3, 2016

pditommaso commented Nov 3, 2016

ewels commented Nov 3, 2016

ewels commented Nov 3, 2016

pditommaso commented Nov 3, 2016

ewels commented Nov 3, 2016

pditommaso commented Nov 4, 2016

ewels commented Nov 4, 2016

pditommaso commented Nov 4, 2016

ewels commented Jun 28, 2017

mictadlo commented Jul 10, 2017 •

edited

Loading

ewels commented Jul 10, 2017

mictadlo commented Jul 10, 2017

ewels commented Jul 11, 2017

Automated grouping for paired-end input files #236

Automated grouping for paired-end input files #236

Comments

ewels commented Nov 2, 2016

Motivation

Alternative Approach

Ups and Downs

pditommaso commented Nov 3, 2016 • edited Loading

ewels commented Nov 3, 2016

pditommaso commented Nov 3, 2016

ewels commented Nov 3, 2016

ewels commented Nov 3, 2016

pditommaso commented Nov 3, 2016

ewels commented Nov 3, 2016

pditommaso commented Nov 4, 2016

ewels commented Nov 4, 2016

pditommaso commented Nov 4, 2016

ewels commented Jun 28, 2017

mictadlo commented Jul 10, 2017 • edited Loading

ewels commented Jul 10, 2017

mictadlo commented Jul 10, 2017

ewels commented Jul 11, 2017

pditommaso commented Nov 3, 2016 •

edited

Loading

mictadlo commented Jul 10, 2017 •

edited

Loading