Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid error on unknown headers in input.csv #302

Closed
zeehio opened this issue Feb 5, 2024 · 4 comments
Closed

Avoid error on unknown headers in input.csv #302

zeehio opened this issue Feb 5, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@zeehio
Copy link

zeehio commented Feb 5, 2024

Description of feature

The nf-co.re/rnaseq pipeline accepts and ignores any extra column in input.csv that is not required by the pipeline. This is useful because I can reuse the input.csv or include additional information I want to use in downstream analyses, without having to generate a specific input.csv just for running the pipeline.

This scrnaseq pipeline is much more strict, giving an error when any unknown column is found.

I would rather for the scrnaseq pipeline to follow the rnaseq behaviour, following the robustness principle that one should "be conservative in what you send, be liberal in what you accept".

Is there any specific reason why you are not as liberal accepting unknown columns in the input.csv file?

Thanks!

@grst
Copy link
Member

grst commented Mar 7, 2024

Hi,

this issue should be fixed in the development version.
You can give it a try with nextflow run ... -r dev. If it doesn't work, please let me know!

@grst grst closed this as completed Mar 13, 2024
@zeehio
Copy link
Author

zeehio commented Apr 2, 2024

Hi @grst I have been able now to test the dev pipeline. Thanks for the update. Unfortunately I am still facing validation issues:

I am using a single end dataset, where there is a fastq_1, but there is not a fastq_2.

The input.csv file is similar to:

"sample","fastq_1","fastq_2","strandedness",...
"id1","/path/to/fastq/sample1.fastq.gz","","auto",...

Please note how the fastq_2 column contains empty values.

I'm getting an error validating the 'input' again:

ERROR ~ ERROR: Validation of 'input' file failed!

 -- Check '.nextflow.log' file for details
The following errors have been detected:

* -- Entry 1: Missing required value: fastq_2
* -- Entry 2: Missing required value: fastq_2

Having an empty fastq_2 seems correct to me when I check the code at the master branch. There, if the fastq_2 is empty then the single_end variable is set to "1". You can see this below (specifically line 184, in the not fastq_2):

sample_info = [] ## [single_end, fastq_1, fastq_2]
if sample and fastq_1 and fastq_2: ## Paired-end short reads
sample_info = ["0", fastq_1, fastq_2, expected_cells, seq_center, fastq_barcode, sample_type]
elif sample and fastq_1 and not fastq_2: ## Single-end short reads
sample_info = ["1", fastq_1, fastq_2, expected_cells, seq_center, fastq_barcode, sample_type]
else:
print_error("Invalid combination of columns provided!", "Line", line)

However on the dev branch, the input schema used for the fastq_2 validation must exist and can't be empty:

"fastq_1": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
},
"fastq_2": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
},
"expected_cells": {

I'd like for the scrnaseq pipeline to accept an input file with a fastq_2 column filled with "" (empty strings), since that's what is generated by the nf-core/fetchngs pipeline when downloading datasets.

Thanks and sorry for the delay in the reply

@zeehio
Copy link
Author

zeehio commented Apr 2, 2024

Just for further ideas, it may be good to checkout the rnaseq pipeline:

https://github.com/nf-core/rnaseq/blob/b89fac32650aacc86fcda9ee77e00612a1d77066/assets/schema_input.json#L16-L46

@grst
Copy link
Member

grst commented Apr 2, 2024

The check is done on purpose. All protocols supported by this pipeline use paired end data, where R1 contains UMI/barcode and R2 the actual sequence.

What kind of single-cell data are you dealing with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants