Avoid error on unknown headers in input.csv #302

zeehio · 2024-02-05T09:44:58Z

Description of feature

The nf-co.re/rnaseq pipeline accepts and ignores any extra column in input.csv that is not required by the pipeline. This is useful because I can reuse the input.csv or include additional information I want to use in downstream analyses, without having to generate a specific input.csv just for running the pipeline.

This scrnaseq pipeline is much more strict, giving an error when any unknown column is found.

I would rather for the scrnaseq pipeline to follow the rnaseq behaviour, following the robustness principle that one should "be conservative in what you send, be liberal in what you accept".

Is there any specific reason why you are not as liberal accepting unknown columns in the input.csv file?

Thanks!

The text was updated successfully, but these errors were encountered:

grst · 2024-03-07T10:58:16Z

Hi,

this issue should be fixed in the development version.
You can give it a try with nextflow run ... -r dev. If it doesn't work, please let me know!

zeehio · 2024-04-02T15:44:51Z

Hi @grst I have been able now to test the dev pipeline. Thanks for the update. Unfortunately I am still facing validation issues:

I am using a single end dataset, where there is a fastq_1, but there is not a fastq_2.

The input.csv file is similar to:

"sample","fastq_1","fastq_2","strandedness",...
"id1","/path/to/fastq/sample1.fastq.gz","","auto",...

Please note how the fastq_2 column contains empty values.

I'm getting an error validating the 'input' again:

ERROR ~ ERROR: Validation of 'input' file failed!

 -- Check '.nextflow.log' file for details
The following errors have been detected:

* -- Entry 1: Missing required value: fastq_2
* -- Entry 2: Missing required value: fastq_2

Having an empty fastq_2 seems correct to me when I check the code at the master branch. There, if the fastq_2 is empty then the single_end variable is set to "1". You can see this below (specifically line 184, in the not fastq_2):

scrnaseq/bin/check_samplesheet.py

Lines 181 to 187 in 90cb6a4

    
           sample_info = []  ## [single_end, fastq_1, fastq_2] 
        
           if sample and fastq_1 and fastq_2:  ## Paired-end short reads 
        
               sample_info = ["0", fastq_1, fastq_2, expected_cells, seq_center, fastq_barcode, sample_type] 
        
           elif sample and fastq_1 and not fastq_2:  ## Single-end short reads 
        
               sample_info = ["1", fastq_1, fastq_2, expected_cells, seq_center, fastq_barcode, sample_type] 
        
           else: 
        
               print_error("Invalid combination of columns provided!", "Line", line)

However on the dev branch, the input schema used for the fastq_2 validation must exist and can't be empty:

scrnaseq/assets/schema_input.json

Lines 16 to 30 in 1043441

    
           "fastq_1": { 
        
               "type": "string", 
        
               "format": "file-path", 
        
               "exists": true, 
        
               "pattern": "^\\S+\\.f(ast)?q\\.gz$", 
        
               "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" 
        
           }, 
        
           "fastq_2": { 
        
               "type": "string", 
        
               "format": "file-path", 
        
               "exists": true, 
        
               "pattern": "^\\S+\\.f(ast)?q\\.gz$", 
        
               "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" 
        
           }, 
        
           "expected_cells": {

I'd like for the scrnaseq pipeline to accept an input file with a fastq_2 column filled with "" (empty strings), since that's what is generated by the nf-core/fetchngs pipeline when downloading datasets.

Thanks and sorry for the delay in the reply

zeehio · 2024-04-02T15:47:57Z

Just for further ideas, it may be good to checkout the rnaseq pipeline:

https://github.com/nf-core/rnaseq/blob/b89fac32650aacc86fcda9ee77e00612a1d77066/assets/schema_input.json#L16-L46

grst · 2024-04-02T16:28:07Z

The check is done on purpose. All protocols supported by this pipeline use paired end data, where R1 contains UMI/barcode and R2 the actual sequence.

What kind of single-cell data are you dealing with?

zeehio added the enhancement New feature or request label Feb 5, 2024

grst mentioned this issue Mar 5, 2024

Important! Template update for nf-core/tools v2.13.1 #309

Merged

grst closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid error on unknown headers in input.csv #302

Avoid error on unknown headers in input.csv #302

zeehio commented Feb 5, 2024

grst commented Mar 7, 2024

zeehio commented Apr 2, 2024 •

edited

Loading

zeehio commented Apr 2, 2024

grst commented Apr 2, 2024

Avoid error on unknown headers in input.csv #302

Avoid error on unknown headers in input.csv #302

Comments

zeehio commented Feb 5, 2024

Description of feature

grst commented Mar 7, 2024

zeehio commented Apr 2, 2024 • edited Loading

zeehio commented Apr 2, 2024

grst commented Apr 2, 2024

zeehio commented Apr 2, 2024 •

edited

Loading