samplesheet check too stringent for header check #152

askol-lurie · 2023-03-07T21:32:53Z

Description of the bug

I'm starting to use v2.0.0 of the nf-core HiC. I used the previous version but always submitted one sample at a time. This time, I created a samplesheet and am running into an issue where hic doesn't think the file has a header. It does. The has_header() function of the cvs module used in check_samplesheet.py is overly stringent in how it defines headers and seems like it would fail for must samplesheets, as it does for mine.

The following sample sheets will fail and succeed, respectively:

sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,1,2

sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,p1,q2

Command used and terminal output

nextflow run /home/ass6094/bin/nextflow_modules/hic_v2.0.0/main.nf \
--digestion 'qiagen' \
--input /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv \
  --outdir $outdir \
  --fasta /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa  \
 --bwt2_index /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/   \
--split_fastq --fastq_chunks_size 10000000   --max_memory 64.GB   --bin_size \ 20000,40000,150000,500000,1000000  \ --bwt2_opts_end2end \
'--very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14'   --bwt2_opts_trimmed ' \
--very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14' \
-profile singularity,slurmshort   -with-report hic_report.html -with-trace \
-with-timeline hic_timeline.html   -with-dag hic_dag.png -bg   -w $scratch

Output:

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/hic v2.0.0
------------------------------------------------------
Core Nextflow options
  runName                      : magical_hamilton
  containerEngine              : singularity
  launchDir                    : /projects/b1103/HIC_Macquarrie/hic_round2
  workDir                      : /scratch/ass6094/hic/nextflow
  projectDir                   : /home/ass6094/bin/nextflow_modules/hic_v2.0.0
  userName                     : ass6094
  profile                      : singularity,slurmshort
  configFiles                  : /home/ass6094/bin/nextflow_modules/hic_v2.0.0/nextflow.config

Input/output options
  input                        : /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv
  outdir                       : /projects/b1103/HIC_Macquarrie/hic_round2/NextflowResults/

Reference genome options
  fasta                        : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
  bwt2_index                   : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/

Digestion Hi-C
  digestion                    : qiagen

DNAse Hi-C
  min_cis_dist                 : 0

Alignments
  split_fastq                  : true
  fastq_chunks_size            : 10000000
  bwt2_opts_end2end            : --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14
  bwt2_opts_trimmed            : --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14

Valid Pairs Detection
  max_insert_size              : 0
  min_insert_size              : 0
  max_restriction_fragment_size: 0
  min_restriction_fragment_size: 0

Contact maps
  bin_size                     : 20000,40000,150000,500000,1000000
  ice_filter_high_count_perc   : 0
  res_zoomify                  : null

Downstream Analysis
  res_dist_decay               : 250000
  tads_caller                  : insulation
  res_tads                     : 40000

Max job request options
  max_cpus                     : 14
  max_memory                   : 64.GB
  max_time                     : 10d

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/hic for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.2669513

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/hic/blob/master/CITATIONS.md
------------------------------------------------------
WARN: A process with name 'BOWTIE2_ALIGN_TRIMMED' is defined more than once in module script: /home/ass6094/bin/nextflow_modules/hic_v2.0.0/./workflows/../subworkflows/local/./hicpro_mapping.nf -- Make sure to not define the same function as process
[65/1f640c] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:GET_RESTRICTION_FRAGMENTS (^GATC)
[11/c26455] Submitted process > NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)
[10/5e7883] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:CUSTOM_GETCHROMSIZES (genome.fa)
Error executing process > 'NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)'

Caused by:
  Process `NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      samplesheet.csv \
      samplesheet.valid.csv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  

Command error:
  WARNING: While bind mounting '/projects/b1103/HIC_Macquarrie/hic_round2:/projects/b1103/HIC_Macquarrie/hic_round2': destination is already in the mount point list
  WARNING: While bind mounting '/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin:/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin': destination is already in the mount point list
  WARNING: While bind mounting '/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65:/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65': destination is already in the mount point list
  WARNING: Skipping mount /hpc/software/singularity/3.8.1/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [CRITICAL] The given sample sheet does not appear to contain a header.

Relevant files

No response

System information

nextflow version 22.10.5.5840
Hardware: Slurm HPC
Executor: slurm
Container engine:Singularity
OS: Redhat Linux 7.9
Version of nf-core/hic 2.0.0

The text was updated successfully, but these errors were encountered:

nservant · 2023-03-23T14:50:16Z

So for sure, this is linked to the csv.Sniffer.has_header function, which return false. No idea why.

nservant · 2023-03-23T14:51:55Z

I checked whether in both cases, the csv package is able to detect the delimiter, and yes. Both files report ',' as the delimiter ...

Line 60

d = sniffer.sniff(peek)
print(repr(d.delimiter))

nservant · 2023-03-23T14:59:33Z

So I think I have the solution for the provided exemple !
has_header return False because the two lines don't belong to the same type !

In

RH41_B6,1,2
SMS_A3,p1,q2

the 1 and 2 are seen as integer. While the p1 and p2 are seens as string.

python/cpython#87791

nservant · 2023-03-23T15:22:24Z

To continue on that, and still based on the thread here python/cpython#87791
It's seems that the has_header function automatically detects the type of a column based on its content (numbers/letters ?)
When two rows have a different column typing pattern, the has_header return False

nservant · 2023-03-23T15:23:23Z

sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
12-female-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

is detected as having no header and crashed
whereas

sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
120-male-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

works ! that's crasy :)

nservant · 2023-03-23T15:45:38Z

will be fixed in the next version

nf-core/tools#2194

maxulysse · 2023-03-23T17:09:15Z

Just because this 12-female-liver -> 120-male-liver in the sample column?

nservant · 2023-04-26T13:31:33Z

yes. But this will be fixed in the next nf-core template

askol-lurie added the bug Something isn't working label Mar 7, 2023

askol-lurie changed the title ~~samplesheet check too too stringent for header check~~ samplesheet check too stringent for header check Mar 9, 2023

cnluzon mentioned this issue May 9, 2023

Certain sample names produce a missing header error on INPUT_CHECK:SAMPLESHEET_CHECK step #163

Closed

AlcaArctica mentioned this issue Oct 11, 2023

samplesheet error #179

Closed

nservant added this to the version-2.2.0 milestone Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samplesheet check too stringent for header check #152

samplesheet check too stringent for header check #152

askol-lurie commented Mar 7, 2023 •

edited by ewels

Loading

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023 •

edited

Loading

nservant commented Mar 23, 2023 •

edited

Loading

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023

maxulysse commented Mar 23, 2023

nservant commented Apr 26, 2023

samplesheet check too stringent for header check #152

samplesheet check too stringent for header check #152

Comments

askol-lurie commented Mar 7, 2023 • edited by ewels Loading

Description of the bug

Command used and terminal output

Relevant files

System information

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023 • edited Loading

nservant commented Mar 23, 2023 • edited Loading

nservant commented Mar 23, 2023

nservant commented Mar 23, 2023

maxulysse commented Mar 23, 2023

nservant commented Apr 26, 2023

askol-lurie commented Mar 7, 2023 •

edited by ewels

Loading

nservant commented Mar 23, 2023 •

edited

Loading

nservant commented Mar 23, 2023 •

edited

Loading