Pipeline fails due to trimming related removal of all reads from a sample #825

dmalzl · 2022-05-13T13:14:48Z

Description of the bug

I am using the rnaseq pipeline to process a large number of samples I downloaded from the SRA. Everything runs as expected but it encountered an error writing that Salmon could not detect any reads aligned to the reference transcriptome. Upon closer inspection of the sample that caused the error I found that the length of the reads in this sample is 17 bp. Adapter and quality trimming with trim_galore imposes a hard length cutoff of trimmed reads which by default is 20 thus effectively removing all reads from my sample even before alignment. Consequently STAR does not have anything to align and writes an empty bamfile which eventually is passed to Salmon and causes the pipeline to fail. Currently there is no saveguard for this except removing such samples prior to the processing or skip trimming altogether. Since I do not want to rerun the pipeline from the start as most of my samples are already aligned, is there another route I can take to make the pipeline finish. I am already thinking of changing the Salmon processes a bit but I think this will just pass the problem to the next stages of the pipeline. I would therefore suggest to somehow catch this edge case in some or the other way after trimming to avoid failures in such situations

Command used and terminal output

nextflow run nf-core/rnaseq \
        -profile cbe \
        -w /scratch/daniel.malzl/work \
        -qs 500 \
        --input samplesheet_chunkaa.csv \
        --outdir aligned \
        --fasta ref/GRCh38.primary_assembly.genome.fa \
        --gtf ref/gencode.v40.primary_assembly.annotation.gtf \
        --gencode \
        --rseqc_modules 'bam_stat,infer_experiment,inner_distance,read_distribution' \
        --skip_deseq2_qc \
        --skip_stringtie \
        --skip_bigwig \
        -resume

Relevant files

any fastqfile with too short reads (e.g. SRR5277994)

System information

No response

drpatelh · 2022-05-13T13:27:20Z

Hi @dmalzl !

You could try to use this in a config you pass via -c custom.config

process.errorStrategy = 'ignore'

That should ignore any failing processes when you -resume but you will need to be careful that processes are only failing due to the error you expect and not others.

Is there a way we can detect "empty" files like this? So the files still have content but the sequences are zero length. Are you able to provide an example here for future reference?

dmalzl · 2022-05-13T13:34:06Z

Hmm I think this is worth trying, but then I would have to check all failures manually in the end which might be a pain but thanks for the suggestion anyway.

For me the fastq file trim_galore produced is simply empty. So one obvious way is to check if the produced file is empty. Another one could be to just use the trimming report which has the following section

RUN STATISTICS FOR INPUT FILE: SRX2581878.fastq.gz
=============================================
25414891 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 20 bp:  25414891 (100.0%)

and could easily be parsed to see what percentage of the reads are thrown out and then like compare this to a threshold.
A third but maybe also futile way would be to check after the alignment if the bam is empty.

drpatelh · 2022-05-13T13:42:12Z

Can you do ls -la in the work directory for the failing process? I want to make sure that the file is 0 bytes. If that's the case then this shouldn't be a difficult fix. We could use some logic like here

dmalzl · 2022-05-13T13:48:15Z

this is the ouptut I get fom ls -la in the work directory the trimming process was writing to:

drwxrwxrwx.  2 daniel.malzl pavri.grp   4096 May 12 11:30 .
drwxrwxrwx. 80 daniel.malzl pavri.grp  12288 May 13 11:53 ..
-rwxrwxrwx.  1 daniel.malzl pavri.grp      0 May 12 11:29 .command.begin
-rwxrwxrwx.  1 daniel.malzl pavri.grp   3773 May 12 11:30 .command.err
-rwxrwxrwx.  1 daniel.malzl pavri.grp   4231 May 12 11:30 .command.log
-rwxrwxrwx.  1 daniel.malzl pavri.grp     56 May 12 11:30 .command.out
-rwxrwxrwx.  1 daniel.malzl pavri.grp  11073 May 12 10:05 .command.run
-rwxrwxrwx.  1 daniel.malzl pavri.grp    440 May 12 10:05 .command.sh
-rwxrwxrwx.  1 daniel.malzl pavri.grp    241 May 12 11:30 .command.trace
-rwxrwxrwx.  1 daniel.malzl pavri.grp      1 May 12 11:30 .exitcode
lrwxrwxrwx.  1 daniel.malzl pavri.grp     30 May 12 11:29 SRX2581878.fastq.gz -> SRX2581878_SRR5277994.fastq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp   2324 May 12 11:29 SRX2581878.fastq.gz_trimming_report.txt
lrwxrwxrwx.  1 daniel.malzl pavri.grp     77 May 12 11:29 SRX2581878_SRR5277994.fastq.gz -> /scratch-cbe/users/daniel.malzl/het/data/fastq/SRX2581878_SRR5277994.fastq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp 192265 May 12 11:29 SRX2581878_trimmed_fastqc.html
-rwxrwxrwx.  1 daniel.malzl pavri.grp  59111 May 12 11:29 SRX2581878_trimmed_fastqc.zip
-rwxrwxrwx.  1 daniel.malzl pavri.grp     20 May 12 11:29 SRX2581878_trimmed.fq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp    103 May 12 11:30 versions.yml

The file is not empty due to the gzip header I guess but gunzip yields a file with size 0. I guess if 20 is the standard size for the header we could use a similar logic as you link points to

drpatelh · 2022-05-13T13:53:48Z

Hmmm...thanks. Ok. Maybe it's easier to parse the Cutadapt logs in that case 👍🏽 So something like this instead.

dmalzl · 2022-05-13T13:57:42Z

I think this would be easiest. Thanks for looking into it. Another related question. Since I am not really planning on using the output salmon produces I was thinking if there is a way to tell the pipeline to only ignore errors in any salmon process something like this:

process {
    withName: '.*:QUANTIFY_STAR_SALMON:.*' {
        errorStrategy = 'ignore'
    }
}

do you think that will work?

dmalzl · 2022-05-13T14:57:59Z

Update: seems to work for now

drpatelh · 2022-05-24T09:42:05Z

Fixed in drpatelh@96f6988

Added a new parameter --min_trimmed_reads that allows users to configure how many reads are acceptable before ignoring a given sample from downstream processing.

Added some logic and a new process that generates a warning in the MultiQC report if samples fail the --min_trimmed_reads threshold as a QC reporting mechanism rather than samples not being processed silently:

dmalzl added the bug Something isn't working label May 13, 2022

drpatelh added this to the 3.8 milestone May 23, 2022

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue May 23, 2022

Fix nf-core#825

96f6988

drpatelh mentioned this issue May 24, 2022

Fixes for v3.8 release #830

Merged

drpatelh closed this as completed May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline fails due to trimming related removal of all reads from a sample #825

Pipeline fails due to trimming related removal of all reads from a sample #825

dmalzl commented May 13, 2022

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022 •

edited

Loading

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022 •

edited

Loading

dmalzl commented May 13, 2022

drpatelh commented May 24, 2022 •

edited

Loading

Pipeline fails due to trimming related removal of all reads from a sample #825

Pipeline fails due to trimming related removal of all reads from a sample #825

Comments

dmalzl commented May 13, 2022

Description of the bug

Command used and terminal output

Relevant files

System information

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022 • edited Loading

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022

drpatelh commented May 13, 2022

dmalzl commented May 13, 2022 • edited Loading

dmalzl commented May 13, 2022

drpatelh commented May 24, 2022 • edited Loading

dmalzl commented May 13, 2022 •

edited

Loading

dmalzl commented May 13, 2022 •

edited

Loading

drpatelh commented May 24, 2022 •

edited

Loading