Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline fails due to trimming related removal of all reads from a sample #825

Closed
dmalzl opened this issue May 13, 2022 · 8 comments
Closed
Labels
bug Something isn't working
Milestone

Comments

@dmalzl
Copy link

dmalzl commented May 13, 2022

Description of the bug

I am using the rnaseq pipeline to process a large number of samples I downloaded from the SRA. Everything runs as expected but it encountered an error writing that Salmon could not detect any reads aligned to the reference transcriptome. Upon closer inspection of the sample that caused the error I found that the length of the reads in this sample is 17 bp. Adapter and quality trimming with trim_galore imposes a hard length cutoff of trimmed reads which by default is 20 thus effectively removing all reads from my sample even before alignment. Consequently STAR does not have anything to align and writes an empty bamfile which eventually is passed to Salmon and causes the pipeline to fail. Currently there is no saveguard for this except removing such samples prior to the processing or skip trimming altogether. Since I do not want to rerun the pipeline from the start as most of my samples are already aligned, is there another route I can take to make the pipeline finish. I am already thinking of changing the Salmon processes a bit but I think this will just pass the problem to the next stages of the pipeline. I would therefore suggest to somehow catch this edge case in some or the other way after trimming to avoid failures in such situations

Command used and terminal output

nextflow run nf-core/rnaseq \
        -profile cbe \
        -w /scratch/daniel.malzl/work \
        -qs 500 \
        --input samplesheet_chunkaa.csv \
        --outdir aligned \
        --fasta ref/GRCh38.primary_assembly.genome.fa \
        --gtf ref/gencode.v40.primary_assembly.annotation.gtf \
        --gencode \
        --rseqc_modules 'bam_stat,infer_experiment,inner_distance,read_distribution' \
        --skip_deseq2_qc \
        --skip_stringtie \
        --skip_bigwig \
        -resume

Relevant files

any fastqfile with too short reads (e.g. SRR5277994)

System information

No response

@dmalzl dmalzl added the bug Something isn't working label May 13, 2022
@drpatelh
Copy link
Member

Hi @dmalzl !

You could try to use this in a config you pass via -c custom.config

process.errorStrategy = 'ignore'

That should ignore any failing processes when you -resume but you will need to be careful that processes are only failing due to the error you expect and not others.

Is there a way we can detect "empty" files like this? So the files still have content but the sequences are zero length. Are you able to provide an example here for future reference?

@dmalzl
Copy link
Author

dmalzl commented May 13, 2022

Hmm I think this is worth trying, but then I would have to check all failures manually in the end which might be a pain but thanks for the suggestion anyway.

For me the fastq file trim_galore produced is simply empty. So one obvious way is to check if the produced file is empty. Another one could be to just use the trimming report which has the following section

RUN STATISTICS FOR INPUT FILE: SRX2581878.fastq.gz
=============================================
25414891 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 20 bp:  25414891 (100.0%)

and could easily be parsed to see what percentage of the reads are thrown out and then like compare this to a threshold.
A third but maybe also futile way would be to check after the alignment if the bam is empty.

@drpatelh
Copy link
Member

Can you do ls -la in the work directory for the failing process? I want to make sure that the file is 0 bytes. If that's the case then this shouldn't be a difficult fix. We could use some logic like here

@dmalzl
Copy link
Author

dmalzl commented May 13, 2022

this is the ouptut I get fom ls -la in the work directory the trimming process was writing to:

drwxrwxrwx.  2 daniel.malzl pavri.grp   4096 May 12 11:30 .
drwxrwxrwx. 80 daniel.malzl pavri.grp  12288 May 13 11:53 ..
-rwxrwxrwx.  1 daniel.malzl pavri.grp      0 May 12 11:29 .command.begin
-rwxrwxrwx.  1 daniel.malzl pavri.grp   3773 May 12 11:30 .command.err
-rwxrwxrwx.  1 daniel.malzl pavri.grp   4231 May 12 11:30 .command.log
-rwxrwxrwx.  1 daniel.malzl pavri.grp     56 May 12 11:30 .command.out
-rwxrwxrwx.  1 daniel.malzl pavri.grp  11073 May 12 10:05 .command.run
-rwxrwxrwx.  1 daniel.malzl pavri.grp    440 May 12 10:05 .command.sh
-rwxrwxrwx.  1 daniel.malzl pavri.grp    241 May 12 11:30 .command.trace
-rwxrwxrwx.  1 daniel.malzl pavri.grp      1 May 12 11:30 .exitcode
lrwxrwxrwx.  1 daniel.malzl pavri.grp     30 May 12 11:29 SRX2581878.fastq.gz -> SRX2581878_SRR5277994.fastq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp   2324 May 12 11:29 SRX2581878.fastq.gz_trimming_report.txt
lrwxrwxrwx.  1 daniel.malzl pavri.grp     77 May 12 11:29 SRX2581878_SRR5277994.fastq.gz -> /scratch-cbe/users/daniel.malzl/het/data/fastq/SRX2581878_SRR5277994.fastq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp 192265 May 12 11:29 SRX2581878_trimmed_fastqc.html
-rwxrwxrwx.  1 daniel.malzl pavri.grp  59111 May 12 11:29 SRX2581878_trimmed_fastqc.zip
-rwxrwxrwx.  1 daniel.malzl pavri.grp     20 May 12 11:29 SRX2581878_trimmed.fq.gz
-rwxrwxrwx.  1 daniel.malzl pavri.grp    103 May 12 11:30 versions.yml

The file is not empty due to the gzip header I guess but gunzip yields a file with size 0. I guess if 20 is the standard size for the header we could use a similar logic as you link points to

@drpatelh
Copy link
Member

Hmmm...thanks. Ok. Maybe it's easier to parse the Cutadapt logs in that case 👍🏽 So something like this instead.

@dmalzl
Copy link
Author

dmalzl commented May 13, 2022

I think this would be easiest. Thanks for looking into it. Another related question. Since I am not really planning on using the output salmon produces I was thinking if there is a way to tell the pipeline to only ignore errors in any salmon process something like this:

process {
    withName: '.*:QUANTIFY_STAR_SALMON:.*' {
        errorStrategy = 'ignore'
    }
}

do you think that will work?

@dmalzl
Copy link
Author

dmalzl commented May 13, 2022

Update: seems to work for now

@drpatelh drpatelh added this to the 3.8 milestone May 23, 2022
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue May 23, 2022
@drpatelh
Copy link
Member

drpatelh commented May 24, 2022

Fixed in drpatelh@96f6988

Added a new parameter --min_trimmed_reads that allows users to configure how many reads are acceptable before ignoring a given sample from downstream processing.

Added some logic and a new process that generates a warning in the MultiQC report if samples fail the --min_trimmed_reads threshold as a QC reporting mechanism rather than samples not being processed silently:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants