Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

kchilkert · 2023-10-11T19:17:32Z

kchilkert
Oct 11, 2023

Hello,

I am running a variant calling pipeline using GATK4 used as a containerized nextflow script for analyzing BGI sequencing data, which I am submitting as a Slurm job to HPC. I set my input to the reads I want analyzed, the workflow begins and completes step 1 but fails step 2 saying that the _aligned_reads.sam file does not exist, which is the output of step 1. The process/error is below:

executor > local (2)
[7b/56c3d5] process > align (1) [100%] 1 of 1 ✔
[83/d89167] process > markDuplicatesSpark (1) [ 0%] 0 of 1
[- ] process > getMetrics -
[- ] process > haplotypeCaller -
[- ] process > selectVariants -
[- ] process > filterSnps -
[- ] process > filterIndels -
[- ] process > bqsr -
[- ] process > analyzeCovariates -
[- ] process > snpEff -
[- ] process > qc -
Error executing process > 'markDuplicatesSpark (1)'

Caused by:
Process markDuplicatesSpark (1) terminated with an error exit status (2)

Command executed:

mkdir -p /scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/
gatk --java-options "-Djava.io.tmpdir=/scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/" MarkDuplicatesSpark -I _aligned_reads.sam -M _dedup_metrics.txt -O _sorted_dedup.bam
rm -r /scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/

Command exit status:
2

Command output:
(empty)

Command error:
18:17:56.068 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@51e0f2eb{/api,null,AVAILABLE,@spark}
18:17:56.069 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@aa794a3{/jobs/job/kill,null,AVAILABLE,@spark}
18:17:56.069 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@22cb8e5f{/stages/stage/kill,null,AVAILABLE,@spark}
18:17:56.072 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@5ca8c904{/metrics/json,null,AVAILABLE,@spark}
18:17:56.076 INFO MarkDuplicatesSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
18:17:56.118 INFO GoogleHadoopFileSystemBase - GHFS version: 1.9.4-hadoop3
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
18:17:56.286 INFO MemoryStore - Block broadcast_0 stored as values in memory (estimated size 1540.3 KiB, free 17.8 GiB)
18:17:56.593 INFO MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 68.4 KiB, free 17.8 GiB)
18:17:56.596 INFO BlockManagerInfo - Added broadcast_0_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.599 INFO SparkContext - Created broadcast 0 from broadcast at SamSource.java:78
18:17:56.719 INFO MemoryStore - Block broadcast_1 stored as values in memory (estimated size 188.3 KiB, free 17.8 GiB)
18:17:56.741 INFO MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 41.8 KiB, free 17.8 GiB)
18:17:56.742 INFO BlockManagerInfo - Added broadcast_1_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.742 INFO SparkContext - Created broadcast 1 from newAPIHadoopFile at SamSource.java:108
18:17:56.833 INFO BlockManagerInfo - Removed broadcast_1_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.837 INFO BlockManagerInfo - Removed broadcast_0_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 68.4 KiB, free: 17.8 GiB)
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
18:17:56.903 INFO MemoryStore - Block broadcast_2 stored as values in memory (estimated size 1540.3 KiB, free 17.8 GiB)
18:17:56.912 INFO MemoryStore - Block broadcast_2_piece0 stored as bytes in memory (estimated size 68.4 KiB, free 17.8 GiB)
18:17:56.913 INFO BlockManagerInfo - Added broadcast_2_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.914 INFO SparkContext - Created broadcast 2 from broadcast at SamSource.java:78
18:17:56.917 INFO MemoryStore - Block broadcast_3 stored as values in memory (estimated size 188.3 KiB, free 17.8 GiB)
18:17:56.927 INFO MemoryStore - Block broadcast_3_piece0 stored as bytes in memory (estimated size 41.8 KiB, free 17.8 GiB)
18:17:56.928 INFO BlockManagerInfo - Added broadcast_3_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.928 INFO SparkContext - Created broadcast 3 from newAPIHadoopFile at SamSource.java:108
18:17:56.974 INFO BlockManagerInfo - Removed broadcast_2_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.977 INFO BlockManagerInfo - Removed broadcast_3_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.978 INFO AbstractConnector - Stopped Spark@5cb6966{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
18:17:56.981 INFO SparkUI - Stopped Spark web UI at http://hpc-compute-p36.cm.cluster:4040
18:17:56.989 INFO MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
18:17:57.004 INFO MemoryStore - MemoryStore cleared
18:17:57.004 INFO BlockManager - BlockManager stopped
18:17:57.006 INFO BlockManagerMaster - BlockManagerMaster stopped
18:17:57.008 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
18:17:57.016 INFO SparkContext - Successfully stopped SparkContext
18:17:57.016 INFO MarkDuplicatesSpark - Shutting down engine
[October 9, 2023 at 6:17:57 PM EDT] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 0.06 minutes.
Runtime.totalMemory()=285212672

A USER ERROR has occurred: Failed to load reads from _aligned_reads.sam
Caused by:Input path does not exist: file:_aligned_reads.sam

More Info on what I'm running is below:

Config File:

// Required Parameters
params.reads = "/projects/oleksyk-lab/Kenneth/Golden_Standard/BGI/{E150016531_L01_75_1.fq.gz,E150016531_L01_75_2.fq.gz}"
params.ref = "/projects/oleksyk-lab/Kenneth/Golden_Standard/References/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
params.outdir = "/scratch/projects/oleksyk-lab/gatk4"
params.snpeff_db = "GRCh38.105"
params.pl = "bgi"
params.pm = "dnbseq"

// Set the Nextflow Working Directory
// By default this gets set to params.outdir + '/nextflow_work_dir'
workDir = params.outdir + '/nextflow_work_dir'

Slurm Script (dsl1):

module load bwa
module load GATK
export NXF_VER=22.10.7
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
source activate nf-env
nextflow run main.nf -c goldstandardnextflow.config

I cannot find anyone with this error and I'm very confused as to why I am receiving it. Any help is greatly appreciated!!

mribeirodantas · 2023-10-12T04:00:05Z

mribeirodantas
Oct 12, 2023
Collaborator

It may be that your process is expecting a specific filename, and the output of the previous process doesn't match the filename. This is not uncommon on nf-core pipelines, and is usually solved with something like the snippet below in a configuration file:

process {
  withName: PICARD_MARKDUPLICATES {
    ext.prefix = { "output_${meta.id}" }
  }
}

A first step should be to check the task dir of the failed task and see what input files are linked there, if any. If you can share a minimal reproducible example or a publicly available pipeline for me to check, I can try to be reproduce the issue on my side and work on a solution 😄

1 reply

kchilkert Oct 12, 2023
Author

Thanks for your response, I really appreciate it. I will try to add that into my config file and see what happens.

In the meantime, I am using the pipeline found here:
https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/

In summary, I modify the 6 input parameters in the config file for my specific reads that I want to analyze, as shown below from the write-up:

**params.reads = "/path/to/reads/n0{1,2}.fastq.gz"
params.ref = "/path/to/ref.fa"
params.outdir = "/scratch/user/gatk4/"
params.snpeff_db = "athalianaTair10"
params.pl = "illumina"
params.pm = "nextseq"

After activating your conda environment with Nextflow installed, run:

nextflow run main.nf

By default, Nextflow will look for a config file called nextflow.config inside the directory from which Nextflow was launched. You can specify a different config file on the command line using -c:

nextflow run main.nf -c your.config**

If you check my config file, I've done exactly that:

// Required Parameters
params.reads = "/projects/oleksyk-lab/Kenneth/Golden_Standard/BGI/{E150016531_L01_75_1.fq.gz,E150016531_L01_75_2.fq.gz}"
params.ref = "/projects/oleksyklab/Kenneth/Golden_Standard/References/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
params.outdir = "/scratch/projects/oleksyk-lab/gatk4"
params.snpeff_db = "GRCh38.105"
params.pl = "bgi"
params.pm = "dnbseq"

Below is the main.nf for the first 2 processes in the pipeline(I've only modified under step 1 bwa mem -t, designating 40 cpu's to the job):

process align {

publishDir "${params.out}/aligned_reads", mode:'copy'



input:

set pair_id, file(reads) from read_pairs_ch



output:

set val(pair_id), file("${pair_id}_aligned_reads.sam") \

    into aligned_reads_ch



script:

readGroup = \

    "@RG\\tID:${pair_id}\\tLB:${pair_id}\\tPL:${[params.pl](http://params.pl/)}\\tPM:${[params.pm](http://params.pm/)}\\tSM:${pair_id}"

"""

bwa mem \

-K 100000000 \

    -v 3 \

-t 40 \

    -Y \

-R \"${readGroup}\" \

    $ref \

${reads[0]} \

    ${reads[1]} \

    > ${pair_id}_aligned_reads.sam

"""

}

process markDuplicatesSpark {

publishDir "${params.out}/dedup_sorted", mode:'copy'



input:

set val(pair_id), file(aligned_reads) from aligned_reads_ch



// If we're doing this step, it's the first round

// so we set val(1) (round = 1).

output:

set val(pair_id), \

    val(1), \

    file("${pair_id}_sorted_dedup.bam") \

    into bam_for_variant_calling, \

    sorted_dedup_ch_for_metrics, \

    bam_for_bqsr

set val(pair_id), \

    file ("${pair_id}_dedup_metrics.txt") \

    into dedup_qc_ch



script:

"""

mkdir -p ${params.tmpdir}/${workflow.runName}/${pair_id}

gatk --java-options "-Djava.io.tmpdir=${params.tmpdir}/${workflow.runName}/${pair_id}" \

     MarkDuplicatesSpark \

    -I $aligned_reads \

    -M ${pair_id}_dedup_metrics.txt \

    -O ${pair_id}_sorted_dedup.bam

rm -r ${params.tmpdir}/${workflow.runName}/${pair_id}

The write-up I'm using is extremely helpful as I'm quite new to this, as far as I know I've followed it exactly but still getting this User Error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

kchilkert Oct 11, 2023

Replies: 1 comment · 1 reply

mribeirodantas Oct 12, 2023 Collaborator

kchilkert Oct 12, 2023 Author

kchilkert
Oct 11, 2023

Replies: 1 comment 1 reply

mribeirodantas
Oct 12, 2023
Collaborator

kchilkert Oct 12, 2023
Author