-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkDuplicates step fails with UMIs from read names and 4 lanes #802
Comments
Hi @sci-kai ! It might not be necessary to mark duplicates depending on your setup https://nf-co.re/sarek/3.0.2/usage#how-to-handle-umis . As to if the fastq files should be merged beforehand maybe @lescai can comment, please also note that the whole umi subworkflow is currently being overhauled to match the latest recommendations by fgbio |
Thanks Friederike for the hint! I tried running it with "skip_tools = 'baserecalibrator,markduplicates'" but it reports an error at the samblaster step within the CREATE_UMI_CONSENSUS module:
|
apologies I'll try to have a look as soon as possible |
Hi @sci-kai ! Apologies for the delayed response, I was on vacation. This issue is a bit older but looks like your read files might be unmated: GregoryFaust/samblaster#37 . Could you try name sorting the read files as suggested in the linked issue? |
Hi @FriederikeHanssen.
|
Reading over the issue again I am confused now. So in the original run (where MD failed), samblaster ran fine? But now it fails? |
sorry for the confusion and delay. command line:
config file:
error
|
Hi! Just looking at this issue again. Actually might be completely unrelated to UMI steps, but params should never be provided with a config file but with a params file instead. There is some prioritization magic, where params provided in a config are not overwritten. |
Hi, sorry for the late answer on this topic.
|
Hi, |
@sci-kai Hi! Did you arrive at a solution? For a sample I was thinking of merging the umi-consensus bam files from e.g. 4 lanes and re-doing the workflow by using FGBIO_GROUPREADSBYUMI and FGBIO_CALLMOLECULARCONSENSUSREADS. |
Description of the bug
I am using Sarek with a configuration that reads UMIs from the FASTQ read names.
The workflow fails at the "MarkDuplicates" step with the error message:
htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHD789:806050
According to the documentation of this error, it can result from reads having the same read name in the BAM file.
I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. During that process, the read names are changed to a scheme containing the sample name HD789 and a continuous number. When BAMs for all four lanes are merged, this results into different reads having the same name (4x readpairs since I have 4 lanes) that results into this error in the MarkDuplicates step.
In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.
Command used and terminal output
Relevant files
My config file looks like this:
And my samplesheet for this one sample X like this:
System information
Nextflow: 22.04.5
Hardware: Desktop
Executor: local
Container engine: Singularity
OS: Ubuntu 22.04
Sarek: 3.0.2
The text was updated successfully, but these errors were encountered: