Out of bound reads #17

sguizard · 2022-10-19T09:20:23Z

Hi Kristoffer,

I would need your help on a result that one of nf-core/isoseq pipeline user obtained.
He contacted me to present an issue with the pipeline.
He was able to run the complete pipeline with minimap2 and not with uLTRA.
I identified the read causing the failure and compared the bam files from both aligners.

I found the following result:

$> grep -e 'm54224_190921_125133/25952682/ccs' chunk13_aligned*.sam
chunk13_alignedMinmap2.sam:m54224_190921_125133/25952682/ccs	16	chr11_KI270721v1_random	50361	0	530M1D85M81N123M80N113M95N135M96N1325M4S	*	0	0	CTGT...CCCC	*	NM:i:11	ms:i:2025	AS:i:2158	nn:i:0	ts:A:+	tp:A:P	cm:i:713	s1:i:2218	s2:i:2218	de:f:0.0048	rl:i:15
chunk13_alignedULTRA.sam:m54224_190921_125133/25952682/ccs	272	chr11	1995179	0	210=1X39=1X2=1X261=1X14=1D77=1X7=81N4=1X118=80N113=95N135=96N420=1X167=1X112=1X403=1X212=11S	*	0	0	CTGT...CCCC	*	XA:Z:rna46708XC:Z:FSM	NM:i:11
chunk13_alignedULTRA.sam:m54224_190921_125133/25952682/ccs	16	chr1_KI270713v1_random	50361	60	210=1X39=1X2=1X261=1X14=1D77=1X7=81N4=1X118=80N113=95N135=96N420=1X167=1X112=1X403=1X219=1D1=1X2=	*0	0	CTGT...CCCC	*	XA:Z:XM_006724897.1	XC:Z:FSM	NM:i:13

In the minimap2 alignment, the read is mapped once on the chromosome chr11_KI270721v1_random at POS 50361.
With uLTRA, it's mapped on chromosomes chr11 and chr1_KI270713v1_random at POS 1995179 and 50361.

The length of the sequence are

chr11_KI270721v1_random: 100316 bases
chr11: 135086622 bases
chr1_KI270713v1_random: 40745 bases

As you see third mapping is outside the chromosome itself.

Do you have an idea of could cause that?
Is it possible that there is an error in the sequence name resolution? The POS value of the faulty mapping and minimap2 are the same.

The text was updated successfully, but these errors were encountered:

ksahlin · 2022-10-19T12:19:04Z

Hi @sguizard

Not that I can think of off the top of my head.

It would be great to get the offending read as well as the genome and gtf files so that I could reproduce it. Would it be possible?

Best,
K

sguizard · 2022-10-19T14:22:48Z

The genome the genome and gtf are from Human genome GRCh38 (dowloaded from igenome).
@husensofteng Can you, please, provide the chunk13 fasta file which should contain the read m54224_190921_125133/25952682/ccs?

ksahlin · 2022-10-19T15:24:32Z

I have downloaded the archive. Is the relevant gtf-annotation in the Genes.gencode folder? (there are two other folders Genes and Archives (with subfolders), all with gtf files in them)

As for the genome, is it this one:

husensofteng · 2022-10-20T08:43:05Z

Hi @ksahlin,

Thanks for looking into the issue. Please find the corresponding sequence below.

>m54224_190921_125133/25952682/ccs
GGGGAGGGTGAGGGAGGGGGTGGGATGGGTGGGGGGTAACGGGGGAAACTGGGGAAGTGGGGAACCGAGGGGCAACCAGGGGAAGATGGGGTGCTGGAGGAGAGCTTGTGGGAGCCAAGGAGCACCTTGGACATCTGGAGTCTGGCAGGAGTGATGACGGGTGGAGGGGCTAGCTCGAGGCAGGGCTGGTGGGGCCTGAGGCCAGTGAGGAGTGTGGAGTAGGTGCCCAGGCATCGTGCAGACAGGGCGACATCAGCTGGGGACGATGGGCCTGAGCTAGGGCTGGAAAGAAGGGGGAGCCAGGCATTCATCCCGGTCACTTTTGGTTACAGGACGTGGCAGCTGGTTGGACGAGGGGAGCTGGTGGGCAGGGTTTGATCCCAGGGCCTGGGCAACGGAGGTGTAGCTGGCAGCAGCGGGCAGGTGAGGACCCCATCTGCCGGGCAGGTGAGTCCCTTCCCTCCCCAGGCCTCGCTTCCCCAGCCTTCTGAAAGAAGGAGGTTTAGGGGATCGAGGGCTGGCGGGGAGAAGCAGACACCCTCCCAGCAGAGGGGCAGGATGGGGGCAGGAGAGTTAGCAAAGGTGACATCTTCTCGGGGGGAGCCGAGACTGCGCAAGGCTGGGGGGCTATGGGCCCGTTCCAGGCAGAAAGAGCAAGAGGGCAGGGAGGGAGCACAGGGGTGGCCAGCGTAGGGTCCAGCACGTGGGGTGGTACCCCAGGCCTGGGTCAGACAGGGACAAGGCAGGGGACACAGGACAGAGGGGTCCCCAGCTGCCACCTCACCCACCGCAATTCATTTAGTAGCAGGCACAGGGGCAGCTCCGGCACGGCTTTCTCAGGCCTATGCCGGAGCCTCGAGGGCTGGAGAGCGGGAAGACAGGCAGTGCTCGGGGAGTTGCAGCAGGACATCACCAGGAGGGCGAAGCGGCCACGGGAGGGGGGCCCCGGGACATTGCGCAGCAAGGAGGCTGCAGGGGCTCGGCCTGCGGGCGCCGGTCCCACGAGGCACTGCGGCCCAGGGTCTGGTGCGGAGAGGGCCCACAGTGGACTTGGTGACGCTGTATGCCCTCACCGCTCAGCCCCTGGGGCTGGCTTGGCAGACAGTACAGCATCCAGGGGAGTCAAGGGCATGGGGCGAGACCAGACTAGGCGAGGCGGGCGGGGCGGAGTGAATGAGCTCTCAGGAGGGAGGATGGTGCAGGCAGGGGTGAGGAGCGCAGCGGGCGGCGAGCGGGAGGCACTGGCCTCCAGAGCCCGTGGCCAAGGCGGGCCTCGCGGGCGGCGACGGAGCCGGGATCGGTGCCTCAGCGTTCGGGCTGGAGACGAGGCCAGGTCTCCAGCTGGGGTGGACGTGCCCACCAGCTGCCGAAGGCCAAGACGCCAGGTCCGGTGGACGTGACAAGCAGGACATGACATGGTCCGGTGTGACGGCGAGGACAGAGGAGGCGCGTCCGGCCTTCCTGAACACCTTAGGCTGGTGGGGCTGCGGCAAGAAGCGGGTCTGTTTCTTTACTTCCTCCACGGAGTCGGCACACTATGGCTGCCCTCTGGGCTCCCAGAACCCACAACATGAAAGAAATGGTGCTACCCAGCTCAAGCCTGGGCCTTTGAATCCGGACACAAAACCCTCTAGCTTGGAAATGAATATGCTGCACTTTACAACCACTGCACTACCTGACTCAGGAATCGGCTCTCGAAGGTGAAGCGAGAGGAACCAGACCTCATCAGCCCAACATCAAAGACACCATCGGAACAGCAGCGCCCGCAGCACCCACCCCGCACCGCGACTCCATCTTCACGGCCACCCCCTGCGGCGGACGGTTGACCACCAGCCACCACATCATCCCAGAGCTGAGCTCCTCCAGCGGGATGACGCCGTCCCCACCACCTCCCTCTTCTTCTTTTTCATCCTTCTGTCTCTTTGTTTCTGAGCTTTCCTGTCTTTCCTTTTTTCTGAGAGATTCAAAGCCTCCACGACTCTGTTTCCCCCGTCCCTTCTGAATTTAATTTGCACTAAGTCATTTGCACTGGTTGGAGTTGTGGAGACGGCCTTGAGTCTCGGTGCGAGTGTGCGTGAGTGTGAGCCACCTTGGCAAGTGCCTGCGCAGGGCCCGGCCGCCCTCCATCTGGGCCGGGTGACTGGGCGCCGGCTGTGTGCCCGAGGCCTCACCCTGCCCTCGCCTAGTCTGGAAGCTCCGACCGACATCACGGAGCAGCCTTCAAGCATTCCATTACGCCCCATCTCGCTCTGTGCCCCTCCCCACCAGGGCTTCAGCAGGAGCCCTGGACTCATCATCAATAAACACTGTTACAG

Also, these are the full paths to the reference files:
s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf

ksahlin · 2022-10-23T12:50:54Z

I have found the error. Indeed, the read is aligned to chr11_KI270721v1_random up until the very end in uLTRA where a merging of sam-files is performed (in case of parallelization).

The line causing the error seems to be

alignment_outfile = pysam.AlignmentFile( os.path.join(args.outfolder, args.prefix+".sam"), "w", reference_names=list(refs_lengths.keys()), reference_lengths=list(refs_lengths.values()) )

On this line, uLTRA uses the reference sequences in the reference fasta to form a template for the final SAM-file. uLTRAs uses temporary SAM files to produce the alignment (in ase of multiprocessing).

Now, in the temporary SAM-file, reference accession chr11_KI270721v1_random is found on line 33. When initialising the final SAM file with command above, sequence chr1_KI270713v1_random occurs on line 33.

The read as mapped with uLTRA has already set all fileds for the read, e.g., read.reference_name = "chr11_KI270721v1_random". However, in the temporary SAM file, this "read record" gets updated (involuntarily) as read.reference_name = "chr1_KI270713v1_random".

This leads me to believe that the library Pysam that uLTRA uses internally stores the position of the reference in the header. That is, read.reference_name points to a position in a vector storing the reference sequences. This means that if I now create a new sam file using a different template (different reference sequence vector), reads can get assigned to the wrong reference.

I need to figure out why uLTRA uses only a subset of the references in it's header (creating the diff), and I will fix this asap. This bug will most likely fix issue #2 as well.

ksahlin · 2022-10-23T15:53:47Z

Hi @sguizard and @husensofteng ,

I have pushed a fix for this to master. Could you verify that the latest master solves the problem (and does not introduce something unexpected). The fix was pretty minimal so I think it should not introduce anything unexpected but good to check regardless.

Ideally if would be good with a diff or something looking at the number of lines affected in the SAM between the versions. There could be other reads for where this reference name switch has occurred that are silent due to not being out of bounds.

@sguizard, If everything works with the latest master, I kindly ask for your expertise in updating also the bioconda version whenever you have time :) I made sure to update the setup.py script to v0.0.4.2

sguizard · 2022-10-25T08:04:58Z

Hi @ksahlin, Thanks for your investigation.
I'm having a look to this today.
No problems about bioconda, I'll take care of this ;)

sguizard · 2022-10-26T09:19:06Z

I tested the patch with the complete set of reads.
Apart the corrected sequence for read m54224_190921_125133/25952682/ccs, the existing differences (between **) are located in the XA tag, where ID of alternative mappings are switched.

Before:  m54224_190921_125133/26149707/ccs	0	chr17	1742846	60	1=1X...9=4S	*	0	0	GGG...AGA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:4
After: m54224_190921_125133/26149707/ccs	0	chr17	1742846	60	1=1X...9=4S	*	0	0	GGG...AGA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:4

Before:  m54224_190921_125133/26018748/ccs	0	chr17	1742852	60	2=1D...2=1S	*	0	0	GGG...GCA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:3
After: m54224_190921_125133/26018748/ccs	0	chr17	1742852	60	2=1D...2=1S	*	0	0	GGG...GCA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:3

Before:  m54224_190921_125133/25625088/ccs	0	chr17	1742871	60	4S38=...8=1S	*	0	0	GGG...GCA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:1
After: m54224_190921_125133/25625088/ccs	0	chr17	1742871	60	4S38=...8=1S	*	0	0	GGG...GCA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:1

Before:  m54224_190921_125133/25690218/ccs	0	chr17	1742871	60	4S38=...1144=77S	*	0	0	GGG...AGA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:0
After: m54224_190921_125133/25690218/ccs	0	chr17	1742871	60	4S38=...1144=77S	*	0	0	GGG...AGA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:0

Before:  m54224_190921_125133/25822164/ccs	0	chr17	1742871	60	4S38=...428=1S	*	0	0	GGG...GCA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:7
After: m54224_190921_125133/25822164/ccs	0	chr17	1742871	60	4S38=...428=1S	*	0	0	GGG...GCA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:7

Before:  m54224_190921_125133/26018081/ccs	0	chr17	1742871	60	4S38=...579=3S	*	0	0	GGG...AGA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:2
After: m54224_190921_125133/26018081/ccs	0	chr17	1742871	60	4S38=...579=3S	*	0	0	GGG...AGA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:2

Before:  m54224_190921_125133/26149488/ccs	0	chr17	1742871	60	6S38=...786=3S	*	0	0	GGG...AGA	*	**XA:Z:NM_000934.3,XM_005256698.2**	XC:Z:FSM	NM:i:1
After: m54224_190921_125133/26149488/ccs	0	chr17	1742871	60	6S38=...786=3S	*	0	0	GGG...AGA	*	**XA:Z:XM_005256698.2,NM_000934.3**	XC:Z:FSM	NM:i:1

Before:  m54224_190921_125133/25952682/ccs	16	**chr1_KI270713v1_random**	50359	60	1=1D...1X2=	*	0	0	TCT...CCC	*	XA:Z:XM_006724897.1	XC:Z:FSM	NM:i:14
After: m54224_190921_125133/25952682/ccs	16	**chr11_KI270721v1_random**	50359	60	1=1D...1X2=	*	0	0	TCT...CCC	*	XA:Z:XM_006724897.1	XC:Z:FSM	NM:i:14

So, it seems OK for me.
I update the bioconda recipe and the nf-core/isoseq pipeline

sguizard · 2022-10-26T10:20:24Z

@ksahlin do you plan to create a 0.0.4.2 release?

ksahlin · 2022-10-26T11:04:18Z

Yes, was just waiting for your input. Will do it asap and get back to you here when done.

ksahlin · 2022-10-26T15:52:44Z

I have now uploaded v0.0.4.2 to PyPI and created a release.

I will close this issue. you can always bump here when the conda package is updated and I'll update the readme on the conda version.

Thanks for reporting and documenting so well this nasty bug!

sguizard · 2022-10-31T08:03:22Z

@ksahlin the updated bioconda recipe is now live. I take care of the nf-core modules today.

ksahlin · 2022-10-31T10:25:16Z

super, thanks!

husensofteng mentioned this issue Oct 19, 2022

Error in tama_collapse: Genome seq is not the same length as query seq GenomeRIK/tama#80

Closed

ksahlin closed this as completed in 4c95e9c Oct 23, 2022

ksahlin reopened this Oct 23, 2022

ksahlin closed this as completed Oct 26, 2022

ksahlin mentioned this issue Oct 26, 2022

Alignment outside of contig #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of bound reads #17

Out of bound reads #17

sguizard commented Oct 19, 2022

ksahlin commented Oct 19, 2022

sguizard commented Oct 19, 2022

ksahlin commented Oct 19, 2022 •

edited

husensofteng commented Oct 20, 2022

ksahlin commented Oct 23, 2022 •

edited

ksahlin commented Oct 23, 2022

sguizard commented Oct 25, 2022

sguizard commented Oct 26, 2022

sguizard commented Oct 26, 2022

ksahlin commented Oct 26, 2022

ksahlin commented Oct 26, 2022

sguizard commented Oct 31, 2022

ksahlin commented Oct 31, 2022

Out of bound reads #17

Out of bound reads #17

Comments

sguizard commented Oct 19, 2022

ksahlin commented Oct 19, 2022

sguizard commented Oct 19, 2022

ksahlin commented Oct 19, 2022 • edited

husensofteng commented Oct 20, 2022

ksahlin commented Oct 23, 2022 • edited

ksahlin commented Oct 23, 2022

sguizard commented Oct 25, 2022

sguizard commented Oct 26, 2022

sguizard commented Oct 26, 2022

ksahlin commented Oct 26, 2022

ksahlin commented Oct 26, 2022

sguizard commented Oct 31, 2022

ksahlin commented Oct 31, 2022

ksahlin commented Oct 19, 2022 •

edited

ksahlin commented Oct 23, 2022 •

edited