Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get actual inserted/deleted sequence #10

Closed
brentp opened this issue Jul 10, 2021 · 10 comments
Closed

how to get actual inserted/deleted sequence #10

brentp opened this issue Jul 10, 2021 · 10 comments

Comments

@brentp
Copy link
Contributor

brentp commented Jul 10, 2021

Hi, given a variant like this one (added some newlines for readability):

chr1    54712   5       .       <INS>   .       PASS
SVMETHOD=DYSGUv1.2.3;SVTYPE=INS;END=54712;CHR2=chr1;GRP=426691;NGRP=1;CT=5to3;CIPOS95=0;CIEND95=0;SVLEN=36;
CONTIGA=ttttttttctttctttctttctttctttctttctttctttcTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTCCTCCTTTTCTTTCCTTTTCTTTCTTTCATTCTTTCTTTCTTTTTTAAGTGGCAGGGTCTCACT;
KIND=extra-regional;GC=28.14;NEXP=28;STRIDE=4;
EXPSEQ=ctttctttctttctttctttctttcTTT;
RPOLY=84;OL=0;SU=4;WR=0;PE=0;SR=0;SC=4;BND=4;LPREC=1;RT=pe 
GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB
1/1:10:34.5:4:0:0:0:4:4:16.5:1:1:3:76:1:7:0.829:1.129:0.935:0.65

what is the actual inserted sequence?
SVLEN says 36. EXPSEQ has len=28, the lowercase sequence in CONTIGA has len=41

I understand that in some cases we can't get the full inserted sequence, but for those cases, can we get the left end of the inserted sequence from CONTIGA and the right end from CONTIGB? how?

Would also be nice to be able to get deleted sequence for DEL:

chr1    10250   2       .       <DEL>   .       PASS    SVMETHOD=DYSGUv1.2.3;SVTYPE=DEL;END=10282;CHR2=chr1;GRP=212728;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=32;CONTIGA=CTAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCTCACCCCCACCCCCACCCCCACCCCCACCCCCACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAacCCCTAACCCTAACCCTAACCCTAACCCTAACC;KIND=extra-regional;GC=56.52;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=120;OL=0;SU=6;WR=2;PE=1;SR=0;SC=2;BND=1;LPREC=1;RT=pe

In this case, the sequence after the lower-case letters in CONTIGA is length 32. Is that the deleted sequence? Or would I look this up in a fasta and contig A is the haplotype with the deletion?

@brentp
Copy link
Contributor Author

brentp commented Jul 12, 2021

Could use some help on this @kcleal .

@kcleal
Copy link
Owner

kcleal commented Jul 12, 2021

Hi brentp,

For the 1st record you have highlighted it is possible to get the insertion sequence by slicing the first 36 bases (svlen=36) from the soft-clipped region i.e:

tttctttctttctttctttctttctttctttctttc

This is not guarenteed to work for all insertions though, as some only have partial mappings (or no mappings). I guess full insertion sequences could be added into the alt allele field, however it is not clear what to do for partial or no-mappings.

The lowercase letters are simply non-reference aligned bases, so any insertion or soft-clipped sequences will be represented as lowercase.

In the second example, there looks like a 2bp insertion sequence in the consensus, but this is not related to the deletion. The actual deletion sequence is quite difficult to extract from the record, but can be sliced from the reference using POS and END; this is obviously a pain to do, so I will try and get that added to the alt allele field I think.

@brentp
Copy link
Contributor Author

brentp commented Jul 12, 2021

Thanks for the reply.
Having the insertion/deletion sequence in the ALT field where possible helps for use with downstream tools. I can certainly add the deletion sequence.
For manta, in cases where it can't assemble the entire insertion, it reports LEFT_SVINSSEQ and RIGHT_SVINSSEQ and I can stitch those together separated by 'N' * 100 so that a tool like paragraph can still be used for genotyping.

@kcleal
Copy link
Owner

kcleal commented Jul 12, 2021

Thanks, I think that is a good solution, I will get this fixed.

@brentp
Copy link
Contributor Author

brentp commented Jul 12, 2021

That would be awesome! And, if it can be assembled, manta puts the insertion sequence in SVINSSEQ in the INFO. Though, of course having it in the ALT would be even better.

@kcleal
Copy link
Owner

kcleal commented Jul 15, 2021

Hi brent, ive added some improvements to insertion reporting. If possible the insertion seq is now written in the ALT field, however this applies only to re-mapped sequences where the whole insertion was likely mapped, and 'within-read' events. Other insertion sequences that have partial mappings (or no mappings) are available in the LEFT_SVINSSEQ and RIGHT_SVINSSEQ fields.
Also of note, there is the option to set the contig verbosity. e.g. --verbosity 0 will hide all contigs in the output, for example (but keep the ALT and insertion sequences)

@brentp
Copy link
Contributor Author

brentp commented Jul 16, 2021

Excellent, Thank you! I am evaluating this now.

@brentp
Copy link
Contributor Author

brentp commented Jul 16, 2021

I can run this and get some INS sequences. For those that do not have it, what does that mean?
E.g.:

1       1413382 35      C       <INS>   .       PASS    SVMETHOD=DYSGUv1.2.6;SVTYPE=INS;END=1413382;CHR2=1;GRP=829;NGR
P=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=133;CONTIGA=TTGTATTTTAGTAGAGACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTCTAACTCCCGACCTCAGGTG
ATCCACCCGCCTCGGCCTCTCAAACTGTTGGGATTACAGGCATGTGCCACCACGCCTGGCtaatgttgtattttagtagagacggg;RIGHT_SVINSSEQ=taatgttgtattttag
tagagacggg;KIND=extra-regional;GC=51.95;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=0;OL=0;SU=9;WR=0;PE=0;SR=2;SC=9;BND=5;LPREC=1;RT
=pe     GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:145:26.0:9:0:0:2:9:5:21.11
:1:6:5:52:0:6:0.521:1.46:0.76:0.739

has only RIGHT
and:

1       1362768 33      .       <INS>   .       PASS    SVMETHOD=DYSGUv1.2.6;SVTYPE=INS;END=1362769;CHR2=1;GRP=773;NGR
P=1;CT=3to5;CIPOS95=0;CIEND95=63;SVLEN=31;CONTIGA=tcccctgcatcaccctgccctgccccttcccctccaccaccctgccctgcccccTCCCCTCCATCACC
CTGCCCTGCCCCCTCCCCTCCATCACCCTGCCCTGCCCCCACCCCTCCATCATCCCGCCCGCTCCCCTCTCCACCCCTCC;KIND=extra-regional;GC=73.65;NEXP=0;S
TRIDE=0;EXPSEQ=;RPOLY=31;OL=0;SU=3;WR=0;PE=0;SR=3;SC=1;BND=0;LPREC=1;RT=pe    GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH
10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:34:43.0:3:0:0:3:1:0:20.86:1:2:4:0:0:22:0.667:0.36:0.24:0.517

has neither.
Thanks!

@brentp
Copy link
Contributor Author

brentp commented Jul 19, 2021

Here is another one:

chr2  134621015       35526   .       <INS>   .    P
ASS     SVMETHOD=DYSGUv1.2.6;SVTYPE=INS;END=134621065;CHR2=chr2;GRP=432474;NGRP=1;CT=5to3;CIPOS95=0;CIEND95=0;SVLEN=47
;CONTIGA=agttgataaccgaatatagtcaaaataaaattttctgtgcttcaaaaaatatctttaagaaaatgaaaagacaagctacttactgtgaaaaaatAATATCTTTAAGAAA
ATGAAAAGACAAGCTACTTACTGTGAAAAAATAATTGCAAATCATATTTCTGATAAACTACTTGCATCCAGAATATATATCCC;CONTIGB=AGTTGATAACCGAATATAGTCAAAAT
AAAATTTTCTGTGCTTCAAAAAATATCTTTAAGAAAATGAAAAGACAAGCTACTTACTGTGAAAAAATAATatctttaagaaaatgaaaagacaagctacttactgtgaaaaaataat
tgcaaatcatatttctgataaactacttgcatccagaatatatatccc;KIND=extra-regional;GC=25.52;NEXP=0;STRIDE=0;EXPSEQ;RPOLY=3;OL=3;SU=7
;WR=0;PE=0;SR=7;SC=14;BND=0;LPREC=1;RT=pe       GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:O
CN:PROB 0/1:200:60:7:0:0:7:14:0:35.08:1:6:8:0:0:2:0.725:1.5:1.088:0.794

has contiga and contigb, but no SEQ fields.

@kcleal
Copy link
Owner

kcleal commented Jul 19, 2021

Hi brent, thanks for looking at this, I can see why this is confusing! Basically, the ones you have pulled out that do not have insertion sequences called - these are events that were identified from split/supplementary mappings (they have RMS re-mapping score==0, and WR within-read support == 0). These events will take a bit more handling to infer the insertion sequence. I will try and get this fixed in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants