Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Supplementary Alignments #282

Closed
DarioS opened this issue May 13, 2020 · 11 comments
Closed

Wrong Supplementary Alignments #282

DarioS opened this issue May 13, 2020 · 11 comments

Comments

@DarioS
Copy link

DarioS commented May 13, 2020

I traced back errors from a structural variant software to bwa mem. My colleague, who is a bit of an expert with SAM format explained to me

It seems to occur between an alt contig and a primary alignment in which the edit distance is high. My guess is that the SNVs caused big gap in seeds so bwa looked for alt alignments, but then the s/w extensions to produce the final alignments resulted in full extension of all records. That approach would also explain when I sometimes see split read alignment records that are covered by multiple other records (e.g. 50S50M, 50M50S, and 25S50M25S - that third record is unnecessary).

An example of my data set which looks wrong is

A00121:72:HG2WFDSXX:1:1101:1316:17143   65      chr13   18211999        0       150M    chr21   10325744        0       TTTGTGTTTTTATTTTCCCTGTGTTTGCTTTTTCTCATGGGGAACCTGTGTTGCTGCTTTGAAGGTATATTCATACTGGCCTTTCAAATGCCAACTCTTCAAATTACTAGTTAAGGCTTTCAAAATATGTTATTTAAAAAATTATCCTCT  @BCBBAACCCCAACCCA=0CAAAACCA?CCCCCACA9AA@@@BB?<CAAAACABCA.CCCA:;B@AAAAACAAAA:CA@BACCCAABBAACBAC?DBDDAB:CBDB@D8CBDBCCACD=DBBC9CBBBBBDBBDDBC:C:CBDBB=CEBC  SA:Z:chrUn_JTFH01001465v1_decoy,550,+,150M,6,8; MC:Z:150M       MD:Z:6G3C10A0T4A7T21A2C7G13G8G22T35     RG:Z:HG2WFDSXX.1_OSCC_1-N_1     NM:i:12 AS:i:90 XS:i:70 pa:f:0.818      om:i:31
A00121:72:HG2WFDSXX:1:1101:1316:17143   129     chr21   10325744        0       150M    chr13   18211999        0       TTCACTGTATTGGCCAGGACGGTCTTGATCTCTTCACCTTGTGATCCCCTTGCCTTGGCCTCCAAATTTGCTGGGATTACAGGCCTGAGCCAAGATCCATATTTTTTAAATGAAAAAAAATTTCAAAGGTACTCTTCTTGGTACAATAAT  >B>A-BAAAACA@BAAB@B?:@AACCABAACACC:6?ACCAAABAAAAACC9BAC;A@BACAAABB@CCABCA@@BACA?AA@B8CA7BBAABBBAAAAAA*CCCCCA7*AA7B7*B7779C;AAB*C@AB@>BDDBDDAABB@(DCBCA  SA:Z:chrUn_JTFH01001680v1_decoy,312,-,150M,18,6;        MC:Z:150M       MD:Z:19T12C14T5T9G2A47G20G7A6   RG:Z:HG2WFDSXX.1_OSCC_1-N_1     NM:i:9  AS:i:105        XS:i:105        pa:f:0.875      om:i:0
A00121:72:HG2WFDSXX:1:1101:1316:17143   2113    chrUn_JTFH01001465v1_decoy      550     6       150M    chr21   10325744        0       TTTGTGTTTTTATTTTCCCTGTGTTTGCTTTTTCTCATGGGGAACCTGTGTTGCTGCTTTGAAGGTATATTCATACTGGCCTTTCAAATGCCAACTCTTCAAATTACTAGTTAAGGCTTTCAAAATATGTTATTTAAAAAATTATCCTCT  @BCBBAACCCCAACCCA=0CAAAACCA?CCCCCACA9AA@@@BB?<CAAAACABCA.CCCA:;B@AAAAACAAAA:CA@BACCCAABBAACBAC?DBDDAB:CBDB@D8CBDBCCACD=DBBC9CBBBBBDBBDDBC:C:CBDBB=CEBC  SA:Z:chr13,18211999,+,150M,31,12;       XA:Z:chr13,+18211999,150M,12;chrUn_JTFH01001680v1_decoy,+96,150M,9;chrUn_JTFH01001478v1_decoy,+1,41S109M,2;chrUn_JTFH01001430v1_decoy,+108,150M,12;     MC:Z:150M       MD:Z:6G14A3G1A36T25T19A3T35     RG:Z:HG2WFDSXX.1_OSCC_1-N_1     NM:i:8  AS:i:110        XS:i:105
A00121:72:HG2WFDSXX:1:1101:1316:17143   2193    chrUn_JTFH01001680v1_decoy      312     18      150M    chr13   18211999        0       ATTATTGTACCAAGAAGAGTACCTTTGAAATTTTTTTTCATTTAAAAAATATGGATCTTGGCTCAGGCCTGTAATCCCAGCAAATTTGGAGGCCAAGGCAAGGGGATCACAAGGTGAAGAGATCAAGACCGTCCTGGCCAATACAGTGAA  ACBCD(@BBAADDBDDB>@BA@C*BAA;C9777B*7B7AA*7ACCCCC*AAAAAABBBAABB7AC8B@AA?ACAB@@ACBACC@BBAAACAB@A;CAB9CCAAAAABAAACCA?6:CCACAABACCAA@:?B@BAAB@ACAAAAB-A>B>  SA:Z:chr21,10325744,+,150M,0,9; XA:Z:chr21,+10325744,150M,9;chr21,-10270285,150M,9;chrUn_JTFH01001465v1_decoy,-766,150M,8;chrUn_JTFH01001512v1_decoy,+1142,112M1I37M,8;chrUn_JTFH01001724v1_decoy,+607,150M,10;chrUn_JTFH01001478v1_decoy,-176,79M1I70M,10;     MC:Z:150M       MD:Z:14C16G16G4T66A9A19 RG:Z:HG2WFDSXX.1_OSCC_1-N_1     NM:i:6  AS:i:120        XS:i:110

Both the primary alignment and supplementary one are 150M for a 150 base read.

@lh3
Copy link
Owner

lh3 commented May 13, 2020

I am not sure what is wrong with this example.

@d-cameron
Copy link

The supplementary (not secondary) alignment fully overlaps with the primary alignment. The SAM specifications definition states A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. As these reads fully overlap, they violate the SAM specifications.

The issue with these alignments is that they break downstream programs that expect each alignment in a split read alignment to have at least one one base that is not aligned to another location in the chimeric alignment.

@lh3
Copy link
Owner

lh3 commented May 13, 2020

The SAM spec defines chimeric alignment and says a chimeric alignment can be represented by a supplementary alignment. It doesn't say supplementary alignments shall not have large overlaps. Nothing is wrong here.

@lh3 lh3 closed this as completed May 13, 2020
@d-cameron
Copy link

Section 1.2 of the SAM specifications states the following:

Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for 0x40 and 0x80 flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary.

@d-cameron
Copy link

These alignments typically involve a primary alignment with high NM. Does bwa produce such alignments as it doesn't find a primary contig seed in part of the read (due to the high error rate), so looks for split alignments (in ALT contigs?) then when it does S/W it overaligns w.r.t. the seeding?

If you're not planning to actually change bwa, could you please update the documentation so downstream tools have an idea of the circumstances in which bwa will write alignments that violate the specifications?

@lh3
Copy link
Owner

lh3 commented May 14, 2020

Read again. I don't see bwa is violating sam spec. I have been careful when writing the initial version of these sentences.

@lh3
Copy link
Owner

lh3 commented May 14, 2020

Ok, I see why you said the output is wrong. Bwa outputs supplementary alignments. It doesn't say they are chimeric alignments. A chimeric alignment is represented by supplementary alignments, but supplementary alignments may have other meanings. I intentionally called 0x800 as a "supplementary" flag, not a "chimeric" flag.

@d-cameron
Copy link

So you're saying the alignment isn't a chimeric alignment? Just to be clear, is your argument that

a) The alignment is a chimeric alignment and you consider it a valid alignment.

b) The alignment is not a chimeric alignment, but a different kind of supplementary alignment (an ALT alignment?

The problem with a) is that it violates the no large overlap requirement of S1.2

The problems with b) are:

  1. a SA tag is writen, and the SA tag is defined as Other canonical alignments in a chimeric alignment

  2. S1.4.2 defines the supplementary flag bit as: Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment.

So I'm failing to see a spec-complaint interpretation of these records. They look very much like a split read alignment. For a hg38 with decoy reference, I'm getting around 1 in 500 reads exhibiting this behavour which has forced me to throttle the log messages in my my SV caller to prevent multiple gigabytes of warning messages about split reads with multiple alignments starting at the same read base offset. If it was a 1 in a million edge case I wouldn't be so concerned but it's a non-trival subset of reads when aligning against hg38.

If these are essentially ALT contig secondary alignments, then I can adjust my code to handle these. What's the intended interpretation of such records? Are they a bug, or is it intentional ALT contig behavour?

@lh3
Copy link
Owner

lh3 commented May 14, 2020

the alignment isn't a chimeric alignment

No, it is not. And a supplementary alignment is not intended to be representing chimeric alignments only. The bwa behavior is intentional and has been documented in README-alt.md since 2014.

@d-cameron
Copy link

a supplementary alignment is not intended to be representing chimeric alignments only

It looks to me that the wording of the specifications means that the only supplementary alignment you can represent in a spec-compliant manner is a chimeric alignment.

The bwa behavior is intentional and has been documented in README-alt.md since 2014.

Thanks for pointer to the existing documentation. If you chould just add a few points of clarification on how a downstream tool is to determine what type of supplementary record bwa is reporting it would be much appreciated:

  • Does bwa report chimeric alignments between:
    • primary and ALT contigs?
    • ALT and ALT contigs?
  • Can a supplementary ALT hit also be a chimeric alignments (ie 1 primary record + 2 split read ALT records)?
  • Do bwa split read alignment have a max number of overlapping bases?
  • How are alt contigs identified in the bam?
    • Are alt contigs (and only alt contigs) guaranteed to have AH:* in their @SG records?

@d-cameron
Copy link

@lh3 any documentation on how a downstream tool can definitively identify whether an alignment should be considered a chimeric alignment, or a secondary alignment to an ALT contig would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants