Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM file discussion #300

Open
leoisl opened this issue Oct 17, 2022 · 8 comments
Open

SAM file discussion #300

leoisl opened this issue Oct 17, 2022 · 8 comments

Comments

@leoisl
Copy link
Collaborator

leoisl commented Oct 17, 2022

Intro

In the racon_denovo_p1 branch, we are producing a pseudo SAM file. The pseudo comes from some fields breaking them SAM format. This issue is to discuss the SAM file we produce now, and if there are fields we should add/change/remove. It would be nice to have input from you @iqbal-lab , @mbhall88 , @Danderson123 . This is a snippet of the current SAM we produce:

@PG	ID:pandora	PN:pandora	VN:0.9.2
@CO	LF: left flank sequence, the sequence before the first mapped kmer, soft-clipped
@CO	RF: right flank sequence, the sequence after the last mapped kmer, soft-clipped
@CO	PPCH: Prg Paths of the Cluster of Hits: the PRG path of each hit in considered cluster of hits
@CO	NM: Total number of mismatches in the quasi-alignment
@CO	AS: Alignment score (number of matches)
@CO	nn: Number of ambiguous bases in the quasi-alignment
@CO	cm: Number of minimizers in the quasi-alignment
simulated_read_0[5:97]	0	GC00010897	1{[343, 358)}	255	5S92=3S	*	0	0	TGGCACGGCATGGGGGAGGTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGC	*	LF:Z:CGATC	RF:Z:TGA	PPCH:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}->	NM:i:0	AS:i:92	nn:i:92	cm:i:12
simulated_read_1[11:97]	0	GC00006032	3{[161, 169)[172, 173)[180, 186)}	255	11S86=3S	*	0	0	TGGCTAATCACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCACCAAAATGAAACACTGCGAC	*	LF:Z:TTCCGCCTCCC	RF:Z:ATT	PPCH:Z:3{[161, 169)[172, 173)[180, 186)}->2{[172, 173)[180, 194)}->1{[186, 201)}->1{[200, 215)}->1{[214, 229)}->1{[218, 233)}->1{[222, 237)}->3{[229, 237)[240, 241)[249, 255)}->1{[250, 265)}->2{[253, 267)[271, 272)}->	NM:i:0	AS:i:86	nn:i:86	cm:i:10
simulated_read_2[8:87]	0	GC00006032	1{[93, 108)}	255	8S79=13S	*	0	0	AGGCAAAGAGAGATGATCTCCTTGTATGTCTGTATTAGCATGATGTTGCTTGTTAATAATTAAAAATATCAACGCGCTT	*	LF:Z:TAAAAAAC	RF:Z:ATATAAGCGCGGG	PPCH:Z:1{[93, 108)}->1{[86, 101)}->1{[82, 97)}->1{[79, 94)}->1{[67, 82)}->1{[66, 81)}->1{[61, 76)}->1{[58, 73)}->1{[52, 67)}->1{[43, 58)}->1{[29, 44)}->	NM:i:0	AS:i:79	nn:i:79	cm:i:11
simulated_read_3[5:95]	0	GC00006032	1{[200, 215)}	255	5S90=5S	*	0	0	AATATTGTGATGTGCTCCATAAATGCCAATGTGGTGATTAGCCAGGGAGGCGGAAAATGTGTTTACACTCCTGAAATAATAAAAAACAGG	*	LF:Z:ATTGA	RF:Z:CAAAG	PPCH:Z:1{[200, 215)}->1{[186, 201)}->2{[172, 173)[180, 194)}->3{[161, 169)[172, 173)[180, 186)}->3{[158, 169)[172, 173)[180, 183)}->3{[144, 145)[152, 153)[156, 169)}->3{[137, 145)[152, 153)[156, 162)}->1{[129, 144)}->1{[125, 140)}->1{[115, 130)}->1{[105, 120)}->	NM:i:0	AS:i:90	nn:i:90	cm:i:11

We first have some header lines saying that pandora created this SAM file, and some explanations about the SAM extra fields we add. In particular, LF, RF and PPCH are pandora-specific SAM extra fields, and can get quite large - I think we will by default remove these, but we can add them back with a pandora --extra-SAM-fields flag, as they can be quite useful for debugging. The other SAM extra fields, NM, AS, nn, and cm are some of the standard ones we can produce - minimap2 also outputs these.

In summary, pandora matches reads and PRG minimisers, then filters these mappings, obtaining clusters of hits. Each cluster of hits can be thought of as a region of the read that maps to a single PRG. Each such cluster of hits -> PRG pair becomes a SAM record in the SAM file. Therefore, a single read can map to several PRGs and have several SAM records (this is expected for long reads, for example).

SAM fields details

Some comments about the SAM fields:
Field 1. QNAME, query name: we include the query name, and in brackets the specific region of the query we quasimap. This is intentionally added as pandora expects to map different regions of a read to different floating loci. This is one crucial difference WRT linear mappers, and as such I think it needs to be well highlighted and put in the query name;
Field 2. FLAG: right now the only flags we set are 0x0 (i.e. read mapped) and 0x4 (i.e. read did not map). @Danderson123 is working on setting flag 0x10: SEQ being reverse complemented, so we know the strand of the quasi-mapping (this is not as trivial as it looks like). As said previously, we can consider our query as being composed of several segments, mapping to potentially different loci. We could thus set all these other flags, which I currently simply ignore, as I have never used any in SAM post-filterings:

0x1: template having multiple segments in sequencing
0x2: each segment properly aligned according to the aligner
0x8: next segment in the template unmapped
0x20: SEQ of the next segment in the template being reverse complemented
0x40: the first segment in the template
0x80: the last segment in the template

These fields could be useful to infer for e.g. the order of genes a read maps to more easily, but it can also be inferred from the region described in QNAME. Should we set these flags?

Besides these flags, these two are currently ignored because we choose the best cluster of hits for each region of the read, and ignore secondary/supplementary clusters of hits:

0x100 secondary alignment
0x800 supplementary alignment

And these last two are ignored because they are not applicable:

0x200 not passing filters, such as platform/vendor quality controls
0x400 PCR or optical duplicate
  1. RNAME: this is just the PRG name the region of the query mapped to;

  2. POS: this is very different from the standard SAM file format, as here we describe the PRG path of the first kmer hit, instead of the position on the linear reference the alignment starts. However, the abstract concept this fields represents is the same: it is where the mapping starts;

  3. MAPQ: this is just being ignored for now, we always output 255, which means not available;

  4. CIGAR: we output a simple CIGAR, composed of soft clipping (S, the region of the read before the first mapped kmer and after the last mapped kmer), equal bases (=, when there is a minimiser kmer hit covering the region), mismatch (X, when there is no minimiser kmer hit covering the region). This is an example of a CIGAR we produce: 5S52=10X30=3S: we had the first kmer matching in the 6th kmer (first 5 bases are soft clipped - 5S), then for 52 bases we had exact matches, then we missed 10 bases, then 30 bases of exact matches and the last 3 bases we did not hit (3S). There is not much to improve with just quasi-mapping here. We could improve on the soft-clipped regions (S) and on the mismatched regions (X), but for that we would need proper alignment, which we are not doing.

  5. SEQ: we output the sequence of the region we quasi-mapped. This is just QNAME in sequence itself;

  6. RNEXT, PNEXT, TLEN, QUAL: we output for these either * or 0, which means not available. Please tell me if you think we should output any of these fields properly;

  7. Extra fields: we output the following extra SAM fields, explained in the header of the SAM file: LF, RF, PPCH, NM, AS, nn, cm.

Conclusion

In general, I am OK with what we are producing right now, I think is very useful for future pandora analyses. I can list two TODOs:

  1. Finish implementing flag 0x10;
  2. Decide if we want to be compliant with the SAM file format.
  • The utility of complying to the standard is that several tools would then be able to process pandora SAM files. For example right now we can't even run samtools to filter/sort/etc the SAM file, as it errors out. I can see two things that has to be changed to comply to the standard:
    1. Our references (PRGs) need a length in the header. This could be: length of the longest path, number of kmers in the PRG, number of minimising kmers in the PRG, etc;
    2. Position (POS) needs to be an integer. If I were to choose, I would choose the length of PRGs being the number of minimising kmers in the PRG and POS being the index of the first minimising kmer the query hits (this is possible as we have a total order of the minimising kmers due to the PRG being a DAG).

After this full explanation of the current implementation of SAM files in pandora, I'd need your input on: what to change, add, remove, and when (i.e. now, or in a later version).

@mbhall88
Copy link
Member

mbhall88 commented Oct 18, 2022

This is brilliant Leandro. I have a couple of questions/suggestions.

Firstly, I am very keen to make it SAM-compliant if we can. As you say, it would be make things nicer for using samtools etc.

For QNAME do you think it might be better to either add the specific region as either a pandora-specific tag, or alternatively it could be encoded in the CIGAR string no (i.e. hard/soft clip)?

We could use the TLEN field too right? It could solve your POS problem? POS could be the position the index of the firt minimizer and TLEN could be the number of minimizers in the alignment or something?

I don't have any opinion on flags.

If we are going to use things like number of minimixers etc. then we should definitely encode w and k in the header.

@leoisl
Copy link
Collaborator Author

leoisl commented Oct 18, 2022

For QNAME do you think it might be better to either add the specific region as either a pandora-specific tag, or alternatively it could be encoded in the CIGAR string no (i.e. hard/soft clip)?

That is true, but the idea of adding the region in brackets in QNAME is not only to give this information, but also to differentiate it from possible further mappings. For example, let's say we have a ONT read ONT_read_1 mapping to 3 different PRGs. Then we would have 3 SAM records starting as:

ONT_read_1 ...
ONT_read_1 ...
ONT_read_1 ...

I think by the standard, one of these records must be the primary alignment and the other two secondary/supplementary. But that is not the information we want to convey, as we are not a linear reference aligner, it is normal for us to map different parts of the reads to different "references". By adding the region to QNAME, we create a unique id of (read, region). However, I do agree that encoding this inside QNAME is definitely not the optimal way to deliver this information, but is good to guarantee unique SAM QNAMEs. I think we can keep this in QNAME but also add it in the CIGAR as clipping as you suggested, as this it makes sense (we actually clipped the ends of the read, and just mapped a region).

We could use the TLEN field too right? It could solve your POS problem? POS could be the position the index of the firt minimizer and TLEN could be the number of minimizers in the alignment or something?

Isn't POS a position on the reference only (i.e. the query and alignment do not matter)?

POS: 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference
base (see table below). The first base in a reference sequence has coordinate 1. POS is set as 0 for
an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and
CIGAR.

This is going to be ill-defined anyway, as it seems to me that the CIGAR string (removing clippings) applied to the sequence REF[POS:] should describe the alignment, but that works only for linear references. We can transform our graph to a linear representation by DAGazing the nodes, but it is still not a linear reference. This might point that we actually need to use GAM (the graph version of SAM? I guess this is the name?) instead of SAM, but in a future version...

My preference to use minimisers in the PRG as length of the reference and POS as the index of the minimiser is because this info is easy to get when writing the SAM file.

If we are going to use things like number of minimixers etc. then we should definitely encode w and k in the header.

Well remembered!

@leoisl
Copy link
Collaborator Author

leoisl commented Oct 18, 2022

In particular, LF, RF and PPCH are pandora-specific SAM extra fields, and can get quite large - I think we will by default remove these, but we can add them back with a pandora --extra-SAM-fields flag, as they can be quite useful for debugging.

Just changed this on 54812b0 : flank size is always 2*kmer_size now, so this field never gets large and can be always included. It would not get very large anyway before, as it was capped at 50 bps, but 50 just looked like a random number. It seems to me 2k on each side should be enough for racon to anchor kmers. And SAM files are actually requiredto run racon, together with these two extra SAM fields, LF and RF, as that is what we use to infer which parts of which reads mapped to each PRG: https://github.com/rmcolq/pandora/blob/3c3dba0819e96b4ce364da5bca19923d063f4b31/src/denovo_discovery/denovo_utils.cpp#L3-L52

@mbhall88
Copy link
Member

I think by the standard, one of these records must be the primary alignment and the other two secondary/supplementary. But that is not the information we want to convey, as we are not a linear reference aligner, it is normal for us to map different parts of the reads to different "references". By adding the region to QNAME, we create a unique id of (read, region). However, I do agree that encoding this inside QNAME is definitely not the optimal way to deliver this information, but is good to guarantee unique SAM QNAMEs. I think we can keep this in QNAME but also add it in the CIGAR as clipping as you suggested, as this it makes sense (we actually clipped the ends of the read, and just mapped a region).

We could just take the minimap2 approach (I feel like I say that a lot 😅) and mark them, arbitrarily, as supplementary, but have the tp:A tag to designate them all as primary? Or we potentially don't even need that if we are saying we only ever output primary alignments.

I'm just being a bit paranoid about an edge case where a read has [ and/or ] characters in it already. But I guess in that case we would just need to ensure when we parse that region out of a read, we do an r-search (from right hand side).

Anyway I'm fine to leave it if you think it's best there. We can always revisit later.

Regarding the POS stuff, how about we do this. Using POS 1{[343, 358)} from your example SAM file, we make POS 343, we make TLEN 15 and then we add a tag for 1?

@leoisl
Copy link
Collaborator Author

leoisl commented Oct 19, 2022

We could just take the minimap2 approach (I feel like I say that a lot ) and mark them, arbitrarily, as supplementary, but have the tp:A tag to designate them all as primary? Or we potentially don't even need that if we are saying we only ever output primary alignments.

I didn't know a single query could have multiple primary alignments. I agree with removing the region from the query name, as it makes more sense semantically. The information will now be encoded in the CIGAR, as the first and last soft clipping events. We could also describe the interval as extra SAM fields if needed, although it seems we don't need this for now.

Regarding the POS stuff, how about we do this. Using POS 1{[343, 358)} from your example SAM file, we make POS 343, we make TLEN 15 and then we add a tag for 1?

This looks ok for me, i.e. having 343 as POS in this example, we are putting as POS the position of the first kmer in the string representation of the PRG, which is a linear representation.

I am not sure however if we need to add the TLEN and tag for 1. With TLEN we can infer the number of minimizers in the alignment (i.e. the size of the aligned path in the PRG), and with this tag for 1, we can infer the exact position of the first alignment (e.g. 1{[343, 358)}). However, we do already have the PPCH extra SAM field that describes exactly the aligned path through the PRG, e.g.: PPCH:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}->. So the TLEN and tag for 1 can be completely inferred from PPCH, and looks like redundant info. But happy to add them if you think having them will be useful anyway.

@leoisl
Copy link
Collaborator Author

leoisl commented Oct 19, 2022

With your suggestions, I inferred that the length of a PRG should be the length of its string representation (as POS is a position in this representation). This is required to compose the SAM's @SQ headers. Besides these, soft clipping does not actually work for our use case: if we soft clip the ends of a long read, in the SEQ field we have to put the entire read (samtools complained about this). Hard clipping works though.

With these changes we now have a compliant SAM file 🎉 ... well at least I can run samtools sort on it

@mbhall88
Copy link
Member

Awesome. Let's keep this open as I am sure we are going to want to make changes as we start using these files more deeply.

@mbhall88
Copy link
Member

FWIW I just used this SAM file for the first time to debug some things and it was super helpful in its current form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants