-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM file discussion #300
Comments
This is brilliant Leandro. I have a couple of questions/suggestions. Firstly, I am very keen to make it SAM-compliant if we can. As you say, it would be make things nicer for using samtools etc. For We could use the TLEN field too right? It could solve your POS problem? POS could be the position the index of the firt minimizer and TLEN could be the number of minimizers in the alignment or something? I don't have any opinion on flags. If we are going to use things like number of minimixers etc. then we should definitely encode w and k in the header. |
That is true, but the idea of adding the region in brackets in
I think by the standard, one of these records must be the primary alignment and the other two secondary/supplementary. But that is not the information we want to convey, as we are not a linear reference aligner, it is normal for us to map different parts of the reads to different "references". By adding the region to
Isn't
This is going to be ill-defined anyway, as it seems to me that the CIGAR string (removing clippings) applied to the sequence My preference to use minimisers in the PRG as length of the reference and POS as the index of the minimiser is because this info is easy to get when writing the SAM file.
Well remembered! |
Just changed this on 54812b0 : flank size is always |
We could just take the minimap2 approach (I feel like I say that a lot 😅) and mark them, arbitrarily, as supplementary, but have the I'm just being a bit paranoid about an edge case where a read has Anyway I'm fine to leave it if you think it's best there. We can always revisit later. Regarding the |
I didn't know a single query could have multiple primary alignments. I agree with removing the region from the query name, as it makes more sense semantically. The information will now be encoded in the CIGAR, as the first and last soft clipping events. We could also describe the interval as extra SAM fields if needed, although it seems we don't need this for now.
This looks ok for me, i.e. having I am not sure however if we need to add the |
With your suggestions, I inferred that the length of a PRG should be the length of its string representation (as With these changes we now have a compliant SAM file 🎉 ... well at least I can run |
Awesome. Let's keep this open as I am sure we are going to want to make changes as we start using these files more deeply. |
FWIW I just used this SAM file for the first time to debug some things and it was super helpful in its current form. |
Intro
In the racon_denovo_p1 branch, we are producing a pseudo SAM file. The pseudo comes from some fields breaking them SAM format. This issue is to discuss the SAM file we produce now, and if there are fields we should add/change/remove. It would be nice to have input from you @iqbal-lab , @mbhall88 , @Danderson123 . This is a snippet of the current SAM we produce:
We first have some header lines saying that
pandora
created this SAM file, and some explanations about the SAM extra fields we add. In particular,LF
,RF
andPPCH
arepandora
-specific SAM extra fields, and can get quite large - I think we will by default remove these, but we can add them back with apandora --extra-SAM-fields
flag, as they can be quite useful for debugging. The other SAM extra fields,NM
,AS
,nn
, andcm
are some of the standard ones we can produce -minimap2
also outputs these.In summary,
pandora
matches reads and PRG minimisers, then filters these mappings, obtaining clusters of hits. Each cluster of hits can be thought of as a region of the read that maps to a single PRG. Each suchcluster of hits
->PRG
pair becomes aSAM
record in theSAM
file. Therefore, a single read can map to several PRGs and have severalSAM
records (this is expected for long reads, for example).SAM fields details
Some comments about the SAM fields:
Field 1.
QNAME
, query name: we include the query name, and in brackets the specific region of the query we quasimap. This is intentionally added aspandora
expects to map different regions of a read to different floating loci. This is one crucial difference WRT linear mappers, and as such I think it needs to be well highlighted and put in the query name;Field 2.
FLAG
: right now the only flags we set are0x0
(i.e. read mapped) and0x4
(i.e. read did not map). @Danderson123 is working on setting flag0x10
: SEQ being reverse complemented, so we know the strand of the quasi-mapping (this is not as trivial as it looks like). As said previously, we can consider our query as being composed of several segments, mapping to potentially different loci. We could thus set all these other flags, which I currently simply ignore, as I have never used any in SAM post-filterings:These fields could be useful to infer for e.g. the order of genes a read maps to more easily, but it can also be inferred from the region described in
QNAME
. Should we set these flags?Besides these flags, these two are currently ignored because we choose the best cluster of hits for each region of the read, and ignore secondary/supplementary clusters of hits:
And these last two are ignored because they are not applicable:
RNAME
: this is just the PRG name the region of the query mapped to;POS
: this is very different from the standard SAM file format, as here we describe the PRG path of the first kmer hit, instead of the position on the linear reference the alignment starts. However, the abstract concept this fields represents is the same: it is where the mapping starts;MAPQ
: this is just being ignored for now, we always output 255, which means not available;CIGAR
: we output a simpleCIGAR
, composed of soft clipping (S
, the region of the read before the first mapped kmer and after the last mapped kmer), equal bases (=
, when there is a minimiser kmer hit covering the region), mismatch (X
, when there is no minimiser kmer hit covering the region). This is an example of aCIGAR
we produce:5S52=10X30=3S
: we had the first kmer matching in the 6th kmer (first 5 bases are soft clipped -5S
), then for 52 bases we had exact matches, then we missed 10 bases, then 30 bases of exact matches and the last 3 bases we did not hit (3S
). There is not much to improve with just quasi-mapping here. We could improve on the soft-clipped regions (S
) and on the mismatched regions (X
), but for that we would need proper alignment, which we are not doing.SEQ
: we output the sequence of the region we quasi-mapped. This is justQNAME
in sequence itself;RNEXT
,PNEXT
,TLEN
,QUAL
: we output for these either*
or0
, which means not available. Please tell me if you think we should output any of these fields properly;Extra fields: we output the following extra SAM fields, explained in the header of the SAM file:
LF
,RF
,PPCH
,NM
,AS
,nn
,cm
.Conclusion
In general, I am OK with what we are producing right now, I think is very useful for future
pandora
analyses. I can list two TODOs:0x10
;SAM
file format.pandora
SAM files. For example right now we can't even runsamtools
to filter/sort/etc the SAM file, as it errors out. I can see two things that has to be changed to comply to the standard:POS
) needs to be an integer. If I were to choose, I would choose the length of PRGs being the number of minimising kmers in the PRG andPOS
being the index of the first minimisingkmer
the query hits (this is possible as we have a total order of the minimising kmers due to the PRG being a DAG).After this full explanation of the current implementation of SAM files in
pandora
, I'd need your input on: what to change, add, remove, and when (i.e. now, or in a later version).The text was updated successfully, but these errors were encountered: