-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for extra DNA/RNA modified bases. #945
Comments
I have also raised a similar feature request in Jbrowse https://github.com/GMOD/jbrowse/issues/1588 |
Yes I would be interested. |
Hi Jim. |
By the way, there is also a PR pending in the HTSlib repo to support modification information parsing via the API. Hopefully this one will also get merged quite soon. |
@a-slide Well the first step would be some sample data to see if the htsjdk (https://github.com/samtools/htsjdk) can parse it, and if the necessary information makes it through to create the desired track. You can probably determine this by loading a test file into IGV, right-clicking the alignment, and selecting "copy read details to clipboard". The htslib PR is not directly relevant for IGV. If no modifications of the htsjdk are required then some description of what the track would look like, maybe with a sketch, would be the next step. And to be clear this is the IGV desktop repo (as opposed to igv.js). |
Yes, it sounds like a good plan. |
And yes IGV desktop would be a good start |
For igv.js we might collaborate with the JBrowse team (@cmdcolin) to write any needed parsers, no reason to do that twice. Or maybe that's not needed. In any event a test BAM file would be useful. |
@jrobinso would be happy to collab but possibly not much actual code could be shared because it likely ends up with app-specific code that parses the MM (where the modifications are, delta encoded) and MP/ML tags (probabilities of modification) I hand-edited some of the SAM files on the hts-specs issue and converted them to BAMs on our "volvox" organism with a ctgA These are tiny data files that have some descriptions seen in samtools/hts-specs#418 |
Hi @jrobinso, I gave it a quick try with @cmdcolin hand crafted reads and I am not quite sure it works straight out of the box. Here is the expected line from the BAM file: and this this the result from what I got via IGV
Looks like the Mn field has an extra bit added and the Ml field is not parsed. |
@a-slide Do you mean "Mm" field? That looks like a formatting error in the IGV output, the " should be a new line. What do you mean by "an extra bit added"? |
Yes sorry Mm. By extra bit I meant the horizontal ruler + location string ( |
@a-slide Any ideas on the desired look? Something like "bisulfite" mode? |
Actually, I think the preview from @cmdcolin in Jbrowse is probably quite close to the desirable look https://user-images.githubusercontent.com/6511937/113356791-dbb89900-9310-11eb-817e-b811997dc87f.png Multiple modifications on the same read are visualised using different color scales. As I mentioned in the parallel JBrowse thread (GMOD/jbrowse-components#1869), there will be cases where multiple modifications might be called at the same position, but I think this could be solved by showing only the one with the highest probability (from the Ml tag). |
I generated a BAM file with Megalodon containing reads aligned on the human chromosome 20 with around 5X of coverage. I checked and it loads into the last version of IGV desktop, but obviously without modifications displayed |
@a-slide Thanks. I haven't gotten into this yet but hope to in the next few weeks. |
We might be able to offer help to implement the feature if you need help |
@a-slide always welcome help. I'm out of the office until April 19 without much time to think about this until then. |
@a-slide OK this if finally at the top of the list, and will be my main activity next week. One small thing that would be helpful is a PDF of the latest proposed spec for these tags. |
Hi @jrobinso, |
@a-slide That would be helpful if not too much bother, but I think I have it figured out from the PR branch, examples, and past discussions. I will parse this in IGV for expediency, but it feels like this should ultimately be done in the HTSJDK. When color coding methylation from bisulfite sequencing we settled on red and blue, I don't remember why but probably no particular reason. Has any thought been given to color coding conventions for other modifications, now that we have a potentially large set? |
Hi @jrobinso, The proposed specification defines the new I completely agree with you, it should ideally be parsed by HTSJDK. There is a PR in the C library samtools/htslib#1132, but I haven't seen anything in the Java one. This will probably follow when it becomes mainstream. I think the Red and Blue codes for bisulfite made sense, as it was quite intuitive and bisulfite seq gives a clear binary choice. With this new spec, there is now a modification likelyhood score for each bases, so instead of representing as red or blue, I would probably suggest to use a light to dark color scale for each modification. I am not aware of a color coding convention for DNA/RNA mods I hope this helps |
TODO * parse and use likelihood scores. * resolve multiple modifications on a single base * add color scheme by modification type * add modificaton info to popup text
@a-slide There is prototype ready for testing on the nightly snapshot page https://software.broadinstitute.org/software/igv/download_snapshot. All the tests in the draft spec have been implemented as unit tests and work correctly. However I have been chasing a problem with a few negative strand reads in your test data, and suspect there's a problem with the data. In both cases I run out of read sequence before all modifications listed in the "Mm" tag are found. In this case I don't think the Mm tag can be trusted, and am unsure what to do. Here's an example, from read name d49230b4-7719-4e32-82ca-fded86550a16, flags=16 (negative read strand) The Mm tag and read sequence are below, as well as the complete SAM record. The modification is "C", however as the read strand is reverse complemented we need to look for "G". The tag then requires at least 4488 "G"s, this is the sum of implied modifications (length of mm) + the skipped Gs. However there are only 4246 in the sequence
|
@a-slide I am stuck until i get some feedback on the negative strand reads in your test data. I'm fairly certain I am parsing them according to the spec, which is straightforward, of course I could be wrong. For example, for the "top-rev" case from @jkbonfield 's test data in orient.txt I compute zero-based read sequence offsets of 4, 5, and 28. Those offsets are relative to the read sequence as reported in the SAM record. This matches the data in orient.txt, which should be counted from bottom -> top
|
@jrobinso I think there was maybe a bug and an updated version of the bam files was uploaded that has a fix...it was something i saw also :) |
the issue was very apparent if I ran |
Great, thanks @cmdcolin looks good now. And the modifications make sense, in the old file they looked rather random. So @a-slide The snapshot build is ready for testing https://software.broadinstitute.org/software/igv/download_snapshot. After loading your data select "Color alignments by > base modification". I also find it helpful to "Group by > read strand". We could make this grouping automatic for this color choice. Some non-trivial changes were required to support this so it will be a major number release. You are the only group I know of that can really test it so I will await your input. |
Hi @jrobinso Thank you very much. I gave a try to the IGV version you pointed me at and it seems to be doing what I was expecting. It is really nice to be able to see the mods directly in IGV. We will be testing more thoroughly at ONT and I will give you the feedbacks I collect. but I am quite impressed. One suggestion I would have is to change the pileup color scheme as well in modification mode so that it shows the percentage of modified reads at given position. @cmdcolin has implemented something similar in JBrowse and GMOD/jbrowse-components#1869 (comment). Thank you @cmdcolin for pointing Jim to the new file. This seems to have fixed the issue indeed. |
@jrobinso I modified nanopolish to write out Mm, Ml tags with the reference as SEQ to test this, it looks good to me |
@jrobinso I have tried loading megalodon mod_mappings output (with SEQ field replaced with reference bases), but I am getting the following error I've run the I am happy to provide example reference and cram files if that would be helpful. Hopefully the provided error text will be helpful to track down the issue.
|
@marcus1487 I think that error has been fixed, coincidentally I ran into myself. It was triggered by a bam file without base qualities. Could you try the latest snapshot? If the problem persists I might need test data. |
@a-slide the coverage track now displays the % of modified reads as in bisulfite mode. |
I can confirm that the bug does appear to be fixed in the latest snapshot. Thanks so much for the fix! For some background on the reason for the empty quality scores. Megalodon |
Yes it works perfectly and with the fix for missing quality string we can now display results obtained from both Guppy (basecall anchored mod call then aligned to genome) or from Megalodon (reference anchored calls). Thank you very much for the hard work implementing this feature |
* Remove deprecated GA4GH classes. * Remove unused "lite" classes * Combine classes PicardAlignment -> SAMAlignment * Fix autoscale & group autoscale repaint problems * mm tag support * Introduce "ByteSubarray" to avoid copying read sequence and quality arrays * Add unit test * rename "isSoftClipped" -> "isSoftClip" for clarity * Add option to "color by base modification". See issue #945. TODO * parse and use likelihood scores. * resolve multiple modifications on a single base * add color scheme by modification type * add modificaton info to popup text * alpha-shade base modifications by likelihood * Draw base modification after base mismatches * base modifications -- parse all cases from spec * Change preference hierarchy behavior -- user settings for generic alignment tracks ("Alignments" tab) will now cascade down to RNA and 3rd gen alignments if there is no specific user setting for those types. * remove unused export * Add popup text for base modifications * Popup text for base modifications * update test * Bug fix -- file name used for track name in "load from server". Fixes #959 * Base modifications -- add stable colors for known modification types * Optionally write relative genome paths in sessions * Handle case of SAM record without base qualities * Add "Flags" to alignment record popup text. See issue #864 * Support base modifications in coverage track. See #945 * Group feature tracks by strand. Fixes #962
@a-slide This thread has gotten pretty long, but I think we can maybe close it and open more targeted issues if they arise. Two dangling threads are left here. (1) multiple modifications per base aren't really handled, the modification with the highest likelihood wins, as you suggested. (2) likelihoods don't factor into the coverage track representation. Both could be handled with some complications, but I don't know how important these are right now, and I would want a good test dataset to work with. So I think these should be new distinct issues if/when they become important. Any guess when the spec will be official? If ONT tools are already producing these files I don't need to wait for an official spec to do a release, there are plenty of IGV features supporting non-spec tags. I suppose checking for both MM and Mm wouldn't be a bad idea. |
Hi @jrobinso,
Regarding the spec I really don't know. This is frankly out of our hands (though we reactivate the issue from time to time). The conversation started nearly 3 years ago, so hopefully this will be released before the end of 2021. And yes ONT Guppy and Megalodon stable versions output BAM with Mm and Ml tags and we have no intentions to drop the support in the future. I agree with you about supporting both upper and lower case. |
@a-slide OK ill release this as is the, probably quite soon unless we find something in testing. There are a number of features in this release to test, not just this one. |
A good question. When I wrote it I chose a half-way house of upper-lower case as it fitted in the user-defined space but not an area where most people venture (user defined tags are nearly always pure lowercase or lower+digit). The intention was to have a "draft tag" area of the specification which telegraphs our intentions and lets people implement the ideas and test them out, before making them concrete and immutable. This was in response to the debacle of barcoding tags which dragged on for so long that groups just started to implement their own, conflicting, ideas. It's better for an official "idea" than none at all, even if that idea may subsequently change due to community testing. That was the intention anyway. In reality it sat as a PR without being merged, so it kind of fell flat. :/ The main problem here really has been lack of knowledge by the specification maintainers (myself included!), which makes it hard to judge whether the specification is sound. It's really the community uptake which demonstrates that, so thank you all concerned. Given it's now picking up speed, I think what's likely to happen now is it'll get merged as a draft tag and announced to the community with a relatively short life span in draft form before being made official with uppercase variants. So I'm hopeful it'll be before the end of summer. As you say, it's not so hard to decide whether to only check the official tag names or to check either case. (Sadly I've already seen another bam file using uppercase MM for something different recently, but frankly that's their problem! The user-defined namespace is there for a reason.) |
* Remove deprecated GA4GH classes. * Remove unused "lite" classes * Combine classes PicardAlignment -> SAMAlignment * Fix autoscale & group autoscale repaint problems * mm tag support * Introduce "ByteSubarray" to avoid copying read sequence and quality arrays * Add unit test * rename "isSoftClipped" -> "isSoftClip" for clarity * Add option to "color by base modification". See issue igvteam#945. TODO * parse and use likelihood scores. * resolve multiple modifications on a single base * add color scheme by modification type * add modificaton info to popup text * alpha-shade base modifications by likelihood * Draw base modification after base mismatches * base modifications -- parse all cases from spec * Change preference hierarchy behavior -- user settings for generic alignment tracks ("Alignments" tab) will now cascade down to RNA and 3rd gen alignments if there is no specific user setting for those types. * remove unused export * Add popup text for base modifications * Popup text for base modifications * update test * Bug fix -- file name used for track name in "load from server". Fixes igvteam#959 * Base modifications -- add stable colors for known modification types * Optionally write relative genome paths in sessions * Handle case of SAM record without base qualities * Add "Flags" to alignment record popup text. See issue igvteam#864 * Support base modifications in coverage track. See igvteam#945 * Group feature tracks by strand. Fixes igvteam#962
This is now released in version 2.10.0 |
Hi @igvteam,
From version 2.1 IGV is able to display CpG methylation based on Bisulfite induced mismatches with the reference sequence. This is a very nice a feature which allows to visualise the CpG methylation at single read level. Since then, epigenetic and epitranscriptomic methods have evolved and there are now more modifications which can be detected at single molecule resolution, in particular with Oxford Nanopore sequencing.
In order to avoid the proliferation of non-standard formats to store these modifications there was a long discussion with the samtools team on how to represent all RNA and DNA modifications within SAM/BAM format (samtools/hts-specs#362), leading to the proposition to use additional flags in the specification to store both modification positions (Mm) and probabilities (Ml) at read level. There can also be multiple modification tracks included within a single read. Here is the link to the Pull Request currently being considered: samtools/hts-specs#418. Hopefully the PR will be merged soon and will be part of the official SAM specification.
Several Oxford Nanopore tools to detect base modifications already implement the specification ahead of the official release, including Megalodon (https://github.com/nanoporetech/megalodon) and Guppy basecaller.
Would you be interested to start a discussion on implementing those new tags in IGV ?
We can provide sample datasets and are happy to see how we could help with the implementation.
The text was updated successfully, but these errors were encountered: