-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFF3::Attributes::Variants #15
Comments
I don't have an opinion either way on TAG vs CIGAR. Is the plan to document every single isomiR with a label? How far does one want to go down the rabbit hole of defining every single isomiR that can exist for a miRNA? With very deep RNA-seq and tools such as chimera and miRge, one can get 100s of isomiRs for abundant miRNAs. Do we want to really have a nomenclature for all of them? Is that required for the .gff3 format to work? I am perhaps a bit confused by the intersection of the isomiR data and the .gff3 style. |
Thanks for the thoughts. Well, the idea of the format is to be as much unbiased as possible to the tool. There are things that we cannot avoid, like how the tool map to the miRNAs, but once mapped I think is good to report every sequence that mapped to any miRNA. It is ok that some tools want to be conservative and don't trust mutation for instance, or whatever other variant. But the idea is to have a file that people can reduce if they want and trust whatever they decide, or apply any method downstream to create a final list they trust. Or for instance, you can use the ATTRIBUTE For instance, I would like to have the output of 3 tools, and say, ok, I'll merge all 3' variants to one type, and so on ... but if we don't have the full information there, this is impossible. I can think of a case that a tool can give directly an isomiR that is a representation of multiple sequence and trust this feature and not every single sequence. I think is good and we can adapt that to have a label that indicates that. So people can use directly whatever concept the tool defines as isomir, for instance: We could have this:
So you report each sequence individually, with the CIGAR/TAG, and FILTER= In that case, every tool shows everything and it is easy for the user to trust what they want to trust. I think is ok to have a big file with everything, is like BAM/VCF files, where downstream you decide what to do with the information and how to quantify miRNA/isomiRs. The tool that goes with this format should help with whatever action we want to apply to the file, like filtering, even creation of count matrix, merging, things like that. So the same way we have this is mainly my logic here. |
@lpantano |
OK, may be a stupid question, but what defines a canonical form? I mean this naming nomenclature suggests that there is a canonical form (reference) and the other isoforms named after that, right ? So is miRBAse the gold-standard here ? If that is the case, then it is good to name all possible isomiRs as you suggested. Cheers, |
Thanks for the comment @sinanugur, we are not defining canonical equal to reference. I know it seems it could be the same, but it is slightly different. The idea is to follow a method similar to Variant calling pipeline, where you map agains a reference genome, and give variants from that. You can use any database as reference, you have as well mirGeneDB, or any other custom, as far as you put the name in That way all is traceable. So, reference, means the reference database used for the analysis. Thanks for contributing! |
Lorena - to that definition, make sure version numbers are part of the database reference as the value can change (and has changed) with updated versions. I would add that it would be ideal if everyone could settle on a single database (or as few as possible) from which "canonical" sequences are obtained. Also, how are you proposing to incorporate SNPs that appear in a some mature miRNAs into this nomenclature? |
Thanks Marc for the comment! I agree that version is important, so we can add it to the column following some formatting. The header information should give specifically where the database was taken as well. I agree that we should use the less as possible, and I think that is what is gonna happen. But there are cases where any of these two databases are good enough for very specific species and people generate their own custom database based on experiment. That's the main reason we allow flexibility here. And I am sure, it would be a minority. About the SNP question: I think that is what we are trying to address here. CIGAR or TAG system should work, so in principle this will appear in the file as an isomiR, with a PASS or REJECT attribute, and Variant attribute that should be enough to know where the SNP/mismatches are: 4AC, or whatever system we think we agree on. For sure, if the sequence has other variants they will appear as well here. But I think that is fine, right? I know that the question could be going in other direction, so please give us an example of what exactly you meant I am sure we can get some ideas. |
Hi, |
Thanks Thomas for the comments! I agree on general with everything. I am thinking right now that maybe CIGAR with a TAG that is more general to only mentioning if the sequences has 3',5',SNPS, addition modifications it would be enough. Maybe we can add an attribute to specifically comparing the SEED nts with the reference SEED, just to have a quick view how different this region is. As for the cutoff for reads, etc, ... my opinion is that you can always have the sequence there with a REJECT value to point that the tool is not trusting this sequence for that. Actually, this attribute can be the reason why the sequence is rejected, same logic than VCF files. So the FILTER attribute can be:
I think the format we are defining should be able to allow to put all the data, but I don't have anything against to remove lines, if the tool doesn't give everything as far as there is enough information to know what is going on, I would be happy. For instance, I can imagine that a tool wants to ignore all the SNPs, to be safe. And, I can imagine that the tool won't output all sequences with SNPs because don't want to add all the information. But, then there are two options here, what do you do with the counts of these sequences. I imagine 3 scenarios:
In an ideal world, the format should be good enough to know what is going on. So, we can add these rules to these scenarios:
These are my thoughts, so if nobody has no strong feelings against this, I think we can adapt these rules to the format. Give freedom to developers to report in different way isomiRs that are not trusted, but give the flexibility to report everything is the tool is designed for that. @gurgese, @mlhack , do you have any ideas for the CIGAR/TAG? it would be awesome to have your inputs. Thanks! |
Personally, I believe that a field in the output for hosting a high-level label (tag) can be useful for many reasons. The CIGAR system is good for representing punctual variations, but supplementary analysis steps are required to adapt the data for filtering and group-by operations. I agree to include the filtering system proposed by @lpantano and to include in the output all the detected sequences, even those belonging to untrust classes. |
Hi all, Thanks for all the comments. I'll finish the draft in the next week, and then you can comment to modify whatever I missed or misunderstood. Cheers |
Hi all, after working on the code and the files from different tools, I modified the format slightly for the @gurgese, I hope you can integrate this into the GFF that you are implementing. Thanks
|
cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sinanugur @Bastami @haebhardt
Let's discuss
Variant
attribute that will give the information about the type of isomiR. I think the main idea is to get a CIGAR/TAG like string that can be parsed and give the full information of the change.Some previous discussion are here: https://github.com/miRTop/incubator/blob/master/isomirs/isomir_naming.md
Anybody has more ideas for this? For instance how you would name this isomiR:
This isomiR starts 2 nucleotides before the reference and ends 2 nucleotide before as well. It has TT as nucleotide addition and a NT change at position 5 A->C.
I think there are two general ways to describe this, TAG-wise or CIGAR-like, please propose others if you work with different ones. We can use both and define an attribute for each of them as well, as @gurgese just mentioned in other issue.
TAG-wise or similar (it could be more general as well):
miRNA-5p.AAs.5AC.aa.TTe
CIGAR-like or similar or just use the CIGAR like the BAM file exactly
AAI2M5C19MAADTT
Either way we need to define it exactly. So please, propose one example of what you use or would like to have or you are missing, and I'll try to merge them all and propose the final definition that we can discuss further for minor details.
Cheers
The text was updated successfully, but these errors were encountered: