-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on a JSON format for FlyBase import #1779
Comments
Hi Kim, I've set the json encoder like so (the same as you have in lib/Canto/Track/Serialise.pm): my $json_encoder = JSON::PP->new()->pretty(1)->canonical(1); which seems to be along the right lines :) I've got a question about the 'optional' stuff e.g. from your example.
|
Hi Gillian. Thanks for having a go at this so quickly.
Either leaving out missing fields or using: "allele_name": null would be OK for the code I wrote that uses the JSON. I'd go for whichever is easier. |
I'll probably go for 'leaving out missing fields', thanks ! |
The example JSON file is now in Git and will be used for testing the loading: https://github.com/pombase/canto/blob/fly-canto-dev/t/data/sessions_from_json_test.json I've tweaked a few things. I've added an "allele_type" field on the alleles and changed the example genotype to: "FB<something>": {
"genotype_name": "...",
"background": "...",
"alleles": [
{
"allele_uniquename": "FBal0119310"
},
{
"allele_uniquename": "FBal0064432"
}
]
} The "genotype_name" and "background" are optional in Canto. The Does FlyBase have FB IDs for genotypes? |
we don't have FBids for these. There is an internal database uniquename made up of the symbols of everything in the genotype (with each bit always sorted in the same way to ensure that an individual genotype is only in the database once) - but the symbols aren't always updated when the component allele symbols change (even though the underlying component links are updated). We probably won't be submitting genotypes in most cases (they'll need to be built up during the Canto session), although I can see that for some bulk screens the curators might want to make all possible combinations and submit that in the input file, so we should come up with an ID we can use. Would using combinations of FBal ids be a possibility, something like "genotype_name": "FBal0119310/FBal0064432" or "genotype_name": "FBal0119310, FBal0064432" |
In that case we should maybe concentrate on getting the allele loading working. I'll put the genotypes aside for now.
Yep, I think that would be fine. Perhaps with "genotype: " or something similar as a prefix. |
sounds good. One thing we will need is chromosomal aberrations - these have their FBab numbers as their FBid space, but the tricky thing is that they will not have a "gene": "...." relationship as they can affect multiple genes. So once we got allele loading working, then moving on to FBab might make sense as the next step |
Good plan. I'm not sure how to handle aberrations yet so that gives a bit of time to think. |
OK, I think I have my code working for alleles and genes. |
Thanks Gillian. That's excellent. I'll have a go at loading the whole thing and see what problems appear. |
Previously we were loading very slowly, one at a time as each publication was processed. Refs #1779
After a few code tweaks that JSON files loads fine. On my desktop it took 9 minutes for the 460 publications. Next I'll work on loading the alleles. |
Hi. I've made another tweak to the genotype JSON format. I forgot that Canto needs to know the taxon ID of the organism that the genotype is in. So I've added |
The "FBal0119310" allele IDs are stored in Canto as internal identifiers but currently aren't shown to the user. |
Aberrations support is coming along nicely. It would be helpful to have some example aberrations in a JSON file. Would it be possible to add some to the file you created? I've added an example aberration to the JSON format documentation: https://github.com/pombase/canto/wiki/JSON-Import-Format
The allele_type must be "aberration". The "allele_name" (FlyBase symbol) is required. The "allele_description" is optional. The gene field should be null or not included. |
Hi Kim, |
Thanks Gillian! That was quick! I had a quick look and the changes look great. I'll let you know how loading goes. |
I've changed to loader cope with aberrations but while doing that I realised that the aberrations will require another field because Canto needs to know the organism of each aberration. The aberration definitions we'll need add have a "taxon_id" like this:
If the taxon_id is included for the other allele definitions it will be ignored. I did a hacky test by adding "taxon_id": 7227 to all the aberrations in your file and it all loaded OK. Here's an example genotype management page: Should we hide the expression column for Fly-Canto? |
"I did a hacky test by adding "taxon_id": 7227 to all the aberrations in your file and it all loaded OK. Here's an example genotype management page" - I think that would work "Do you think we'll need to add longer organism names for the genes or are the gene prefixes enough for you?" - I'm agnostic on this Should we hide the expression column for Fly-Canto? - Yes, please We can discuss these on Tuesday but looking at your examples: b) Allele lists |
"Do you think we'll need to add longer organism names for the genes or are the gene prefixes enough for you?" - agreed during the meeting that prefixes are enough |
Hi Gillian. Let's simplify the JSON a little bit. I'm not sure why I thought that adding the "allele_" prefix for allele fields made sense:
Let's remove the prefixes before we get to using this in real life:
|
I think we could add allele synonyms to the JSON like this:
Is that OK? |
yep, this all looks OK to me (will try and get you updated file by next week, but preparing for biocurator at the mo !) |
No rush! It's going to take me a while to add support for allele synonyms. |
Following on from: #1826 (comment) and from:
I think it makes sense to add |
OK, I've think I've done all the changes I'm supposed to to the json input (!) I have done the following: a. removed the 'allele_' style prefixes from the labels b. changed how synonyms are stored, so that they are in "synonyms" as an array:
c. if a synonym used by an author is the same as the valid FBal or FBab symbol, it is no longer included in synonyms (to prevent repetition in the display) so where previously it would have had something like:
now it has:
d. accessory alleles. If an FBal is 'typically' used as an accessory allele, thean the type is now "accessory" rather than other. e.g.
e. taxon - I haven't added taxon as we decided that would be a global config file thing rather than an allele-by-allele thing in the end. f. description. I have added code so that if there is an internal note, this will appear in the 'description' slot (if there is no internal note, the description is still null):
I have made two versions of the file - one with and one without the descriptions. Am attaching the one without as I wasn't sure that the tool is ready for all the descriptions to be filled in yet |
Thanks Gillian. I think those descriptions are going to be too long for the description field in the display. I'll try loading them as comments so we can see if that's convenient enough. |
Also update code to support new JSON format. Refs #1779
Also update code to support new JSON format. Refs #1779
Hi Gillian. Sorry I didn't read that carefully. So please ignore my comment about loading the descriptions as comments. I've modified the code to match the changes suggested above and I've re-loaded the flybase-test canto instance using the new file you made. Everything seems OK: When you get a chance, could you attach your file that includes the allele descriptions so I can give that a go? |
Comment moved to #1872
|
Hi Kim Showing comments/notes/description as in #1872 seems good. |
Hi Vitor. Here's an example: https://curation.pombase.org/flybase-test/curs/05b4acd2db536c5e If you every need other sessions you can go to the front page (https://curation.pombase.org/flybase-test/) and type the PMID of one of the loaded publications into the "Start curating using a PubMed ID:" box. It will say the paper is "currently being curated by someone else" but if you ignore that and click "View session" you should be able to start curating. Alternatively you can go to the report pages (https://curation.pombase.org/flybase-test/track) and then browse to a publication via the "All publications" report. To get to those pages you'll need the username is "flybase-admin@pombase.org" and the password is "flybase-admin". Then if you navigate to a publication page like this: |
Hi Kim, am attaching new json file that has placeholder comments in the allele descriptions slot so that it can be loaded into the main test instance to make it easier for testing |
Thanks Gillian. I've edited the file to change I've loaded that file into the flybase-test instance. The old sessions have been replaced. Here's a new test session: https://curation.pombase.org/flybase-test/curs/1be1b6ac8921b40a |
I'm going to close this because I think everything is done and it's getting long. If there are other things to do we can open other issues. |
We will need to import PMIDs with corresponding genes, alleles and genotypes from FlyBase.
The current JSON export format is quite complex: https://github.com/pombase/pombase-chado/blob/master/data/canto_dump.json
I think the import format can very simple for Fly-Canto because there is very little to load. Perhaps something like this would be enough as a start? Lots of the fields would be optional:
The
"FB<something>"
needs discussion.I've pasted the JSON above into the Wiki: https://github.com/pombase/canto/wiki/JSON-Import-Format
I'll expand that into documentation for the format as we go.
CC @gm119
The text was updated successfully, but these errors were encountered: