Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on a JSON format for FlyBase import #1779

Closed
kimrutherford opened this issue Feb 18, 2019 · 35 comments
Closed

Decide on a JSON format for FlyBase import #1779

kimrutherford opened this issue Feb 18, 2019 · 35 comments

Comments

@kimrutherford
Copy link
Member

kimrutherford commented Feb 18, 2019

We will need to import PMIDs with corresponding genes, alleles and genotypes from FlyBase.

The current JSON export format is quite complex: https://github.com/pombase/pombase-chado/blob/master/data/canto_dump.json

I think the import format can very simple for Fly-Canto because there is very little to load. Perhaps something like this would be enough as a start? Lots of the fields would be optional:

{
  "PMID:2120045": {
    "genes": [
      "FBgn0004107",
      "FBgn0016131"
    ],
    "alleles": {
      "FBal0119310": {
        "gene": "FBgn0004107",
        "allele_type": "...",
        "allele_name": "...",
        "allele_description": "..."
      },
      "FBal0119685": {
        "gene": "FBgn0016131",
        "allele_type": "...",
        "allele_name": "...",
        "allele_description": "..."
      }
    },
    "genotypes": {
      "FB<something>": {
        "taxon_id": 7227,
        "genotype_name": "...",
        "background": "...",
        "comment": "...",
        "alleles": [
          {
            "uniquename": "FBgn0004107"
          },
          {
            "uniquename": "FBgn0016131"
          }
        ]
      }
    }
  }
}

The "FB<something>" needs discussion.

I've pasted the JSON above into the Wiki: https://github.com/pombase/canto/wiki/JSON-Import-Format
I'll expand that into documentation for the format as we go.

CC @gm119

kimrutherford added a commit that referenced this issue Feb 18, 2019
Current status: the JSON is parsed, the publication details are fetched
from PubMed and stored and a new session is created for the publication.

Refs #1743
Refs #1779
@gm119
Copy link
Collaborator

gm119 commented Feb 18, 2019

Hi Kim,
I'm playing with using the JSON::PP perl module to make json from data structures.

I've set the json encoder like so (the same as you have in lib/Canto/Track/Serialise.pm):

my $json_encoder = JSON::PP->new()->pretty(1)->canonical(1);

which seems to be along the right lines :)

I've got a question about the 'optional' stuff e.g.
"allele_type": "...",
"allele_name": "...",
"allele_description": "..."

from your example.
What is the correct thing to happen in the json output if something doesn't have that particular piece of info,
e.g. say FBal0119310 has no allele_name

  • should it just not print out the "allele_name": bit,
    or do you want it printed, but with something to indicate its empty ?
    so far I've managed to make the equivalent of:
    "allele_name" : "",
    "allele_name" : null,

@kimrutherford
Copy link
Member Author

Hi Gillian. Thanks for having a go at this so quickly.

What is the correct thing to happen in the json output if something doesn't have that particular piece of info

Either leaving out missing fields or using: "allele_name": null would be OK for the code I wrote that uses the JSON. I'd go for whichever is easier.

@gm119
Copy link
Collaborator

gm119 commented Feb 18, 2019

I'll probably go for 'leaving out missing fields', thanks !

kimrutherford added a commit that referenced this issue Feb 19, 2019
@kimrutherford
Copy link
Member Author

The example JSON file is now in Git and will be used for testing the loading: https://github.com/pombase/canto/blob/fly-canto-dev/t/data/sessions_from_json_test.json

I've tweaked a few things. I've added an "allele_type" field on the alleles and changed the example genotype to:

   "FB<something>": {
     "genotype_name": "...",
     "background": "...",
     "alleles": [
       {
         "allele_uniquename": "FBal0119310"
       },
       {
         "allele_uniquename": "FBal0064432"
       }
     ]
   }

The "genotype_name" and "background" are optional in Canto. The FB<something> can be anything as long as it's unique.

Does FlyBase have FB IDs for genotypes?

@gm119
Copy link
Collaborator

gm119 commented Feb 19, 2019

Does FlyBase have FB IDs for genotypes?

we don't have FBids for these. There is an internal database uniquename made up of the symbols of everything in the genotype (with each bit always sorted in the same way to ensure that an individual genotype is only in the database once) - but the symbols aren't always updated when the component allele symbols change (even though the underlying component links are updated).

We probably won't be submitting genotypes in most cases (they'll need to be built up during the Canto session), although I can see that for some bulk screens the curators might want to make all possible combinations and submit that in the input file, so we should come up with an ID we can use.

Would using combinations of FBal ids be a possibility, something like

"genotype_name": "FBal0119310/FBal0064432"

or

"genotype_name": "FBal0119310, FBal0064432"

@kimrutherford
Copy link
Member Author

We probably won't be submitting genotypes in most cases

In that case we should maybe concentrate on getting the allele loading working. I'll put the genotypes aside for now.

Would using combinations of FBal ids be a possibility

Yep, I think that would be fine. Perhaps with "genotype: " or something similar as a prefix.

@gm119
Copy link
Collaborator

gm119 commented Feb 19, 2019

In that case we should maybe concentrate on getting the allele loading working. I'll put the genotypes aside for now.

sounds good.

One thing we will need is chromosomal aberrations - these have their FBab numbers as their FBid space, but the tricky thing is that they will not have a "gene": "...." relationship as they can affect multiple genes. So once we got allele loading working, then moving on to FBab might make sense as the next step

@kimrutherford
Copy link
Member Author

So once we got allele loading working, then moving on to FBab might make sense as the next step

Good plan. I'm not sure how to handle aberrations yet so that gives a bit of time to think.

@gm119
Copy link
Collaborator

gm119 commented Feb 19, 2019

OK, I think I have my code working for alleles and genes.
Attaching a json file for 460 PMIDs (!) - this is the whole phenotype curation 'to-do' list, so is likely to be the biggest size of input file we'd want to use.
Hopefully this is good for testing.
all_phen_fb_2018_06.json.txt

@kimrutherford
Copy link
Member Author

Thanks Gillian. That's excellent.

I'll have a go at loading the whole thing and see what problems appear.

kimrutherford added a commit that referenced this issue Feb 21, 2019
kimrutherford added a commit that referenced this issue Feb 21, 2019
Previously we were loading very slowly, one at a time as each
publication was processed.

Refs #1779
@kimrutherford
Copy link
Member Author

After a few code tweaks that JSON files loads fine. On my desktop it took 9 minutes for the 460 publications.

Next I'll work on loading the alleles.

@kimrutherford
Copy link
Member Author

Hi. I've made another tweak to the genotype JSON format. I forgot that Canto needs to know the taxon ID of the organism that the genotype is in. So I've added "taxon_id": 7227, to the example: https://github.com/pombase/canto/wiki/JSON-Import-Format

@kimrutherford
Copy link
Member Author

Progress! I've managed to load the alleles from your example file:

flybase-alleles-1

Do you think we'll need to add longer organism names for the genes or are the gene prefixes enough for you?:

flybase-alleles-front-1

Here's the admin interface. It's mostly not useful for the FlyBase use case, but it's there if you need it:
flybase-alleles-admin-1

@kimrutherford
Copy link
Member Author

The "FBal0119310" allele IDs are stored in Canto as internal identifiers but currently aren't shown to the user.
Would it help if they were?

@kimrutherford
Copy link
Member Author

Aberrations support is coming along nicely. It would be helpful to have some example aberrations in a JSON file. Would it be possible to add some to the file you created?

I've added an example aberration to the JSON format documentation: https://github.com/pombase/canto/wiki/JSON-Import-Format

      "FBab0036462": {
        "allele_type": "aberration",
        "allele_name": "Df(3R)ED5223"
      }

The allele_type must be "aberration". The "allele_name" (FlyBase symbol) is required. The "allele_description" is optional. The gene field should be null or not included.

@gm119
Copy link
Collaborator

gm119 commented Mar 14, 2019

Hi Kim,
I modified my code and added aberrations according to your spec (I think!).
The attached file is the same as for all_phen_fb_2018_06.json.txt but with the FBabs
all_phen_fb_2018_06_with_FBab.json.txt

@kimrutherford
Copy link
Member Author

Thanks Gillian! That was quick!

I had a quick look and the changes look great. I'll let you know how loading goes.

@kimrutherford
Copy link
Member Author

I've changed to loader cope with aberrations but while doing that I realised that the aberrations will require another field because Canto needs to know the organism of each aberration.

The aberration definitions we'll need add have a "taxon_id" like this:

"FBab0037918" : {
    "allele_description" : "Df(2L)Exel7046",
    "allele_name" : "Df(2L)Exel7046",
    "allele_type" : "aberration",
    "taxon_id": 7227
}

If the taxon_id is included for the other allele definitions it will be ignored.

I did a hacky test by adding "taxon_id": 7227 to all the aberrations in your file and it all loaded OK. Here's an example genotype management page:

flycanto-with-aberrations-1

Should we hide the expression column for Fly-Canto?

@vmt25
Copy link
Collaborator

vmt25 commented Mar 29, 2019

"I did a hacky test by adding "taxon_id": 7227 to all the aberrations in your file and it all loaded OK. Here's an example genotype management page" - I think that would work

"Do you think we'll need to add longer organism names for the genes or are the gene prefixes enough for you?" - I'm agnostic on this

Should we hide the expression column for Fly-Canto? - Yes, please

We can discuss these on Tuesday but looking at your examples:
a) Gene list - can the synonyms used in the publication be shown on a second column? if multiple, comma separated

b) Allele lists
(could not see the nice list you uploaded to PMID:29615466 on the test site but)
Can alleles be ordered/clustered by gene name?
Can the synonyms used in the publication, now showing between brackets, be transferred to a second column? if multiple, comma separated (i.e. as is)

@vmt25
Copy link
Collaborator

vmt25 commented Apr 2, 2019

"Do you think we'll need to add longer organism names for the genes or are the gene prefixes enough for you?" - agreed during the meeting that prefixes are enough

@kimrutherford
Copy link
Member Author

Hi Gillian.

Let's simplify the JSON a little bit. I'm not sure why I thought that adding the "allele_" prefix for allele fields made sense:

      "FBal0119310": {
        "gene": "FBgn0004107",
        "allele_type": "other",
        "allele_name": "Dmel\\Cdk2_UAS.Tag:MYC",
        "allele_description": null
      }

Let's remove the prefixes before we get to using this in real life:

      "FBal0119310": {
        "gene": "FBgn0004107",
        "type": "other",
        "name": "Dmel\\Cdk2_UAS.Tag:MYC",
        "description": null
      }

@kimrutherford
Copy link
Member Author

I think we could add allele synonyms to the JSON like this:

      "FBal0119310": {
        "gene": "FBgn0004107",
        "type": "other",
        "name": "Dmel\\Cdk2_UAS.Tag:MYC",
        "description": null,
        "synonyms": ["synonym_1", "synonym_2"]
      }

Is that OK?

@gm119
Copy link
Collaborator

gm119 commented Apr 3, 2019

yep, this all looks OK to me (will try and get you updated file by next week, but preparing for biocurator at the mo !)

@kimrutherford
Copy link
Member Author

will try and get you updated file by next week, but preparing for biocurator at the mo

No rush! It's going to take me a while to add support for allele synonyms.

@kimrutherford
Copy link
Member Author

Following on from: #1826 (comment)

and from:

The aberration definitions we'll need add have a "taxon_id" like this:

I think it makes sense to add "taxon_id": 7227 to all the alleles in the JSON file, not just the aberrations. Then I'll change the loader to use that taxon ID to set the organism of the single allele genotypes in Canto. If all the genotypes are Dmel that should solve #1826.

@gm119
Copy link
Collaborator

gm119 commented May 17, 2019

OK, I've think I've done all the changes I'm supposed to to the json input (!)

I have done the following:

a. removed the 'allele_' style prefixes from the labels

b. changed how synonyms are stored, so that they are in "synonyms" as an array:

     "FBal0325657" : {
        "description" : null,
        "gene" : "FBgn0027538",
        "name" : "&bgr;4GalNAcTA[UASp.cCa]",
        "synonyms" : [
           "UAS-&bgr;4GalNAcTA",
           "UAS-TA"
        ],
        "type" : "other"
     },

c. if a synonym used by an author is the same as the valid FBal or FBab symbol, it is no longer included in synonyms (to prevent repetition in the display)

so where previously it would have had something like:

  "alleles" : {
     "FBal0013698" : {
        "description" : null,
        "gene" : "FBgn0003074",
        "name" : "Pgi[4]",
        "synonyms" : [
           "Pgi[4]"
        ],
        "type" : "other"
     },

now it has:

  "alleles" : {
     "FBal0013698" : {
        "description" : null,
        "gene" : "FBgn0003074",
        "name" : "Pgi[4]",
        "type" : "other"
     },

d. accessory alleles. If an FBal is 'typically' used as an accessory allele, thean the type is now "accessory" rather than other.

e.g.

     "FBal0147425" : {
        "description" : null,
        "gene" : "FBgn0026367",
        "name" : "Scer\\GAL80[ts.&agr;Tub84B]",
        "synonyms" : [
           "tub-Gal80ts"
        ],
        "type" : "accessory"
     },
     "FBal0218122" : {
        "description" : null,
        "gene" : "FBgn0014445",
        "name" : "Scer\\GAL4[GH146]",
        "synonyms" : [
           "GH146-Gal4"
        ],
        "type" : "accessory"
     },

e. taxon - I haven't added taxon as we decided that would be a global config file thing rather than an allele-by-allele thing in the end.

f. description.

I have added code so that if there is an internal note, this will appear in the 'description' slot (if there is no internal note, the description is still null):

     "FBal0125507" : {
        "description" : "They reference FBrf0167923 for this line. gm160916.",
        "gene" : "FBgn0040505",
        "name" : "Alk[UAS.cLa]",
        "synonyms" : [
           "UAS-dAlk[fl]",
           "UAS-dAlk"
        ],
        "type" : "other"
     },

I have made two versions of the file - one with and one without the descriptions. Am attaching the one without as I wasn't sure that the tool is ready for all the descriptions to be filled in yet

@gm119
Copy link
Collaborator

gm119 commented May 17, 2019

@kimrutherford
Copy link
Member Author

Thanks Gillian. I think those descriptions are going to be too long for the description field in the display. I'll try loading them as comments so we can see if that's convenient enough.

kimrutherford added a commit that referenced this issue May 21, 2019
Also update code to support new JSON format.

Refs #1779
kimrutherford added a commit that referenced this issue May 21, 2019
Also update code to support new JSON format.

Refs #1779
@kimrutherford
Copy link
Member Author

I have made two versions of the file - one with and one without the descriptions. Am attaching the
one without

Hi Gillian. Sorry I didn't read that carefully. So please ignore my comment about loading the descriptions as comments.

I've modified the code to match the changes suggested above and I've re-loaded the flybase-test canto instance using the new file you made. Everything seems OK:
https://curation.pombase.org/flybase-test/curs/4e78dde4ca76a9c7

When you get a chance, could you attach your file that includes the allele descriptions so I can give that a go?

@kimrutherford
Copy link
Member Author

kimrutherford commented May 23, 2019

Comment moved to #1872

I've loaded your JSON file with allele descriptions into my local Canto, storing the descriptions as comments. It looks like this. The comment/description pops up if you mouse over the "comment..." link. Do you think that like be convenient enough for seeing the allele descriptions?

@vmt25
Copy link
Collaborator

vmt25 commented May 23, 2019

Hi Kim

Showing comments/notes/description as in #1872 seems good.
I could not access the test curation @ https://curation.pombase.org/flybase-test/curs/4e78dde4ca76a9c7
Could you provide another link, so that we can play around?

@kimrutherford
Copy link
Member Author

Hi Vitor.

Here's an example: https://curation.pombase.org/flybase-test/curs/05b4acd2db536c5e

If you every need other sessions you can go to the front page (https://curation.pombase.org/flybase-test/) and type the PMID of one of the loaded publications into the "Start curating using a PubMed ID:" box. It will say the paper is "currently being curated by someone else" but if you ignore that and click "View session" you should be able to start curating.

Alternatively you can go to the report pages (https://curation.pombase.org/flybase-test/track) and then browse to a publication via the "All publications" report. To get to those pages you'll need the username is "flybase-admin@pombase.org" and the password is "flybase-admin".
The "All publications" link will take you here:
https://curation.pombase.org/flybase-test/view/list/all_publications?model=track

Then if you navigate to a publication page like this:
https://curation.pombase.org/flybase-test/view/object/pub/440?model=track
you can get to the session for that publication with the "Go to the curation session ..." link.

@gm119
Copy link
Collaborator

gm119 commented May 24, 2019

Hi Kim, am attaching new json file that has placeholder comments in the allele descriptions slot so that it can be loaded into the main test instance to make it easier for testing
all_phen_fb_2019_01_09_intnote_placeholder.txt

@kimrutherford
Copy link
Member Author

Thanks Gillian.

I've edited the file to change "description" : to "comment" : so the descriptions get loaded as comments. We should discuss that plan on the next call.

I've loaded that file into the flybase-test instance. The old sessions have been replaced. Here's a new test session: https://curation.pombase.org/flybase-test/curs/1be1b6ac8921b40a

@kimrutherford
Copy link
Member Author

I'm going to close this because I think everything is done and it's getting long. If there are other things to do we can open other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants