Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more integrity checks on fields in JSON schema 2 #69

Closed
ssarrafan opened this issue May 28, 2021 · 13 comments
Closed

Add more integrity checks on fields in JSON schema 2 #69

ssarrafan opened this issue May 28, 2021 · 13 comments
Assignees

Comments

@ssarrafan
Copy link
Collaborator

There are different levels of checks, this issue is to cover the first 2:

  1. Is the ID prefix valid? (e.g. KEGG.KO vs KEGG.ORTHOLOG)
    2 Is the local part of the ID syntactically conformant? (e.g. KEGG:K\d+)

The first is very easy to do with using the existing id_prefixes annotated in the schema

@ssarrafan
Copy link
Collaborator Author

Continuation of work discussed (at least on GH) from the May sprint at microbiomedata/nmdc-metadata#308

@ssarrafan
Copy link
Collaborator Author

Third level of checks handled in microbiomedata/nmdc-metadata#362

@ssarrafan
Copy link
Collaborator Author

Removing other assignees. @turbomam let me know if this assignment isn't you.

@turbomam
Copy link
Member

RE Is the ID prefix valid? (e.g. KEGG.KO vs KEGG.ORTHOLOG)

I see prefix definitions, especially in nmdc-schema/src/schema/annotation.yaml

MAM@MAM-M74 schema % pwd
/Users/MAM/Documents/gitrepos/nmdc-schema/src/schema
MAM@MAM-M74 schema % grep -i kegg *
annotation.yaml:      - KEGG.PATHWAY
annotation.yaml:      - KEGG.REACTION
annotation.yaml:      - KEGG.ORTHOLOGY  ## KO number
core.yaml:      - KEGG.COMPOUND

Anywhere else I should be looking? @wdduncan @cmungall

@turbomam
Copy link
Member

RE Is the local part of the ID syntactically conformant? (e.g. KEGG:K\d+)

I don't see patterns for the local parts, at least not in nmdc-schema/src/schema/annotation.yaml

  pathway:
    aliases:
      - biological process
      - metabolic pathway
      - signaling pathway
    is_a: functional annotation term
    description: >-
      A pathway is a sequence of steps/reactions carried out by an organism or community of organisms
    slot_usage:
      has_part:
        range: reaction
        multivalued: true
        description: >-
          A pathway can be broken down to a series of reaction step
    id_prefixes:
      - KEGG.PATHWAY
      - COG
    exact_mappings:
      - biolink:Pathway

@turbomam
Copy link
Member

very rough example for nmdc-schema/src/schema/annotation.yaml from @cmungall

functional annotation term:
    aliases:
      - function
      - functional annotation
    is_a: ontology class
    slot_usage:
      id:
        pattern: "^(KEGG.ORTHOLOG:K\\d+|EC:\\d+\\.ETC)$"
    description: >-
      Abstract grouping class for any term/descriptor that can be applied to a functional unit of a genome (protein, ncRNA, complex).
    abstract: true
    todos:
      - decide if this should be used for product naming

@turbomam turbomam transferred this issue from microbiomedata/nmdc-metadata Jun 17, 2021
@turbomam
Copy link
Member

Was microbiomedata/nmdc-metadata issue 360

I will be adding local part patterns to the yaml files in this repo.

@ssarrafan @wdduncan @cmungall

@turbomam
Copy link
Member

See notes from @cmungall at PR #70, especially

I suggested the parens to indicate that we need other IDs, e.g (FOO|BAR|...)

@turbomam
Copy link
Member

turbomam commented Jun 23, 2021

@wdduncan and @cmungall : I can't find patterns for COG or RetroRules at the BioRgistry, or sample usages in the MongoDB.

example working query:

> db.raw.functional_annotation_set.find({"has_function": {"$regex": "^pfam", $options: 'i'}})
{ "_id" : ObjectId("6011a09275ead576bdc24c02"), "subject" : "nmdc:Ga0482148_260452_3_287", "has_function" : "PFAM:PF00001", "was_generated_by" : "nmdc:8a43ec3baf8aafe09d96eb7fbf58c916" }
{ "_id" : ObjectId("6011a0d2666867f660864500"), "subject" : "nmdc:Ga0482235_197390_1_279", "has_function" : "PFAM:PF00001", "was_generated_by" : "nmdc:e763e255fa74e2629d7d86e10f838d4b" }
{ "_id" : ObjectId("6011a1113350938c11bd6527"), "subject" : "nmdc:Ga0482263_74753_2_277", "has_function" : "PFAM:PF00001", "was_generated_by" : "nmdc:686818cb31dc45d3d4482847ec007584" }

But neither of these return any matches:

  • db.raw.functional_annotation_set.find({"has_function": {"$regex": "^cog", $options: 'i'}})
  • db.raw.functional_annotation_set.find({"has_function": {"$regex": "^retrorules:", $options: 'i'}})

turbomam added a commit that referenced this issue Jun 29, 2021
example of finding local part of IDs #69
@turbomam turbomam moved this from In progress to Done in NMDC June 2021 Sprint Jun 29, 2021
@turbomam
Copy link
Member

@ssarrafan do you have a sense of who raised this concern? Can I close it?

@turbomam
Copy link
Member

It doesn't seem like it's really specific to checks on the contents of a JOSN file by the JSON schema serialization of the schema.

@turbomam
Copy link
Member

Is the concern especially about validating KEGG-related CURIes?

@ssarrafan
Copy link
Collaborator Author

@ssarrafan do you have a sense of who raised this concern? Can I close it?

This is from 2021 so I don't remember which meeting this came from. I would say it can probably be closed.

@turbomam turbomam closed this as completed Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants