-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking proteins and humangenes annotation preferred name to identifier #54
Comments
You can link via the OCID to tables in sciwalker-open-data.data:ontologies_2020r02 tables - @c-ruttkies may be able to help join from there to another identifier. |
Thanks @wetherbeei for the assistance!
However, I am not sure what is an OCID and who mints them. But I think they are "Ontology Concept Identifiers" minted by SciWalker since https://bioregistry.io/ocid:102100007495 redirects to https://www.sciwalker.com/sciwalker/faces/ociddata.xhtml?ocid=102100007495. Any advice @c-ruttkies on how to map OCID to other gene nomenclatures, such as ncbigene, ensembl, HGNC, etcetera would be much appreciated! |
The OntoChem team has added this table to map between OCIDs and other databases: sciwalker-open-data:ontologies_2020r02.ocid_xref SELECT DISTINCT database FROM sciwalker-open-data.ontologies_2020r02.ocid_xref 1 | RefSeq | |
Thanks @wetherbeei and @c-ruttkies! Here's a query showing which external databases SELECT
database, COUNT(*) AS n_mappings
FROM
`sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
INNER JOIN
`sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
ON
human_genes_preflabel.ocid = ocid_xref.ocid
GROUP BY database
ORDER BY n_mappings DESC Great to see so many external vocabularies, particularly Ensembl, GeneID, HGNC.
@ibarshai did you want to go ahead and close this issue unless we have any other questions? |
And here is an example query to map gene preferred labels to ensembl gene IDs that returns 21362 results: SELECT
DISTINCT human_genes_preflabel.name AS gene_preflabel,
ocid_xref.database_id AS ensembl_id
FROM
`sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
INNER JOIN
`sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
USING
(ocid)
WHERE
ocid_xref.database = "Ensembl"
# exclude ensembl protein mappings
AND ocid_xref.database_id LIKE "ENSG%"
ORDER BY
gene_preflabel,
ensembl_id
|
Note that to map to patent annotations, it's not necessary to go through SELECT
DISTINCT ocid_xref.database_id AS ensembl_gene_id,
ocid,
patent_annotations.preferred_name AS patent_annotation_preferred_name,
human_genes_preflabel.name AS gene_preflabel,
patent_annotations.publication_number,
patent_annotations.domain,
-- patent_annotations.source,
-- patent_annotations.confidence,
-- patent_annotations.character_offset_start,
-- patent_annotations.character_offset_end
FROM
`patents-public-data.google_patents_research.annotations` AS patent_annotations
INNER JOIN
`sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
USING
(ocid)
LEFT JOIN
`sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
USING
(ocid)
WHERE
ocid_xref.database = "Ensembl"
# exclude ensembl protein mappings
AND ocid_xref.database_id LIKE "ENSG%"
# restrict to single patent publication for development
AND patent_annotations.publication_number = "JP-3820105-B2"
LIMIT
1000
|
👋 @wetherbeei noticed the 2022 release of the "ontologies" at sciwalker-open-data:ontologies_2022r01, would it be possible to publish ocid_xref for 2022 as well please? |
Also ping @c-ruttkies, would appreciate your help as well please ^ |
FYI ocid_xref for 2022 was published, thanks OntoChem/Sciwalker. This issue can probably be closed btw. |
I'm looking at annotations in the proteins and humangenes domain in the
google_patent_research.annotations
table. It appears that the annotations themselves are normalized to their preferred name, but I was wondering if there is any way to link the preferred name of the gene annotation to some kind of unique identifier, such as HGNC, that can be used to ground these annotations?I see that there is huge amount of information in the
ebi_chembl
section, with seemingly promising table names, but haven't spotted a useful connection by looking through the schemas.The text was updated successfully, but these errors were encountered: