July 2023 update to PrimeKG #11

ayushnoori · 2023-07-14T18:57:56Z

Changes to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that 17 scripts datasets/processing_scripts/ are re-run or updated to build a new version of PrimeKG, while datasets/feature_construction/ scripts may be out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS.

For more information, see primary_data_resources.sh. Changes include the following:

General

Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.

Fixed broken environment construction script.
Script automatically creates required directories.
Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from www.genenames.org, then output to vocab/gene_names.csv and vocab/gene_map.csv.

Bgee

58405/5257181 gold quality calls with expression rank < 25000 now specify cell type in a particular tissue (e.g., UBERON:0000473 ∩ CL:0000089, which denotes germ line stem cell in testis).
These rows are dropped in bgee.py.
URL updated to here.

Comparative Toxicogenomics Database

URL updated to here.

DisGeNET

No changes needed.

DrugBank

Fixed paths in parsexml_drugbank.py. Output to new /parsed subdirectory. Removed extraneous lines in Parsed_feature.ipynb.
✅ Successfully ran drugbank_drug_drug.py and drugbank_drug_protein.py.
⚠️ parsexml_drugbank.py and Parsed_feature.ipynb may need updates.

DrugCentral

Modified drugcentral_queries.txt to work on O2, the Harvard Medical School high-performance computing cluster.
⚠️ drugcentral_feature.Rmd may need updates.

NCBI Gene

No changes needed.

Gene Ontology

Used -L flag to follow redirects. No other changes needed.

Human Phenotype Ontology

Used -L flag to follow redirects. No other changes needed to hpo.py.
Updated hpoa.py to replace old column names with new column names.

MONDO

Added check for NoneType values in external references (line 29).

Reactome

No changes needed.

SIDER

No changes needed.

UBERON

Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.

UMLS

UMLS data pulled and paths updated for 2023 data.
⚠️ umls.ipynb may need updates.

payalchandak

Looks good!

ayushnoori added 3 commits July 13, 2023 14:42

Update PrimeKG on 2023.07.10.

dd5b57b

Clean previous notebooks.

b9881f2

Update README with PR details.

91dbb0a

payalchandak approved these changes Jul 16, 2023

View reviewed changes

payalchandak merged commit 680161b into mims-harvard:main Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

July 2023 update to PrimeKG #11

July 2023 update to PrimeKG #11

ayushnoori commented Jul 14, 2023

payalchandak left a comment

July 2023 update to PrimeKG #11

July 2023 update to PrimeKG #11

Conversation

ayushnoori commented Jul 14, 2023

General

Bgee

Comparative Toxicogenomics Database

DisGeNET

DrugBank

DrugCentral

NCBI Gene

Gene Ontology

Human Phenotype Ontology

MONDO

Reactome

SIDER

UBERON

UMLS

payalchandak left a comment

Choose a reason for hiding this comment