Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

July 2023 update to PrimeKG #11

Merged
merged 3 commits into from Jul 16, 2023
Merged

Conversation

ayushnoori
Copy link
Member

Changes to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that 17 scripts datasets/processing_scripts/ are re-run or updated to build a new version of PrimeKG, while datasets/feature_construction/ scripts may be out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS.

For more information, see primary_data_resources.sh. Changes include the following:

General

Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.

  • Fixed broken environment construction script.
  • Script automatically creates required directories.
  • Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from www.genenames.org, then output to vocab/gene_names.csv and vocab/gene_map.csv.

Bgee

  • 58405/5257181 gold quality calls with expression rank < 25000 now specify cell type in a particular tissue (e.g., UBERON:0000473 ∩ CL:0000089, which denotes germ line stem cell in testis).
  • These rows are dropped in bgee.py.
  • URL updated to here.

Comparative Toxicogenomics Database

  • URL updated to here.

DisGeNET

  • No changes needed.

DrugBank

  • Fixed paths in parsexml_drugbank.py. Output to new /parsed subdirectory. Removed extraneous lines in Parsed_feature.ipynb.
  • ✅ Successfully ran drugbank_drug_drug.py and drugbank_drug_protein.py.
  • ⚠️ parsexml_drugbank.py and Parsed_feature.ipynb may need updates.

DrugCentral

  • Modified drugcentral_queries.txt to work on O2, the Harvard Medical School high-performance computing cluster.
  • ⚠️ drugcentral_feature.Rmd may need updates.

NCBI Gene

  • No changes needed.

Gene Ontology

  • Used -L flag to follow redirects. No other changes needed.

Human Phenotype Ontology

  • Used -L flag to follow redirects. No other changes needed to hpo.py.
  • Updated hpoa.py to replace old column names with new column names.

MONDO

  • Added check for NoneType values in external references (line 29).

Reactome

  • No changes needed.

SIDER

  • No changes needed.

UBERON

  • Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.

UMLS

  • UMLS data pulled and paths updated for 2023 data.
  • ⚠️ umls.ipynb may need updates.

Copy link
Collaborator

@payalchandak payalchandak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@payalchandak payalchandak merged commit 680161b into mims-harvard:main Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants