# TABLE OF CONTENTS

## 1 USE CASE: COVID-19 
###  &emsp; 1.1 What genes are connected to COVID-19?
####  &emsp; &emsp; 1.1.1 COVID-19 -> Genes (determine directly related) 
####  &emsp; &emsp; 1.1.2 COVID-19 -> All intermediate node types -> Genes
###  &emsp; 1.2 What are the symptoms that are related to COVID-19?
####  &emsp; &emsp; 1.2.1 COVID-19 -> Symptoms (PhenotypicFeature, BiologicalProcess)
###  &emsp; 1.3 Which of the genes related to COVID-19 are related to symptoms of COVID-19? 
####  &emsp; &emsp; 1.3.1 Genes (from 1.1) -> Symptoms (From 1.2.1)
####  &emsp; &emsp; 1.3.2 Genes (from 1.1) -> [Drugs, SequenceVariant, Pathways, MolecularActivity] -> Symptoms (From 1.2.1)
###  &emsp; 1.4 What proteins/genes are in pathways of known COVID-19 related genes? Which of these can be related to symptoms? 
####  &emsp; &emsp; 1.4.1 Genes (from 1.1.1) -> Pathways -> Genes
####  &emsp; &emsp; 1.4.2 COVID-19 Symptoms -> Pathways -> Genes
###  &emsp; 1.5 In what way can co-occurrence data from COHD EHR data (conditions, drugs, and procedures) be used to further identify or establish genes associated with COVID-19? 
####  &emsp; &emsp; 1.5.1 Co-occurence of related conditions (parent diseases, siblings) and drugs
####  &emsp; &emsp; 1.5.2 Co-occurrence of related drugs and related symptoms 

In [104]:
###### CODE SETUP 

## First get all the functions set up
import pandas as pd
import requests
import difflib
import math

# import itables.interactive
# from itables import show
# import itables.options as opt
# opt.maxBytes = 10000000


## Load BTE
from biothings_explorer.user_query_dispatcher import FindConnection
from biothings_explorer.hint import Hint
ht = Hint()

## Functions that will be used
# Check for every intermediate node type in Predict funciton
def predict_many(input_object, intermediate_node_list, output_type):
    df_list = []
    for inter in intermediate_node_list:
        try: 
            print("Intermediate Node type running:")
            print(inter)
            fc = FindConnection(input_obj=input_object, output_obj=output_type, intermediate_nodes=[inter])
            fc.connect(verbose=False)
            df = fc.display_table_view()
            rows = df.shape[0]
            if(rows > 0):
                df_list.append(df)
        except:
            print("FAILED")
    if(len(df_list) > 0):
        return pd.concat(df_list)
    else:
        return None
    
# all intermediate node types

node_type_list = (['Gene', 'SequenceVariant', 'ChemicalSubstance', 'Disease', 
                   'MolecularActivity', 'BiologicalProcess', 'CellularComponent', 
                   'Pathway', 'AnatomicalEntity', 'PhenotypicFeature'])

## 1.1 What genes are connected to COVID-19?

### 1.1.1 COVID-19 -> Genes (determine directly related) 

In [19]:
## get COVID-19
covid19 = ht.query("COVID-19")['Disease'][0]
covid19

{'MONDO': 'MONDO:0100096',
 'DOID': 'DOID:0080600',
 'name': 'COVID-19',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0100096'},
 'display': 'MONDO(MONDO:0100096) DOID(DOID:0080600) name(COVID-19)',
 'type': 'Disease'}

In [20]:
fc = FindConnection(input_obj=covid19, output_obj='Gene', intermediate_nodes=None)
fc.connect(verbose=True)
covid19_to_genes = fc.display_table_view()
covid19_to_genes


BTE will find paths that join 'COVID-19' and 'Gene'.                   Paths will have 0 intermediate node.




==== Step #1: Query path planning ====

Because COVID-19 is of type 'Disease', BTE will query our meta-KG for APIs that can take 'Disease' as input and 'Gene' as output

BTE found 8 apis:

API 1. mgi_gene2phenotype(1 API call)
API 2. hetio(1 API call)
API 3. pharos(1 API call)
API 4. scibite(1 API call)
API 5. cord_disease(1 API call)
API 6. biolink(1 API call)
API 7. DISEASES(1 API call)
API 8. scigraph(1 API call)


==== Step #2: Query path execution ====
NOTE: API requests are dispatched in parallel, so the list of APIs below is ordered by query time.

API 5.1: https://biothings.ncats.io/cord_disease/query?fields=associated_with (POST -d q=DOID:0080600&scopes=doid)
API 6.1: https://api.monarchinitiative.org/api/bioentity/disease/MONDO:0100096/genes?rows=200
API 1.1: https://pending.biothings.io/mgigene2phenotype/query?fields=_id&size=300 (POST -d q=DOID:0080600&scopes=mgi

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,output_type,output_name,output_id
0,COVID-19,Disease,related_to,DISEASE,DISEASES API,,Gene,EID2,NCBIGene:163126
1,COVID-19,Disease,related_to,DISEASE,DISEASES API,,Gene,ACE2,NCBIGene:59272
2,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,ACE2,NCBIGene:59272
3,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,MARS1,NCBIGene:4141
4,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,SON,NCBIGene:6651
5,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,TH,NCBIGene:7054
6,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,TMPRSS2,NCBIGene:7113
7,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,POR,NCBIGene:5447
8,COVID-19,Disease,related_to,scigraph,Automat CORD19 Scigraph API,,Gene,CRP,NCBIGene:1401


In [21]:
i = list(covid19_to_genes["output_name"])
d = {x:i.count(x) for x in i}
sorted_genes_covid_2_genes = {k: v for k, v in sorted(d.items(), key=lambda item: item[1])}
sorted_genes_covid_2_genes


{'EID2': 1,
 'MARS1': 1,
 'SON': 1,
 'TH': 1,
 'TMPRSS2': 1,
 'POR': 1,
 'CRP': 1,
 'ACE2': 2}

### 1.1.2 COVID-19 -> All intermediate node types -> Genes

In [22]:
covid_allNodes_Genes = predict_many(covid19,node_type_list,'Gene')

Intermediate Node type running:
Gene
Intermediate Node type running:
SequenceVariant
Intermediate Node type running:
ChemicalSubstance
API 2.1 pharos failed
Intermediate Node type running:
Disease
Intermediate Node type running:
MolecularActivity
Intermediate Node type running:
BiologicalProcess
Intermediate Node type running:
CellularComponent
Intermediate Node type running:
Pathway
Intermediate Node type running:
AnatomicalEntity
Intermediate Node type running:
PhenotypicFeature


In [23]:
## Genes identified = HUGE NUMBER (13562)
len(list(covid_allNodes_Genes["output_name"]))

14066

In [24]:
i = list(covid_allNodes_Genes["output_name"])
d = {x:i.count(x) for x in i}
sorted_genes_covid_2_allNodes_2_genes = {k: v for k, v in sorted(d.items(), key=lambda item: item[1])}
for x in list(reversed(list(sorted_genes_covid_2_allNodes_2_genes)))[0:50]:
    print(str(x) + ": " + str(sorted_genes_covid_2_allNodes_2_genes[x]))

TNF: 44
CYP3A4: 33
CAT: 32
INS: 27
C0017337: 26
C0014442: 26
CYP2D6: 25
AKT1: 25
ANG: 24
IL6: 23
ABCB1: 22
TP53: 19
FOS: 18
C1705556: 18
MAPK1: 18
ACE2: 18
ACE: 18
CYP1A2: 17
C0010762: 17
HIF1A: 17
AR: 17
SQSTM1: 17
APP: 17
TLR9: 16
PPIG: 16
ALB: 16
CDKN1A: 16
TH: 16
C0164786: 16
CD4: 16
SOD1: 15
VEGFA: 15
CYP2C9: 15
IL1B: 15
BAX: 15
LEP: 15
C1705526: 15
EGFR: 15
IFNA1: 15
RELA: 15
CAMP: 15
C0010531: 15
C0030956: 15
SOD2: 14
EPO: 14
MPO: 14
MTOR: 14
MAPK8: 14
CA2: 14
CASP3: 14


In [35]:
## store top 50 genes
top_50_related_genes_covid_2_allNodes_2_genes = list(reversed(list(sorted_genes_covid_2_allNodes_2_genes)))[0:50]
top_50_related_genes_covid_2_allNodes_2_genes = top_50_related_genes_covid_2_allNodes_2_genes + list(sorted_genes_covid_2_genes.keys())
top_50_related_genes_covid_2_allNodes_2_genes = list(dict.fromkeys(top_50_related_genes_covid_2_allNodes_2_genes))
top_50_related_genes_covid_2_allNodes_2_genes

['TNF',
 'CYP3A4',
 'CAT',
 'INS',
 'C0017337',
 'C0014442',
 'CYP2D6',
 'AKT1',
 'ANG',
 'IL6',
 'ABCB1',
 'TP53',
 'FOS',
 'C1705556',
 'MAPK1',
 'ACE2',
 'ACE',
 'CYP1A2',
 'C0010762',
 'HIF1A',
 'AR',
 'SQSTM1',
 'APP',
 'TLR9',
 'PPIG',
 'ALB',
 'CDKN1A',
 'TH',
 'C0164786',
 'CD4',
 'SOD1',
 'VEGFA',
 'CYP2C9',
 'IL1B',
 'BAX',
 'LEP',
 'C1705526',
 'EGFR',
 'IFNA1',
 'RELA',
 'CAMP',
 'C0010531',
 'C0030956',
 'SOD2',
 'EPO',
 'MPO',
 'MTOR',
 'MAPK8',
 'CA2',
 'CASP3',
 'EID2',
 'MARS1',
 'SON',
 'TMPRSS2',
 'POR',
 'CRP']

In [36]:
top_genes_pub_counts = {}
for index, row in covid_allNodes_Genes.iterrows():
    if row["output_name"] in top_50_related_genes_covid_2_allNodes_2_genes:
        current_pubcount = 0
        if(row["pred1_pubmed"] != None):
            current_pubcount = current_pubcount + row["pred1_pubmed"].count(",") + 1
        if(row["pred2_pubmed"] != None):
            current_pubcount = current_pubcount + row["pred2_pubmed"].count(",") + 1
        if row["output_name"] in top_genes_pub_counts:
            top_genes_pub_counts[row["output_name"]] = top_genes_pub_counts[row["output_name"]] + current_pubcount
        else: 
            top_genes_pub_counts[row["output_name"]] = current_pubcount

top_genes_pub_counts
# for x in top_50_related_genes_covid_2_allNodes_2_genes:
    

{'C0014442': 331,
 'C0017337': 196,
 'C0030956': 38,
 'C0010531': 66,
 'ACE': 32,
 'ANG': 78,
 'ACE2': 24,
 'APP': 21,
 'CAMP': 71,
 'MAPK1': 40,
 'RELA': 32,
 'CAT': 98,
 'TMPRSS2': 3,
 'INS': 53,
 'C1705556': 146,
 'MARS1': 5,
 'AKT1': 51,
 'IFNA1': 15,
 'TNF': 114,
 'CD4': 10,
 'EGFR': 8,
 'CASP3': 75,
 'C0164786': 27,
 'C1705526': 31,
 'CYP2D6': 18,
 'FOS': 34,
 'TH': 541,
 'SQSTM1': 45,
 'CDKN1A': 29,
 'ALB': 9,
 'AR': 11,
 'LEP': 20,
 'CA2': 80,
 'MAPK8': 24,
 'MTOR': 21,
 'TP53': 42,
 'MPO': 33,
 'BAX': 30,
 'PPIG': 45,
 'EPO': 26,
 'HIF1A': 205,
 'POR': 15,
 'IL1B': 39,
 'IL6': 58,
 'TLR9': 15,
 'C0010762': 80,
 'CYP1A2': 25,
 'CYP3A4': 69,
 'CYP2C9': 11,
 'ABCB1': 25,
 'VEGFA': 150,
 'CRP': 0,
 'SOD1': 26,
 'SON': 2,
 'SOD2': 40}

## 1.2 What are the symptoms that are related to COVID-19?

### COVID-19 -> PhenotypicFeature

In [37]:
fc = FindConnection(input_obj=covid19, output_obj='PhenotypicFeature', intermediate_nodes=None)
fc.connect(verbose=False)
covid19_2_phentoypic_feature = fc.display_table_view()
covid19_2_phentoypic_feature

## no results 

In [38]:
## try more broad "corona" look at coronaviruses in general 
corona = ht.query("CORONAVINAE INFECTIOUS DISEASE")['Disease'][0]
corona

{'MONDO': 'MONDO:0005719',
 'name': 'Coronavinae infectious disease',
 'MESH': 'D018352',
 'primary': {'identifier': 'MONDO',
  'cls': 'Disease',
  'value': 'MONDO:0005719'},
 'display': 'MONDO(MONDO:0005719) MESH(D018352) name(Coronavinae infectious disease)',
 'type': 'Disease'}

In [39]:
fc = FindConnection(input_obj=corona, output_obj='PhenotypicFeature', intermediate_nodes=None)
fc.connect(verbose=False)
covid19_2_phentoypic_feature = fc.display_table_view()
covid19_2_phentoypic_feature

## no results 

### COVID-19 -> BiologicalProcess

In [40]:
fc = FindConnection(input_obj=covid19, output_obj='BiologicalProcess', intermediate_nodes=None)
fc.connect(verbose=False)
covid19_2_biologicalProcess = fc.display_table_view()
covid19_2_biologicalProcess

In [41]:
# try broader corona family again 
fc = FindConnection(input_obj=corona, output_obj='BiologicalProcess', intermediate_nodes=None)
fc.connect(verbose=False)
covid19_2_biologicalProcess = fc.display_table_view()
covid19_2_biologicalProcess

## Determine symptoms from: http://www.diseasesdatabase.com/relationships.asp?glngUserChoice=60833&bytRel=2&blnBW=0&strBB=LR&blnClassSort=255&Key={A27BEC6F-30C5-4893-BB0F-9FEB5589DEB3}


In [47]:
# Symptoms and signs:
    
# Cough
#    Coughing
# Diarrhoea
#     Loose stools
#     Diarrhea
# Myalgia
#     Myodynia
#     Muscle pain
# Pyrexia
#     Body temperature increased
#     Febrile
#     Fever
#     Hyperthermia
# Taste disturbance
#     Ageusia
#     Dysgeusia
#     Hypogeusia
#     Parageusia


# Haematological abnormalities:
# Lymphocytopenia
#     Lymphopenia
#     Lymphocyte count low (peripheral blood)

# Biochemical abnormalities:
# Lactate dehydrogenase levels raised (plasma or serum)
#     LDH raised

# Cardiac and vascular conditions:
# Myocarditis

# Inflammatory conditions:
# Pneumonia
#     Pneumonitis
#     Pulmonary inflammation


symptom_and_phenotype_list = ['Cough','Coughing','Diarrhoea','Loose stools','Diarrhea','Myalgia','Myodynia',
                              'Muscle pain','Pyrexia','Body temperature increased','Febrile','Fever','Hyperthermia',
                              'Taste disturbance','Ageusia','Dysgeusia','Hypogeusia','Parageusia','Lymphocytopenia',
                              'Lymphopenia','Lymphocyte count low (peripheral blood)',
                              'Lactate dehydrogenase levels raised (plasma or serum)','LDH raised','Myocarditis',
                              'Pneumonia','Pneumonitis','Pulmonary inflammation']



symptom_and_phenotype_list = [x.lower() for x in symptom_and_phenotype_list]
symptom_and_phenotype_list
# symptom_and_phenotype_list 

['cough',
 'coughing',
 'diarrhoea',
 'loose stools',
 'diarrhea',
 'myalgia',
 'myodynia',
 'muscle pain',
 'pyrexia',
 'body temperature increased',
 'febrile',
 'fever',
 'hyperthermia',
 'taste disturbance',
 'ageusia',
 'dysgeusia',
 'hypogeusia',
 'parageusia',
 'lymphocytopenia',
 'lymphopenia',
 'lymphocyte count low (peripheral blood)',
 'lactate dehydrogenase levels raised (plasma or serum)',
 'ldh raised',
 'myocarditis',
 'pneumonia',
 'pneumonitis',
 'pulmonary inflammation']

## 1.2.3 Determine Symptoms thru HPO API: 
https://hpo.jax.org/webjars/swagger-ui/3.20.9/index.html?url=/api/hpo/docs/#/Search/search

In [48]:
disease_name = 'coronavirus'
r = requests.get('https://hpo.jax.org/api/hpo/search/?q=' + disease_name)
res = r.json()
# print(res)
# pick result, base 0
result_number = 0
disease_id = res['diseases'][result_number]['diseaseId']

r = requests.get("http://hpo.jax.org/api/hpo/disease/" + disease_id)
res = r.json()
# print(res)
HP_IDs = []
for x in res['catTermsMap']:
    for y in x["terms"]:
        print(y["name"].lower())
        print(y["ontologyId"])
        HP_IDs.append(y["ontologyId"])

symptoms = []
for x in HP_IDs: 
    r = requests.get('https://biothings.ncats.io/hpo/phenotype/' + x)
    res = r.json()
    if(('_id' in res) & ('name' in res)):
        symptoms.append(res['name'].lower())
    if('synonym' in res):
        for z in res['synonym']:
            if('EXACT' in z):
                name = z.split('"')[1].lower()
                if name not in symptoms: 
                    symptoms.append(name)
        
        

print(symptoms)

pharyngitis
HP:0025439
fever
HP:0001945
diabetes mellitus
HP:0000819
acute kidney injury
HP:0001919
immunodeficiency
HP:0002721
headache
HP:0002315
respiratory distress
HP:0002098
dyspnea
HP:0002094
respiratory failure requiring assisted ventilation
HP:0004887
acute infectious pneumonia
HP:0011949
cough
HP:0012735
chronic lung disease
HP:0006528
hypoxemia
HP:0012418
myalgia
HP:0003326
['pharyngitis', 'fever', 'hyperthermia', 'pyrexia', 'diabetes mellitus', 'acute kidney injury', 'acute kidney failure', 'acute renal failure', 'immunodeficiency', 'decreased immune function', 'immune deficiency', 'headache', 'headaches', 'respiratory distress', 'breathing difficulties', 'difficulty breathing', 'respiratory difficulties', 'dyspnea', 'abnormal breathing', 'breathing difficulty', 'difficult to breathe', 'dyspnoea', 'trouble breathing', 'respiratory failure requiring assisted ventilation', 'respiratory distress necessitating mechanical ventilation', 'respiratory distress requiring endotrachea

In [49]:
del symptoms[symptoms.index('diabetes mellitus')]
symptom_and_phenotype_list = symptoms

symptom_and_phenotype_list.append('blood coagulation')
symptom_and_phenotype_list.append('coagulation')
symptom_and_phenotype_list.append('blood clotting')
symptom_and_phenotype_list

['pharyngitis',
 'fever',
 'hyperthermia',
 'pyrexia',
 'acute kidney injury',
 'acute kidney failure',
 'acute renal failure',
 'immunodeficiency',
 'decreased immune function',
 'immune deficiency',
 'headache',
 'headaches',
 'respiratory distress',
 'breathing difficulties',
 'difficulty breathing',
 'respiratory difficulties',
 'dyspnea',
 'abnormal breathing',
 'breathing difficulty',
 'difficult to breathe',
 'dyspnoea',
 'trouble breathing',
 'respiratory failure requiring assisted ventilation',
 'respiratory distress necessitating mechanical ventilation',
 'respiratory distress requiring endotracheal intubation',
 'respiratory distress requiring mechanical ventilation',
 'acute infectious pneumonia',
 'cough',
 'coughing',
 'chronic lung disease',
 'hypoxemia',
 'low blood oxygen level',
 'myalgia',
 'muscle ache',
 'muscle pain',
 'blood coagulation',
 'coagulation',
 'blood clotting']

### 1.3 Which of the genes related to COVID-19 are related to symptoms of COVID-19? 

### 1.3.1 Genes (from 1.1) -> Symptoms (From 1.2.1)

#### 1.3.1.1 Gene -> Phenotype type "symptoms"

In [50]:
df_list = []
for x in top_50_related_genes_covid_2_allNodes_2_genes: 
#     print(x)
    try: 
        gene = ht.query(x)["Gene"][0]
        fc = FindConnection(input_obj=gene, output_obj='PhenotypicFeature', intermediate_nodes=None)
        fc.connect(verbose=False)
        df = fc.display_table_view()
        rows = df.shape[0]
        if(rows > 0):
            df_list.append(df)
    except:
        print(str(x) + " FAILED")
if(len(df_list) > 0):
    top50gene_2_phenotypicFeature = pd.concat(df_list)


C0017337 FAILED
C0014442 FAILED
C1705556 FAILED
C0010762 FAILED
C0164786 FAILED
C1705526 FAILED
C0010531 FAILED
C0030956 FAILED


In [51]:
top50gene_2_phenotypicFeature.shape

(1400, 9)

In [52]:
## Get names for HP ids
HP_ids = top50gene_2_phenotypicFeature[top50gene_2_phenotypicFeature["output_name"].str.contains("HP:",regex=False)]["output_name"]
HP_ids = list(HP_ids)
HP_ids = list(dict.fromkeys(HP_ids))
len(HP_ids)
HP_dict = {}
for x in HP_ids: 
    HP_ID = x.split(':')[1]
    r = requests.get('https://biothings.ncats.io/hpo/phenotype/HP%3A' + HP_ID)
    res = r.json()
    if(('_id' in res) & ('name' in res)):
        HP_dict[res['_id']] = res['name'].lower()

In [53]:
def get_similar_phen_indices(list1,list2,similarity):
    res = [] 
    i = 0
    while (i < len(list1)):
        append_i = False
        lookup = list1[i].lower()
        if('HP:' in list1[i]):
            if(list1[i]  in HP_dict):
                lookup = HP_dict[list1[i]]
        for j in list2:
                if(difflib.SequenceMatcher(None,lookup,j).ratio() > similarity):
    #                 if(i < 3):
                    print("Matched similar terms:")
                    print(lookup + ' and ' + j)
#                     print()
                    append_i = True
        if(append_i): 
            res.append(i) 
        i += 1
    print(len(res))
    return(res)


In [54]:
phen_indices = get_similar_phen_indices(list(top50gene_2_phenotypicFeature["output_name"]),symptom_and_phenotype_list,0.9)

Matched similar terms:
dyspnea and dyspnea
Matched similar terms:
dyspnea and dyspnoea
Matched similar terms:
fever and fever
Matched similar terms:
dyspnea and dyspnea
Matched similar terms:
dyspnea and dyspnoea
Matched similar terms:
headache and headache
Matched similar terms:
headache and headaches
Matched similar terms:
immunodeficiency and immunodeficiency
Matched similar terms:
immunodeficiency and immune deficiency
Matched similar terms:
myalgia and myalgia
Matched similar terms:
dyspnea and dyspnea
Matched similar terms:
dyspnea and dyspnoea
Matched similar terms:
headache and headache
Matched similar terms:
headache and headaches
Matched similar terms:
fever and fever
Matched similar terms:
dyspnea and dyspnea
Matched similar terms:
dyspnea and dyspnoea
Matched similar terms:
cough and cough
Matched similar terms:
dyspnea and dyspnea
Matched similar terms:
dyspnea and dyspnoea
12


In [55]:
phen_top_50 = top50gene_2_phenotypicFeature.iloc[phen_indices,:]
# phen_top_50
for index in range(phen_top_50.shape[0]):
#     if("HP:" in row['output_name']):
#     print(index)
    if(phen_top_50.iloc[index]["output_name"] in HP_dict):
        phen_top_50.iloc[index]["output_name"] = HP_dict[phen_top_50.iloc[index]["output_name"]]

phen_top_50

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,output_type,output_name,output_id
6,ANG,Gene,related_to,,BioLink API,,PhenotypicFeature,dyspnea,HP:HP:0002094
22,TP53,Gene,related_to,,BioLink API,,PhenotypicFeature,fever,HP:HP:0001945
47,TP53,Gene,related_to,,BioLink API,,PhenotypicFeature,dyspnea,HP:HP:0002094
96,TP53,Gene,related_to,,BioLink API,,PhenotypicFeature,headache,HP:HP:0002315
5,FOS,Gene,related_to,,BioLink API,,PhenotypicFeature,immunodeficiency,HP:HP:0002721
39,FOS,Gene,related_to,,BioLink API,,PhenotypicFeature,myalgia,HP:HP:0003326
50,SQSTM1,Gene,related_to,,BioLink API,,PhenotypicFeature,dyspnea,HP:HP:0002094
43,APP,Gene,related_to,,BioLink API,,PhenotypicFeature,headache,HP:HP:0002315
19,TH,Gene,related_to,,BioLink API,"10407773,9732974,0011551,21937992,20430833,252...",PhenotypicFeature,fever,HP:HP:0001945
24,SOD1,Gene,related_to,,BioLink API,,PhenotypicFeature,dyspnea,HP:HP:0002094


#### 1.3.1.2  Gene -> Bioprocess type "symptoms"

In [56]:
df_list = []
for x in top_50_related_genes_covid_2_allNodes_2_genes: 
#     print(x)
    try: 
        gene = ht.query(x)["Gene"][0]
        fc = FindConnection(input_obj=gene, output_obj='BiologicalProcess', intermediate_nodes=None)
        fc.connect(verbose=False)
        df = fc.display_table_view()
        rows = df.shape[0]
        if(rows > 0):
            df_list.append(df)
    except:
        print(str(x) + " FAILED")
if(len(df_list) > 0):
    top50gene_2_bioprocesses = pd.concat(df_list)

C0017337 FAILED
C0014442 FAILED
C1705556 FAILED
C0010762 FAILED
C0164786 FAILED
C1705526 FAILED
C0010531 FAILED
C0030956 FAILED


In [57]:
top50gene_2_bioprocesses.shape

(17958, 9)

In [58]:
## Get names for go ids
go_ids = top50gene_2_bioprocesses[top50gene_2_bioprocesses["output_name"].str.contains("go:",regex=False)]["output_name"]
go_ids = list(go_ids)
go_ids = list(dict.fromkeys(go_ids))
len(go_ids)
go_dict = {}
for x in go_ids: 
    go_ID = x.split(':')[1]
    r = requests.get('https://biothings.ncats.io/go_bp/geneset/GO%3A' + go_ID)
    res = r.json()
    if('name' in res):
        go_dict[res['_id']] = res['name'].lower()

In [59]:
def get_similar_bp_indices(list1,list2,similarity):
    res = [] 
    i = 0
    while (i < len(list1)):
        append_i = False
        lookup = list1[i].lower()
        if('go:' in list1[i]):
            if list1[i] in go_dict:
                lookup = go_dict[list1[i]]
        for j in list2:
                if(difflib.SequenceMatcher(None,lookup,j).ratio() > similarity):
    #                 if(i < 3):
                    print("Matched similar terms:")
                    print(lookup + ' and ' + j)
#                     print()
                    append_i = True
        if(append_i): 
            res.append(i) 
        i += 1
    print(len(res))
    return(res)

In [60]:
bp_indices = get_similar_bp_indices(list(top50gene_2_bioprocesses["output_name"]),symptom_and_phenotype_list,0.9)

Matched similar terms:
coagulation and coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
blood coagulation and blood coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
blood coagulation and blood coagulation
Matched similar terms:
blood coagulation and blood coagulation
Matched similar terms:
blood coagulation and blood coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
coagulation and coagulation
Matched similar terms:
coagulation and coagulation
12


In [72]:
bioprocess_top_50 = top50gene_2_bioprocesses.iloc[bp_indices,:]
bioprocess_top_50

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,output_type,output_name,output_id
1164,TNF,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817
840,INS,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817
847,INS,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,BLOOD COAGULATION,GO:GO:0007596
821,AKT1,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817
358,IL6,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817
576,MAPK1,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817
4,ALB,Gene,affects,SEMMED,SEMMED Gene API,75114959215020.0,BiologicalProcess,BLOOD COAGULATION,name:BLOOD COAGULATION
516,VEGFA,Gene,disrupts,SEMMED,SEMMED Gene API,22532265.0,BiologicalProcess,BLOOD COAGULATION,name:BLOOD COAGULATION
615,IFNA1,Gene,functional_association,entrez,MyGene.info API,,BiologicalProcess,BLOOD COAGULATION,GO:GO:0007596
129,EPO,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,COAGULATION,GO:GO:0050817


#### 1.3.1.3  Gene -> Disease type "symptoms" 

In [62]:
df_list = []
for x in top_50_related_genes_covid_2_allNodes_2_genes: 
#     print(x)
    try: 
        gene = ht.query(x)["Gene"][0]
        fc = FindConnection(input_obj=gene, output_obj='Disease', intermediate_nodes=None)
        fc.connect(verbose=False)
        df = fc.display_table_view()
        rows = df.shape[0]
        if(rows > 0):
            df_list.append(df)
    except:
        print(str(x) + " FAILED")
if(len(df_list) > 0):
    top50gene_2_diseases = pd.concat(df_list)

top50gene_2_diseases.shape

C0017337 FAILED
C0014442 FAILED
C1705556 FAILED
C0010762 FAILED
C0164786 FAILED
C1705526 FAILED
C0010531 FAILED
C0030956 FAILED


(48223, 9)

In [63]:
def get_similar_disease_indices(list1,list2,similarity):
    res = [] 
    i = 0
    while (i < len(list1)):
        append_i = False
        lookup = list1[i].lower()
#         if('go:' in list1[i]):
#             if list1[i] in go_dict:
#                 lookup = go_dict[list1[i]]
        for j in list2:
                if(difflib.SequenceMatcher(None,lookup,j).ratio() > similarity):
    #                 if(i < 3):
#                     print("Matched similar terms:")
#                     print(lookup + ' and ' + j)
#                     print()
                    append_i = True
        if(append_i): 
            res.append(i) 
        i += 1
    print(len(res))
    return(res)


In [64]:
disease_indices = get_similar_disease_indices(list(top50gene_2_diseases["output_name"]),symptom_and_phenotype_list,0.9)

118


In [65]:
# top50gene_2_diseases
relevant_top50gene_2_diseases = top50gene_2_diseases.iloc[disease_indices,:]
relevant_top50gene_2_diseases 

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,output_type,output_name,output_id
1539,TNF,Gene,disrupts,SEMMED,SEMMED Gene API,144323616300807226736189094446,Disease,FEVER,MONDO:C0015967
1540,TNF,Gene,causes,SEMMED,SEMMED Gene API,"10701765,15373964,16460809,1714101,17374708,17...",Disease,FEVER,MONDO:C0015967
1541,TNF,Gene,affects,SEMMED,SEMMED Gene API,"11593333,12879338,15855300,15965498,17967442,1...",Disease,FEVER,MONDO:C0015967
1542,TNF,Gene,related_to,disgenet,mydisease.info API,,Disease,FEVER,MONDO:C0015967
2150,TNF,Gene,causes,SEMMED,SEMMED Gene API,21426732,Disease,COUGHING,MONDO:C0010200
...,...,...,...,...,...,...,...,...,...
1175,CA2,Gene,affects,SEMMED,SEMMED Gene API,1193655631097314022197,Disease,FEVER,MONDO:C0015967
1177,CA2,Gene,affects,SEMMED,SEMMED Gene API,4061632,Disease,ACUTE KIDNEY FAILURE,MONDO:MONDO:0002492
402,CRP,Gene,related_to,DISEASE,DISEASES API,,Disease,PHARYNGITIS,MONDO:MONDO:0002258
403,CRP,Gene,related_to,scigraph,Automat CORD19 Scigraph API,,Disease,PHARYNGITIS,MONDO:MONDO:0002258


In [66]:
i = list(top50gene_2_diseases.iloc[disease_indices,:]["input"])
d = {x:i.count(x) for x in i}
sorted_genes_from_symptoms = {k: v for k, v in sorted(d.items(), key=lambda item: item[1])}
for x in list(reversed(list(sorted_genes_from_symptoms)))[0:50]:
    print(str(x) + ": " + str(sorted_genes_from_symptoms[x]))

TNF: 12
INS: 9
VEGFA: 8
IL6: 8
EPO: 6
LEP: 6
CAMP: 5
BAX: 5
ALB: 5
ACE: 5
TP53: 5
IFNA1: 4
CRP: 3
MPO: 3
EGFR: 3
SOD1: 3
CDKN1A: 3
FOS: 3
CA2: 2
MAPK8: 2
SOD2: 2
IL1B: 2
CYP2C9: 2
PPIG: 2
CYP2D6: 2
CAT: 2
CD4: 1
TH: 1
TLR9: 1
AR: 1
ACE2: 1
AKT1: 1


['disrupts',
 'causes',
 'affects',
 'related_to',
 'causes',
 'causes',
 'related_to',
 'related_to',
 'causes',
 'related_to',
 'related_to',
 'related_to',
 'disrupts',
 'related_to',
 'causes',
 'causes',
 'affects',
 'treats',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'affects',
 'related_to',
 'related_to',
 'related_to',
 'affects',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'affects',
 'related_to',
 'causes',
 'related_to',
 'related_to',
 'related_to',
 'disrupts',
 'related_to',
 'disrupts',
 'causes',
 'related_to',
 'treats',
 'related_to',
 'disrupts',
 'causes',
 'causes',
 'affects',
 'related_to',
 'related_to',
 'related_to',
 'causes',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'prevents',
 'related_to',
 'related_to',
 'disrupts',
 'related_to',
 'causes',
 'related_to',
 'related_to',
 'related_to',
 'related_to',
 'affect

In [96]:
causes_df = relevant_top50gene_2_diseases[relevant_top50gene_2_diseases["pred1"] == "causes"]
i = list(causes_df["input"])
causes_dict = {x:i.count(x) for x in i}
causes_dict

{'TNF': 4,
 'INS': 2,
 'FOS': 1,
 'ACE': 1,
 'PPIG': 1,
 'ALB': 1,
 'CDKN1A': 1,
 'VEGFA': 1,
 'LEP': 1,
 'IFNA1': 1,
 'CAMP': 1,
 'EPO': 3,
 'MAPK8': 1}

**How to Interpret above**: Of the top genes associated with COVID-19, the above are genes that are known to cause symptoms described as symptoms in COVID-19

## Assembly of Results

In [73]:
all_gene_connections = pd.concat([bioprocess_top_50,phen_top_50,relevant_top50gene_2_diseases])
all_gene_connections["output_name"] = all_gene_connections["output_name"].str.lower()
all_gene_connections

Unnamed: 0,input,input_type,pred1,pred1_source,pred1_api,pred1_pubmed,output_type,output_name,output_id
1164,TNF,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,coagulation,GO:GO:0050817
840,INS,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,coagulation,GO:GO:0050817
847,INS,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,blood coagulation,GO:GO:0007596
821,AKT1,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,coagulation,GO:GO:0050817
358,IL6,Gene,related_to,Translator Text Mining Provider,CORD Gene API,,BiologicalProcess,coagulation,GO:GO:0050817
...,...,...,...,...,...,...,...,...,...
1175,CA2,Gene,affects,SEMMED,SEMMED Gene API,1193655631097314022197,Disease,fever,MONDO:C0015967
1177,CA2,Gene,affects,SEMMED,SEMMED Gene API,4061632,Disease,acute kidney failure,MONDO:MONDO:0002492
402,CRP,Gene,related_to,DISEASE,DISEASES API,,Disease,pharyngitis,MONDO:MONDO:0002258
403,CRP,Gene,related_to,scigraph,Automat CORD19 Scigraph API,,Disease,pharyngitis,MONDO:MONDO:0002258


In [84]:
# all_gene_connections
top_symptom_pub_counts = {}
for index, row in all_gene_connections.iterrows():
#     if row["input_name"] in top_50_related_genes_covid_2_allNodes_2_genes:
    current_pubcount = 0
    if(row["pred1_pubmed"] != None):
        current_pubcount = current_pubcount + row["pred1_pubmed"].count(",") + 1
    if row["input"] in top_symptom_pub_counts:
        top_symptom_pub_counts[row["input"]] = top_symptom_pub_counts[row["input"]] + current_pubcount
    else: 
        top_symptom_pub_counts[row["input"]] = current_pubcount

top_symptom_pub_counts

{'TNF': 37,
 'INS': 6,
 'AKT1': 1,
 'IL6': 5,
 'MAPK1': 0,
 'ALB': 5,
 'VEGFA': 10,
 'IFNA1': 10,
 'EPO': 6,
 'MAPK8': 2,
 'CRP': 0,
 'ANG': 0,
 'TP53': 4,
 'FOS': 3,
 'SQSTM1': 0,
 'APP': 0,
 'TH': 9,
 'SOD1': 2,
 'MARS1': 0,
 'CAT': 1,
 'CYP2D6': 1,
 'ACE2': 0,
 'ACE': 4,
 'AR': 1,
 'TLR9': 0,
 'PPIG': 3,
 'CDKN1A': 3,
 'CD4': 0,
 'CYP2C9': 1,
 'IL1B': 0,
 'BAX': 5,
 'LEP': 8,
 'EGFR': 1,
 'CAMP': 6,
 'SOD2': 0,
 'MPO': 1,
 'CA2': 4}

In [74]:
print(sorted_genes_covid_2_allNodes_2_genes['INS'])

results_dict = {}
for i in range(all_gene_connections.shape[0]):
    if(all_gene_connections.iloc[i]["input"] in results_dict):
        results_dict[all_gene_connections.iloc[i]["input"]]["symptoms_associated"].append(all_gene_connections.iloc[i]["output_name"])
    else:
        results_dict[all_gene_connections.iloc[i]["input"]] = {
            "two_step_associations_to_covid" : sorted_genes_covid_2_allNodes_2_genes[all_gene_connections.iloc[i]["input"]],
            "direct_associations_to_covid" : sorted_genes_covid_2_genes[all_gene_connections.iloc[i]["input"]] if all_gene_connections.iloc[i]["input"] in sorted_genes_covid_2_genes else 0,
            "symptoms_associated" : [all_gene_connections.iloc[i]["output_name"]]
        }
    
print(results_dict)

27
{'TNF': {'two_step_associations_to_covid': 44, 'direct_associations_to_covid': 0, 'symptoms_associated': ['coagulation', 'fever', 'fever', 'fever', 'fever', 'coughing', 'acute kidney failure', 'acute kidney failure', 'acute kidney failure', 'hypoxemia', 'pharyngitis', 'pharyngitis', 'pharyngitis']}, 'INS': {'two_step_associations_to_covid': 27, 'direct_associations_to_covid': 0, 'symptoms_associated': ['coagulation', 'blood coagulation', 'coughing', 'fever', 'fever', 'chronic lung disease', 'acute kidney failure', 'acute kidney failure', 'acute kidney failure', 'acute kidney failure', 'pharyngitis']}, 'AKT1': {'two_step_associations_to_covid': 25, 'direct_associations_to_covid': 0, 'symptoms_associated': ['coagulation', 'fever']}, 'IL6': {'two_step_associations_to_covid': 23, 'direct_associations_to_covid': 0, 'symptoms_associated': ['coagulation', 'acute kidney failure', 'acute kidney failure', 'acute kidney failure', 'fever', 'fever', 'headache', 'pharyngitis', 'pharyngitis']}, 'M

In [100]:
def get_connection_normalizing_count(gene):
    count = 0
    input_object = ht.query(gene)['Gene'][0]
    for x in node_type_list:
        fc = FindConnection(input_obj=input_object, output_obj=x, intermediate_nodes=None)
        fc.connect(verbose=False)
        df = fc.display_table_view()
        rows = df.shape[0]
        count = count + rows
    return(count)
        
TNF_count = get_connection_normalizing_count('TNF')
print(TNF_count)

19385


In [110]:
connection_dict = {}
for key in results_dict:
    connection_dict[key] = get_connection_normalizing_count(key)

In [118]:
dataframe_input = []

for key in results_dict:
    connections_count = connection_dict[key]
    current_result = {'gene': key, 
                      "direct_disease_assoc": results_dict[key]["direct_associations_to_covid"], 
                      "two_step_assoc_to_disease": results_dict[key]["two_step_associations_to_covid"],
                      "two_step_pub_count": top_genes_pub_counts[key],
                      "disease_symptoms_gene_is_associated_with": results_dict[key]["symptoms_associated"],
                      "symptoms_associated_count": len(results_dict[key]["symptoms_associated"]),
                      "disease_symptom_gene_pub_count": top_symptom_pub_counts[key],
                      "causes_symptom_count": causes_dict[key] if key in causes_dict else 0,
                      "relevance_score": ((results_dict[key]["direct_associations_to_covid"]*10 +
                                          results_dict[key]["two_step_associations_to_covid"] + 
                                          round(top_genes_pub_counts[key] / 3) + 
                                          round(top_symptom_pub_counts[key]) + 
                                          (causes_dict[key] if key in causes_dict else 0)*20 + 
                                          len(results_dict[key]["symptoms_associated"])*10)
                                          /connections_count)
                     }
    dataframe_input.append(current_result)
    
final_df = pd.DataFrame(dataframe_input)
final_df = final_df.sort_values(by=['relevance_score'], ascending=False)
final_df

Unnamed: 0,gene,direct_disease_assoc,two_step_assoc_to_disease,two_step_pub_count,disease_symptoms_gene_is_associated_with,symptoms_associated_count,disease_symptom_gene_pub_count,causes_symptom_count,relevance_score
8,EPO,0,14,26,"[coagulation, dyspnea, fever, hypoxemia, acute...",7,6,3,0.073509
21,ACE2,2,18,24,[acute kidney failure],1,0,0,0.066986
35,MPO,0,14,33,"[hypoxemia, acute kidney failure, acute kidney...",3,1,0,0.05045
16,TH,1,16,541,"[fever, pharyngitis]",2,9,0,0.041778
18,MARS1,1,7,5,"[cough, dyspnea]",2,0,0,0.041183
24,TLR9,0,16,15,[pharyngitis],1,0,0,0.04
22,ACE,0,18,32,"[acute kidney failure, acute kidney failure, c...",5,4,1,0.037184
11,ANG,0,24,78,[dyspnea],1,0,0,0.034345
25,PPIG,0,16,45,"[fever, acute kidney failure]",2,3,1,0.029088
10,CRP,1,10,0,"[coagulation, pharyngitis, pharyngitis, acute ...",4,0,0,0.028143


In [112]:
final_df

Unnamed: 0,gene,direct_disease_assoc,two_step_assoc_to_disease,two_step_pub_count,disease_symptoms_gene_is_associated_with,symptoms_associated_count,disease_symptom_gene_pub_count,causes_symptom_count,relevance_score
21,ACE2,2,18,24,[acute kidney failure],1,0,0,0.07177
8,EPO,0,14,26,"[coagulation, dyspnea, fever, hypoxemia, acute...",7,6,3,0.054092
16,TH,1,16,541,"[fever, pharyngitis]",2,9,0,0.040533
35,MPO,0,14,33,"[hypoxemia, acute kidney failure, acute kidney...",3,1,0,0.034234
18,MARS1,1,7,5,"[cough, dyspnea]",2,0,0,0.033791
24,TLR9,0,16,15,[pharyngitis],1,0,0,0.032258
11,ANG,0,24,78,[dyspnea],1,0,0,0.03091
22,ACE,0,18,32,"[acute kidney failure, acute kidney failure, c...",5,4,1,0.026354
25,PPIG,0,16,45,"[fever, acute kidney failure]",2,3,1,0.024371
29,IL1B,0,15,39,"[fever, pharyngitis]",2,0,0,0.019629


## 1.4 What proteins/genes are in pathways of known COVID-19 related genes? Which of these can be related to symptoms? 
### 1.4.1 Genes (from 1.1.1) -> Pathways -> Genes


### 1.4.2 COVID-19 Symptoms -> Pathways -> Genes

In [59]:
for x in symptom_and_phenotype_list:
#     print(x)
    if(ht.query(x)['PhenotypicFeature']):
        print(ht.query(x)['PhenotypicFeature'])

In [71]:
disease_symptom_list = []
for x in symptom_and_phenotype_list:
#     print(x)
    res = ht.query(x)['Disease']
    if(res):
        for y in res:
            if y['name'].lower() == x:
                disease_symptom_list.append(y)
disease_symptom_list

[{'MONDO': 'MONDO:0002258',
  'DOID': 'DOID:2275',
  'UMLS': 'C0031350',
  'name': 'pharyngitis',
  'MESH': 'D010612',
  'primary': {'identifier': 'MONDO',
   'cls': 'Disease',
   'value': 'MONDO:0002258'},
  'display': 'MONDO(MONDO:0002258) DOID(DOID:2275) UMLS(C0031350) MESH(D010612) name(pharyngitis)',
  'type': 'Disease'},
 {'MONDO': 'C0015967',
  'UMLS': 'C0015967',
  'name': 'Fever',
  'primary': {'identifier': 'MONDO', 'cls': 'Disease', 'value': 'C0015967'},
  'display': 'MONDO(C0015967) UMLS(C0015967) name(Fever)',
  'type': 'Disease'},
 {'MONDO': 'MONDO:0002492',
  'DOID': 'DOID:3021',
  'UMLS': 'C0022660',
  'name': 'acute kidney failure',
  'MESH': 'D058186',
  'primary': {'identifier': 'MONDO',
   'cls': 'Disease',
   'value': 'MONDO:0002492'},
  'display': 'MONDO(MONDO:0002492) DOID(DOID:3021) UMLS(C0022660) MESH(D058186) name(acute kidney failure)',
  'type': 'Disease'},
 {'MONDO': 'C0018681',
  'UMLS': 'C0018681',
  'name': 'Headache',
  'primary': {'identifier': 'MONDO'

In [72]:
symptom_and_phenotype_list

['pharyngitis',
 'fever',
 'hyperthermia',
 'pyrexia',
 'diabetes mellitus',
 'acute kidney injury',
 'acute kidney failure',
 'acute renal failure',
 'immunodeficiency',
 'decreased immune function',
 'immune deficiency',
 'headache',
 'headaches',
 'respiratory distress',
 'breathing difficulties',
 'difficulty breathing',
 'respiratory difficulties',
 'dyspnea',
 'abnormal breathing',
 'breathing difficulty',
 'difficult to breathe',
 'dyspnoea',
 'trouble breathing',
 'respiratory failure requiring assisted ventilation',
 'respiratory distress necessitating mechanical ventilation',
 'respiratory distress requiring endotracheal intubation',
 'respiratory distress requiring mechanical ventilation',
 'acute infectious pneumonia',
 'cough',
 'coughing',
 'chronic lung disease',
 'hypoxemia',
 'low blood oxygen level',
 'myalgia',
 'muscle ache',
 'muscle pain']