# Workflow1 Module1 Question4

## What subset of conditions are most representative of [conditions]? (find archetypes)

Here's one solution to question 4 based on using similarity values.   Note that some versions of the module implementation may not directly need this question (such as the similarity or robokop modules).  We'll start with the results from the enrichment answer to Question 3 for diabetes (reproduced here so we can easily get the identifiers).  That's kicked off with a set of diabetes symptoms that are for demo purposes, but a more complete implementation would use a longer set.

### Question 3: use enrichment to get genetic conditions

In [1]:
import requests
import pandas as pd

def enrichment(type1,identlist,type2,threshhold=None,maxresults=None,numtype1=None,include_descendants=None,rebuild=None):
    url=f'http://robokop.renci.org/api/simple/enriched/{type1}/{type2}'
    params = { 'threshhold': threshhold, 'max_results': maxresults, 
              'num_type1':numtype1, 'identifiers': identlist, 
              'include_descendants':include_descendants, 'rebuild': rebuild }
    params = { k:v for k,v in params.items() if v is not None }
    response=requests.post(url, json = params)
    print( f'Return Status: {response.status_code}' )
    if response.status_code == 200:
        return response.json()
    return []

In [2]:
diabetes_symptoms=['HP:0004904', 'HP:0001988', 'HP:0000833', 'HP:0006279','HP:0000842'] 
diabetes_enriched = enrichment('phenotypic_feature',diabetes_symptoms,'genetic_condition',threshhold=0.01,include_descendants=True)
diabetes_enriched_frame = pd.DataFrame(diabetes_enriched)

Return Status: 200


For the demo, let's just get the first 10 rows:

In [3]:
conditions = diabetes_enriched_frame.iloc[0:10]
conditions

Unnamed: 0,id,name,p
0,MONDO:0012381,hyperinsulinism due to INSR deficiency,3.741608e-09
1,MONDO:0011236,hyperinsulinism due to glucokinase deficiency,2.30349e-08
2,MONDO:0015967,rare genetic diabetes mellitus,3.87142e-08
3,MONDO:0005803,hyperinsulinemic hypoglycemia (disease),1.952519e-07
4,MONDO:0015618,genetic pancreatic disease,2.97092e-07
5,MONDO:0004993,carcinoma,3.382536e-07
6,MONDO:0019010,congenital isolated hyperinsulinism,5.36354e-07
7,MONDO:0001076,glucose intolerance,2.038438e-06
8,MONDO:0017688,disorder of glycolysis,5.615776e-06
9,MONDO:0007540,multiple endocrine neoplasia type 1,1.03589e-05


## Get pairwise similarities for all of these by phenotype

There's not a service for pairwise similiarty in ROBOKOP.  We could go to biolink, but here we'll try to use ROBOKOP as an illustration.

In [4]:
def similarity(type1,ident,type2,by_type,threshhold=None,maxresults=None,rebuild=None):
    url=f'http://robokop.renci.org/api/simple/similarity/{type1}/{ident}/{type2}/{by_type}'
    params = { 'threshhold': threshhold, 'max_results': maxresults, 'rebuild': rebuild }
    params = { k:v for k,v in params.items() if v is not None }
    response=requests.get(url, params = params)
    print( 'Return code:',response.status_code )
    return response.json()

In [5]:
sims = {}
for identifier in conditions['id']:
    sims[identifier] = similarity('genetic_condition',identifier,'genetic_condition','phenotypic_feature',threshhold=0.1)

Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200
Return code: 200


In [6]:
sims_dict = { s : {s2['id']: s2['similarity'] for s2 in sims[s] } for s in conditions['id'] }

Now we have the similarity between each of our top 10 conditions and all other conditions (by phenotype).   One way of seeing what is the most representative is to ask, which of these conditions is similar to the most other conditions.  We could do clustering here, but we'll show a simpler approach.  Let's just count the number of conditions in our set of 10 that are within some threshold of each item, and find the most connected item.

In [7]:
from collections import defaultdict

count = defaultdict(int)
thresh = 0.1

for d,v in sims_dict.items():
    for od,s in v.items():
        if od in list(conditions['id']) and s > thresh:
            count[d] += 1
            
clist = [ (v,k) for k,v in count.items() ]
clist.sort()
clist.reverse()
clist          

[(6, 'MONDO:0005803'),
 (5, 'MONDO:0019010'),
 (5, 'MONDO:0012381'),
 (4, 'MONDO:0017688'),
 (4, 'MONDO:0011236'),
 (2, 'MONDO:0015967'),
 (2, 'MONDO:0015618'),
 (2, 'MONDO:0007540'),
 (2, 'MONDO:0004993'),
 (2, 'MONDO:0001076')]

So the member of our list that has the most neighbors is `MONDO:0005803`, or "hyperinsulinemic hypoglycemia (disease)"