### Phenotypic Co-occurrence
One measure enabled by the HPOA disease to phenotype data is phenotypic co-occurrence.  Phenotypic co-occurrence is of interest because it allows us to hypothesize that there is a dependent relationship between two phenotypes.  Although it does not imply causality, it may indicate a causal relationship between two phenotypes and and a third (or more) latent variables, or a causual relationship between two phenotypes. For example, _progressive muscle weakness_ causes _falls_.  Or an alternative example, there is a biological process and/or environmental factor that is causes allergies and asthma to co-occurr.  The latter is of interest in Monarch because we can query biological pathways and processes (GO, Reactome, etc.) related to phenotypes by joining gene-disease-phenotype relationships.  In addition, this type of analysis could  be useful to weight gene to phenotype associations where phenotypes co-occur in a mendelian disease where we have identified a causual gene to disease association.  This analysis could also be used to hypothesize pleiotropic effects.

For simplicitiy, we will treat all phenotypes and diseases as flat (or leaf nodes) in our disease to phenotype association data.  In practice we know this is not correct, and both phenotype groups and disease groups appear in the association data.

#### Approach
We will query the HPOA association data using the Monarch Neo4J database.  These values may differ from querying the raw dataset due to merging equivalent diseases in MONDO.

This analysis assumes we are starting with a phenotype of interest.  Although a more comprehensive analyses, as in generating a co-occurrence matrix, may also be useful.


#### About this notebook
This notebook uses the _Prominent nasal bridge_ as an example phenotype.  This can be changed in the second cell to analyze different phenotypes.

Dependencies:

* pip install requests
* pip install pandas
* pip install neo4j-driver

In [61]:
import requests

# Query the Monarch database for co-occurrence of prominent nasal bridge and other phenotypes
# This can performed using a count aggregate function in cypher

phenotype = "HP:0000426" # Prominent nasal bridge

SCIGRAPH = "https://scigraph-data-dev.monarchinitiative.org/scigraph/"
scigraph_exec = SCIGRAPH + "cypher/execute"
scigraph_resolve = SCIGRAPH + "cypher/resolve"

# The WHERE p1 <> p2 because clique merging of diseases causes
# duplicate edges, need to fix
cypher_query = """
    MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'}),
          (disease)-[:RO:0002200]->(p2:Node)
    WHERE p1 <> p2
    RETURN p2.label as phenotype, COUNT(DISTINCT(disease)) as disease_count
    ORDER BY disease_count DESC
""" % phenotype

params = {
    'cypherQuery': cypher_query,
    'limit': 10
}

scigraph_req = requests.get(scigraph_exec, params=params)
print(scigraph_req.text) # Default format is ascii text table, but we can get back json

+--------------------------------------------------+
| phenotype                        | disease_count |
+--------------------------------------------------+
| "Global developmental delay"     | 68            |
| "Microcephaly"                   | 68            |
| "Short stature"                  | 67            |
| "Intellectual disability"        | 65            |
| "Seizures"                       | 51            |
| "Micrognathia"                   | 48            |
| "Downslanted palpebral fissures" | 46            |
| "Cryptorchidism"                 | 45            |
| "High palate"                    | 45            |
| "Low-set ears"                   | 43            |
+--------------------------------------------------+
10 rows



#### Normalization
We know that the distribution of phenotypes across diseases is not uniform.  In other words, some phenotypes are annnotated to diseases more often than others.  Therefore, we need to normalize this data.  This is not to be confused with the frequency that a patient presents with a phenotype or expressivity.

Two common methods for normalizing co-occurrence data are Jaccard similarity and Cosine similarity.

Given two phenotypes, P1 and P2, we define jaccard similarity as the count of diseases that contain both P1 AND P2, divided by the count of diseases that contain either P1 OR P2, or:

$$Jaccard(P1,P2) = \frac{\mid \ P1 \cap P2 \ \mid}{\mid\  P1 \mid + \mid P2 \mid - \mid P1 \cap P2 \ \mid }$$

Where 
$$ \mid \  P \mid = \text{Number of diseases annotated to phenotype P} $$



For cosine similarity we will use the Ochiai coefficient, which is defined as the count of diseases that contain both P1 AND P2, divided by the square root of the product of the count of diseases that contain P1 and the count of the diseases that contain P2, or:

$$\text{Ochiai coefficient(P1,P2)} = \frac{\mid P1 \cap P2 \mid}{\sqrt{\mid \ P1 \mid \times \mid P2 \ \mid}}$$

#### Approach
This can achieved in pure cypher, but we may consider creating a function on the server.


In [62]:
# Normalize with jaccard similarity
cypher_query = """
    MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'}),
          (disease)-[:RO:0002200]->(p2:Phenotype)
    WHERE p2 <> p1
    WITH p2, COUNT(DISTINCT(disease)) as co_count
    MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'})
    WITH COUNT(DISTINCT(disease)) as p1_count, p2, co_count
    MATCH (disease:disease)-[:RO:0002200]->(p2)
    WITH COUNT(DISTINCT(disease)) as p2_count, p1_count, p2, co_count
    RETURN p2.label as phenotype, p1_count, p2_count, co_count as intersection,
           toFloat(co_count)/((p1_count + p2_count)-co_count) as jaccard_sim
    ORDER BY jaccard_sim DESC
    """ % (phenotype, phenotype)

params = {
    'cypherQuery': cypher_query,
    'limit': 10
}

scigraph_req = requests.get(scigraph_exec, params=params)
print(scigraph_req.text) # Default format is ascii text table, but we can get back json

+---------------------------------------------------------------------------------------------+
| phenotype                        | p1_count | p2_count | intersection | jaccard_sim         |
+---------------------------------------------------------------------------------------------+
| "Long face"                      | 130      | 117      | 27           | 0.12272727272727273 |
| "Highly arched eyebrow"          | 130      | 103      | 22           | 0.10426540284360189 |
| "Downslanted palpebral fissures" | 130      | 374      | 46           | 0.10043668122270742 |
| "High palate"                    | 130      | 430      | 45           | 0.08737864077669903 |
| "Thin vermilion border"          | 130      | 110      | 19           | 0.08597285067873303 |
| "Short philtrum"                 | 130      | 152      | 22           | 0.08461538461538462 |
| "Posteriorly rotated ears"       | 130      | 186      | 24           | 0.0821917808219178  |
| "Macrotia"                       | 130

In [68]:
# Normalize with ochiai coefficient
ochiai_query = """
    MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'}),
          (disease)-[:RO:0002200]->(p2:Phenotype)
    WHERE p1 <> p2
    WITH DISTINCT p1, p2, disease
    WITH p2, COUNT(DISTINCT(disease)) as co_count
    MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'})
    WITH COUNT(DISTINCT(disease)) as p1_count, p2, co_count
    MATCH (disease:disease)-[:RO:0002200]->(p2)
    WITH COUNT(DISTINCT(disease)) as p2_count, p1_count, p2, co_count
    RETURN p2.label as phenotype, p1_count, p2_count, co_count as intersection,
    toFloat(co_count)/sqrt(p1_count * p2_count) as ochai_coef
    ORDER BY ochai_coef DESC
    """ % (phenotype, phenotype)

params = {
    'cypherQuery': ochiai_query,
    'limit': 10
}

scigraph_req = requests.get(scigraph_exec, params=params)
print(scigraph_req.text) # Default format is ascii text table, but we can get back json

+---------------------------------------------------------------------------------------------+
| phenotype                        | p1_count | p2_count | intersection | ochai_coef          |
+---------------------------------------------------------------------------------------------+
| "Long face"                      | 130      | 117      | 27           | 0.21892691493473396 |
| "Downslanted palpebral fissures" | 130      | 374      | 46           | 0.20861731638973618 |
| "Microcephaly"                   | 130      | 945      | 68           | 0.19400875660494718 |
| "High palate"                    | 130      | 430      | 45           | 0.1903297204970161  |
| "Highly arched eyebrow"          | 130      | 103      | 22           | 0.19012200791583994 |
| "Short stature"                  | 130      | 1187     | 67           | 0.17056022767438492 |
| "Low-set ears"                   | 130      | 490      | 43           | 0.17037220312632037 |
| "Ptosis"                         | 130

#### Results
The original results showed several neurological abnormalities: seizures, global development delay, intellecutal disability.  In contrast, after normalization the majority of co-occurring phenotypes are related to morphilogical abnormalities of the head, with the exception of cryptorchidism.

#### Computing p-values for co-occurrence
Similarity metrics are useful in normalizing and ranking co-occurrence data; however, it doesn't measure if a pair of phenotypes significantly co-occur in our corpus.

In order to compute the probability that the two phenotypes are independent, we create a 2x2 contigency table and run a one tailed fisher's exact test.  The contingency table is structured as:


|     | P1 Present     |  P1 Absent |
|:-------------|:----------------:|:--------:|
|__P2 Present__      |A   | B |
|__P2 Absent__     |C  |D  |

Given the variables in the contingency table, jaccard similarity could be rewritten as:

$$Jaccard(P1,P2) = \frac{A}{A + B + C}$$

For those coming from a GO term enrichment perspective, this can also be conceptualized as testing for enrichment on the class P1 where our selected dataset is diseases with P2, and our unselected dataset is diseases without P2, or:

|     | P1 Present     |  P1 Absent |
|:-------------|:----------------:|:--------:|
|Sample group (diseases with P2)      |A   | B |
|Background (diseases without P2)     |C  |D  |



In [83]:
# There are a couple libs that compute the fisher's exact test:
# scipy, https://github.com/brentp/fishers_exact_test,
# but it's easy enough to rewrite for a 2x2 table

# References: https://en.wikipedia.org/wiki/Fisher%27s_exact_test
# This is also a good bio related description:
# http://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/#setup

import math


def hyper_geometric(matrix):
    a = matrix[0][0]
    b = matrix[0][1]
    c = matrix[1][0]
    d = matrix[1][1]
    numerator = math.factorial(a + b) * math.factorial(c + d) \
                * math.factorial(a + c) * math.factorial(b + d)
    denominator = math.factorial(a) * math.factorial(b) \
                  * math.factorial(c) * math.factorial(d) \
                  * math.factorial(a + b + c + d)
    return numerator/denominator


def fisher_exact(matrix, direction="greater"):
    p_value =  hyper_geometric(matrix)
    if direction == "greater":
        while matrix[0][1] > 0 and matrix[1][0] > 0:
            matrix[0][0] += 1
            matrix[0][1] -= 1
            matrix[1][0] -= 1
            matrix[1][1] += 1
            p_value += hyper_geometric(matrix)

    elif direction == "lesser":
        while matrix[0][0] > 0 and matrix[1][1] > 0:
            matrix[0][0] -= 1
            matrix[0][1] += 1
            matrix[1][0] += 1
            matrix[1][1] -= 1
            p_value += hyper_geometric(matrix)
    else:
        raise ValueError("only accepts greater or lesser")
    return p_value

In [84]:
import pandas as pd


# Get input for fisher's exact

params = {
    'cypherQuery': ochiai_query,
    'limit': 50
}

scigraph_req = requests.get(scigraph_exec + ".json", params=params)
cooccur_table = scigraph_req.json()

# Get count of diseases with phenotypes
# Get input for fisher's exact

cypher_query = """
    MATCH (disease:disease)-[:RO:0002200]->(:Phenotype)
    RETURN COUNT(DISTINCT(disease)) as count
"""

params = {
    'cypherQuery': cypher_query,
    'limit': 1
}

scigraph_req = requests.get(scigraph_exec + ".json", params=params)
disease_count = scigraph_req.json()[0]['count']

result_table = pd.DataFrame()

for result in cooccur_table:
    row = {}
    row['phenotype'] = result['phenotype']
    row['p1_count'] = result['p1_count']
    row['p2_count'] = result['p2_count']
    row['intersection'] = result['intersection']
    row['ochai_coef'] = result['ochai_coef']
    p1_only = result['p1_count'] - result['intersection']
    p2_only = result['p2_count'] - result['intersection']
    matrix = [[result['intersection'], p2_only],
              [ p1_only, (disease_count - p2_only - p1_only - result['intersection'])]]
    row['p_value'] = fisher_exact(matrix)
    result_table = result_table.append(row, ignore_index=True)
    
result_table.sort_values(by=['ochai_coef'], ascending=False).head(10)

Unnamed: 0,intersection,ochai_coef,p1_count,p2_count,p_value,phenotype
0,27.0,0.218927,130.0,117.0,7.294206000000001e-28,Long face
1,46.0,0.208617,130.0,374.0,7.820636e-35,Downslanted palpebral fissures
2,68.0,0.194009,130.0,945.0,3.249306e-38,Microcephaly
3,45.0,0.19033,130.0,430.0,6.6682610000000005e-31,High palate
4,22.0,0.190122,130.0,103.0,5.388294e-22,Highly arched eyebrow
5,67.0,0.17056,130.0,1187.0,6.076037e-31,Short stature
6,43.0,0.170372,130.0,490.0,2.831551e-26,Low-set ears
7,42.0,0.164409,130.0,502.0,8.355355000000001e-25,Ptosis
8,48.0,0.163994,130.0,659.0,6.779898e-26,Micrognathia
9,45.0,0.163178,130.0,585.0,3.819463e-25,Cryptorchidism


#### Correcting p-values

For p-value correction we'll use Bonferroni correction.  To apply Bonferonni correction we multiply the uncorrected p value by the number of comparisons or hypotheses we are generating.  To determine the number of possible co-occurring phenotypes we take the binomial coefficient, or n choose k, where n is two and k is the count of all phenotypes with at least one disease annotation.

In [86]:
# Calculate bonferroni correction
# Source: https://en.wikipedia.org/wiki/Bonferroni_correction

cypher_query = """
    MATCH (disease:disease)-[:RO:0002200]->(p:Phenotype)
    RETURN COUNT(DISTINCT(p)) as count
"""

params = {
    'cypherQuery': cypher_query,
    'limit': 1
}

scigraph_req = requests.get(scigraph_exec + ".json", params=params)
phenotype_count = scigraph_req.json()[0]['count']

def n_choose_k(n, k):
    return math.factorial(n) / (math.factorial(k) * math.factorial(n - k))

comparisons = n_choose_k(phenotype_count, 2)

for index, row in result_table.iterrows():
    corrected_pval = row['p_value'] * comparisons
    result_table.loc[index,'corrected_pval'] = corrected_pval
    
result_table.sort_values(by=['ochai_coef'], ascending=False).head(10)

Unnamed: 0,intersection,ochai_coef,p1_count,p2_count,p_value,phenotype,corrected_pval
0,27.0,0.218927,130.0,117.0,7.294206000000001e-28,Long face,2.7361139999999996e-20
1,46.0,0.208617,130.0,374.0,7.820636e-35,Downslanted palpebral fissures,2.933582e-27
2,68.0,0.194009,130.0,945.0,3.249306e-38,Microcephaly,1.2188399999999999e-30
3,45.0,0.19033,130.0,430.0,6.6682610000000005e-31,High palate,2.5013170000000002e-23
4,22.0,0.190122,130.0,103.0,5.388294e-22,Highly arched eyebrow,2.021192e-14
5,67.0,0.17056,130.0,1187.0,6.076037e-31,Short stature,2.2791700000000002e-23
6,43.0,0.170372,130.0,490.0,2.831551e-26,Low-set ears,1.062137e-18
7,42.0,0.164409,130.0,502.0,8.355355000000001e-25,Ptosis,3.13416e-17
8,48.0,0.163994,130.0,659.0,6.779898e-26,Micrognathia,2.543193e-18
9,45.0,0.163178,130.0,585.0,3.819463e-25,Cryptorchidism,1.4327110000000003e-17


In [89]:
# make sure we're rejecting some

result_table.sort_values(by=['corrected_pval'], ascending=False).head(15)

Unnamed: 0,intersection,ochai_coef,p1_count,p2_count,p_value,phenotype,corrected_pval
37,2.0,0.124035,130.0,2.0,0.000141526,Cat cry,5308.753026
44,4.0,0.116941,130.0,9.0,2.336082e-06,Oval face,87.628273
23,4.0,0.143223,130.0,6.0,2.859711e-07,Conspicuously happy disposition,10.727002
35,5.0,0.126592,130.0,12.0,1.665228e-07,Profound global developmental delay,6.246404
40,6.0,0.120727,130.0,19.0,6.170429e-08,Narrow nose,2.314577
46,12.0,0.115524,130.0,83.0,2.375031e-10,Prominent nose,0.008909
48,14.0,0.114501,130.0,115.0,7.570281e-11,Hypoplasia of the maxilla,0.00284
47,14.0,0.115002,130.0,114.0,6.715816e-11,Downturned corners of mouth,0.002519
42,14.0,0.11761,130.0,109.0,3.621595e-11,Thick eyebrow,0.001358
49,17.0,0.114354,130.0,170.0,1.547818e-11,High forehead,0.000581


#### Normalization on phenotypic frequency
For a single disease to phenotype association, the HPOAs provide a frequency field, defined as the frequency of patients that show a particular clinical feature. Examples are Obligate, Frequent, and Occasional.

We can use this data to further weight/normalize phenotypic co-occurrence data.  For example, if two phenotypes occur "very frequently" in the same disease, this would be weighted higher than if one phenotype occurs very frequently, and one occurs occasionally.  We consider the intersection the minimum frequency between two phenotypes.  In addition, the total disease count will be adjusted to account for frequency.

As a test we will set the following weights:

| Frequency    | Definition     | Weight |
|:-------------|:----------------|:--------|
|Excluded      |present in 0%    | 0      |
|Very rare     |present in 1-4%  |. 25    |
|Occasional    |present in 5-29% |1.7    |
|Frequent      |present in 30-79%|5.45    |
|Very frequent |present in 80-99%|8.95    |
|Obligate      |present in 100%|10    |
|Not provided  ||4    |

In [104]:
import pandas as pd
from neo4j.v1 import GraphDatabase

# Frequency weight map
freq_weights = {
    'HP:0040285': 0,
    'HP:0040284': .25,
    'HP:0040283': 1.7,
    'HP:0040282': 5.45,
    'HP:0040281': 8.95,
    'HP:0040280': 10,
    'unknown':    4 # Not sure how to boost this
}

# Result table
result_table = pd.DataFrame()

# Note this query would be a lot shorter if frequencies were edge properties
cypher_query = """
      MATCH (disease:disease)-[:RO:0002200]->(p1:Node{iri:'%s'}),
            (disease)-[:RO:0002200]->(p2:Phenotype)
      WHERE p1 <> p2
      RETURN DISTINCT p1, p2, disease
      """ % phenotype

params = {
    'cypherQuery': cypher_query
}

scigraph_req = requests.get(scigraph_resolve, params=params)
resolved_query = scigraph_req.text # Resolve curies to IRIs

scigraph_bolt = "bolt://neo4j.monarchinitiative.org:443"
driver = GraphDatabase.driver(scigraph_bolt, auth=("neo4j", "neo4j"))

def get_scigraph_results(query):
    with driver.session() as session:
        with session.begin_transaction() as tx:
            for record in tx.run(query):
                yield record
                
solr = 'https://solr-dev.monarchinitiative.org/solr/golr/select/'
res_objects = []
            
for result in get_scigraph_results(resolved_query):
    row = {}
    row['query_phenotype'] = result['p1']['label']
    row['phenotype'] = result['p2']['label']
    row['disease'] = result['disease']['label']
    row['qphenotype_curie'] = result['p1']['iri'].replace("http://purl.obolibrary.org/obo/HP_", "HP:")
    row['phenotype_curie'] = result['p2']['iri'].replace("http://purl.obolibrary.org/obo/HP_", "HP:")
    row['disease_curie'] = result['disease']['iri'].replace("http://purl.obolibrary.org/obo/MONDO_", "MONDO:")
    
    params = {
      "fq": [
        'subject:"{0}"'.format(row['disease_curie']),
        'object:"{0}" OR object:"{1}"'\
            .format(row['qphenotype_curie'],row['phenotype_curie'])
      ],
      "rows": "2",
      "q": "*:*",
      "wt": "json",
      "fl": "object,frequency"
    }
    
    solr_req = requests.get(solr, params=params)
    solr_docs = solr_req.json()
    if solr_docs['response']['numFound'] > 2:
        print(solr_docs)
        raise ValueError("Unexpected number of docs")
    for doc in solr_docs['response']['docs']:
        if doc['object'] == row['qphenotype_curie']:
            freq_p1 = doc['frequency'] if 'frequency' in doc else 'unknown'
        elif doc['object'] == row['phenotype_curie']:
            freq_p2 = doc['frequency'] if 'frequency' in doc else 'unknown'
    
    row['q_phenotype_frequency'] = freq_weights[freq_p1]
    row['phenotype_frequency'] = freq_weights[freq_p2]
    
    result_table = result_table.append(row, ignore_index=True)
    
result_table.head()

Unnamed: 0,disease,disease_curie,phenotype,phenotype_curie,phenotype_frequency,q_phenotype_frequency,qphenotype_curie,query_phenotype
0,"X-linked intellectual disability, Cilliers type",MONDO:0015600,Decreased testicular size,HP:0008734,8.95,8.95,HP:0000426,Prominent nasal bridge
1,"X-linked intellectual disability, Cilliers type",MONDO:0015600,"Intellectual disability, mild",HP:0001256,8.95,8.95,HP:0000426,Prominent nasal bridge
2,"X-linked intellectual disability, Cilliers type",MONDO:0015600,Failure to thrive,HP:0001508,8.95,8.95,HP:0000426,Prominent nasal bridge
3,"X-linked intellectual disability, Cilliers type",MONDO:0015600,Small nail,HP:0001792,8.95,8.95,HP:0000426,Prominent nasal bridge
4,"X-linked intellectual disability, Cilliers type",MONDO:0015600,Abnormal facial shape,HP:0001999,8.95,8.95,HP:0000426,Prominent nasal bridge


In [107]:
len(result_table.index)

5119

In [101]:
aggregate_table = pd.DataFrame()

phenotypes = result_table['phenotype'].unique()

for pheno in phenotypes:
    group_by_pheno = result_table[result_table['phenotype'] == pheno]
    intersection = group_by_pheno.loc[:, ['phenotype_frequency', 'q_phenotype_frequency']].min(axis=1).sum()
    row = {
        'phenotype': pheno,
        'phenotype_curie': group_by_pheno.iloc[0]['phenotype_curie'],
        'intersection': intersection
    }
    aggregate_table = aggregate_table.append(row, ignore_index=True)

aggregate_table.sort_values(by=['intersection'], ascending=False).head(10)
    

Unnamed: 0,intersection,phenotype,phenotype_curie
47,298.2,Global developmental delay,HP:0001263
12,283.95,Microcephaly,HP:0000252
19,280.8,Short stature,HP:0004322
60,275.1,Intellectual disability,HP:0001249
27,213.9,Micrognathia,HP:0000347
67,212.2,Downslanted palpebral fissures,HP:0000494
10,185.35,Cryptorchidism,HP:0000028
29,184.05,High palate,HP:0000218
72,180.3,Low-set ears,HP:0000369
92,172.25,Hypertelorism,HP:0000316


The top ten look similar to our original top ten list. However, we still need to normalize this data.  For the next step we will leverage our solr cache.  Solr/Golr is useful because we can toggle between treating phenotype disease annotations as flat or querying grouping classes (when applicable, such as microcephaly).

In [102]:
import math

# Pull down whole pivot table
solr = 'https://solr-dev.monarchinitiative.org/solr/golr/select/'
params = {
  "facet.pivot": "object,frequency",
  "fq": [
    "subject_category:disease",
    "object_category:phenotype"
  ],
  "rows": "0",
  "q": "*:*",
  "facet.limit": "12000", # Should get all of HPO
  "f.object_closure.facet.prefix": "HP",
  "facet.method": "enum",
  "facet.mincount": "1",
  "facet": "true",
  "wt": "json",
  "facet.sort": "count"
};

solr_req = requests.get(solr, params=params)
pivot_table = solr_req.json()

aggregate_table['p2_count'] = 0
phenotype_ids = result_table['phenotype_curie'].unique()


def calculate_weighted_frequency(facet):
    count = (int(facet['count']) - sum([freq['count'] for freq in facet['pivot']])) * freq_weights['unknown']
    for freq in facet['pivot']:
        count += (freq['count'] * freq_weights[freq['value']])
    return count

for facet in pivot_table['facet_counts']['facet_pivot']['object,frequency']:
    if facet['value'] == phenotype:
        if 'pivot' in facet:
            count = calculate_weighted_frequency(facet)
        else:
            count = int(facet['count']) * freq_weights['unknown']
        aggregate_table['p1_count'] = count
    elif facet['value'] in phenotype_ids:
        if 'pivot' in facet:
            count = calculate_weighted_frequency(facet)
        else:
            count = int(facet['count']) * freq_weights['unknown']
        aggregate_table.loc[(aggregate_table['phenotype_curie'] == facet['value']), "p2_count"] = count

def calculate_jaccard(intersection, count1, count2):
    return intersection / ((count1 + count2) - intersection)

def calcuate_ochiai(intersection, count1, count2):
    return intersection / math.sqrt(count1 * count2)


aggregate_table['jaccard_sim'] = aggregate_table.apply(
        func=lambda row: calculate_jaccard(
                              row['intersection'],
                              row['p1_count'],
                              row['p2_count']),
        axis=1
)
    
aggregate_table['ochiai_coeff'] = aggregate_table.apply(
        func=lambda row: calcuate_ochiai(
                              row['intersection'],
                              row['p1_count'],
                              row['p2_count']),
        axis=1
)

aggregate_table.sort_values(by=['jaccard_sim'], ascending=False).head(15)

Unnamed: 0,intersection,phenotype,phenotype_curie,p2_count,p1_count,jaccard_sim,ochiai_coeff
366,104.2,Long face,HP:0000276,518.7,647.5,0.098117,0.1798
67,212.2,Downslanted palpebral fissures,HP:0000494,1796.65,647.5,0.095074,0.19674
29,184.05,High palate,HP:0000218,1885.75,647.5,0.078346,0.166561
116,68.9,Low anterior hairline,HP:0000294,302.8,647.5,0.078171,0.155604
68,88.05,Thin vermilion border,HP:0000233,571.0,647.5,0.077889,0.144808
16,116.15,Macrotia,HP:0000400,970.85,647.5,0.07732,0.146495
106,96.05,Short philtrum,HP:0000322,720.4,647.5,0.07552,0.140634
71,116.75,Narrow mouth,HP:0000160,1020.75,647.5,0.07525,0.143608
187,99.25,Posteriorly rotated ears,HP:0000358,813.4,647.5,0.07289,0.13676
124,74.65,Highly arched eyebrow,HP:0002553,461.75,647.5,0.072153,0.136523


In [103]:
aggregate_table.sort_values(by=['ochiai_coeff'], ascending=False).head(15)

Unnamed: 0,intersection,phenotype,phenotype_curie,p2_count,p1_count,jaccard_sim,ochiai_coeff
67,212.2,Downslanted palpebral fissures,HP:0000494,1796.65,647.5,0.095074,0.19674
366,104.2,Long face,HP:0000276,518.7,647.5,0.098117,0.1798
29,184.05,High palate,HP:0000218,1885.75,647.5,0.078346,0.166561
12,283.95,Microcephaly,HP:0000252,4489.9,647.5,0.058505,0.166534
116,68.9,Low anterior hairline,HP:0000294,302.8,647.5,0.078171,0.155604
91,164.1,Wide nasal bridge,HP:0000431,1886.15,647.5,0.069254,0.148491
72,180.3,Low-set ears,HP:0000369,2308.3,647.5,0.064961,0.147479
16,116.15,Macrotia,HP:0000400,970.85,647.5,0.07732,0.146495
68,88.05,Thin vermilion border,HP:0000233,571.0,647.5,0.077889,0.144808
27,213.9,Micrognathia,HP:0000347,3391.25,647.5,0.055924,0.144348


For this example, adding frequency data does not seem to affect the results much with the exception that _low anterior hairline_ is upranked in the frequency aware lists.

Follow up questions:
- What genes are associated with this group of phenotypes?
- What processes and pathways are associated with these genes?
