#### Leveraging glycophenotypes for diagnosis of human diseases: example of Fucosidosis


We hypothesize that expanding the representation of glycophenotypes in the HPO and their use in disease annotations could improve phenotype-based comparisons in the Monarch Initiative. Here we perform an analysis with phenotypes associated with the disease fucosidosis (MONDO:0009254). First, we generate 100 "simulated" disease phenotype profiles by randomly sampling 10 of the 16 hallmark phenotypes associated with fucosidosis.  In a second set, for each profile, we randomly replaced two phenotypes with two glycophenotypes known to be associated with fucosidosis. Both sets are compared using the OwlSim algorithm from the Monarch Initiative, which given a list of phenotypes, ranks candidate diseases based on phenotypic similarity.

This notebook walks through this analysis, first where the 10 sampled non glycophenotypes and 2 glycophenotypes are provided, and a second run where the entire analysis is randomized.

##### Ontologies, data, and tools

###### Ontologies
This analysis uses the following versions of ontologies:

Human Phenotype Ontology (HPO): http://purl.obolibrary.org/obo/hp/releases/2018-12-21/hp.owl

Monarch Disease Ontology (MONDO): http://purl.obolibrary.org/obo/mondo/releases/2018-12-02/mondo.owl

###### Disease to phenotype annotations

The [HPO disease to phenotype annotation file](http://compbio.charite.de/jenkins/job/hpo.annotations/lastStableBuild/artifact/misc/phenotype_annotation.tab) is not versioned; therefore, a snapshot of is provided alongside this notebook,

_HPO annotations_: ../data/phenotype_annotation.tab

Disease identifiers were merged using MONDO, and annotations updated with merged IDs,

_Disease to phenotype annotations_: ../data/mondo_hp.tsv

###### OwlSim
The official Monarch Initiative OwlSim endpoint is available via:
https://monarchinitiative.org/owlsim/

Alternatively, we provide a docker file to run owlsim locally with the above ontology and annotation files. To run:

    cd owlsim-docker
    docker build ./ -t owlsim-slim
    docker run -d -p 9031:9031 --name owlsim owlsim-slim
    
    # Test that owlsim is running (may take 60 seconds to start)
    # http://localhost:9031/compareAttributeSets?a=HP:0010539&b=HP:0006989&b=HP:0000219&b=HP:0000248
    
    # To stop the container
    docker stop owlsim


##### Running this notebook
This is a python3 notebook and assumes the environment running jupyter contains the following python libs:
- requests
- rdflib



In [12]:
import random
import requests
import csv

# if not using docker 'https://monarchinitiative.org/owlsim/'
# owlsim_url = 'https://monarchinitiative.org/owlsim/'
owlsim_url = 'http://localhost:9031/owlsim/'

fucosidosis = 'MONDO:0009254'

# I/O
# Get all disease to phenotype associations
diseases = {
    'MONDO:0009254'
}
d2p = {}
with open('../data/mondo_hp.tsv', 'r') as cache_file:
    reader = csv.reader(cache_file, delimiter='\t', quotechar='\"')
    for row in reader:
        if row[0].startswith('#'): continue
        (mondo_id, phenotype_id) = row[0:2]
        if mondo_id in diseases:
            if mondo_id in d2p:
                d2p[mondo_id].add(phenotype_id)
            else:
                d2p[mondo_id] = {phenotype_id}


output = open('./output/fucosidosis-analysis.tsv', 'w')

headers = [
    "sample_score",
    "sample_rank",
    "glyco_score",
    "glyco_rank",
    "sample",
    "w_glyco"
]

output.write("{}\n".format("\t".join(headers)))

def get_sim_and_rank(pheno_list, disease, owlsim):
    search_url = owlsim + "searchByAttributeSet"
    params = {
        'a': pheno_list,
        'target': 'MONDO'
    }
    sim_req = requests.post(search_url, data=params)
    sim_results = sim_req.json()
    sim_list = sim_results['results']
    rank = 1
    last_score = -1
    sample_rank = 'NA'
    sample_score = 'NA'
    for res in sim_list:
        if res["j"]["id"] == fucosidosis:
            sample_rank = rank
            sample_score = res["combinedScore"]
            break
        elif int(res["combinedScore"]) < last_score:
            rank += 1
        last_score = int(res["combinedScore"])

    return sample_score, sample_rank

# Hallmark features of fucosidosis
phenotypes = [
    'HP:0010864',
    'HP:0001263',
    'HP:0000975',
    'HP:0008430',
    'HP:0100578',
    'HP:0000280',
    'HP:0002808',
    'HP:0000248',
    'HP:0000943',
    'HP:0000821',
    'HP:0011220',
    'HP:0002240',
    'HP:0011276',
    'HP:0000365',
    'HP:0005595',
    'HP:0001508'
]

glyco_phenotypes = [
    'HP:0010471',
    'HP:0003541'
]

pheno_profile = d2p[fucosidosis]

seen = set()
i = 0
while(i < 100):
    sample = frozenset(random.sample(phenotypes, 10))
    if sample in seen:
        continue
    seen.add(sample)

    sample_sim, sample_rank = get_sim_and_rank(sample, fucosidosis, owlsim_url)
    sub_sample = random.sample(sample, 8)
    w_glyco = sub_sample + glyco_phenotypes
    sim, rank = get_sim_and_rank(w_glyco, fucosidosis, owlsim_url)

    output.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(
        sample_sim,
        sample_rank,
        sim,
        rank,
        "|".join(sample),
        "|".join(w_glyco),
    ))
    i += 1
output.close()