# Workflow 2, Module 1 (condition similarity)

One approach to solving this module is to not define quite so tightly what's going on at the subquestion level in terms of enrichements and archetypes and so on, but simply pass the question to ROBOKOP and let its scoring bring the the best answers to the top.  Here we will use the quick service to start with a disease, find relevant phenotypes, and from there find genetic conditions. The answers will come out ranked by path.

For more details, see the "quick" notebook in greengamma/general.

First, we'll have a quick function that calls the quick service, and some functions for properly creating the question.  Then we'll create the question, run it, and pretty print it for two examples: diabetes and asthma.

In [1]:
#Load some functions for calling the quick service and parsing its output

import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from gg_functions import parse_answer, quick, expand

The basic machine question created below goes from a disease to a set of phenotypes to a genetic_condition.  Making the phenotypes a set allows there to be many phenotypes that connect the disease to the condition.

In [2]:
def create_basic_question(disease_id):
    return {
    "machine_question": {
        "nodes": [
            {
                "id": "n0",
                "type": "disease",
                "curie": disease_id
            },
            {
                "id": "n1",
                "type": "gene"
            },
            {
                "id": "n2",
                "type": "biological_process",
                "set": True
            },
            {
                "id": "n3",
                "type": "gene"
            }
        ],
        "edges": [
            {
                "id": "e0",
                "source_id": "n0",
                "target_id": "n1"
            },
            {
                "id": "e1",
                "source_id": "n1",
                "target_id": "n2"
            },
            {
                "id": "e2",
                "source_id": "n2",
                "target_id": "n3"
            }
        ]
    }
}

In [3]:
fanconi = 'MONDO:0019391' #fanconi anemia

## What are the Fanconi Genes?

In [13]:
fanconi_expand = expand('disease',fanconi,'gene')
fanconi_genes = parse_answer(fanconi_expand,node_list=['n1'])
print(f"We found {len(fanconi_genes)} genes associated with {fanconi}. Here are the first few:")
fanconi_genes.head()

http://robokop.renci.org:80/api/simple/expand/disease/MONDO:0019391/gene
Return Status: 200
We found 48 genes associated with MONDO:0019391. Here are the first few:


Unnamed: 0,n1 - id,n1 - name,score
0,HGNC:3584,FANCC,1.777429
1,HGNC:20473,BRIP1,1.767911
2,HGNC:3588,FANCG,1.767064
3,HGNC:9820,RAD51C,1.717592
4,HGNC:3582,FANCA,1.627382


## What genes are functionally similar to the Fanconi genes?

Let's make a question of the type above, and run it.  The max_connectivity option sets a maximum degree for a node in the path, and is used to control the amount of time it takes to run and the specificity of the result. 1000 is a decent across the board value.

In [8]:
machine_question = create_basic_question(fanconi)
answer = quick(machine_question,max_connectivity=1000)

Return Status: 200


We don't want the answer to include already-known FA genes from the list above, so let's filter those out of the results:

In [9]:
df = parse_answer(answer,node_list=['n1','n3'])
v = fanconi_genes['n1 - id '].values
new_df = df[ ~ df['n3 - id '].isin(v) ]

new_df.head(50)

Unnamed: 0,n1 - id,n1 - name,n3 - id,n3 - name,score
8,HGNC:3436,ERCC4,HGNC:3433,ERCC1,1.41432
11,HGNC:1058,BLM,HGNC:12791,WRN,1.262387
15,HGNC:3436,ERCC4,HGNC:10289,RPA1,1.206281
16,HGNC:9817,RAD51,HGNC:7325,MSH2,1.186826
21,HGNC:3436,ERCC4,HGNC:3435,ERCC3,1.140517
23,HGNC:9817,RAD51,HGNC:12791,WRN,1.124418
25,HGNC:9817,RAD51,HGNC:10290,RPA2,1.105026
26,HGNC:9817,RAD51,HGNC:2927,DMC1,1.093636
27,HGNC:9820,RAD51C,HGNC:9822,RAD51B,1.091573
28,HGNC:25009,UBE2T,HGNC:12473,UBE2B,1.080033


## What genes are functionally similar to the FA genes, using functions related to FA?

We can potentially sharpen some of these answers if we demand that the processes are actually related to the input disease somehow:

In [10]:
def create_complex_question(disease_id):
    return {
    "machine_question": {
    "nodes": [
            {
                "id": "n0",
                "type": "disease",
                "curie": disease_id
            },
            {
                "id": "n1",
                "type": "gene"
            },
            {
                "id": "n2",
                "type": "biological_process",
                "set": True
            },
            {
                "id": "n3",
                "type": "gene"
            }
        ],
        "edges": [
            {
                "id": "e0",
                "source_id": "n0",
                "target_id": "n1"
            },
            {
                "id": "e1",
                "source_id": "n1",
                "target_id": "n2"
            },
            {
                "id": "e2",
                "source_id": "n2",
                "target_id": "n3"
            },
            {
                "id": "e3",
                "source_id": "n0",
                "target_id": "n2"
            }
        ]
    }}

In [11]:
machine_question_2 = create_complex_question(fanconi)
answer_2 = quick(machine_question_2,max_connectivity=1000)

Return Status: 200


In [12]:
df_2 = parse_answer(answer_2,node_list=['n1','n3'])
v = fanconi_genes['n1 - id '].values
new_df_2 = df_2[ ~ df_2['n3 - id '].isin(v) ]

new_df_2.head(50)

Unnamed: 0,n1 - id,n1 - name,n3 - id,n3 - name,score
0,HGNC:9817,RAD51,HGNC:12791,WRN,1.044402
1,HGNC:9817,RAD51,HGNC:17008,TOPBP1,1.035648
3,HGNC:9817,RAD51,HGNC:13824,MMS19,0.932042
6,HGNC:9817,RAD51,HGNC:11102,SMARCAL1,0.867645
12,HGNC:9817,RAD51,HGNC:882,ATR,0.797899
13,HGNC:20748,FANCL,HGNC:882,ATR,0.797263
15,HGNC:9817,RAD51,HGNC:7230,MRE11,0.784628
16,HGNC:20748,FANCL,HGNC:7230,MRE11,0.782028
17,HGNC:9817,RAD51,HGNC:9816,RAD50,0.777154
18,HGNC:9817,RAD51,HGNC:12830,XRCC3,0.77466


There is a pretty reasonable amount of overlap between this and the previous answer, but there are some very high ranking genes that are new to this list.  For instance #2: TOPBP1 is not in the first list at all.