# Using ROBOKOP's Quick Question Answering Service

The expand, enrich, and similarity services each offer bite-sized approaches to interacting with ROBOKOP.  The ROBOKOP "quick" service offers a slightly more complex approach to answering questions.   We'll be using the following function to call the quick service.  Note that calling the service requires posting a question.  We'll discuss the format of the question below.

In [1]:
robokop_server='robokop.renci.org'

In [2]:
import requests

def quick(question,max_results=None,output_format=None):
    url=f'http://{robokop_server}:80/api/simple/quick/'
    if max_results is not None:
        url += f'?max_results={max_results}'
    if output_format is not None:
        j = '&' if '?' in url else '?'
        url += f'{j}output_format={output_format}'
    response = requests.post(url,json=question)
    print( f"Return Status: {response.status_code}" )
    if response.status_code == 200:
        return response.json()
    return response

## Question Format and Basic Usage

The question is a python dictionary.  It takes a key `machine_question`, which is a dictionary containing a list of `nodes` and a list of `edges`.   Each node object needs an integer `id`.   Any node can also have a `curie` specifying that node.  The edges define connections between the nodes using the identifiers of the nodes as source and targets. The following function shows how to construct a single-hop question.  It takes a specified node of `type1` and looks for any node of `type2`.

In [3]:
def make_one_step_question(type1, id1, type2,rebuild = None):
    question = {
                'machine_question': {
                    'nodes': [
                        {
                            'id': 'n0',
                            'curie': id1,
                            'type': type1
                        },
                        {
                            'id': 'n1',
                            'type': type2
                        }
                    ],
                    'edges': [
                        {
                            'id': 'e0',
                            'source_id': 'n0',
                            'target_id': 'n1'
                        }
                    ]
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild'] = 'True'
    return question

Here, we will specify a one-hop question asking for the phenotypes associated with Fanconi Anemia (MONDO:0019391).  We first construct the question then use it to call the quick service.  In fact, this is how the expand function is implemented internally.  The output is returned in the KG standard answer format.

In [4]:
q = make_one_step_question('disease','MONDO:0019391','phenotypic_feature')
r = quick(q)

Return Status: 200


In [5]:
# Uncomment to see a big long answer
import json
#print( json.dumps(r, indent=4))

The output can be simplified with a function like this:

In [6]:
import pandas as pd

def parse_answer(returnanswer):
    #First, parse out the parts of the kg that we want, names and types
    kg_node_names = { n['id']: n['name'] if 'name' in n else n['id'] for n in returnanswer['knowledge_graph']['nodes'] }
    kg_edge_types = { e['id']: e['type'] for e in returnanswer['knowledge_graph']['edges']}
    kg_edge_sources = { e['id']: e['edge_source'] for e in returnanswer['knowledge_graph']['edges']}
    #now put the answers into a table, looking up node/edge info from the above dicts
    #because of the question, we know that we're interested in the binding for "n1" and "e0"
    #this is quick-n-dirty, so we're also limiting ourselves to the first edge, even if there is more than one
    answers = [ {"result_id": answer['node_bindings']['n1'], 
                 "result_name": kg_node_names[answer['node_bindings']['n1']], #if 'name' in node else node['id'], 
                 "type":        kg_edge_types[answer['edge_bindings']['e0'][0]],
                 "source":      kg_edge_sources[answer['edge_bindings']['e0'][0]],
                 "score" :      answer['score']}
              for answer in returnanswer['answers']]
    return pd.DataFrame(answers)


In [7]:
parse_answer(r)

Unnamed: 0,result_id,result_name,score,source,type
0,HP:0005505,Refractory anemia,1.619721,biolink.disease_get_phenotype,has_phenotype
1,HP:0100615,Ovarian neoplasm,1.619721,biolink.disease_get_phenotype,has_phenotype
2,HP:0002488,Acute leukemia,1.619721,biolink.disease_get_phenotype,has_phenotype
3,HP:0002860,Squamous cell carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
4,HP:0004430,Severe combined immunodeficiency,1.619721,biolink.disease_get_phenotype,has_phenotype
5,HP:0002721,Immunodeficiency,1.619721,biolink.disease_get_phenotype,has_phenotype
6,HP:0001402,Hepatocellular carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
7,HP:0000286,Epicanthus,1.619721,biolink.disease_get_phenotype,has_phenotype
8,HP:0003002,Breast carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
9,HP:0004377,Hematological neoplasm,1.619721,biolink.disease_get_phenotype,has_phenotype


## Multi-step questions

It's straightforward to generalize this linear query to an N-item path.  This function constructs a question from a list of types and a list of identifiers.  The types are the types of nodes that will be traversed along the path, and the identifiers represent fixed elements in the path.  The length of types and ids should be equal, and free nodes should specify an id of None:

In [8]:
def make_N_step_question(types,curies,rebuild = None):
    question = {
                'machine_question': {
                    'nodes': [],
                    'edges': []
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild'] = 'True'
    ei = 0
    for i,t in enumerate(types):
        newnode = {'id': f'n{i}', 'type': t}
        if curies[i] is not None:
            newnode['curie'] = curies[i]
        question['machine_question']['nodes'].append(newnode)
        if i > 0:
            question['machine_question']['edges'].append( {'id': f'e{ei}','source_id': f'n{i-1}', 'target_id': f'n{i}'})
            ei += 1
    return question

We can recapitulate our previous question with this new function like this:

In [9]:
newq = make_N_step_question(['disease','phenotypic_feature'],['MONDO:0019391',None])
q == newq

True

Now we could expand to a longer query.  The following question will start at the disease MONDO:0019391, go to a gene, and from there to a biological process or activity.

In [10]:
two_step_question = make_N_step_question(['disease','gene','biological_process_or_activity'],['MONDO:0019391',None,None])

In [11]:
two_step_answer = quick(two_step_question)

Return Status: 200


We can extract the node names along the paths with a simple function:

In [12]:
def extract_node_names(returnanswer):
    #make a dict from identifiers (like MONDO:0019391) to names (like "Fanconi Anemia")
    kg_node_names = { n['id']: n['name'] if 'name' in n else n['id'] for n in returnanswer['knowledge_graph']['nodes'] }
    #For every answer, make a dict from the query node ids (like n0) to the node name (like Fanconia Anemia)
    nodes = [
                {q_node: kg_node_names[ q_id ] for q_node, q_id in answer['node_bindings'].items()}
                for answer in returnanswer['answers']
            ]
    #Turn this list of dicts into a table
    return pd.DataFrame(nodes)

In [13]:
two_step_nodes = extract_node_names(two_step_answer)
two_step_nodes

Unnamed: 0,n0,n1,n2
0,Fanconi anemia,BRIP1,response to toxic substance
1,Fanconi anemia,FANCG,response to radiation
2,Fanconi anemia,RAD51C,DNA recombination
3,Fanconi anemia,FANCA,regulation of cell proliferation
4,Fanconi anemia,FANCA,female gonad development
5,Fanconi anemia,FANCL,regulation of cell proliferation
6,Fanconi anemia,BRCA2,female gonad development
7,Fanconi anemia,BRCA2,inner cell mass cell proliferation
8,Fanconi anemia,BRCA2,hemopoiesis
9,Fanconi anemia,PALB2,inner cell mass cell proliferation


## Output Format

### Version 0.9, output_format=message

The results above are returned in v0.9 of the Translator standard for graph answers.  In this format, one graph is returned, containing all of the nodes and edges required for any of the answers.  Then the list of answers is given, binding nodes and edges in the question definition to nodes and edges in the shared knowledge graph.  This allows greater efficiency and specificity in relating outputs to question parameters.

The response has 3 keys returning
1. The json formatted question that was asked 
2. The shared knowledge graph 
3. The answers as defined on that graph

In [29]:
print(two_step_answer.keys())

dict_keys(['question_graph', 'knowledge_graph', 'answers'])


The knowledge graph consists of nodes (containing all the information about the entity) and edges (containing information about the relationship between entities).

In [30]:
print(two_step_answer['knowledge_graph'].keys())

dict_keys(['nodes', 'edges'])


Here's what the first 2 nodes and the first edge look like:

In [31]:
print(json.dumps(two_step_answer['knowledge_graph']['nodes'][:2],indent=4))

[
    {
        "congenital abnormality": true,
        "name": "Fanconi anemia",
        "nutritional or metabolic disease": true,
        "id": "MONDO:0019391",
        "equivalent_identifiers": [
            "MESH:D005199",
            "MEDDRA:10016218",
            "MONDO:0019391",
            "MEDDRA:10055206",
            "UMLS:C0015625",
            "ORPHANET:84",
            "DOID:13636"
        ],
        "type": "disease",
        "rare disease": true,
        "syndromic disease": true
    },
    {
        "name": "UBE2T",
        "location": "1q32.1",
        "locus_group": "protein-coding gene",
        "gene_family": [
            "Ubiquitin conjugating enzymes E2",
            "FA complementation groups"
        ],
        "chromosome": "1",
        "id": "HGNC:25009",
        "equivalent_identifiers": [
            "UniProtKB:Q9NPD8",
            "NCBIGene:29089",
            "HGNC:25009",
            "ENSEMBL:ENSG00000077152",
            "HGNC.SYMBOL:UBE2T",
          

In [33]:
print(json.dumps(two_step_answer['knowledge_graph']['edges'][1],indent=4))

{
    "relation": "PHAROS:gene_involved",
    "target_id": "HGNC:25009",
    "edge_source": "pharos.disease_get_gene",
    "publications": [],
    "id": "4257632",
    "predicate_id": "NCIT:R176",
    "source_database": "pharos",
    "source_id": "MONDO:0019391",
    "type": "disease_to_gene_association",
    "ctime": 1543738834.740142,
    "relation_label": "gene_involved",
    "weight": 0.25832639515782074
}


Here's what the first answer looks like:

In [34]:
print(json.dumps(two_step_answer['answers'][0],indent=4))

{
    "node_bindings": {
        "n0": "MONDO:0019391",
        "n1": "HGNC:20473",
        "n2": "GO:0009636"
    },
    "edge_bindings": {
        "e0": [
            "4257531",
            "6938566",
            "6938600"
        ],
        "e1": [
            "4294578",
            "1094816"
        ]
    },
    "score": 0.3959587237470695
}


The answers are ranked by score, so this is the highest ranking answer, and the score is given.  The `node_bindings` relate the identifiers of the questions (`n0`, `n1`, `n2`) to the identifiers in the knowledge graph (`MONDO:0019391`, `HGNC:20473`, `GO:0009636`) and we could use these identifiers to look up further information about those nodes in the graph.

Similarly, the `e0` edge (connecting `n0` and `n1` in the question), is mapped to several possible edges in the graph, and we could look up the edge information using the identifier.  For instance, we can find the first edge between for `e0`:

In [41]:
scores = [a['score'] for a in two_step_answer['answers']]
scores

[0.3959587237470695,
 0.39562714985581743,
 0.3930911628130927,
 0.38680121546632074,
 0.38680121546632074,
 0.38237448152146924,
 0.37742069307012543,
 0.37742069307012543,
 0.37742069307012543,
 0.35967233858255104,
 0.35967233858255104,
 0.3328828490620355,
 0.3328828490620355,
 0.3328828490620355,
 0.3328828490620355,
 0.3328828490620355,
 0.330896351413593,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.29235569718019727,
 0.22921193146495913,
 0.22921193146495913,
 0.22921193146495913,
 0.22921193146495913,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 0.225403541755474,
 

In [35]:
my_edge = list(filter(lambda x: x['id'] == "4257531", two_step_answer['knowledge_graph']['edges']))[0]
print(json.dumps(my_edge,indent=4))

{
    "relation": "PHAROS:gene_involved",
    "target_id": "HGNC:20473",
    "edge_source": "pharos.disease_get_gene",
    "publications": [],
    "id": "4257531",
    "predicate_id": "NCIT:R176",
    "source_database": "pharos",
    "source_id": "MONDO:0019391",
    "type": "disease_to_gene_association",
    "ctime": 1543738834.73799,
    "relation_label": "gene_involved",
    "weight": 0.25832639515782074
}


### v0.8, output_format=dense

For backwards compatibility, output can also be returned in v0.8 of the KG format.  To return output in this format, pass the argument `output_format=dense` to the url.   The `output_format` defaults to `message` which specifies v0.9.

In [36]:
two_step_answer_dense = quick(two_step_question,output_format='dense')

Return Status: 200


In [38]:
two_step_answer_dense.keys()

dict_keys(['datetime', 'id', 'message', 'response_code', 'result_list'])

In [42]:
two_step_answer_dense['result_list'][0]

{'confidence': 0.3959587237470695,
 'id': '5939821d-f96e-442c-ab37-059ffe1087e4',
 'result_graph': {'edge_list': [{'confidence': 0.25832639515782074,
    'num_publications': None,
    'provided_by': 'pharos.disease_get_gene',
    'publications': '',
    'source_id': 'MONDO:0019391',
    'target_id': 'HGNC:20473',
    'type': 'disease_to_gene_association'},
   {'confidence': 0.4071474314830641,
    'num_publications': None,
    'provided_by': 'biolink.disease_get_gene',
    'publications': 'PMID:20639400',
    'source_id': 'MONDO:0019391',
    'target_id': 'HGNC:20473',
    'type': 'biomarker_for'},
   {'confidence': 0.9998780282217139,
    'num_publications': None,
    'provided_by': 'biolink.disease_get_gene',
    'publications': 'PMID:25980754,PMID:19127258,PMID:16116423,PMID:16153896,PMID:16116424,PMID:27498913,PMID:22006311,PMID:24556621,PMID:26689913,PMID:26681312,PMID:17033622,PMID:26720728,PMID:25186627,PMID:25452441,PMID:26921362,PMID:21964575,PMID:26786923,PMID:21127055,PMID:2

## Specifying multiple fixed nodes

The examples above have all started from one known node and expanded from it, sometimes in multiple steps.  It's also possible to have more than one node specified.  For instance, if we wanted to look for genes that link Fanconi Aneima (MONDO:0019391) and DNA repair (GO:0006281), we could rerun our previous query, but setting the final curie, like this:

In [14]:
two_step_question_fixed_ends = \
   make_N_step_question(['disease','gene','biological_process_or_activity'],['MONDO:0019391',None,'GO:0006281'])
two_step_answer_fixed_ends = quick(two_step_question_fixed_ends)

Return Status: 200


In [15]:
extract_node_names(two_step_answer_fixed_ends)

Unnamed: 0,n0,n1,n2
0,Fanconi anemia,FANCC,DNA repair
1,Fanconi anemia,FANCG,DNA repair
2,Fanconi anemia,RAD51C,DNA repair
3,Fanconi anemia,FANCA,DNA repair
4,Fanconi anemia,FANCL,DNA repair
5,Fanconi anemia,SLX4,DNA repair
6,Fanconi anemia,RAD51,DNA repair
7,Fanconi anemia,ERCC4,DNA repair
8,Fanconi anemia,XRCC2,DNA repair
9,Fanconi anemia,UBE2T,DNA repair


## Non-path queries

So far, we've only looked at linear paths.  But the question format is actually more general than that - we can define a path pattern generally. So for instance, in the above query, we find all genes that are linked to both FA and DNA repair.  But what if we wanted to find entities that are connected to more than two specified entities.  Here is a query-generation function for the star query:

In [16]:
def make_star_question(types,curies,shared_type,rebuild=None):
    """Create a question to find entities of shared_type that are linked to all of the nodes specified in the
    types and curies arrays."""
    question = {
                #'rebuild': 'True',
                'machine_question': {
                    'nodes': [],
                    'edges': []
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild']='True'
    question['machine_question']['nodes'].append( {'id': 'n0', 'type': shared_type})
    for i,t in enumerate(types):
        newnode = {'id': f'n{i+1}', 'type': t}
        if curies[i] is not None:
            newnode['curie'] = curies[i]
        question['machine_question']['nodes'].append(newnode)
        question['machine_question']['edges'].append( {'id': f'e{i}', 'source_id': 'n0', 'target_id': f'n{i+1}'})
    return question

Suppose I have this set of GO terms, and I'd like to find genes that they all have in common:

* 'voltage-gated sodium channel activity(GO:0005248)',
* 'muscle contraction(GO:0006936)',
* 'voltage-gated ion channel activity(GO:0005244)',
* 'regulation of ion transmembrane transport(GO:0034765)',
* 'sodium ion transmembrane transport(GO:0035725)',
* 'neuronal action potential(GO:0019228)',
* 'membrane depolarization during action potential(GO:0086010)',
* 'sodium ion transport(GO:0006814)'

In [17]:
go_terms=['GO:0005248','GO:0006936','GO:0005244','GO:0034765','GO:0035725','GO:0019228','GO:0086010','GO:0006814']
types = ['biological_process_or_activity' for g in go_terms]
star_q = make_star_question(types,go_terms,'gene')
star_q

{'machine_question': {'edges': [{'id': 'e0',
    'source_id': 'n0',
    'target_id': 'n1'},
   {'id': 'e1', 'source_id': 'n0', 'target_id': 'n2'},
   {'id': 'e2', 'source_id': 'n0', 'target_id': 'n3'},
   {'id': 'e3', 'source_id': 'n0', 'target_id': 'n4'},
   {'id': 'e4', 'source_id': 'n0', 'target_id': 'n5'},
   {'id': 'e5', 'source_id': 'n0', 'target_id': 'n6'},
   {'id': 'e6', 'source_id': 'n0', 'target_id': 'n7'},
   {'id': 'e7', 'source_id': 'n0', 'target_id': 'n8'}],
  'nodes': [{'id': 'n0', 'type': 'gene'},
   {'curie': 'GO:0005248',
    'id': 'n1',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006936',
    'id': 'n2',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0005244',
    'id': 'n3',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0034765',
    'id': 'n4',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0035725',
    'id': 'n5',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0019228',
   

In [18]:
common_gene_answer = quick(star_q)

Return Status: 200


In [19]:
extract_node_names(common_gene_answer)

Unnamed: 0,n0,n1,n2,n3,n4,n5,n6,n7,n8
0,SCN4A,GO:0005248,muscle contraction,voltage-gated ion channel activity,regulation of ion transmembrane transport,sodium ion transmembrane transport,neuronal action potential,membrane depolarization during action potential,sodium ion transport
1,SCN7A,GO:0005248,muscle contraction,voltage-gated ion channel activity,regulation of ion transmembrane transport,sodium ion transmembrane transport,neuronal action potential,membrane depolarization during action potential,sodium ion transport


We can also have more than one unspecified node.  Suppose, for instance that we wanted to do the previous query, but we also wanted to know what chemicals interact with the genes that we find.  We can do another star query, where one of our spokes is unspecified:

In [20]:
go_terms=['GO:0005248','GO:0006936','GO:0005244','GO:0034765','GO:0035725','GO:0019228','GO:0086010','GO:0006814',None]
types = ['biological_process_or_activity' for i in range(8)]+['chemical_substance']
star_q_compound = make_star_question(types,go_terms,'gene',rebuild=True)
star_q_compound

{'machine_question': {'edges': [{'id': 'e0',
    'source_id': 'n0',
    'target_id': 'n1'},
   {'id': 'e1', 'source_id': 'n0', 'target_id': 'n2'},
   {'id': 'e2', 'source_id': 'n0', 'target_id': 'n3'},
   {'id': 'e3', 'source_id': 'n0', 'target_id': 'n4'},
   {'id': 'e4', 'source_id': 'n0', 'target_id': 'n5'},
   {'id': 'e5', 'source_id': 'n0', 'target_id': 'n6'},
   {'id': 'e6', 'source_id': 'n0', 'target_id': 'n7'},
   {'id': 'e7', 'source_id': 'n0', 'target_id': 'n8'},
   {'id': 'e8', 'source_id': 'n0', 'target_id': 'n9'}],
  'nodes': [{'id': 'n0', 'type': 'gene'},
   {'curie': 'GO:0005248',
    'id': 'n1',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006936',
    'id': 'n2',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0005244',
    'id': 'n3',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0034765',
    'id': 'n4',
    'type': 'biological_process_or_activity'},
   {'curie': 'GO:0035725',
    'id': 'n5',
    'type': 'biologica

**WARNING: This takes a long time to run**

In [23]:
common_gene_compound_answer = quick(star_q_compound)

Return Status: 200


In [24]:
cgc_nodes = extract_node_names(common_gene_compound_answer)
cgc_nodes[['n0','n9']]

Unnamed: 0,n0,n9
0,SCN4A,m-cresol
1,SCN4A,"4-chloro-3,5-dimethylphenol"
2,SCN4A,4-chlorophenol
3,SCN7A,ethotoin
4,SCN7A,benzocaine
5,SCN7A,tocainide
6,SCN4A,pilsicainide
7,SCN4A,lamotrigine
8,SCN4A,ethotoin
9,SCN7A,phenazopyridine


Many other graph shapes are possible, these are simply a couple of examples.  For instance, we could add onto this query an edge from node 0 to a new disease node (node 10) and also include an edge from the target node (node 9) to this same disease node. Now we would be finding genes that share a set of GO terms, then looking for disease,drug pairs such that the disease, the drug, and the gene are all interconnected.

### Controlling the number of results

By default, quick will return up to 250 results, but this can be raised or lowered using the max_results parameter.  We've already rebuilt this question, so we won't do rebuild=True this time, which should make it go faster.

In [27]:
star_q_compound = make_star_question(types,go_terms,'gene')
common_gene_compound_answer_10 = quick(star_q_compound,max_results=10)

Return Status: 200


In [28]:
cgc_nodes_10 = extract_node_names(common_gene_compound_answer_10)
cgc_nodes_10[['n0','n9']]

Unnamed: 0,n0,n9
0,SCN4A,m-cresol
1,SCN4A,"4-chloro-3,5-dimethylphenol"
2,SCN4A,4-chlorophenol
3,SCN7A,ethotoin
4,SCN7A,benzocaine
5,SCN7A,tocainide
6,SCN4A,pilsicainide
7,SCN4A,lamotrigine
8,SCN4A,ethotoin
9,SCN7A,phenazopyridine


## Rebuild and caching

For details on ROBOKOP caching, and guidance for when to use the `rebuild` parameter, see the notebook on the expand service.  Briefly, the quick service will only run against the cache unless the rebuild parameter is passed, in which case services are re-queried.   The rebuild parameter is passed as part of the json query as seen in the `make_star_query` function above. 

## Edge Properties

In the expand notebook, we saw how a particular predicate could be chosen to limit the results.  This same type information can in fact be added to any edge in a quick query, providing fine control to the results returned.