# Using ROBOKOP's Quick Question Answering Service

The expand, enrich, and similarity services each offer bite-sized approaches to interacting with ROBOKOP.  The ROBOKOP "quick" service offers a slightly more complex approach to answering questions.   We'll be using the following function to call the quick service.  Note that calling the service requires posting a question.  We'll discuss the format of the question below.

In [13]:
robokop_server='robokop.renci.org'

In [14]:
import requests

def quick(question,max_results=None):
    url=f'http://{robokop_server}:80/api/simple/quick/'
    if max_results is not None:
        url += f'?max_results={max_results}'
    response = requests.post(url,json=question)
    print( f"Return Status: {response.status_code}" )
    if response.status_code == 200:
        return response.json()
    return response

## Question Format and Basic Usage

The question is a python dictionary.  It takes a key `machine_question`, which is a dictionary containing a list of `nodes` and a list of `edges`.   Each node object needs an integer `id`.   Any node can also have a `curie` specifying that node.  The edges define connections between the nodes using the identifiers of the nodes as source and targets. The following function shows how to construct a single-hop question.  It takes a specified node of `type1` and looks for any node of `type2`.

In [15]:
def make_one_step_question(type1, id1, type2,rebuild = None):
    question = {
                'machine_question': {
                    'nodes': [
                        {
                            'id': 'n0',
                            'curie': id1,
                            'type': type1
                        },
                        {
                            'id': 'n1',
                            'type': type2
                        }
                    ],
                    'edges': [
                        {
                            'id': 'e0',
                            'source_id': 'n0',
                            'target_id': 'n1'
                        }
                    ]
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild'] = 'True'
    return question

Here, we will specify a one-hop question asking for the phenotypes associated with Fanconi Anemia (MONDO:0019391).  We first construct the question then use it to call the quick service.  In fact, this is how the expand function is implemented internally.  The output is returned in the KG standard answer format.

In [16]:
q = make_one_step_question('disease','MONDO:0019391','phenotypic_feature')
r = quick(q)

Return Status: 200


In [17]:
# Uncomment to see a big long answer
#import json
#print( json.dumps(r, indent=4))

The output can be simplified with a function like this:

In [19]:
import pandas as pd

def parse_answer(returnanswer):
    nodes = [answer['nodes'][1] for answer in returnanswer['answers']]
    edges = [answer['edges'][0] for answer in returnanswer['answers']]
    answers = [ {"result_id": node["id"], 
                 "result_name": node["name"], 
                 "type": edge["type"],
                 "source": edge['edge_source']}
              for node,edge in zip(nodes,edges)]
    return pd.DataFrame(answers)

In [20]:
parse_answer(r)

Unnamed: 0,result_id,result_name,source,type
0,HP:0001915,Aplastic anemia,biolink.disease_get_phenotype,has_phenotype
1,HP:0001876,Pancytopenia,biolink.disease_get_phenotype,has_phenotype
2,HP:0003002,Breast carcinoma,biolink.disease_get_phenotype,has_phenotype
3,HP:0004808,Acute myeloid leukemia,biolink.disease_get_phenotype,has_phenotype
4,HP:0001873,Thrombocytopenia,biolink.disease_get_phenotype,has_phenotype
5,HP:0002860,Squamous cell carcinoma,biolink.disease_get_phenotype,has_phenotype
6,HP:0001875,Neutropenia,biolink.disease_get_phenotype,has_phenotype
7,HP:0100615,Ovarian neoplasm,biolink.disease_get_phenotype,has_phenotype
8,HP:0002488,Acute leukemia,biolink.disease_get_phenotype,has_phenotype
9,HP:0002721,Immunodeficiency,biolink.disease_get_phenotype,has_phenotype


## Multi-step questions

It's straightforward to generalize this linear query to an N-item path.  This function constructs a question from a list of types and a list of identifiers.  The types are the types of nodes that will be traversed along the path, and the identifiers represent fixed elements in the path.  The length of types and ids should be equal, and free nodes should specify an id of None:

In [23]:
def make_N_step_question(types,curies,rebuild = None):
    question = {
                'machine_question': {
                    'nodes': [],
                    'edges': []
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild'] = 'True'
    ei = 0
    for i,t in enumerate(types):
        newnode = {'id': f'n{i}', 'type': t}
        if curies[i] is not None:
            newnode['curie'] = curies[i]
        question['machine_question']['nodes'].append(newnode)
        if i > 0:
            question['machine_question']['edges'].append( {'id': f'e{ei}','source_id': f'n{i-1}', 'target_id': f'n{i}'})
            ei += 1
    return question

We can recapitulate our previous question with this new function like this:

In [24]:
newq = make_N_step_question(['disease','phenotypic_feature'],['MONDO:0019391',None])
q == newq

True

Now we could expand to a longer query.  The following question will start at the disease MONDO:0019391, go to a gene, and from there to a biological process or activity.

In [25]:
two_step_question = make_N_step_question(['disease','gene','biological_process_or_activity'],['MONDO:0019391',None,None])

In [None]:
two_step_answer = quick(two_step_question)

We can extract the node names along the paths with a simple function:

In [12]:
def extract_node_names(returnanswer):
    nodes = [{f'node_{i}': (node['name'] if 'name' in node else node['id'])  
              for i,node in enumerate(answer['nodes'])} for answer in returnanswer['answers']]
    return pd.DataFrame(nodes)

In [13]:
two_step_nodes = extract_node_names(two_step_answer)
two_step_nodes

Unnamed: 0,node_0,node_1,node_2
0,Fanconi anemia,FANCD2,interstrand cross-link repair
1,Fanconi anemia,FANCA,interstrand cross-link repair
2,Fanconi anemia,FANCM,interstrand cross-link repair
3,Fanconi anemia,FANCG,interstrand cross-link repair
4,Fanconi anemia,FANCF,interstrand cross-link repair
5,Fanconi anemia,FANCB,interstrand cross-link repair
6,Fanconi anemia,ERCC4,interstrand cross-link repair
7,Fanconi anemia,UBE2T,interstrand cross-link repair
8,Fanconi anemia,RFWD3,interstrand cross-link repair
9,Fanconi anemia,RAD51,interstrand cross-link repair


## Specifying multiple fixed nodes

The examples above have all started from one known node and expanded from it, sometimes in multiple steps.  It's also possible to have more than one node specified.  For instance, if we wanted to look for genes that link Fanconi Aneima (MONDO:0019391) and DNA repair (GO:0006281), we could rerun our previous query, but setting the final curie, like this:

In [15]:
two_step_question_fixed_ends = \
   make_N_step_question(['disease','gene','biological_process_or_activity'],['MONDO:0019391',None,'GO:0006281'])
two_step_answer_fixed_ends = quick(two_step_question_fixed_ends)

Return Status: 200


In [16]:
extract_node_names(two_step_answer_fixed_ends)

Unnamed: 0,node_0,node_1,node_2
0,Fanconi anemia,FANCD2,DNA repair
1,Fanconi anemia,FANCA,DNA repair
2,Fanconi anemia,RAD51C,DNA repair
3,Fanconi anemia,FANCG,DNA repair
4,Fanconi anemia,ERCC4,DNA repair
5,Fanconi anemia,RAD51,DNA repair
6,Fanconi anemia,FANCC,DNA repair
7,Fanconi anemia,XRCC2,DNA repair
8,Fanconi anemia,FAN1,DNA repair
9,Fanconi anemia,RAD51D,DNA repair


## Non-path queries

So far, we've only looked at linear paths.  But the question format is actually more general than that - we can define a path pattern generally. So for instance, in the above query, we find all genes that are linked to both FA and DNA repair.  But what if we wanted to find entities that are connected to more than two specified entities.  Here is a query-generation function for the star query:

In [17]:
def make_star_question(types,curies,shared_type,rebuild=None):
    """Create a question to find entities of shared_type that are linked to all of the nodes specified in the
    types and curies arrays."""
    question = {
                #'rebuild': 'True',
                'machine_question': {
                    'nodes': [],
                    'edges': []
                }
            }
    if rebuild is not None and str(rebuild).upper() == 'TRUE':
        question['rebuild']='True'
    question['machine_question']['nodes'].append( {'id': 0, 'type': shared_type})
    for i,t in enumerate(types):
        newnode = {'id': i+1, 'type': t}
        if curies[i] is not None:
            newnode['curie'] = curies[i]
        question['machine_question']['nodes'].append(newnode)
        question['machine_question']['edges'].append( {'source_id': 0, 'target_id': i+1})
    return question

Suppose I have this set of GO terms, and I'd like to find genes that they all have in common:

* 'voltage-gated sodium channel activity(GO:0005248)',
* 'muscle contraction(GO:0006936)',
* 'voltage-gated ion channel activity(GO:0005244)',
* 'regulation of ion transmembrane transport(GO:0034765)',
* 'sodium ion transmembrane transport(GO:0035725)',
* 'neuronal action potential(GO:0019228)',
* 'membrane depolarization during action potential(GO:0086010)',
* 'sodium ion transport(GO:0006814)'

In [18]:
go_terms=['GO:0005248','GO:0006936','GO:0005244','GO:0034765','GO:0035725','GO:0019228','GO:0086010','GO:0006814']
types = ['biological_process_or_activity' for g in go_terms]
star_q = make_star_question(types,go_terms,'gene')
star_q

{'machine_question': {'edges': [{'source_id': 0, 'target_id': 1},
   {'source_id': 0, 'target_id': 2},
   {'source_id': 0, 'target_id': 3},
   {'source_id': 0, 'target_id': 4},
   {'source_id': 0, 'target_id': 5},
   {'source_id': 0, 'target_id': 6},
   {'source_id': 0, 'target_id': 7},
   {'source_id': 0, 'target_id': 8}],
  'nodes': [{'id': 0, 'type': 'gene'},
   {'curie': 'GO:0005248', 'id': 1, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006936', 'id': 2, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0005244', 'id': 3, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0034765', 'id': 4, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0035725', 'id': 5, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0019228', 'id': 6, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0086010', 'id': 7, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006814',
    'id': 8,
    'type': 'biological_process_or_activit

In [19]:
common_gene_answer = quick(star_q)

Return Status: 200


In [20]:
extract_node_names(common_gene_answer)

Unnamed: 0,node_0,node_1,node_2,node_3,node_4,node_5,node_6,node_7,node_8
0,SCN4A,voltage-gated sodium channel activity,muscle contraction,voltage-gated ion channel activity,regulation of ion transmembrane transport,sodium ion transmembrane transport,neuronal action potential,membrane depolarization during action potential,sodium ion transport
1,SCN7A,voltage-gated sodium channel activity,muscle contraction,voltage-gated ion channel activity,regulation of ion transmembrane transport,sodium ion transmembrane transport,neuronal action potential,membrane depolarization during action potential,sodium ion transport


We can also have more than one unspecified node.  Suppose, for instance that we wanted to do the previous query, but we also wanted to know what chemicals interact with the genes that we find.  We can do another star query, where one of our spokes is unspecified:

In [21]:
go_terms=['GO:0005248','GO:0006936','GO:0005244','GO:0034765','GO:0035725','GO:0019228','GO:0086010','GO:0006814',None]
types = ['biological_process_or_activity' for i in range(8)]+['chemical_substance']
star_q_compound = make_star_question(types,go_terms,'gene',rebuild=True)
star_q_compound

{'machine_question': {'edges': [{'source_id': 0, 'target_id': 1},
   {'source_id': 0, 'target_id': 2},
   {'source_id': 0, 'target_id': 3},
   {'source_id': 0, 'target_id': 4},
   {'source_id': 0, 'target_id': 5},
   {'source_id': 0, 'target_id': 6},
   {'source_id': 0, 'target_id': 7},
   {'source_id': 0, 'target_id': 8},
   {'source_id': 0, 'target_id': 9}],
  'nodes': [{'id': 0, 'type': 'gene'},
   {'curie': 'GO:0005248', 'id': 1, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006936', 'id': 2, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0005244', 'id': 3, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0034765', 'id': 4, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0035725', 'id': 5, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0019228', 'id': 6, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0086010', 'id': 7, 'type': 'biological_process_or_activity'},
   {'curie': 'GO:0006814', 'id': 8, 'type': '

**WARNING: This takes a long time to run**

In [22]:
common_gene_compound_answer = quick(star_q_compound)

Return Status: 200


In [28]:
cgc_nodes = extract_node_names(common_gene_compound_answer)
cgc_nodes[['node_0','node_9']]

Unnamed: 0,node_0,node_9
0,SCN4A,lamotrigine
1,SCN4A,Carbamazepine
2,SCN4A,Lidocaine
3,SCN4A,benzocaine
4,SCN4A,phenytoin
5,SCN4A,Tocainide
6,SCN4A,Mexiletine
7,SCN4A,Flecainide
8,SCN7A,benzocaine
9,SCN7A,Nickel


Many other graph shapes are possible, these are simply a couple of examples.  For instance, we could add onto this query an edge from node 0 to a new disease node (node 10) and also include an edge from the target node (node 9) to this same disease node. Now we would be finding genes that share a set of GO terms, then looking for disease,drug pairs such that the disease, the drug, and the gene are all interconnected.

### Controlling the number of results

By default, quick will return up to 250 results, but this can be raised or lowered using the max_resuls parameter:

In [29]:
common_gene_compound_answer_10 = quick(star_q_compound,max_results=10)


Return Status: 200


In [30]:
cgc_nodes_10 = extract_node_names(common_gene_compound_answer_10)
cgc_nodes_10[['node_0','node_9']]

Unnamed: 0,node_0,node_9
0,SCN4A,lamotrigine
1,SCN4A,Carbamazepine
2,SCN4A,Lidocaine
3,SCN4A,benzocaine
4,SCN4A,phenytoin
5,SCN4A,Tocainide
6,SCN4A,Mexiletine
7,SCN4A,Flecainide
8,SCN7A,benzocaine
9,SCN7A,Nickel


## Rebuild and caching

For details on ROBOKOP caching, and guidance for when to use the `rebuild` parameter, see the notebook on the expand service.  Briefly, the quick service will only run against the cache unless the rebuild parameter is passed, in which case services are re-queried.   The rebuild parameter is passed as part of the json query as seen in the `make_star_query` function above. 

## Edge Properties

In the expand notebook, we saw how a particular predicate could be chosen to limit the results.  This same type information can in fact be added to any edge in a quick query, providing fine control to the results returned.