# Using ROBOKOP's expand service

The most basic functionality in answering questions is to start with an entity and find other connected entities.
In this context, an entity is defined by a curie-formatted identifier.

ROBOKOP's expand service performs this function.  The user provides an identifier and its biolink-model type, and the type of entities that it wants returned.  ROBOKOP will call out to any sources that it is aware of that can answer the particular question.  If multiple services can provide the information, ROBOKOP will call all of them.  It will then rank the results based on literature co-occurence data.

In [1]:
robokop_server = 'robokop.renci.org'

In [2]:
import requests
import json
import pandas as pd

The following python function shows how to call the ROBOKOP expand service.  For the moment, let's focus only on the arguments `type1`, `identifier`, and `type2`.    The expand service is called when a user has an `identifier` of `type1`, and wants to know what entities of `type2` are connected to it.

In [3]:
def expand(type1,identifier,type2,rebuild=None,output_format=None,predicate=None,direction=None):
    url=f'http://{robokop_server}:80/api/simple/expand/{type1}/{identifier}/{type2}'
    print(url)
    params = {'rebuild': rebuild, 
              'predicate': predicate,
              'output_format': output_format,
              'direction': direction}
    params = { k:v for k,v in params.items() if v is not None }
    response = requests.get(url,params=params)
    print( f'Return Status: {response.status_code}' )
    if response.status_code == 200:
        return response.json()
    return []

## Basic Usage

In [4]:
complex_parts = expand('cellular_component','GO:1990157','chemical_substance')

http://robokop.renci.org:80/api/simple/expand/cellular_component/GO:1990157/chemical_substance
Return Status: 200


In [6]:
x={ name:n['name'] for n in complex_parts['knowledge_graph']['nodes']}
import pd
df = pd.DataFrame(x)
df.head()

SyntaxError: invalid syntax (<ipython-input-6-64d4c6d41b1c>, line 1)

In this example, we have the `disease` Fanconi Anemia defined by the curie identifier `MONDO:0019391`.  We want to know the `phenotypic_feature`s that are associated with it.  We can call the function above like this:

In [4]:
fanconi_phenotypes = expand('disease', 'MONDO:0019391', 'phenotypic_feature')

Return Status: 200


## Service Output

The result that comes back is json in the KG-standard.

Importantly, results are ranked using ROBOKOP's standard ranking algorithm, which is looking at literature co-occurance based on the `omnicorp` repository.

In [5]:
#Uncomment to see a big long json (which uses up lots of room in the github render)
#print(json.dumps(fanconi_phenotypes,indent=4))

This output has plenty of information, but for display purposes, it's sometimes easier to tabularize with the following function:

In [9]:
def parse_answer(returnanswer):
    #First, parse out the parts of the kg that we want, names and types
    kg_node_names = { n['id']: n['name'] if 'name' in n else n['id'] for n in returnanswer['knowledge_graph']['nodes'] }
    kg_edge_types = { e['id']: e['type'] for e in returnanswer['knowledge_graph']['edges']}
    kg_edge_sources = { e['id']: e['edge_source'] for e in returnanswer['knowledge_graph']['edges']}
    #now put the answers into a table, looking up node/edge info from the above dicts
    #because of the question, we know that we're interested in the binding for "n1" and "e0"
    # -By definition, this will always be the case for the expand function
    #this is quick-n-dirty, so we're also limiting ourselves to the first edge, even if there is more than one
    answers = [ {"result_id": answer['node_bindings']['n1'], 
                 "result_name": kg_node_names[answer['node_bindings']['n1']], #if 'name' in node else node['id'], 
                 "type":        kg_edge_types[answer['edge_bindings']['e0'][0]],
                 "source":      kg_edge_sources[answer['edge_bindings']['e0'][0]],
                 "score" :      answer['score']}
              for answer in returnanswer['answers']]
    return pd.DataFrame(answers)

In [7]:
fanconi_pheno_frame = parse_answer(fanconi_phenotypes)
fanconi_pheno_frame

Unnamed: 0,result_id,result_name,score,source,type
0,HP:0005505,Refractory anemia,1.619721,biolink.disease_get_phenotype,has_phenotype
1,HP:0100615,Ovarian neoplasm,1.619721,biolink.disease_get_phenotype,has_phenotype
2,HP:0002488,Acute leukemia,1.619721,biolink.disease_get_phenotype,has_phenotype
3,HP:0002860,Squamous cell carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
4,HP:0004430,Severe combined immunodeficiency,1.619721,biolink.disease_get_phenotype,has_phenotype
5,HP:0002721,Immunodeficiency,1.619721,biolink.disease_get_phenotype,has_phenotype
6,HP:0001402,Hepatocellular carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
7,HP:0000286,Epicanthus,1.619721,biolink.disease_get_phenotype,has_phenotype
8,HP:0003002,Breast carcinoma,1.619721,biolink.disease_get_phenotype,has_phenotype
9,HP:0004377,Hematological neoplasm,1.619721,biolink.disease_get_phenotype,has_phenotype


In this case, all of the results are coming from biolink's disease to phenotype function.  As mentioned above, results here are ranked by their literature co-occurence with the query term.

If the caller doesn't want to dig around in a json return, they can also ask for a csv-style list:

In [8]:
fanconi_phenotypes_csv = expand('disease', 'MONDO:0019391', 'phenotypic_feature',output_format = 'csv')

Return Status: 200


In [18]:
#fanconi_phenotypes_csv
fanconi_phenotypes_csv.split('\n')

['n0,n1',
 'MONDO:0019391,HP:0005505',
 'MONDO:0019391,HP:0100615',
 'MONDO:0019391,HP:0002488',
 'MONDO:0019391,HP:0002860',
 'MONDO:0019391,HP:0004430',
 'MONDO:0019391,HP:0002721',
 'MONDO:0019391,HP:0001402',
 'MONDO:0019391,HP:0000286',
 'MONDO:0019391,HP:0003002',
 'MONDO:0019391,HP:0004377',
 'MONDO:0019391,HP:0007354',
 'MONDO:0019391,HP:0008064',
 'MONDO:0019391,HP:0006721',
 'MONDO:0019391,HP:0003254',
 'MONDO:0019391,HP:0001915',
 'MONDO:0019391,HP:0000824',
 'MONDO:0019391,HP:0004808',
 'MONDO:0019391,HP:0005560',
 'MONDO:0019391,HP:0100806',
 'MONDO:0019391,HP:0005510',
 'MONDO:0019391,HP:0001972',
 'MONDO:0019391,HP:0001251',
 'MONDO:0019391,HP:0011018',
 'MONDO:0019391,HP:0004431',
 'MONDO:0019391,HP:0004810',
 'MONDO:0019391,HP:0000789',
 'MONDO:0019391,HP:0000953',
 'MONDO:0019391,HP:0001900',
 'MONDO:0019391,HP:0001908',
 'MONDO:0019391,HP:0000135',
 'MONDO:0019391,HP:0100608',
 'MONDO:0019391,HP:0010650',
 'MONDO:0019391,HP:0010972',
 'MONDO:0019391,HP:0001009',
 'MO

## Curie inputs and synonymization

ROBOKOP will perform identifier translations when it can.  This means that for most common input types, there are a range of curie prefixes that will work without the user doing any work.  

For example, Fanconi Anemia is identified as `MONDO:0019339`, which is ROBOKOP's preferred identifier, but that is equivalent to `DOID:13636`, `Orphanet:84`, `NCIT:C62505`, `UMLS:C0015625`, `MeSH:D005199`, and `MedDRA:10055206`.  We can see that calling expand with any of these inputs will produce the same results:

In [11]:
equivalents=['MedDRA:10055206','DOID:13636','UMLS:C0015625','Orphanet:84','NCIT:C62505','MeSH:D005199']
for equivalent_id in equivalents:
    e_result = expand('disease', equivalent_id, 'phenotypic_feature',output_format='csv')
    print(equivalent_id, len(e_result.split('\n')), e_result.split('\n')[1])

Return Status: 200
MedDRA:10055206 251 MONDO:0019391,HP:0005505
Return Status: 200
DOID:13636 251 MONDO:0019391,HP:0005505
Return Status: 200
UMLS:C0015625 251 MONDO:0019391,HP:0005505
Return Status: 200
Orphanet:84 251 MONDO:0019391,HP:0005505
Return Status: 200
NCIT:C62505 251 MONDO:0019391,HP:0005505
Return Status: 200
MeSH:D005199 251 MONDO:0019391,HP:0005505


## Query Types

The `type1` and `type2` arguments are chosen from the [biolink-model](https://biolink.github.io/biolink-model/).  While any type in the model is potentially acceptable, only some types are exposed via ROBOKOP.  The current list of acceptable types is:

* disease_or_phenotypic_feature
    * **phenotypic_feature**
    * **disease**
       * genetic_condition

* **gene**

* **anatomical_entity**
    * **cell**
    * gross_anatomical_structure
    * cellular_component

* **biological_process_or_activity**
    * biological_process
        * pathway
    * molecular_activity

* **chemical_substance**
    * metabolite
    * drug
    
ROBOKOP understands the hierarchical nature of these relationships and can figure out services to call at a different level of the heirarchy.  For instance, suppose an adverse events service returns a mix of diseases and phenotypic features, but the caller only wants diseases.  The user can then ask for diseases, and the service will be called an automatically filtered.  On the other hand, if a user is willing to accept, say, either disease or phenotypic features for a query, then any function that returns both, or either type will automatically get called.

Because of ROBOKOP's caching (see below), some types will return more quickly than others.  These types are in **bold** above.

Note that `genetic_condition` above is not part of the biolink model, but is an additional type that descends from disease in ROBOKOP.

As an example, we'll call expand twice with the same gene, once asking for a disease, and once asking for a genetic condition. There are no translator services that return only genetic condition, but the service knows how to recognize genetic conditions and returns only those diseases that are genetic conditions:

In [19]:
#HGNC:7897 = "NPC1"
expand_result_disease = expand('gene', 'HGNC:7897', 'genetic_condition',output_format='csv',rebuild=True).split('\n')
NPC1_diseases = set([n.split(',')[1] for n in expand_result_disease[1:]]) 
expand_result_gc = expand('gene', 'HGNC:7897', 'genetic_condition',output_format='csv',rebuild=True).split('\n')
NPC1_genetic_conditions = set([n.split(',')[1] for n in expand_result_gc[1:]])

Return Status: 200
Return Status: 200


In [20]:
print('There are',len(NPC1_genetic_conditions),'Genetic Conditions associated with NPC1')
print('There are',len(NPC1_diseases),'Diseases associated with NPC1')
print(len(NPC1_diseases.intersection(NPC1_genetic_conditions)),'of these are in common. In other words, everything in the genetic condition list is also in the disease list')

There are 21 Genetic Conditions associated with NPC1
There are 21 Diseases associated with NPC1
21 of these are in common. In other words, everything in the genetic condition list is also in the disease list


## Caching and Rebuilding

ROBOKOP maintains caches results.  The cache is built both opportunistically (including the results of all previous queries) and proactively (pre-loading data that expected to be heavily used).  By default, expand only looks in its cache.  If a result has not been previously cached, then this call will not return anything (and may return a status code of 500).

If a user wants to force the service to look beyond its local cache, it sends a parameter `rebuild=True`, as seen in the NPC1 examples above.

If a user wants to be sure to retreive all relevant data, they should use `rebuild=True`, but this will be at the expense of performance.  In order to increase performance without sacrificing reliability, certain type pairs are preloaded into the cache.  In this case, there will be no difference in results between calling `rebuild=True` and `rebuild=False`, but calling with `rebuild=True` will be noticeably slower.

Certain pairs of types are preloaded into ROBOKOP's cache, so there is no point in using rebuild for them. The following list will be updated as the preloaded list is modified.  Note that with the data loaded, it doesn't matter which type is the query and which is the resut.  That is, if a row in this table specifies `disease` and `phenotypic_feature`, then there is no reason to use rebuild for `type1='disease' type2='phenotypic_feature'` or `type1='phenotypic_feature' type2='disease'`.

| type | type |
|------|------|
| disease | phenotypic_feature |
| genetic_condition | phenotypic_feature |
| gene    | biological_process_or_activity |
| gene    | disease |
| disease | chemical_substance |
| gene | chemical_substance |
| anatomical_entity | phenotypic_feature |

## Specifying Predicates

The responses can be filtered by a predicate, which should also come from the biolink model.  Here, we look up any chemical that is associated with asthma:

In [6]:
asthma = "MONDO:0004979"

In [None]:
asthma_chemicals = expand('disease',asthma,'chemical_substance',rebuild=True)

There are multiple types that come back from this query: treats, contributes_to, associated_with, etc...

In [22]:
asthma_chem_frame = parse_answer(asthma_chemicals)
asthma_chem_frame

Unnamed: 0,result_id,result_name,score,source,type
0,MESH:D000393,Air Pollutants,1.779961,ctd.disease_to_exposure,affects
1,CHEBI:28177,theophylline,1.775978,hmdb.metabolite_to_disease,related_to
2,MESH:D014028,Tobacco Smoke Pollution,1.716335,ctd.disease_to_exposure,affects
3,MESH:D001335,Vehicle Emissions,1.716335,ctd.disease_to_exposure,affects
4,CHEBI:9449,terbutaline,1.665752,ctd.disease_to_chemical,contributes_to
5,CHEBI:41879,dexamethasone,1.575013,ctd.disease_to_chemical,contributes_to
6,CHEBI:8382,prednisone,1.447142,mychem.get_adverse_events,treats
7,MESH:D052638,Particulate Matter,1.372814,ctd.disease_to_chemical,contributes_to
8,CHEBI:25812,ozone,1.358147,ctd.disease_to_chemical,contributes_to
9,MESH:D004391,Dust,1.354406,ctd.disease_to_exposure,affects


Now we can do an expand setting the predicate to treats, and we'll only return the chemicals above with that predicate type.  Note that we don't need to rebuild, since this will be a subset of the previous query, and so everything will already be in our local cache.

Note that there is one extra subtlety when using predicates.  Edges with predicates have a direction.  That is, the statement (disease)-treats->(drug) is different from (drug)-treats->(disease).   And we are specifically looking for the latter case, where the direction of the edge is towards our input parameter (here the disease).  Without specifying the `direction` parameter, expand will assume that we are looking for outward edges, i.e. cases where our input into expand is the subject.  But here we want inward edges, where the input is the object, so we specify `direction=in`.

In [7]:
asthma_treatment = expand('disease',asthma,'chemical_substance',predicate='treats',direction='in')

http://robokop.renci.org:80/api/simple/expand/disease/MONDO:0004979/chemical_substance
Return Status: 200


In [10]:
asthma_treat_frame = parse_answer(asthma_treatment)
asthma_treat_frame

Unnamed: 0,result_id,result_name,score,source,type
0,CHEBI:8382,prednisone,1.594186,mychem.get_adverse_events,treats
1,CHEBI:28177,theophylline,1.504095,mychem.get_drugcentral,treats
2,CHEBI:3207,budesonide,1.450466,mychem.get_drugcentral,treats
3,CHEBI:3001,beclomethasone,1.441718,ctd.disease_to_chemical,treats
4,CHEBI:9449,terbutaline,1.403697,mychem.get_drugcentral,treats
5,CHEBI:5134,fluticasone,1.376925,mychem.get_adverse_events,treats
6,CHEBI:2549,albuterol,1.290118,ctd.disease_to_chemical,treats
7,CHEBI:8378,prednisolone,1.268903,mychem.get_drugcentral,treats
8,CHEBI:94810,N-[4-oxo-2-(2H-tetrazol-5-yl)-1-benzopyran-8-y...,1.250675,mychem.get_drugcentral,treats
9,CHEBI:10100,zafirlukast,1.182609,mychem.get_drugcentral,treats


## Data sources and allowed hops

ROBOKOP can join many kinds of types, but only those for which it has a configured knowledge source.  For instance, ROBOKOP doesn't currently have a way to join `chemical_substance` and `biological_process_or_activity` directly.  Currently, these are the service hops that are configured in ROBOKOP.  As above, ROBOKOP will traverse the biolink-model hierarchy, so for instance if there is an entry from `metabolite` to `disease` below, then it will also be possible to call `chemical_substance` to `disease`.

The most up-to-date list, and information about which services are configured for each pair of types, can always be found in [this file](https://github.com/NCATS-Gamma/robokop-interfaces/blob/master/greent/rosetta.yml)

| input type | output type |
|------------|------------|
| anatomical_entity | cell |
| anatomical_entity | gene |
| anatomical_entity | phenotypic_feature |
| cell | anatomical_entity |
| cell | biological_process_or_activity |
| biological_process_or_activity | gene |
| biological_process_or_activity | cell |
| chemical_substance | gene |
| chemical_substance | disease |
| chemical_substance | phenotypic_feature |
| disease | phenotypic_feature |
| disease | gene |
| disease | chemical_substance |
| metabolite | pathway |
| gene | anatomical_entity |
| gene | disease |
| gene | chemical_substance |
| gene | pathway |
| gene | biological_process_or_activity |
| pathway | gene |
| pathway | metabolite |
| phenotypic_feature | anatomical_entity |
| phenotypic_feature | disease |