# Python TERMite toolkit - TERMite

We provide a Python library for making calls to our NER engine, TERMite, as well as the TExpress module for defining more complex semantic patterns. The library also enables post-processing of the JSON returned from such requests. This notebook gives you the rundown on how to make a call to TERMite and some of the possible post-processing of the JSON output.

## Install or update Python toolkit¶

The Python toolkit can simply be installed by running the following command in the terminal:
```
pip3 install termite_toolkit
```
If you already have the toolkit install make sure you have the latest version:
```
pip3 install termite_toolkit --upgrade
```

## Example call to TERMite

Making a call to TERMite with the toolkit is easy: simply ```import termite``` from the ```termite_toolkit``` and make a call.

A call is made up of:
* the TERMite API endpoint
* the entities you wish to use for annotation
* a TERMite request
* request execution

Save the TERMite call in a python script and simply run ```python ExampleCall.py``` in the terminal.

This is some example text we can make a TERMite call on

In [4]:
input_text = "The data in Table 2, Row 2 suggest that Telmisartan might be useful to prevent colon cancer (note that Clopidogrel is in both the Drug and Control arm, so we did not investigate Clopidogrel further). Recent cell-based studies reported that Telmisartan exerts anti-tumor effects by activating peroxisome proliferator-activated receptor-γ (Li et al., 2014; Pu, Zhu & Kong, 2016; Wu et al., 2016b). The algorithm presented here provides the first evidence from a randomized clinical trial indicating that Telmisartan may be viable as a repurposed prevention for colon cancer. Phylloquinone (Table 2, Row 4) is a vitamin (vitamin K1) supplement rather than a prescription drug. K vitamins + sorafenib induce apoptosis in human pancreatic cancer cell lines (Wei, Wang & Carr, 2010). A prospective cohort analysis found that individuals who increased their intake of dietary phylloquinone might have a lower risk of cancer than those who did not (Juanola-Falgarona et al., 2014). The data from the randomized trial in Table 2 suggest that vitamin K1 might actually help prevent cancer (OR = 0.27, 95% CI [0.07–0.98]). The potential cancer prevention by vitamin K1 is especially intriguing because one can get more than 1,000% daily value of vitamin K1 by simply eating one cup of cooked kale or spinach (https://www.healthaliciousness.com/articles/food-sources-of-vitamin-k.php)."

Below is an example TERMite call. The API endpoint specified is TERMite's default endpoint. Here we just print the TERMite result to the screen. 

In [12]:
from pprint import pprint
from termite_toolkit import termite

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# specify entities to annotate
entities = "DRUG,INDICATION"

# initialise a request builder
t = termite.TermiteRequestBuilder()

# add items to your TERMite request
t.set_url(termite_home)
t.set_text(input_text)  # this is where we send the text to be annotated
t.set_entities(entities)  # you must specify the vocab neams you would like to use for annotation
t.set_subsume(True)
t.set_input_format("txt")
t.set_output_format("json")  # you can try different output formats here e.g. "tsv"
t.set_reject_ambiguous(False)


# once the query object has been built, execute the TERMite request
termite_response = t.execute(display_request=False)

pprint(termite_response, depth = 3)

{'RESP_META': {'CONID': '127.0.0.1/68',
               'ENTITIES_LIMIT': '[DRUG, INDICATION]',
               'HTTP_CODE': '200',
               'INPUT_SIZE': 1373,
               'JSON_PRODUCER': 'EFFICIENT',
               'REQID': 'd70829d1-bc8f-44ce-a2ee-39a19321a43a-16391',
               'RUNTIME_OPTIONS': {'_termitesys.exetermite': 'true',
                                   '_termitesys.exetexpress': 'false',
                                   'rejectAmbig': 'false',
                                   'subsume': 'true'},
               'TERMITE_RUNTIME': 'default',
               'TERMITE_VERS': '6.3.17',
               'Timing_msec_TOTAL': '1'},
 'RESP_PAYLOAD': {'DRUG': [{...}, {...}, {...}, {...}],
                  'INDICATION': [{...}, {...}, {...}, {...}]},


To understand the JSON output of TERMite results [click here](https://help.scibite.com/a/solutions/articles/179705-anatomy-of-a-termite-hit).

Use ```help(termite.TermiteRequestBuilder)``` to view the documentation to see the available functions of ```TermiteRequestBuilder()``` and how they can be used to set the runtime options.

Once familiar with making a call in Python you'll be able to make calls on files and using a python dict object of TERMite options (these can be viewed on your TERMite server homepage), like the example below:


In [4]:
from pprint import pprint
from termite_toolkit import termite
import sys
import os

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# input file
parentDir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))  # this line relatively locates the parent directory
input_file = os.path.join(parentDir, 'sample_scripts/medline_sample.zip')  

# TERMite options
options = {"format": "medline.xml", "output": "json", "entities": "DRUG,GENE,INDICATION"}

# TERMite call as JSON result
termite_json_response = termite.annotate_files(termite_home, input_file, options)

## TERMite toolkit library

The standard JSON output is gives the most rich output, but this isn't the most human friendly.

The TERMite toolkit has many built-in functions for parsing outputs. For example, ```get_entitiy_hits_from_json()``` takes a JSON TERMite response and returns a summary of the hits with additional filtering rules applied. The returned object is a python dict object indexed by entity ID, with associated frequency counts.

Below is an example of post-processing of the results from our first TERMite example call; we've filtered the TERMite hits so that we're only looking at DRUG hits.

In [5]:
filtered_hits = termite.get_entity_hits_from_json(termite_response, 'DRUG', reject_ambig=False)

pprint(filtered_hits)

{'DRUG$CHEMBL1017': {'doc_count': 1,
                     'doc_id': [''],
                     'hit_count': 3,
                     'id': 'CHEMBL1017',
                     'max_relevance_score': 4,
                     'name': 'Telmisartan',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1336': {'doc_count': 1,
                     'doc_id': [''],
                     'hit_count': 1,
                     'id': 'CHEMBL1336',
                     'max_relevance_score': 1,
                     'name': 'BAY43-9006',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1550': {'doc_count': 1,
                     'doc_id': [''],
                     'hit_count': 6,
                     'id': 'CHEMBL1550',
                     'max_relevance_score': 4,
                     'name': 'Phytomenadione',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1771': {'doc_count': 1,
                     'doc_id': [''],
                     'hit_count': 2,
                     'id': 'CHEMBL1771',
   

We've also added functionality to convert the json and doc.JSONx outputs into a pandas dataframe, either by individual hits or grouped by TERMite ID.

In [6]:
termite.get_termite_dataframe(termite_response, reject_ambig = False).head()

Unnamed: 0,docID,entityType,hitID,name,score,realSynList,totnosyns,nonambigsyns,frag_vector_array,hitCount
0,Termite_Doc_d70829d1-bc8f-44ce-a2ee-39a19321a4...,DRUG,CHEMBL1017,Telmisartan,4,"[Telmisartan, Telmisartan, Telmisartan]",1,1,"[1#le 2, Row 2 suggest that {!Telmisartan!} mi...",3
1,Termite_Doc_d70829d1-bc8f-44ce-a2ee-39a19321a4...,DRUG,CHEMBL1336,BAY43-9006,1,[sorafenib],1,1,[5#K vitamins + {!sorafenib!} induce apoptosis...,1
2,Termite_Doc_d70829d1-bc8f-44ce-a2ee-39a19321a4...,DRUG,CHEMBL1550,Phytomenadione,4,"[Phylloquinone, vitamin K1, phylloquinone, vit...",2,2,"[4#{!Phylloquinone!} (Table 2, Row 4) is a vi,...",6
3,Termite_Doc_d70829d1-bc8f-44ce-a2ee-39a19321a4...,DRUG,CHEMBL1771,Clopidogrel Bisulfate,1,"[Clopidogrel, Clopidogrel]",1,1,[1#colon cancer (note that {!Clopidogrel!} is ...,2
4,Termite_Doc_d70829d1-bc8f-44ce-a2ee-39a19321a4...,INDICATION,D010190,Pancreatic Neoplasms,1,[pancreatic cancer],1,1,[5#nduce apoptosis in human {!pancreatic cance...,1


In [7]:
termite.all_entities_df(termite_json_response).head()

Unnamed: 0,doc_count,doc_id,hit_count,id,max_relevance_score,name,type
DRUG$CHEMBL38,1,[25987350],5,CHEMBL38,5,Tretinoin,DRUG
GENE$CD4,9,"[25987350, 24793818, 23523417, 22995775, 23021...",38,CD4,5,CD4 molecule,GENE
GENE$TGFB1,7,"[25987350, 22578910, 23403039, 22974822, 23643...",22,TGFB1,5,transforming growth factor beta 1,GENE
GENE$IL2RA,4,"[25987350, 23643066, 26589955, 24411923]",9,IL2RA,4,interleukin 2 receptor subunit alpha,GENE
GENE$FOXP3,5,"[25987350, 23523417, 23643066, 26589955, 24411...",10,FOXP3,5,forkhead box P3,GENE


We've made it easier to identify which VOCabs have hits within the TERMite input, their frequencies, and the most frequent hits:

In [8]:
termite.all_entities(termite_json_response)

['DRUG', 'GENE', 'INDICATION']

In [9]:
termite.entity_freq(termite_json_response)

Unnamed: 0,entityType
INDICATION,3168
DRUG,573
GENE,530


In [16]:
top_hits_df(termite_json_response, selection=5)

Unnamed: 0,id,name,type,hit_count
INDICATION$D003424,D003424,Crohn Disease,INDICATION,2084
INDICATION$D003093,D003093,"Colitis, Ulcerative",INDICATION,1509
INDICATION$D015212,D015212,Inflammatory Bowel Diseases,INDICATION,1000
INDICATION$D007249,D007249,Inflammation,INDICATION,468
DRUG$CHEMBL1201581,CHEMBL1201581,TA-650,DRUG,369
