# Python TERMite toolkit

We provide a Python library for making calls to our NER engine, TERMite, as well as the TExpress module for defining more complex semantic patterns. The library also enables post-processing of the JSON returned from such requests.

## Install Python toolkit

The Python toolkit can simply be installed by running the following command in the terminal:

```
pip3 install termite_toolkit
```

## Example call to TERMite

Making a call to TERMite with the toolkit is easy: simply ```import termite``` from the ```termite_toolkit``` and make a call.

A call is made up of:
* the TERMite API endpoint
* the entities you wish to use for annotation
* a TERMite request
* request execution

Save the TERMite call in a python script and simply run ```python ExampleCall.py``` in the terminal.

Run the next cell, it's just some example text we can make a TERMite call on

In [None]:
input_text = "The data in Table 2, Row 2 suggest that Telmisartan might be useful to prevent colon cancer (note that Clopidogrel is in both the Drug and Control arm, so we did not investigate Clopidogrel further). Recent cell-based studies reported that Telmisartan exerts anti-tumor effects by activating peroxisome proliferator-activated receptor-γ (Li et al., 2014; Pu, Zhu & Kong, 2016; Wu et al., 2016b). The algorithm presented here provides the first evidence from a randomized clinical trial indicating that Telmisartan may be viable as a repurposed prevention for colon cancer. Phylloquinone (Table 2, Row 4) is a vitamin (vitamin K1) supplement rather than a prescription drug. K vitamins + sorafenib induce apoptosis in human pancreatic cancer cell lines (Wei, Wang & Carr, 2010). A prospective cohort analysis found that individuals who increased their intake of dietary phylloquinone might have a lower risk of cancer than those who did not (Juanola-Falgarona et al., 2014). The data from the randomized trial in Table 2 suggest that vitamin K1 might actually help prevent cancer (OR = 0.27, 95% CI [0.07–0.98]). The potential cancer prevention by vitamin K1 is especially intriguing because one can get more than 1,000% daily value of vitamin K1 by simply eating one cup of cooked kale or spinach (https://www.healthaliciousness.com/articles/food-sources-of-vitamin-k.php)."

Below is an example TERMite call. The API endpoint specified is TERMite's default endpoint. Here we just print the TERMite result to the screen.

Run the cell to see.

In [None]:
from pprint import pprint
from termite_toolkit import termite

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# specify entities to annotate
entities = "DRUG,INDICATION,HUCELL,GENE,ADVENTMED"

# initialise a request builder
t = termite.TermiteRequestBuilder()

# add items to your TERMite request
t.set_url(termite_home)
t.set_text(input_text)  # this is where we send the text to be annotated
t.set_entities(entities)  # you must specify the vocab neams you would like to use for annotation
t.set_subsume(True)
t.set_input_format("txt")
t.set_output_format("json")  # you can try different output formats here e.g. "tsv"
t.set_reject_ambiguous(False)


# once the query object has been built, execute the TERMite request
termite_response = t.execute(display_request=False)

pprint(termite_response)

To understand the JSON output of TERMite results [click here](https://help.scibite.com/a/solutions/articles/179705-anatomy-of-a-termite-hit).

Use ```help(termite.TermiteRequestBuilder)``` to view the documentation to see the available functions of ```TermiteRequestBuilder()``` and how they can be used.

Once you're confident with making a call in Python you'll be able to make calls on files and using a python dict object of TERMite options (these can be viewed on your TERMite server homepage), like the example below:

In [None]:
from pprint import pprint
from termite_toolkit import termite
import sys
import os

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# input file
parentDir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))  # this line relatively locates the parent directory
input_file = os.path.join(parentDir, 'sample_scripts/medline_sample.zip')  

# TERMite options
options = {"format": "medline.xml", "output": "json", "entities": "DRUG,GENE,INDICATION,HUCELL"}

# TERMite call as JSON result
termite_json_response = termite.annotate_files(termite_home, input_file, options)

pprint(termite_json_response)

## Example call to TExpress

The toolkit can also be used to make TExpress calls to identify patterns and extract biomedical relationships.

A simple TExpress call is made up of:
* the TERMite API endpoint
* the pattern you wish to search for - this can be created in the TERMite UI
* a TExpress request
* request execution

Below is an example TExpress call with the result being printed to the screen. Run it...

In [None]:
from pprint import pprint
from termite_toolkit import texpress

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# specify the pattern you wish to search for- this can created in the TERMite UI
pattern = ":(INDICATION):{0,5}:(GENE)"

t = texpress.TexpressRequestBuilder()

# individually add items to your TERMite request
t.set_url(termite_home)
t.set_text("sildenafil citrate macrophage colony stimulating factor influenza")
t.set_subsume(True)
t.set_input_format("txt")
t.set_output_format("json")
t.set_allow_ambiguous(False)
t.set_pattern(pattern)

# execute the request
texpress_response = t.execute(display_request=False)

pprint(texpress_response)

For more information on the TExpress JSON results [click here](https://help.scibite.com/a/solutions/articles/4000021813-anatomy-of-a-texpress-result-server-).

Like TERMite, TExpress calls can be simplified to call options and annotation:

In [None]:
from pprint import pprint
from termite_toolkit import texpress
import sys
import os

termite_home = "http://localhost:9090/termite"
parentDir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))  # this line relatively locates the parent directory
input_file = os.path.join(parentDir, 'sample_scripts/medline_sample.zip')
options = {"format": "medline.xml", "output": "json", "pattern": ":(INDICATION):{0,5}:(GENE)",
           "opts"  : "reverse=false"}

texpress_json_response = texpress.annotate_files(termite_home, input_file, options)

## TERMite toolkit library

The script containing your TERMite (or TExpress) call may have further downstream processing of the results to make them more human friendly... such as the example codes below!

In [None]:
import pandas as pd
import collections

all_entity_hits = []

for entity_type, entity_hits in termite_response['RESP_PAYLOAD'].items():
    for hit in entity_hits:
        hit_info = collections.OrderedDict()
        hit_info['hit_id'] = hit["hitID"]
        hit_info['hit_name'] = hit["name"]
        hit_info['hit_entity'] = entity_type
        hit_info['hit_count'] = hit["hitCount"]
        
        if hit["nonambigsyns"] >0:
            hit_info['poor'] = 'N'
        else:
            hit_info['poor'] = 'Y'
            
        hit_info['actual_hits'] = '; '.join(list(set(hit["realSynList"])))
        
        all_entity_hits.append(hit_info)

infoDF = pd.DataFrame(all_entity_hits)
infoDF.set_index('hit_id', inplace=True)

print(infoDF)

The TERMite toolkit has many built-in functions for parsing outputs. For example, ```get_entitiy_hits_from_json()``` takes a JSON TERMite response and returns a summary of the hits with additional filtering rules applied. The returned object is a python dict object indexed by entity ID, with associated frequency counts.

Below is an example of post-processing of the results from our first TERMite example call; we've filtered the TERMite hits so that we're only looking at DRUG hits. Run the cell to see:

In [None]:
filtered_hits = termite.get_entity_hits_from_json(termite_response, 'DRUG', reject_ambig=False)

pprint(filtered_hits)

From TERMite hits on medline files we can extract and visualise the most frequently occuring entities types and hits.

In [None]:
medline_hits = termite.get_entity_hits_from_json(termite_json_response, 'DRUG,GENE,INDICATION,HUCELL')
medlineDF = pd.DataFrame(medline_hits).T

print(medlineDF)

import matplotlib.pyplot as plt

plt.figure();
pd.value_counts(medlineDF['type']).plot.pie()
plt.title("Entity Types")
plt.show()

In [None]:
medlineDF.sort_values(by=['hit_count'], ascending=False, inplace = True)

colours = {'INDICATION': '#1f77b4', 'GENE':'#ff7f0e', 'DRUG':'#2ca02c', 'HUCELL':'#d62728'}

plt.figure();
medlineDF.iloc[0:50, 2].plot(kind='bar', color=medlineDF['type'].apply(lambda x: colours[x]))
plt.title("Top 50 most frequent hits")
plt.show()