# Building Adeft Models

The [Introduction](introduction.ipynb) notebook explains how to use Adeft's pretrained disambiguation models. This notebook is for users who would like to build their own models or simply to better understand the inner workings of Adeft. Here we will go through the steps of creating a model for the shortform `IR`.

## Mining longform expansions from text corpora

The first step in building a model is assembling a corpus of texts containing mentions of the desired shortform. Adeft does not provide tools for text acquisition. We assume users will be able to supply their own texts. For the pretrained models, texts are extracted from the [INDRA Database](https://github.com/indralab/indra_db) (which is not publicly available). For this tutorial we will use a sample of 500 texts from the over 10,000 texts used to build the pretrained model.

In [None]:
import json

with open('data/example_texts.json') as f:
    ir_texts = json.load(f)

Adeft uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/software/acromine/) to identify longform expansions for a given shortform within a corpus of texts. This is done by searching for defining patterns (DPs) for the shortform within the texts. Statistical co-occurence frequencies are used to identify the correct expansions corresponding to the defining patterns. For example in the phrase

> This is done by searching for defining patterns (DPs)...

possible expansions for `DPs`, based on the text preceding the parentheses, are:
* `patterns`
* `defining patterns`
* `for defining patterns`
* `searching for defining patterns`
* etc...

While the appropriate text boundaries for the longform is difficult to determine from a single sentence, the
correct scope can be determined looking at defining patterns in a large corpus of texts. For example, given many such texts, the Acromine algorithm can determine that `defining patterns` occurs much more frequently than `for defining patterns` and that `patterns` occurs rarely without `defining` preceding it.

Longform expansions are mined from texts with the ``AdeftMiner`` class. The following code shows how to initialize an instance of ``AdeftMiner`` for a given shortform and process a list of texts.

In [None]:
from adeft.discover import AdeftMiner

ir_miner = AdeftMiner('IR')
ir_miner.process_texts(ir_texts)

A score will be produced for each possible longform expansion. Top scoring expansions can be inspected as follows:

In [None]:
ir_miner.top(10)

A score will be produced for each possible longform expansion. Top scoring expansions can be inspected as follows:

In [None]:
ir_miner.top(10)

These "raw" longforms include redundant/overlapping entries (for example, "ionizing radiation" and "to ionizing radiation", and "insulin resistance" and "of insulin resistance"). Adeft analyzes the words in each longform to identify and remove non-relevant prefixes, arriving at an optimized set which can be inspected using the method ``get_longforms``:

In [None]:
ir_miner.get_longforms(cutoff=5)

As shown above, for Adeft obtains a good set of longforms for ``IR`` from this corpus.

## Labeling Texts

To build models, users must produce a dictionary that maps longforms to desired identifiers. We call these *grounding maps*. For the pretrained models we use labels consisting of a [Namespace](introduction.ipynb#Name-Spaces) and corresponding ID separated by a colon. An example grounding map is shown below.

In [None]:
grounding_map = {'ionizing radiation': 'MESH:D011839',
                 'insulin resistance': 'MESH:D007333',
                 'ischemia reperfusion': 'MESH:D015427',
                 'insulin receptor': 'HGNC:6091',
                 'irradiation': 'MESH:D011839',
                 'infrared': 'MESH:D007259',
                 'immunoreactive': 'ungrounded'}

Given a grounding map, Adeft can automatically associate identifiers with the texts in the training corpus that contain defining patterns corresponding to one of the longform expansions in the grounding map. This is done with the ``AdeftLabeler`` class. To initialize this object, we need a dictionary mapping each shortform to a grounding map.

In [None]:
grounding_dict = {'IR': grounding_map}

In some cases, it may be useful to train a model for multiple synonymous shortforms. For example, "nanoparticles" can be abbreviated as ``NP`` or ``NPs``, and it is useful to train a single model on texts containing both shortforms. In this case one can create a dictionary linking a grounding map to each shortform:

In [None]:
np_grounding_dict = {"NP":  {"nanoparticle": "MESH:D053758",
                             "nucleus pulposus": "MESH:D000070614",
                             "nucleoprotein": "MESH:D009698",},
                     "NPs": {"nanoparticles": "MESH:D053758",
                             "natriuretic peptides": "FPLX:Natriuretic_peptide",
                             "nurse practitioners": "ungrounded",}}

Given a grounding dictionary for the relevant shortform(s), the ``AdeftLabeler`` is initialized as follows:

In [None]:
from adeft.modeling.label import AdeftLabeler

labeler = AdeftLabeler(grounding_dict)

Texts for model training are labeled using the ``AdeftLabeler.build_from_texts`` method. The output, ``corpus``, contains a list of two-element tuples of which the first element is a text and the second element is a label taken from the values of the grounding map. 

In [None]:
corpus = labeler.build_from_texts(ir_texts)
print("corpus[0][0]: %s..." % corpus[0][0][0:70])
print("corpus[0][1]: %s" % str(corpus[0][1]))

We use ``zip`` to convert this to two lists: a list of texts and a corresponding list of labels. Of the 500 input texts, 340 contained defining patterns for ``IR``.

In [None]:
texts, labels = zip(*corpus)

print(len(texts))

## Building Predictive Models

The ``AdeftClassifier`` class can then be used to train a logistic regression model to disambiguate texts that do not contain a defining pattern. The ``AdeftClassifier`` must be initialized with the shortform of interest and a list of the labels to consider as *positive* labels. A positive label is one that is considered relevant to the user's information extraction task. Precision, recall, and F1 scores are calculated using a weighted average of the positive labels. The following code illustrates the initialization of an ``AdeftClassifier``:

In [None]:
%%capture
from adeft.modeling.classify import AdeftClassifier

classifier = AdeftClassifier('IR', ['MESH:D011839', 'HGNC:6091'])

When training models to disambiguate multiple synonymous shortforms (as in the case  of ``NP``/``NPs``, above), the ``AdeftClassifier`` is passed a list of shortforms, as in the following example:

In [None]:
np_classifier = AdeftClassifier(['NP', 'NPs'], ['MESH:D053758', 'FPLX:Natriuretic_peptide'])

### Model Details

The AdeftClassifier uses a Logistic Regression model with [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorized [n-gram](https://en.wikipedia.org/wiki/N-gram) features. It is implemented with the [Scikit-Learn](https://scikit-learn.org/stable/) Python library. It has three parameters that can be tuned:

* **C : float**

$L_1$ regularization parameter. Following Scikit-learn's Logistic Regression implementation, $C$ is the reciprocal of the $L_1$ penalty $\lambda$. Lower values of $C$ correspond to greater regularization. $L_1$ Regularization controls model complexity by adding a multiple, $\lambda$, of the sum of the absolute value of coefficients to the Logistic Regression objective function. *$L_1$ regularization shrinks regression coefficients to zero, with higher regularization causing the model to use fewer features.*

* **max_features : int**

Cutoff for the number of TF-IDF vectorized $n$-grams to use as features. Selects the top $n$-grams by frequency in the training set.

* **ngram_range : tuple of int**

Range of values $n$ for which model takes $n$-grams as features. When ngram_range is set to (1, 1) only unigrams are used. Must be a tuple of ints $(a, b)$ with $a < b$. For ngram_range $(a, b)$ with $a <= b$, a-grams through b-grams are used.

### Training

The ``AdeftClassifier`` has a ``cv`` method which can be used to perform a cross-validated grid search to calculate classification metrics for a variety of parameter values. In this example, we have specified that only ``MESH:D011839`` (Ionizing Radiation) and ``HGNC:6091`` (Insulin Receptor) are to be considered as positive labels. This impacts how the classification metrics are calculated and hence how the training parameters are optimized. ``Adeft`` will report the crossvalidated precision, recall, and F1 score. For multilabel classification, ``Adeft`` will take the average of these scores over all positive labels, weighted by the frequency of each label in the test data. The ``cv`` method takes a ``param_grid`` argument as used in Sci-kit Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). ``param_grid`` must be provided as a dictionary mapping feature names to lists of values. Crossvalidation is performed for each combination of parameters from the lists.

The following illustrates training with cross-validation using two alternative values for ``max_features``, 100 and 1000:

In [None]:
param_grid = {'C': [10.0], 'max_features': [100, 1000], 'ngram_range': [(1, 2)]}
classifier.cv(texts, labels, param_grid, cv=5)

The parameter ``cv`` can either be an ``int`` specifying the number of folds, or a cross-validation generator or iterable as taken for the ``cv`` argument of a  [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object. A summary of model statistics for the best combination of parameters can be accessed as follows.

In [None]:
classifier.stats

It's also possible to access the underlying [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object to get more detailed information. See the Scikit-learn documentation for more information.

In [None]:
grid_search = classifier.grid_search
print(grid_search.cv_results_)

## Disambiguators

The logistic regression classifier we've produced in the steps above can be combined with the ``grounding_dict`` to build a disambiguator like the one shown in the [Introduction](introduction.ipynb) notebook. These are instantiated as ``AdeftDisambiguator`` objects. An ``AdeftDisambiguator`` first seeks to disambiguate text by searching for defining patterns. The logistic regression model is used only if a defining pattern for the shortform is not found.

You may recall from the introduction that a disambiguator returns standardized names for each grounding label. These must be explicitly supplied, as in the example below:

In [None]:
names = {'MESH:D011839': 'Radiation, Ionizing',
         'MESH:D007333': 'Insulin Resistance',
         'HGNC:6091': 'INSR',
         'MESH:D015427': 'Reperfusion Injury',
         'MESH:D007259': 'Infrared Rays'}

An ``AdeftDisambiguator`` is instantiated with a classifier, grounding_dict, and dictionary of names as follows:

In [None]:
from adeft.disambiguate import AdeftDisambiguator

my_disambiguator = AdeftDisambiguator(classifier, grounding_dict, names)

We can use the ``info`` method to see statistics for our custom disambiguator just as for the pretrained disambiguators:

In [None]:
print(my_disambiguator.info())

We can then disambiguate the examples from the Introduction notebook:

In [None]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')
example2 = ('The detrimental effects of IR involve a highly orchestrated series of'
            ' events that are amplified by endogenous signaling and culminating in'
            ' oxidative damage to DNA, lipids, proteins, and many metabolites.')
with open('data/example.txt') as f:
    example3 = f.read()

The first example contains a defining pattern. The logistic regression classifier is used for the second two examples and produces the correct groundings.

In [None]:
my_disambiguator.disambiguate(example1)

In [None]:
my_disambiguator.disambiguate(example2)

In [None]:
my_disambiguator.disambiguate(example3)

## Saving Disambiguators

Disambiguators can be serialized for use at a later time. A disambiguator has three components: a logistic regression model, a grounding dictionary, and a names dictionary. These will be saved to three separate files within a directory with the following structure.

* `<ModelName>`
    - `<ModelName>_grounding_dict.json`
    - `<ModelName>_names.json`
    - `<ModelName>_model.gz`

Models are saved to the user's filesystem using ``AdeftDisambiguator.dump``, which takes two arguments: a string identifying the model (e.g., ``IR``), and a path to a folder for storing models. Because the model identifier is used to create subfolders within the model directory, characters such as "/" should not be used. Also note that some file systems (e.g., Mac OS) are case-insensitive.

In [None]:
my_disambiguator.dump('IR', path='data')

An ``IR`` subfolder is created for the model inside the ``data`` directory:

In [None]:
ls -lh 'data'

The model folder contains three files:

In [None]:
ls -lh 'data/IR/'

The file ``IR_model.gz`` contains the coefficients of the logistic regression model, the $n$-gram features along with their frequencies in the document training data, and other classifier metadata. These are stored within a json file which is then compressed with gzip. The grounding and names dictionaries are serialized directly to json. When downloaded, ``Adeft``'s pretrained models are stored in this format within a hidden folder in the users home directory called ``.adeft``.

To load custom models, the ``load_disambiguator`` function used in the Introduction notebook can be passed an optional ``path`` argument for a user-specified model folder, as shown below:

In [None]:
from adeft.disambiguate import load_disambiguator

also_my_disambiguator = load_disambiguator('IR', path='data')

print(also_my_disambiguator.info())

The serialized model we have loaded produces the same disambiguation results as the original model:

In [None]:
also_my_disambiguator.disambiguate(example3)

In [None]:
my_disambiguator.disambiguate(example3)

## Adeft Grounding Assistant

Because the task of linking groundings to longforms is not automated, Adeft provides a simple graphical user interface to assist with data entry for grounding.

The function ``adeft.gui.ground_with_gui`` opens a simple web application in the browser to allow users to enter groundings for longforms, standardized names, and choose which labels should be considered positive labels when evaluating classifiers. 

In [None]:
from adeft.gui import ground_with_gui

longforms, scores = zip(*ir_miner.get_longforms(cutoff=5))

In [None]:
result = ground_with_gui(longforms, scores)

In [None]:
print(result)

A screenshot of the app after all groundings have been entered is shown below. Users can enter names and groundings in the text boxes and then check off the boxes next to the longforms. Clicking submit will gives the checked off longforms the entered name and grounding. Names and groundings can be deleted by pressing the X button to the right of the grounding column. The labels column on the far right is populated with the unique groundings. Clicking the + button toggles whether a lable is considered positive. When the user presses generate, the ground_with_gui function returns a tuple containing a grounding map, a names dictionary, and a list of positive labels. The web application will then stop running.

<img src="figures/adeft_app.png">

Users may supply an initial grounding map. When building adeft models, we supply initial grounding maps generating by an imperfect grounding function and then use the GUI to manually review and correct these initial groundings. If a grounding map is supplied, an initial names map and list of positive labels can also be supplied.

In [None]:
result2 = ground_with_gui(longforms, scores, grounding_map=grounding_map, names=names)

In [None]:
print(result2)

## Modifying Groundings without Retraining

It's possible to modify groundings and standardized names without having to retrain the classifier. Suppose for
instance that you prefer the Uniprot grounding for Insulin Receptor the the HGNC grounding and the protein name *Insulin Receptor* to the HGNC symbol INSR. This can be accomplished with a disambiguators ```modify_groundings``` method. Users can pass in dictionaries mapping previous groundings to new groundings, and previous groundings to new names. This method does not allow for two distinct groundings to be mapped to the same new grounding. Users should retrain the model if this is desired.

In [None]:
my_disambiguator.modify_groundings(new_groundings={'HGNC:6091': 'UP:P06213'},
                                   new_names={'HGNC:6091': 'Insulin Receptor'})

We see below that the model info has successfully changed.

In [None]:
print(my_disambiguator.info())

and disambiguations are now made with the updated grounding and name

In [None]:
my_disambiguator.disambiguate(example3)

## Conclusion

If you've followed along with this notebook, you're ready to build your own disambiguation models provided that you have access to proper text corpora. If you believe you've found a bug in Adeft please submit an issue at https://github.com/indralab/adeft/issues. If you'd like to contribute see https://github.com/indralab/adeft/CONTRIBUTING.md