# Building Adeft Models

In the [Introduction](introduction.ipynb) notebook, we went over how to use Adeft's pretrained disambiguation models. This notebook is for users who would like to build their own models or simply to better understand the inner workings of adeft. We will go through the steps of creating a model for the shortform IR.

## Mining longform expansions from text corpora

The first step in building a model is assembling a corpus of texts containing mentions of the desired shortform. Adeft does not provide tools for text acquisition. We assume users will be able to supply their own texts. For the pretrained models, texts are extracted from the [INDRA Database](https://github.com/indralab/indra_db) which is not publically available. We have built the [Adeft App](https://github.com/indralab/adeft_app) based on adeft to build models based on content from the INDRA database. The Adeft App is open source and users are encouraged to look over it for inspiration or fork and modify it for their own purposes. For this tutorial we will use a sample of 500 texts from the over 10,000 texts used to build the pretrained model.

In [None]:
import json

with open('data/example_texts.json') as f:
    ir_texts = json.load(f)

Adeft uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/index.php) to identify longform expansions for a given shortform within a corpus of texts. This is done by searching for defining patterns (DPs) for the shortform within the texts. Statistical co-occurence frequencies are used to identify the correct expansions corresponding to the defining patterns. Possible expansions for DPs in the sentence before last are
* patterns
* defining patterns
* for defining patterns
* searching for defining patterns
* etc...

A machine cannot apriori tell what the correct expansion is from a single sentence, but by looking at many DPs for DPs within an appropriate corpus of texts it would be able to tell that ***defining patterns*** occurs much more frequently than ***for defining patterns*** and that ***patterns*** occurs rarely without ***defining*** preceding it.

Longform expansions can be minded from texts with the AdeftMiner object. The following cell shows how to initialize an AdeftMiner for a given shortform and process a list of texts.

In [None]:
from adeft.discover import AdeftMiner

ir_miner = AdeftMiner('IR')
ir_miner.process_texts(ir_texts)

A score will be produced for each possible longform expansion. Top scoring expansions can be inspected as follows

In [None]:
ir_miner.top(10)

We see that the top scoring expansions do not immediately give the correct expansions. A method is implemented to extract the best potential longforms.

In [None]:
ir_miner.get_longforms(cutoff=5)

We see the method has done a good job of identifying correct longform expansions.

## Labeling Texts

To build models, users must produce a dictionary mapping longforms to desired grounding labels. We call these grounding maps. For the pretrained models we use labels consisting of a [Name Space](introduction.ipynb#Name-Spaces) and corresponding ID separated by a colon. In the [Adeft App](https://github.com/indralab/adeft_app) we've implemented a simple GUI to assist in building these dictionaries.

In [None]:
grounding_map = {'ionizing radiation': 'MESH:D011839',
                 'insulin resistance': 'MESH:D007333',
                 'ischemia reperfusion': 'MESH:D015427',
                 'insulin receptor': 'HGNC:6091',
                 'irradiation': 'MESH:D011839',
                 'infrared': 'MESH:D011840',
                 'immunoreactive': 'ungrounded'}

Adeft is then able to automatically produce labels for some of the texts by searching for defining patterns matching one of these longform expansions. This is done with an AdeftLabeler object. To initialize this object, we need a dictionary mapping shortforms to grounding maps as created above. Dictionaries containing grounding maps for multiple shortforms may be used to produce models for multiple synonymous shortforms.

In [None]:
grounding_dict = {'IR': grounding_map}

The AdeftLabeler is then initialized as follows

In [None]:
from adeft.modeling.label import AdeftLabeler

labeler = AdeftLabeler(grounding_dict)

The following cell illustrates how to use the AdeftLabeler. It returns a list of two element tuples. Each tuple contains a text as its first element and a label as its second.

In [None]:
corpus = labeler.build_from_texts(ir_texts)
texts, labels = zip(*corpus)

The output, corpus, contains a list of tuples of which the first element is a text and the second element is a label taken from the values of a grounding map. We use a list unpacking trick with zip to convert this to two lists, a list of texts and a list of labels. Of the 500 input texts, 340 contained defining patterns.

In [None]:
print(len(texts))

## Building Predictive Models

An AdeftClassifier can than be used to train a logistic regression model to disambiguate texts that do not contain a defining pattern. The AdeftClassifier must be initialized with the shortform of interest and a list of the labels to consider as positive labels. Users can also pass a list of shortforms for models disambiguating multiple synonymous shortforms. The following cell illustrates the initialization of an AdeftClassifier.

In [None]:
%%capture
from adeft.modeling.classify import AdeftClassifier

classifier = AdeftClassifier('IR', ['MESH:D011839', 'HGNC:6091'])

### Model Details

The AdeftClassifier uses a Logistic Regression model with [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorized [n-gram](https://en.wikipedia.org/wiki/N-gram) features. It is implemented with the [Scikit-Learn](https://scikit-learn.org/stable/) python library. It has three parameters that can be tuned

* **C : float**

$L_1$ regularization parameter. Following Scikit-learn's Logistic Regression implementation, $C$ is the reciprocal of the $L_1$ penalty $\lambda$. Lower values of $C$ correspond to greater regularization. $L_1$ Regularization controls model complexity by adding a multiple, $\lambda$, of the sum of the absolute value of coefficients to the Logistic Regression objective function. *$L_1$ regularization shrinks regression coefficients to zero, with higher regularization causing the model to use fewer features.*

* **max_features : int**

Cutoff for the number of TF-IDF vectorized n_grams to use as features. Selects the top $n$-grams by frequency in the training set.

* **ngram_range : tuple of int**

Range of values $n$ for which model takes $n$-grams as features. When ngram_range is set to (1, 1) only unigrams are used. Must be a tuple of ints $(a, b)$ with $a < b$. For ngram_range $(a, b)$ with $a <= b$, a-grams through b-grams are used.

### Training
The AdeftClassifier has a cv method which can be used to perform a crossvalidated gridsearch and calculate classification metrics for a variety of parameters. We have declared that MESH:D011839 (Ionizing Radiation) and HGNC:6091 (Insulin Receptor) are to be considered as positive labels. This impacts how the classification metrics are calculated. Adeft will report the crossvalidated precision, recall, and F1 score. For multilabel classification, Adeft will take the weighted average of these scores over all positive labels, weighted by the frequency of each label in the test data. The cv method takes a param_grid as used in Sci-kit Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). This is a dict mapping feature names to lists of values. Crossvalidation is performed for each combination of parameters from the lists.

Training with crossvalidation is illustrated in the following cell.

In [None]:
param_grid = {'C': [10.0], 'max_features': [1000], 'ngram_range': [(1, 2)]}
classifier.cv(texts, labels, param_grid, cv=5)

The parameter cv can either be an int specifying the number of folds, or a crossvalidation generator or iterable as taken for the cv argument of a  [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object. A summary of model statistics for the best combination of parameters can be accessed as follows.

In [None]:
classifier.stats

It's also possible to access the underlying GridSearchCV object to get more detailed information. See the Scikit-learn documentation for more information.

In [None]:
grid_search = classifier.grid_search
print(grid_search.cv_results_)

## Disambiguators

We can use the grounding_dict and classifier we've produced to build a disambiguator like the one seen in the [Introduction](introduction.ipynb) notebook. These are instantiated as AdeftDisambiguator objects. An AdeftDisambiguator first seeks to disambiguate text by searching for defining patterns. The logistic regression model is used if a defining pattern is not found.

You may recall from the introduction that a disambiguator returns standardized names for each grounding label. These must be explicitly supplied.

In [None]:
names = {'MESH:D011839': 'Radiation, Ionizing',
         'MESH:D007333': 'Insulin Resistance',
         'HGNC:6091': 'INSR',
         'MESH:D015427': 'Reperfusion Injury',
         'MESH:D011840': 'Radiation, Nonionizing'}

An AdeftDisambiguator is instantiated with a classifier, grounding_dict, and dictionary of names as follows

In [None]:
from adeft.disambiguate import AdeftDisambiguator

my_disambiguator = AdeftDisambiguator(classifier, grounding_dict, names)

We may use the info method to see statistics for our custom trained disambiguator just as we could for the pretrained disambiguators

In [None]:
print(my_disambiguator.info())

We can then disambiguate the examples from the Introduction notebook.

In [None]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')
example2 = ('The detrimental effects of IR involve a highly orchestrated series of'
            ' events that are amplified by endogenous signaling and culminating in'
            ' oxidative damage to DNA, lipids, proteins, and many metabolites.')
with open('data/example.txt') as f:
    example3 = f.read()

The first example contains a defining pattern. The logistic regression classifier is used for the second two examples and produces the correct groundings.

In [None]:
my_disambiguator.disambiguate(example1)

In [None]:
my_disambiguator.disambiguate(example2)

In [None]:
my_disambiguator.disambiguate(example3)

## Saving Disambiguators

Disambiguators can be serialized for use at a later time. A disambiguator has three components, a logistic regression model, a grounding dictionary, and a names dictionary. These will be saved to three separate files. Let's create a directory to put our model in.

In [None]:
import os

if not os.path.exists('data/IR'):
    os.makedirs('data/IR')

The AdeftClassifier has a method to serialize its model to a gzipped json file. It takes as argument a path to where the file should be saved.

In [None]:
classifier.dump_model('data/IR/IR_model.gz')

The model has now been saved into the specified file. The file stores the models coefficients and intercepts, a list of features along with their document frequencies, model statistics, and other metadata. We see that the file is quite lightweight.

In [None]:
ls -lh 'data/IR/IR_model.gz'

The grounding and names dictionaries should then be serialized into json files

In [None]:
with open('data/IR/IR_grounding_dict.json', 'w') as f:
    json.dump(grounding_dict, f)
    
with open('data/IR/IR_names.json', 'w') as f:
    json.dump(names, f)

Adeft's pretrained models all have directories following the structure

* `<ModelName>`
    - `<ModelName>_grounding_dict.json`
    - `<ModelName>_names.json`
    - `<ModelName>_model.gz`
    
Pretrained models are stored in a folder named .adeft_models in the users home directory. If you've placed your disambiguator's files in a directory with structured this way, you can load it with the load_disambiguator function as follows

In [None]:
from adeft.disambiguate import load_disambiguator

also_my_disambiguator = load_disambiguator('IR', models_path='data')

You simply pass the path to the folder the model where the model is contained. The default value for models_path is ```$HOME/.adeft_models```. Here we've placed the model folder for a model named IR in the data directory for this notebook. Models are stored in folders on the users system keyed by folder name. This introduces some subtleties. Some characters such as "/" are not allowed in filenames. Some file systems are not case sensitive. For the pretrained models, we use escape characters to handle these issues.

In [None]:
print(also_my_disambiguator.info())

We see that the serialized model we have loaded produces the same disambiguation results as the original model

In [None]:
also_my_disambiguator.disambiguate(example3)

In [None]:
my_disambiguator.disambiguate(example3)

## Conclusion

If you've followed along with this notebook, you're ready to build your own disambiguation models provided that you have access to proper text corpora. If you believe you've found a bug in Adeft please submit an issue at https://github.com/indralab/adeft/issues. 