# Building Adeft Models

In the [introduction](introduction.ipynb) notebook, we went over how to use Adeft's pretrained disambiguation models. This notebook is for users who would like to build their own models or simply to better understand the inner workings of adeft. We will go through the steps of creating a model for the shortform IR.

## Mining longform expansions from text corpora

The first step in building a model is assembling a corpus of texts containing mentions of the desired shortform. Adeft does not provide tools for text acquisition. We assume users will be able to supply their own texts. For the pretrained models, texts are extracted from the [INDRA Database](https://github.com/indralab/indra_db) which is not publically available. We have built the [Adeft App](https://github.com/indralab/adeft_app) based on adeft to build models based on content from the INDRA database. The Adeft App is open source and users are encouraged to look over it for inspiration or fork and modify it for their own purposes. For this tutorial we will use a sample of 500 texts from the over 10,000 texts used to build the pretrained model.

In [1]:
import json

with open('data/example_texts.json') as f:
    ir_texts = json.load(f)

Adeft uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/index.php) to identify longform expansions for a given shortform within a corpus of texts. This is done by searching for defining patterns (DPs) for the shortform within the texts. Statistical co-occurence frequencies are used to identify the correct expansions corresponding to the defining patterns. Possible expansions for DPs in the sentence before last are
* patterns
* defining patterns
* for defining patterns
* searching for defining patterns
* etc...

A machine cannot apriori tell what the correct expansion is from a single sentence, but by looking at many DPs within an appropriate corpus of texts it would be able to tell that ***defining patterns*** occurs much more frequently than ***for defining patterns*** and that ***patterns*** occurs rarely without ***defining*** preceding it.

Longform expansions can be minded from texts with the AdeftMiner object. The following cell shows how to initialize an AdeftMiner for a given shortform and process a list of texts.

In [2]:
from adeft.discover import AdeftMiner

ir_miner = AdeftMiner('IR')
ir_miner.process_texts(ir_texts)

A score will be produced for each possible longform expansion. Top scoring expansions can be inspected as follows

In [3]:
ir_miner.top(10)

[('ionizing radiation', 132.3956834532374),
 ('insulin resistance', 125.3089430894309),
 ('ischemia reperfusion', 73.12328767123287),
 ('insulin receptor', 69.90697674418604),
 ('radiation', 48.84269662921349),
 ('to ionizing radiation', 36.46511627906977),
 ('the insulin receptor', 30.176470588235293),
 ('irradiation', 20.782608695652172),
 ('of insulin resistance', 17.263157894736846),
 ('reperfusion', 16.813953488372093)]

We see that the top scoring expansions do not immediately give the correct expansions. A method is implemented to extract the best potential longforms.

In [4]:
ir_miner.get_longforms(cutoff=5)

[('ionizing radiation', 132.3956834532374),
 ('insulin resistance', 125.3089430894309),
 ('ischemia reperfusion', 73.12328767123287),
 ('insulin receptor', 69.90697674418604),
 ('irradiation', 20.782608695652172),
 ('infrared', 15.777777777777779),
 ('immunoreactive', 6.0)]

We see the method has done a good job of identifying correct longform expansions.

## Labeling Texts

To build models, users must produce a dictionary mapping longforms to desired grounding labels. We call these grounding maps. For the pretrained models we use labels consisting of a [Name Space](introduction.ipynb#Name-Spaces) and corresponding ID separated by a colon. In the [Adeft App](https://github.com/indralab/adeft_app) we've implemented a simple GUI to assist in building these dictionaries.

In [5]:
grounding_map = {'ionizing radiation': 'MESH:D011839',
                 'insulin resistance': 'MESH:D007333',
                 'ischemia reperfusion': 'MESH:D015427',
                 'insulin receptor': 'HGNC:6091',
                 'irradiation': 'MESH:D011839',
                 'infrared': 'ungrounded',
                 'immunoreactive': 'ungrounded'}

Adeft is then able to automatically produce labels for some of the texts by searching for defining patterns matching one of these longform expansions. This is done with an AdeftLabeler object. To initialize this object, we need a dictionary mapping shortforms to grounding maps as created above. Dictionaries containing grounding maps for multiple shortforms may be used to produce models for multiple synonymous shortforms.

In [6]:
grounding_dict = {'IR': grounding_map}

The AdeftLabeler is then initialized as follows

In [7]:
from adeft.modeling.label import AdeftLabeler

labeler = AdeftLabeler(grounding_dict)

The following cell illustrates how to use the AdeftLabeler. It returns a list of two element tuples. Each tuple contains a text as its first element and a label as its second.

In [8]:
corpus = labeler.build_from_texts(ir_texts)
texts, labels = zip(*corpus)

The output, corpus, contains a list of tuples of which the first element is a text and the second element is a label taken from the values of a grounding map. We use a list unpacking trick with zip to convert this to two lists, a list of texts and a list of labels. Of the 500 input texts, 340 contained defining patterns.

In [9]:
print(len(texts))

340


## Building Predictive Models

An AdeftClassifier can than be used to train a logistic regression model to disambiguate texts that do not contain a defining pattern. The AdeftClassifier must be initialized with the shortform of interest and a list of the labels to consider as positive labels. Users can also pass a list of shortforms for models disambiguating multiple synonymous shortforms. The following cell illustrates the initialization of an AdeftClassifier.

In [10]:
%%capture
from adeft.modeling.classify import AdeftClassifier

classifier = AdeftClassifier('IR', ['MESH:D011839', 'HGNC:6091'])

### Model Details

The AdeftClassifier uses a Logistic Regression model with [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorized [n-gram](https://en.wikipedia.org/wiki/N-gram) features. It is implemented with the [Scikit-Learn](https://scikit-learn.org/stable/) python library. It has three parameters that can be tuned

* **C : float**

$L_1$ regularization parameter. Following Scikit-learn's Logistic Regression implementation, $C$ is the reciprocal of the $L_1$ penalty $\lambda$. Lower values of $C$ correspond to greater regularization. $L1$ Regularization controls model complexity by adding a multiple, $\lambda$, of the sum of the absolute value of coefficients to the Logistic Regression objective function. *$L_1$ regularization shrinks regression coefficients to zero, with higher regularization causing the model to use fewer features.*

* **max_features : int**

Cutoff for the number of TF-IDF vectorized n_grams to use as features. Selects the top $n$-grams by frequency in the training set.

* **ngram_range : tuple of int**

Range of values $n$ for which model takes $n$-grams as features. When ngram_range is set to (1, 1) only unigrams are used. Must be a tuple of ints $(a, b)$ with $a < b$. For ngram_range $(a, b)$ with $a <= b$, a-grams through b-grams are used.

### Training
The AdeftClassifier has a cv method which can be used to perform a crossvalidated gridsearch and calculate classification metrics for a variety of parameters. We have declared that MESH:D011839 (Ionizing Radiation) and HGNC:6091 (Insulin Receptor) are to be considered as positive labels. This impacts how the classification metrics are calculated. Adeft will report the crossvalidated precision, recall, and F1 score. For multilabel classification, Adeft will take the weighted average of these scores over all positive labels, weighted by the frequency of each label in the test data. The cv method takes a param_grid as used in Sci-kit Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). This is a dict mapping feature names to lists of values. Crossvalidation is performed for each combination of parameters from the lists.

Training with crossvalidation is illustrated in the following cell.

In [11]:
param_grid = {'C': [10.0], 'max_features': [1000], 'ngram_range': [(1, 2)]}
classifier.cv(texts, labels, param_grid, cv=5)

The parameter cv can either be an int specifying the number of folds, or a crossvalidation generator or iterable as taken for the cv argument of a  [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object. A summary of model statistics for the best combination of parameters can be accessed as follows.

In [12]:
classifier.stats

{'label_distribution': {'MESH:D007333': 88,
  'MESH:D011839': 121,
  'HGNC:6091': 70,
  'MESH:D015427': 44,
  'ungrounded': 17},
 'f1': {'mean': 0.9156382549969049, 'std': 0.022090185589664735},
 'precision': {'mean': 0.8887090512018272, 'std': 0.023986053595694345},
 'recall': {'mean': 0.9477732793522268, 'std': 0.032975453755946946}}

It's also possible to access the underlying GridSearchCV object to get more detailed information. See the Scikit-learn documentation for more information.

In [13]:
grid_search = classifier.grid_search
print(grid_search.cv_results_)

{'mean_fit_time': array([1.05533218]), 'std_fit_time': array([0.01500571]), 'mean_score_time': array([0.30904794]), 'std_score_time': array([0.02238507]), 'param_logit__C': masked_array(data=[10.0],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_tfidf__max_features': masked_array(data=[1000],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_tfidf__ngram_range': masked_array(data=[(1, 2)],
             mask=[False],
       fill_value='?',
            dtype=object), 'params': [{'logit__C': 10.0, 'tfidf__max_features': 1000, 'tfidf__ngram_range': (1, 2)}], 'split0_test_f1': array([0.92011834]), 'split1_test_f1': array([0.93798783]), 'split2_test_f1': array([0.89781103]), 'split3_test_f1': array([0.88319088]), 'split4_test_f1': array([0.93908319]), 'mean_test_f1': array([0.91563825]), 'std_test_f1': array([0.02209019]), 'rank_test_f1': array([1], dtype=int32), 'split0_test_pr': array([0.92260209]), 'split1_test_pr': ar

## Disambiguators

In [14]:
names = {'MESH:D011839': 'Radiation, Ionizing',
         'MESH:D007333': 'Insulin Resistance',
         'HGNC:6091': 'INSR',
         'MESH:D015427': 'Reperfusion Injury'}

In [16]:
from adeft.disambiguate import AdeftDisambiguator

my_disambiguator = AdeftDisambiguator(classifier, grounding_dict, names)

In [17]:
print(my_disambiguator.info())

Disambiguation model for IR

Produces the disambiguations:
	Radiation, Ionizing*	MESH:D011839
	Insulin Resistance	MESH:D007333
	INSR*	HGNC:6091
	Reperfusion Injury	MESH:D015427

Training data had class balance:
	Radiation, Ionizing*	121
	Insulin Resistance	88
	INSR*	70
	Reperfusion Injury	44
	Ungrounded	17

Classification Metrics:
	F1 score:	0.91564
	Precision:	0.88871
	Recall:		0.94777

* Positive labels
See Docstring for explanation



In [18]:
disambs = my_disambiguator.disambiguate(ir_texts)

In [19]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')

In [20]:
my_disambiguator.disambiguate(example1)

('MESH:D011839',
 'Radiation, Ionizing',
 {'HGNC:6091': 0.0,
  'MESH:D007333': 0.0,
  'MESH:D011839': 1.0,
  'ungrounded': 0.0,
  'MESH:D015427': 0.0})

In [21]:
example2 = 'IR is radiation that carries enough energy to detach electrons from atoms or molecules'

In [22]:
my_disambiguator.disambiguate(example2)

('MESH:D011839',
 'Radiation, Ionizing',
 {'HGNC:6091': 0.029510016355119264,
  'MESH:D007333': 0.08471306232236007,
  'MESH:D011839': 0.6437228878920963,
  'MESH:D015427': 0.05812736438929494,
  'ungrounded': 0.18392666904112942})

In [23]:
with open('data/example.txt') as f:
    example3 = f.read()

In [24]:
my_disambiguator.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9806451466133977,
  'MESH:D007333': 0.018681316518097377,
  'MESH:D011839': 0.00015954836583725616,
  'MESH:D015427': 0.00020904676603453542,
  'ungrounded': 0.00030494173663304084})