# Building Adeft Models

The [Introduction](introduction.ipynb) notebook explains how to use Adeft's pretrained disambiguation models. This notebook is for users who would like to build their own models or simply to better understand the inner workings of Adeft. Here we will go through the steps of creating a model for the shortform `IR`.

## Mining longform expansions from text corpora

The first step in building a model is assembling a corpus of texts containing mentions of the desired shortform. Adeft does not provide tools for text acquisition. We assume users will be able to supply their own texts. For the pretrained models, texts are extracted from the [INDRA Database](https://github.com/indralab/indra_db) (which is not publicly available). For this tutorial we will use a sample of 500 texts from the over 10,000 texts used to build the pretrained model.

In [1]:
import json

with open('data/example_texts.json') as f:
    ir_texts = json.load(f)

Adeft uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/software/acromine/) to identify longform expansions for a given shortform within a corpus of texts. This is done by searching for defining patterns (DPs) for the shortform within the texts. Statistical co-occurence frequencies are used to identify the correct expansions corresponding to the defining patterns. For example in the phrase

> This is done by searching for defining patterns (DPs)...

possible expansions for `DPs`, based on the text preceding the parentheses, are:
* `patterns`
* `defining patterns`
* `for defining patterns`
* `searching for defining patterns`
* etc...

While the appropriate text boundaries for the longform can in some cases be difficult to determine from a single sentence, the correct scope can be determined looking at defining patterns in a large corpus of texts. For example, given many such texts, the Acromine algorithm can determine that `defining patterns` occurs much more frequently than `for defining patterns` and that `patterns` occurs rarely without `defining` preceding it.

In many cases it also possible to find the correct text boundaries by aligning the characters in prospective longforms with the characters in the associated shortform. In the above example the first characters of the longform match the characters in the shortform: **d**efining **p**atterns (**DP**). Adeft incorporates a sophisticated alignment based scoring algorithm and combines it with the Acromine based approach to improve precision and recall beyond what can be achieved with either method alone.

Longform expansions are mined from texts with the ``AdeftMiner`` class. The following code shows how to initialize an instance of ``AdeftMiner`` for a given shortform and process a list of texts.

In [2]:
from adeft.discover import AdeftMiner

In [3]:
ir_miner = AdeftMiner('IR')

In [4]:
ir_miner.process_texts(ir_texts)

A score will be produced for each possible longform expansion. Top scoring expansions can be inspected as follows:

In [5]:
ir_miner.top(20, use_alignment_based_scoring=False)

[('ionizing radiation', 150, 0.9704569606801275),
 ('insulin resistance', 131, 0.9685688522478423),
 ('ischemia reperfusion', 76, 0.9467455621301776),
 ('insulin receptor', 86, 0.9451355661881977),
 ('radiation', 178, 0.9228435197225835),
 ('irradiation', 23, 0.8318098720292505),
 ('reperfusion', 85, 0.7979797979797979),
 ('infrared', 18, 0.7869822485207101),
 ('resistance', 136, 0.6855491329479768),
 ('immunoreactive', 7, 0.5555555555555556),
 ('the insulin receptor', 35, 0.4001876348625575),
 ('rate', 5, 0.31034482758620685),
 ('ionising radiation', 22, 0.26829412460101665),
 ('to ionizing radiation', 43, 0.2619368311791019),
 ('immediate release', 3, 0.2),
 ('infrared spectroscopy', 3, 0.2),
 ('an iterative reconstruction', 3, 0.2),
 ('initial decay rate', 3, 0.1724137931034483),
 ('of insulin resistance', 19, 0.12779242967653584),
 ('lung ischemia reperfusion', 11, 0.11715976331360943)]

These "raw" longforms include redundant/overlapping entries (for example, "ionizing radiation" and "to ionizing radiation", and "insulin resistance" and "of insulin resistance"). Adeft analyzes the words in each longform to identify and remove non-relevant prefixes, arriving at an optimized set which can be inspected using the method ``get_longforms``:

In [6]:
ir_miner.get_longforms(use_alignment_based_scoring=False)

[('ionizing radiation', 150, 0.9704569606801275),
 ('insulin resistance', 131, 0.9685688522478423),
 ('insulin receptor', 86, 0.9451355661881977),
 ('ischemia reperfusion', 76, 0.9467455621301776),
 ('irradiation', 23, 0.8318098720292505),
 ('infrared', 18, 0.7869822485207101),
 ('immunoreactive', 7, 0.5555555555555556),
 ('rate', 5, 0.31034482758620685),
 ('an iterative reconstruction', 3, 0.2),
 ('immediate release', 3, 0.2),
 ('infrared spectroscopy', 3, 0.2)]

We see that Adeft obtains a reasonable set of longforms for ``IR`` from this corpus, though there are mistakes for longform expansions that appear less frequently. A better set of longforms can be found by incorporating alignment based scoring. Alignment based scoring is turned on by default.

In [7]:
ir_miner.get_longforms(use_alignment_based_scoring=True)

[('ionizing radiation', 150, 0.9963623875040957),
 ('insulin resistance', 131, 0.9963549849557908),
 ('insulin receptor', 86, 0.996346769663452),
 ('ischemia reperfusion', 76, 0.9963445302899536),
 ('irradiation', 23, 0.9967054552992847),
 ('ionising radiation', 22, 0.9658174307847472),
 ('infrared', 18, 0.9677666980660011),
 ('immunoreactive', 7, 0.9286419275546706),
 ('iterative reconstruction', 4, 0.9995556666481504),
 ('immediate release', 3, 0.9992003998667001),
 ('induced resistance', 3, 0.9919998944107065),
 ('initial decay rate', 3, 0.8490639333335526),
 ('ischemia and reperfusion', 3, 0.7726626658137018),
 ('infrared spectroscopy', 3, 0.7349345289121265),
 ('irregular and or linear opacities', 3, 0.6123805551410642),
 ('indirect revascularization', 2, 1.0),
 ('induced repair', 2, 1.0),
 ('information retrieval', 2, 1.0),
 ('input resistance', 2, 0.9913173258033389),
 ('ischaemia reperfusion', 2, 0.9843241652031548),
 ('ischemic reperfusion', 2, 0.9843241652031548),
 ('immune r

## Labeling Texts

To build models, users must produce a dictionary that maps longforms to desired identifiers. We call these *grounding maps*. For the pretrained models we use labels consisting of a [Namespace](introduction.ipynb#Name-Spaces) and corresponding ID separated by a colon. An example grounding map is shown below.

In [8]:
grounding_map = {'ionizing radiation': 'MESH:D011839',
                 'insulin resistance': 'MESH:D007333',
                 'ischemia reperfusion': 'MESH:D015427',
                 'insulin receptor': 'HGNC:6091',
                 'irradiation': 'MESH:D011839',
                 'infrared': 'MESH:D007259',
                 'immunoreactive': 'ungrounded'}

Given a grounding map, Adeft can automatically associate identifiers with the texts in the training corpus that contain defining patterns corresponding to one of the longform expansions in the grounding map. This is done with the ``AdeftLabeler`` class. To initialize this object, we need a dictionary mapping each shortform to a grounding map.

In [9]:
grounding_dict = {'IR': grounding_map}

In some cases, it may be useful to train a model for multiple synonymous shortforms. For example, "nanoparticles" can be abbreviated as ``NP`` or ``NPs``, and it is useful to train a single model on texts containing both shortforms. In this case one can create a dictionary linking a grounding map to each shortform:

In [5]:
np_grounding_dict = {"NP":  {"nanoparticle": "MESH:D053758",
                             "nucleus pulposus": "MESH:D000070614",
                             "nucleoprotein": "MESH:D009698",},
                     "NPs": {"nanoparticles": "MESH:D053758",
                             "natriuretic peptides": "FPLX:Natriuretic_peptide",
                             "nurse practitioners": "ungrounded",}}

Given a grounding dictionary for the relevant shortform(s), the ``AdeftLabeler`` is initialized as follows:

In [6]:
from adeft.modeling.label import AdeftLabeler

labeler = AdeftLabeler(grounding_dict)

NameError: name 'grounding_dict' is not defined

Texts for model training are labeled using the ``AdeftLabeler.build_from_texts`` method. The input should be a list like of tuples of the form (text, identifier). The output, ``corpus``, contains a list of three-element tuples of which the first element is a text with all defining patterns stripped out, the second element is a label taken from the values of the grounding map and the third element is identifier associated to the text in the input. Each text should have a unique identifier. The identifiers are useful for mapping texts in the output corpus back to texts in the input. Texts without defining patterns will be filtered out and those with defining patterns will be modified by replacing the defining patterns with only the shortform, making it nontrivial to map back without the identifiers.

In [12]:
corpus = labeler.build_from_texts([(text, identifier) for identifier, text in enumerate(ir_texts)])
print("corpus[0][0]: %s..." % corpus[0][0][0:70])
print("corpus[0][1]: %s" % str(corpus[0][1]))

corpus[0][0]: Polycystic ovary syndrome (PCOS) is the most common endocrinopathy aff...
corpus[0][1]: MESH:D007333


We use ``zip`` to convert this to two lists: a list of texts and a corresponding list of labels. Of the 500 input texts, 340 contained defining patterns for ``IR``.

In [13]:
texts, labels, identifiers = zip(*corpus)

print(len(texts))

339


## Building Predictive Models

The ``AdeftClassifier`` class can then be used to train a logistic regression model to disambiguate texts that do not contain a defining pattern. The ``AdeftClassifier`` must be initialized with the shortform of interest and a list of the labels to consider as *positive* labels. A positive label is one that is considered relevant to the user's information extraction task. Precision, recall, and F1 scores are calculated using a weighted average of the positive labels. There is an optional argument, `random_state`, which allows users to specify a seed to use in internal random number generators to guarantee consistent results. The following code illustrates the initialization of an ``AdeftClassifier``:

In [14]:
%%capture
from adeft.modeling.classify import AdeftClassifier

classifier = AdeftClassifier('IR', ['MESH:D011839', 'HGNC:6091'], random_state=1)

When training models to disambiguate multiple synonymous shortforms (as in the case  of ``NP``/``NPs``, above), the ``AdeftClassifier`` is passed a list of shortforms, as in the following example:

In [15]:
np_classifier = AdeftClassifier(['NP', 'NPs'], ['MESH:D053758', 'FPLX:Natriuretic_peptide'])

### Model Details

The AdeftClassifier uses a Logistic Regression model with [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorized [n-gram](https://en.wikipedia.org/wiki/N-gram) features. It is implemented with the [Scikit-Learn](https://scikit-learn.org/stable/) Python library. It has three parameters that can be tuned:

* **C : float**

$L_1$ regularization parameter. Following Scikit-learn's Logistic Regression implementation, $C$ is the reciprocal of the $L_1$ penalty $\lambda$. Lower values of $C$ correspond to greater regularization. $L_1$ Regularization controls model complexity by adding a multiple, $\lambda$, of the sum of the absolute value of coefficients to the Logistic Regression objective function. *$L_1$ regularization shrinks regression coefficients to zero, with higher regularization causing the model to use fewer features.*

* **max_features : int**

Cutoff for the number of TF-IDF vectorized $n$-grams to use as features. Selects the top $n$-grams by frequency in the training set.

* **ngram_range : tuple of int**

Range of values $n$ for which model takes $n$-grams as features. When ngram_range is set to (1, 1) only unigrams are used. Must be a tuple of ints $(a, b)$ with $a < b$. For ngram_range $(a, b)$ with $a <= b$, a-grams through b-grams are used.

### Training

The ``AdeftClassifier`` has a ``cv`` method which can be used to perform a cross-validated grid search to calculate classification metrics for a variety of parameter values. In this example, we have specified that only ``MESH:D011839`` (Ionizing Radiation) and ``HGNC:6091`` (Insulin Receptor) are to be considered as positive labels. This impacts how the classification metrics are calculated and hence how the training parameters are optimized. ``Adeft`` will report the crossvalidated precision, recall, and F1 score. For multilabel classification, ``Adeft`` will take the average of these scores over all positive labels, weighted by the frequency of each label in the test data. The ``cv`` method takes a ``param_grid`` argument as used in Sci-kit Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). ``param_grid`` must be provided as a dictionary mapping feature names to lists of values. Crossvalidation is performed for each combination of parameters from the lists.

The following illustrates training with cross-validation using two alternative values for ``max_features``, 100 and 1000:

In [16]:
%%capture
param_grid = {'C': [10.0], 'max_features': [100, 1000], 'ngram_range': [(1, 2)]}
classifier.cv(texts, labels, param_grid, cv=5)

The parameter ``cv`` can either be an ``int`` specifying the number of folds, or a cross-validation generator or iterable as taken for the ``cv`` argument of a  [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object. A summary of model statistics for the best combination of parameters can be accessed as follows.

In [17]:
classifier.stats

{'label_distribution': {'MESH:D007333': 87,
  'MESH:D011839': 121,
  'HGNC:6091': 70,
  'MESH:D015427': 44,
  'MESH:D007259': 11,
  'ungrounded': 6},
 'f1': {'mean': 0.920783, 'std': 0.022582},
 'precision': {'mean': 0.903594, 'std': 0.041701},
 'recall': {'mean': 0.942375, 'std': 0.042142},
 'MESH:D007259': {'f1': {'mean': 0.346667, 'std': 0.299333},
  'pr': {'mean': 0.266667, 'std': 0.226078},
  'rc': {'mean': 0.5, 'std': 0.447214}},
 'MESH:D011839': {'f1': {'mean': 0.952425, 'std': 0.028773},
  'pr': {'mean': 0.975, 'std': 0.020412},
  'rc': {'mean': 0.933333, 'std': 0.059259}},
 'HGNC:6091': {'f1': {'mean': 0.866005, 'std': 0.068818},
  'pr': {'mean': 0.885714, 'std': 0.09689},
  'rc': {'mean': 0.852259, 'std': 0.069444}},
 'ungrounded': {'f1': {'mean': 0.133333, 'std': 0.266667},
  'pr': {'mean': 0.2, 'std': 0.4},
  'rc': {'mean': 0.1, 'std': 0.2}},
 'MESH:D007333': {'f1': {'mean': 0.862796, 'std': 0.026696},
  'pr': {'mean': 0.861438, 'std': 0.059853},
  'rc': {'mean': 0.872741, 

It's also possible to access the underlying [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object to get more detailed information. See the Scikit-learn documentation for more information.

In [15]:
grid_search = classifier.grid_search
print(grid_search.cv_results_)

{'mean_fit_time': array([0.97101436, 1.07915239]), 'std_fit_time': array([0.04293427, 0.03734512]), 'mean_score_time': array([2.07902403, 2.12095847]), 'std_score_time': array([0.17432504, 0.18048402]), 'param_logit__C': masked_array(data=[10.0, 10.0],
             mask=[False, False],
       fill_value='?',
            dtype=object), 'param_tfidf__max_features': masked_array(data=[100, 1000],
             mask=[False, False],
       fill_value='?',
            dtype=object), 'param_tfidf__ngram_range': masked_array(data=[(1, 2), (1, 2)],
             mask=[False, False],
       fill_value='?',
            dtype=object), 'params': [{'logit__C': 10.0, 'tfidf__max_features': 100, 'tfidf__ngram_range': (1, 2)}, {'logit__C': 10.0, 'tfidf__max_features': 1000, 'tfidf__ngram_range': (1, 2)}], 'split0_test_f1_weighted': array([0.91355506, 0.89062449]), 'split1_test_f1_weighted': array([0.81527348, 0.86054446]), 'split2_test_f1_weighted': array([0.94736842, 0.96079484]), 'split3_test_f1_weight

### Feature Importances
A method exists to calculate feature importance scores for each label

In [16]:
fi = classifier.feature_importances()

`classifier.feature_importances` returns a dictionary mapping class labels to lists of (feature, importance score) pairs

The top ten features for the label insulin receptor (HGNC:6091) are shown below

In [17]:
fi['HGNC:6091'][0:10]

[('insulin', 2.219766514782094),
 ('igf', 0.86189562095087),
 ('cells', 0.42302066571620284),
 ('tyrosine', 0.37950205050869584),
 ('igf1r', 0.3227184395053713),
 ('mice', 0.21965291247074434),
 ('signaling', 0.18588834332638676),
 ('1b', 0.1727115959409645),
 ('phosphorylation', 0.16725433486791888),
 ('stem', 0.1621870597561652)]

These scores can aid in interpretion of how the classifier is making its predictions. A feature importance score is a standardized logistic regression coefficient. It is equal to the change in the linear predictor corresponding to a one standard deviation change in the associated feature value.

## Disambiguators

The logistic regression classifier we've produced in the steps above can be combined with the ``grounding_dict`` to build a disambiguator like the one shown in the [Introduction](introduction.ipynb) notebook. These are instantiated as ``AdeftDisambiguator`` objects. An ``AdeftDisambiguator`` first seeks to disambiguate text by searching for defining patterns. The logistic regression model is used only if a defining pattern for the shortform is not found.

You may recall from the introduction that a disambiguator returns standardized names for each grounding label. These must be explicitly supplied, as in the example below:

In [18]:
names = {'MESH:D011839': 'Radiation, Ionizing',
         'MESH:D007333': 'Insulin Resistance',
         'HGNC:6091': 'INSR',
         'MESH:D015427': 'Reperfusion Injury',
         'MESH:D007259': 'Infrared Rays'}

An ``AdeftDisambiguator`` is instantiated with a classifier, grounding_dict, and dictionary of names as follows:

In [19]:
from adeft.disambiguate import AdeftDisambiguator

my_disambiguator = AdeftDisambiguator(classifier, grounding_dict, names)

We can use the ``info`` method to see statistics for our custom disambiguator just as for the pretrained disambiguators:

In [20]:
print(my_disambiguator.info())

Disambiguation model for IR

Produces the disambiguations:
	INSR*	HGNC:6091
	Infrared Rays	MESH:D007259
	Insulin Resistance	MESH:D007333
	Radiation, Ionizing*	MESH:D011839
	Reperfusion Injury	MESH:D015427

Class level metrics:
--------------------
Grounding          	Count	F1     
Radiation, Ionizing*	121	0.94534
 Insulin Resistance	 88	 0.9034
               INSR*	 70	0.87629
 Reperfusion Injury	 44	0.96725
      Infrared Rays	 11	0.83333
         Ungrounded	  6	    0.0

Weighted Metrics:
-----------------
	F1 score:	0.92009
	Precision:	0.88585
	Recall:		0.95816

* Positive labels
See Docstring for explanation



We can then disambiguate the examples from the Introduction notebook:

In [21]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')
example2 = ('The detrimental effects of IR involve a highly orchestrated series of'
            ' events that are amplified by endogenous signaling and culminating in'
            ' oxidative damage to DNA, lipids, proteins, and many metabolites.')
with open('data/example.txt') as f:
    example3 = f.read()

The first example contains a defining pattern. The logistic regression classifier is used for the second two examples and produces the correct groundings.

In [22]:
my_disambiguator.disambiguate(example1)

('MESH:D011839',
 'Radiation, Ionizing',
 {'ungrounded': 0.0,
  'MESH:D015427': 0.0,
  'HGNC:6091': 0.0,
  'MESH:D011839': 1.0,
  'MESH:D007333': 0.0,
  'MESH:D007259': 0.0})

In [23]:
my_disambiguator.disambiguate(example2)

('MESH:D011839',
 'Radiation, Ionizing',
 {'HGNC:6091': 0.01278152179976415,
  'MESH:D007259': 0.004672804991690736,
  'MESH:D007333': 0.007817122899458365,
  'MESH:D011839': 0.9592061778944365,
  'MESH:D015427': 0.011601329274007591,
  'ungrounded': 0.003921043140642753})

In [24]:
my_disambiguator.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9770414181197636,
  'MESH:D007259': 0.00027064795309301987,
  'MESH:D007333': 0.02213486460196367,
  'MESH:D011839': 0.00015631159768796982,
  'MESH:D015427': 0.00019090079505250734,
  'ungrounded': 0.00020585693243914863})

## Saving Disambiguators

Disambiguators can be serialized for use at a later time. A disambiguator has three components: a logistic regression model, a grounding dictionary, and a names dictionary. These will be saved to three separate files within a directory with the following structure.

* `<ModelName>`
    - `<ModelName>_grounding_dict.json`
    - `<ModelName>_names.json`
    - `<ModelName>_model.gz`

Models are saved to the user's filesystem using ``AdeftDisambiguator.dump``, which takes two arguments: a string identifying the model (e.g., ``IR``), and a path to a folder for storing models. Because the model identifier is used to create subfolders within the model directory, characters such as "/" should not be used. Also note that some file systems (e.g., Mac OS) are case-insensitive.

In [25]:
my_disambiguator.dump('IR', path='data')

An ``IR`` subfolder is created for the model inside the ``data`` directory:

In [26]:
ls -lh 'data'

total 4.7M
drwxr-xr-x 5 albertsteppi  160 Jul  8  2019 IR/
-rw-r--r-- 1 albertsteppi 5.7K Jun 26  2019 example.txt
-rw-r--r-- 1 albertsteppi 4.7M Jun 26  2019 example_texts.json


The model folder contains three files:

In [27]:
ls -lh 'data/IR/'

total 32K
-rw-r--r-- 1 albertsteppi 248 Jan 29 17:05 IR_grounding_dict.json
-rw-r--r-- 1 albertsteppi 24K Jan 29 17:05 IR_model.gz
-rw-r--r-- 1 albertsteppi 169 Jan 29 17:05 IR_names.json


The file ``IR_model.gz`` contains the coefficients of the logistic regression model, the $n$-gram features along with their frequencies in the document training data, and other classifier metadata. These are stored within a json file which is then compressed with gzip. The grounding and names dictionaries are serialized directly to json. When downloaded, ``Adeft``'s pretrained models are stored in this format within a hidden folder in the users home directory called ``.adeft``.

To load custom models, the ``load_disambiguator`` function used in the Introduction notebook can be passed an optional ``path`` argument for a user-specified model folder, as shown below:

In [28]:
from adeft.disambiguate import load_disambiguator

also_my_disambiguator = load_disambiguator('IR', path='data')

print(also_my_disambiguator.info())

Disambiguation model for IR

Produces the disambiguations:
	INSR*	HGNC:6091
	Infrared Rays	MESH:D007259
	Insulin Resistance	MESH:D007333
	Radiation, Ionizing*	MESH:D011839
	Reperfusion Injury	MESH:D015427

Class level metrics:
--------------------
Grounding          	Count	F1     
Radiation, Ionizing*	121	0.94534
 Insulin Resistance	 88	 0.9034
               INSR*	 70	0.87629
 Reperfusion Injury	 44	0.96725
      Infrared Rays	 11	0.83333
         Ungrounded	  6	    0.0

Weighted Metrics:
-----------------
	F1 score:	0.92009
	Precision:	0.88585
	Recall:		0.95816

* Positive labels
See Docstring for explanation



The serialized model we have loaded produces the same disambiguation results as the original model:

In [29]:
also_my_disambiguator.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9770414181197636,
  'MESH:D007259': 0.00027064795309301987,
  'MESH:D007333': 0.02213486460196367,
  'MESH:D011839': 0.00015631159768796982,
  'MESH:D015427': 0.00019090079505250734,
  'ungrounded': 0.00020585693243914863})

In [30]:
my_disambiguator.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9770414181197636,
  'MESH:D007259': 0.00027064795309301987,
  'MESH:D007333': 0.02213486460196367,
  'MESH:D011839': 0.00015631159768796982,
  'MESH:D015427': 0.00019090079505250734,
  'ungrounded': 0.00020585693243914863})

## Adeft Grounding Assistant

Because the task of linking groundings to longforms is not automated, Adeft provides a simple graphical user interface to assist with data entry for grounding.

The function ``adeft.gui.ground_with_gui`` opens a simple web application in the browser to allow users to enter groundings for longforms, standardized names, and choose which labels should be considered positive labels when evaluating classifiers. The web application requires that port 5000 be free on the user's machine.

In [5]:
from adeft.gui import ground_with_gui

longforms, counts, scores = zip(*ir_miner.get_longforms())

In [None]:
result = ground_with_gui(longforms, counts, identifiers_file='~/scratch/sample_groundings.csv')

In [7]:
print(result)

({'immediate release': 'ungrounded', 'immune reconstitution': 'ungrounded', 'immunoreactive': 'ungrounded', 'indirect revascularization': 'ungrounded', 'induced repair': 'ungrounded', 'induced resistance': 'ungrounded', 'inflammation observed postischemia reperfusion': 'ungrounded', 'inflammatory redox': 'ungrounded', 'information retrieval': 'ungrounded', 'infrared': 'ungrounded', 'infrared radiation': 'ungrounded', 'infrared spectroscopy': 'ungrounded', 'inhibitory rate': 'ungrounded', 'initial decay rate': 'ungrounded', 'initial rate': 'ungrounded', 'initial rate of resistance decrease': 'ungrounded', 'input resistance': 'ungrounded', 'insulin receptor': 'HGNC:99999', 'insulin release': 'ungrounded', 'insulin resistance': 'ungrounded', 'internal': 'ungrounded', 'intrinsic religiosity': 'ungrounded', 'intron retention': 'ungrounded', 'inwardly rectifying k + current': 'ungrounded', 'ionising radiation': 'MESH:D1', 'ionizing': 'MESH:D1', 'ionizing raddition': 'MESH:D1', 'ionizing radi

A screenshot of the app after all groundings have been entered is shown below. Users can enter names and groundings in the text boxes and then check off the boxes next to the longforms. Clicking submit will link the checked longforms to the entered name and grounding. Names and groundings can be deleted by pressing the X button to the right of the grounding column. The labels column on the far right displays the unique groundings. Clicking the ``+`` button toggles whether a label is considered positive. When the user presses the ``generate`` button, the ``ground_with_gui`` function returns a tuple containing a grounding map, a names dictionary, and a list of positive labels. The web application will then stop running.

<img src="figures/adeft_app.png">

Users may supply an initial grounding map. When building Adeft models, the user can supply initial groundings generated by a external grounding process and then use the GUI to manually review and correct these initial groundings. If a grounding map is supplied, an initial names map and list of positive labels can also be supplied.

In [34]:
result2 = ground_with_gui(longforms, scores, grounding_map=grounding_map, names=names)

In [35]:
print(result2)

({'immunoreactive': 'ungrounded', 'infrared': 'MESH:D007259', 'insulin receptor': 'HGNC:6091', 'insulin resistance': 'MESH:D007333', 'ionizing radiation': 'MESH:D011839', 'irradiation': 'MESH:D011839', 'ischemia reperfusion': 'MESH:D015427'}, {'MESH:D007259': 'Infrared Rays', 'HGNC:6091': 'INSR', 'MESH:D007333': 'Insulin Resistance', 'MESH:D011839': 'Radiation, Ionizing', 'MESH:D015427': 'Reperfusion Injury'}, ['HGNC:6091', 'MESH:D011839'])


## Modifying Groundings without Retraining

It is possible to modify groundings and standardized names without having to retrain the classifier. Suppose for
instance that you prefer the Uniprot grounding for Insulin Receptor to the HGNC grounding and the protein name *Insulin Receptor* to the HGNC symbol INSR. This can be accomplished with the ``AdeftDisambiguator.modify_groundings`` method. Users can pass in dictionaries mapping previous groundings to new groundings, and previous names to new names. This method does not allow for two distinct groundings to be mapped to the same new grounding; the model should be retrained if this is desired.

In [33]:
my_disambiguator.modify_groundings(new_groundings={'HGNC:6091': 'UP:P06213'},
                                   new_names={'HGNC:6091': 'Insulin Receptor'})

We see below that the model info has successfully changed.

In [34]:
print(my_disambiguator.info())

Disambiguation model for IR

Produces the disambiguations:
	Infrared Rays	MESH:D007259
	Insulin Receptor*	UP:P06213
	Insulin Resistance	MESH:D007333
	Radiation, Ionizing*	MESH:D011839
	Reperfusion Injury	MESH:D015427

Class level metrics:
--------------------
Grounding          	Count	F1     
Radiation, Ionizing*	121	0.94534
 Insulin Resistance	 88	 0.9034
   Insulin Receptor*	 70	       
 Reperfusion Injury	 44	0.96725
      Infrared Rays	 11	0.83333
         Ungrounded	  6	    0.0

Weighted Metrics:
-----------------
	F1 score:	0.92009
	Precision:	0.88585
	Recall:		0.95816

* Positive labels
See Docstring for explanation



Disambiguations are now made with the updated grounding and name:

In [35]:
my_disambiguator.disambiguate(example3)

('UP:P06213',
 'Insulin Receptor',
 {'UP:P06213': 0.9770414181197636,
  'MESH:D007259': 0.00027064795309301987,
  'MESH:D007333': 0.02213486460196367,
  'MESH:D011839': 0.00015631159768796982,
  'MESH:D015427': 0.00019090079505250734,
  'ungrounded': 0.00020585693243914863})

## Conclusion

If you've followed along with this notebook and have access to suitable text corpora, you're now ready to build your own disambiguation models. If you believe you've found a bug in Adeft please submit an issue at https://github.com/indralab/adeft/issues. If you'd like to contribute see https://github.com/indralab/adeft/CONTRIBUTING.md. 

If you make use of Adeft in your research please cite the following paper:

Steppi A, Gyori BM, Bachman JA (2020). Adeft: Acromine-based Disambiguation of Entities from Text with applications to the biomedical literature. Journal of Open Source Software, 5(45), 1708, https://doi.org/10.21105/joss.01708