# Building Adeft Models

In the [introduction](introduction.ipynb) notebook, we went over how to use Adeft's pretrained disambiguation models. This notebook is for users who would like to build their own models or simply to better understand the inner workings of adeft. We will go through the steps of creating a model for the shortform IR.

## Mining longform expansions from text corpora

The first step in building a model is assembling a corpus of texts containing mentions of the desired shortform. Adeft does not provide tools for text acquisition. We assume users will be able to supply their own texts. For the pretrained models, texts are extracted from the [INDRA Database](https://github.com/indralab/indra_db) which is not publically available. We have built the [Adeft App](https://github.com/indralab/adeft_app) based on adeft to build models based on content from the INDRA database. The Adeft App is open source and users are encouraged to look over it for inspiration or fork and modify it for their own purposes. For this tutorial we will use a sample of 500 texts from the over 10,000 texts used to build the pretrained model.

In [1]:
import json

with open('data/example_texts.json') as f:
    ir_texts = json.load(f)

Adeft uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/index.php) to identify longform expansions for a given shortform within a corpus of texts. This is done by searching for defining patterns (DPs) for the shortform within the texts. Statistical co-occurence frequencies are used to identify the correct expansions corresponding to the defining patterns. Possible expansions for DPs in the sentence before last are
* patterns
* defining patterns
* for defining patterns
* searching for defining patterns
* etc...

A machine cannot apriori tell what the correct expansion is from a single sentence, but by looking at many DPs within an appropriate corpus of texts it would be able to tell that ***defining patterns*** occurs much more frequently than ***for defining patterns*** and that ***patterns*** occurs rarely without ***defining*** preceding it. See the [Appendix](#Appendix) of this notebook for more details on the Acromine algorithm and Adeft's implementation of it.

Longforms expansions can be mined from texts with Adeft's DeftMiner objects. The following cell shows how to initialize a DeftMiner for a given longform and process a list of texts.

In [74]:
from adeft.discover import DeftMiner

ir_miner = DeftMiner('IR')
ir_miner.process_texts(ir_texts)

A score will be produced for each possible longform expansion. The following shows how to inspect top scoring expansions.

In [75]:
ir_miner.top(10)

[('ionizing radiation', 132.3956834532374),
 ('insulin resistance', 125.3089430894309),
 ('ischemia reperfusion', 73.12328767123287),
 ('insulin receptor', 69.90697674418604),
 ('radiation', 48.84269662921349),
 ('to ionizing radiation', 36.46511627906977),
 ('the insulin receptor', 30.176470588235293),
 ('irradiation', 20.782608695652172),
 ('of insulin resistance', 17.263157894736846),
 ('reperfusion', 16.813953488372093)]

We see that the top scoring expansions do not immediately give us correct expansions. A method is implemented to extract the best potential longforms from this list. For more detail on how this works see the [Appendix](#Appendix) of this notebook.

We see in the following cell that this method has done a good job of identifying correct longform expansions.

In [76]:
ir_miner.get_longforms(cutoff=5)

[('ionizing radiation', 132.3956834532374),
 ('insulin resistance', 125.3089430894309),
 ('ischemia reperfusion', 73.12328767123287),
 ('insulin receptor', 69.90697674418604),
 ('irradiation', 20.782608695652172),
 ('infrared', 15.777777777777779),
 ('immunoreactive', 6.0)]

To build models, users must produce a dictionary mapping longforms to desired grounding labels. We call these grounding maps. For the pretrained models we use labels consisting of a [Name Space](introduction.ipynb#Name-Spaces) and corresponding ID separated by a colon. In the [Adeft App](https://github.com/indralab/adeft_app) we've implemented a simple GUI to assist in building these dictionaries.

In [12]:
grounding_map = {'ionizing radiation': 'MESH:D011839',
                 'insulin resistance': 'MESH:D007333',
                 'ischemia reperfusion': 'MESH:D015427',
                 'insulin receptor': 'HGNC:6091',
                 'irradiation': 'MESH:D011839',
                 'infrared': 'ungrounded',
                 'immunoreactive': 'ungrounded'}


Adeft is then able to automatically produce labels for some of the texts by searching for defining patterns matching one of these longform expansions. This is done with the DeftCorpusBuilder object. To initialize this object, we need a dictionary mapping shortforms to grounding maps as created above. Dictionaries containing grounding maps for multiple shortforms may be used to produce models for multiple synonymous shortforms.

In [77]:
grounding_dict = {'IR': grounding_map}

The DeftCorpusBuilder can then be initialized as follows

In [78]:
from adeft.modeling.corpora import DeftCorpusBuilder

cb = DeftCorpusBuilder(grounding_dict)

The following cell shows how to label some of the data using the DeftCorpusBuilder. 

In [21]:
corpus = ir_cb.build_from_texts(ir_texts)
texts, labels = zip(*corpus)

The output, corpus, contains a list of tuples of which the first element is a text and the second element is a label taken from the values of a grounding map.

In [59]:
%%capture
from adeft.modeling.classify import DeftClassifier

dc = DeftClassifier(['IR'], ['MESH:D011839', 'HGNC:6091'])
param_grid = {'C': [10.0], 'max_features': [1000]}

dc.cv(texts, labels, param_grid, cv=5)

In [60]:
dc.stats

{'label_distribution': {'MESH:D007333': 88,
  'MESH:D011839': 121,
  'HGNC:6091': 70,
  'MESH:D015427': 44,
  'ungrounded': 17},
 'f1': {'mean': 0.9130277433366194, 'std': 0.02217757288203372},
 'precision': {'mean': 0.8883437944263142, 'std': 0.023475627649764973},
 'recall': {'mean': 0.9426450742240217, 'std': 0.038024603397752606}}

In [61]:
names = {'MESH:D011839': 'Radiation, Ionizing',
         'MESH:D007333': 'Insulin Resistance',
         'HGNC:6091': 'INSR',
         'MESH:D015427': 'Reperfusion Injury'}

In [62]:
from adeft.disambiguate import DeftDisambiguator

my_disambiguator = DeftDisambiguator(dc, grounding_dict, names)

In [63]:
print(my_disambiguator.info())

Disambiguation model for IR

Produces the disambiguations:
	Radiation, Ionizing*	MESH:D011839
	Insulin Resistance	MESH:D007333
	INSR*	HGNC:6091
	Reperfusion Injury	MESH:D015427

Training data had class balance:
	Radiation, Ionizing*	121
	Insulin Resistance	88
	INSR*	70
	Reperfusion Injury	44
	Ungrounded	17

Classification Metrics:
	F1 score:	0.91303
	Precision:	0.88834
	Recall:		0.94265

* Positive labels
See Docstring for explanation



In [64]:
disambs = my_disambiguator.disambiguate(ir_texts)

In [65]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')

In [66]:
my_disambiguator.disambiguate(example1)

('MESH:D011839',
 'Radiation, Ionizing',
 {'MESH:D007333': 0.0,
  'HGNC:6091': 0.0,
  'MESH:D011839': 1.0,
  'MESH:D015427': 0.0,
  'ungrounded': 0.0})

In [55]:
example2 = 'IR is radiation that carries enough energy to detach electrons from atoms or molecules'

In [67]:
my_disambiguator.disambiguate(example2)

('MESH:D011839',
 'Radiation, Ionizing',
 {'HGNC:6091': 0.028012806325265496,
  'MESH:D007333': 0.06782596101579537,
  'MESH:D011839': 0.6747833318711274,
  'MESH:D015427': 0.05476828511747727,
  'ungrounded': 0.17460961567033445})

In [68]:
with open('data/example.txt') as f:
    example3 = f.read()

In [69]:
my_disambiguator.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9731105195106295,
  'MESH:D007333': 0.026097963632176976,
  'MESH:D011839': 0.0001837286894577103,
  'MESH:D015427': 0.00024853003937873883,
  'ungrounded': 0.00035925812835693017})

## Appendix