# Adeft

Adeft (Acromine based Disambiguation of Entities From Text) is a utility for building models to disambiguate acronyms and other abbreviations of biological terms mentioned in the scientific literature. It uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by [NaCTeM](http://www.nactem.ac.uk/index.php) at the University of Manchester to identify possible longform expansions for shortforms in text corpora. Adeft allows a user to build models to disambiguate shortforms in literature based on the context in which they appear. A growing number of pretrained disambiguation models for shortforms in the biomedical literature are available for download through Adeft.

## Installation

Adeft is available on PyPI and works with Python versions 3.5 and above. It can be installed with the command
```bash
$ pip install adeft
```

Pretrained disambiguation models can be downloaded with the command
```bash
$ python -m adeft.download
```

By default, models will be stored in a folder named ``adeft`` within a platform specific user data location determined by the [appdirs](https://pypi.org/project/appdirs/) Python package. Users may set the environment variable `ADEFT_HOME` in their shell profile to choose an alternative location.

## Loading pretrained models

After downloading existing models, a dictionary listing shortforms with available models can be inspected as follows:


In [1]:
from adeft import available_shortforms

print(available_shortforms)

{'HA': 'HA', 'RT': 'RT:RT_S', 'RTs': 'RT:RT_S', 'APC': 'APC:APC_S', 'APCs': 'APC:APC_S', 'GH': 'GH', 'BRK': 'BRK', 'SG': 'SG:SG_S', 'SGs': 'SG:SG_S', 'CHK': 'CHK', 'SPD': 'SPD', 'PC': 'PC', 'AST': 'AST', 'EAG': 'EAG', 'SNS': 'SNS', 'HR': 'HR', 'NOP': 'NOP', 'GT': 'GT', 'TLR': 'TLR', 'CF': 'CF', 'PET': 'PET', 'HEP': 'HEP', 'ERM': 'ERM', 'FES': 'FES', 'PI': 'PI', 'TG': 'TG', 'ER': 'ER', 'HK1': 'HK1', 'MB': 'MB', 'NE': 'NE', 'SLK': 'SLK', 'GARP': 'GARP', 'EMT': 'EMT', 'AHR': 'AHR', 'TIF': 'TIF', 'HC': 'HC:HC_S', 'HCs': 'HC:HC_S', 'SD': 'SD:SD_S', 'SDs': 'SD:SD_S', 'FP': 'FP:FP_S', 'FPs': 'FP:FP_S', 'FMS': 'FMS', 'RD': 'RD', 'PAMP': 'PAMP', 'ARG': 'ARG', 'LAB': 'LAB', 'PM': 'PM:PM_S', 'PMs': 'PM:PM_S', 'TAK': 'TAK', 'ODC': 'ODC', 'GAS': 'GAS', 'TGH': 'TGH', 'BP': 'BP', 'PR': 'PR', 'HK2': 'HK2', 'PAF': 'PAF', 'COT': 'COT', 'SN': 'SN', 'LH': 'LH', 'BAL': 'BAL', 'PRK': 'PRK', 'AD': 'AD', 'NP': 'NP:NP_S', 'NPs': 'NP:NP_S', 'RSE': 'RSE', 'RPE': 'RPE', 'MS': 'MS', 'ARF': 'ARF', 'RB': 'RB:R_B', '

The dictionary maps shortforms to model names. It is possible and often desirable for synonymous shortforms to share a model: for example, NP and NPs (often standing for nanoparticles) use the same model.

In [2]:
print('NP can be disambiguated with the model %s' % available_shortforms['NP'])
print('NPs can be disambiguated with the model %s' % available_shortforms['NPs'])

NP can be disambiguated with the model NP:NP_S
NPs can be disambiguated with the model NP:NP_S


A pretrained disambiguator can be loaded using the ``load_disambiguator`` function, which returns an instance of the ``AdeftDisambiguator`` class:

In [3]:
from adeft.disambiguate import load_disambiguator

ir = load_disambiguator('IR')

``AdeftDisambiguator`` has a method, ``info``, that produces a summary of relevant information. Users can see the disambiguations a model can produce, the class balance of labels in the models training data, and metrics describing the models crossvalidated performance on the training data. Depending on how the model was trained, classification metrics may or may not be available.

In [4]:
print(ir.info())

Disambiguation model for IR

Produces the disambiguations:
	INSR*	HGNC:6091
	Ile-Arg*	CHEBI:CHEBI:74061
	Infrared Rays*	MESH:D007259
	Insulin Resistance*	MESH:D007333
	Interneurons*	MESH:D007395
	MDAMB468*	EFO:0001216
	REN*	HGNC:9958
	Radiation, Ionizing*	MESH:D011839
	Reperfusion Injury*	MESH:D015427
	Retina*	MESH:D012160
	Rhinitis*	MESH:D012220
	Wounds and Injuries*	MESH:D014947
	retinal ischemia*	DOID:DOID:12510
	root structure	EFO:0000989

Class level metrics:
--------------------
Grounding          	Count	F1     
Radiation, Ionizing*	3296	0.98324
 Insulin Resistance*	1894	0.95075
               INSR*	1512	0.92161
 Reperfusion Injury*	1193	0.94338
         Ungrounded	 784	0.85292
      Infrared Rays*	 304	0.87597
Wounds and Injuries*	  34	    0.0
            Ile-Arg*	   5	    0.2
           Rhinitis*	   4	    0.6
                REN*	   3	    0.2
             Retina*	   2	    0.0
     root structure	   1	    0.0
       Interneurons*	   1	    0.0
           MDAMB468*	   1	    0.0
  

The labels that appear appended with a star are the *positive labels* for calculating the classification metrics. For cases where there are multiple positive labels, Adeft takes the weighted average of these metrics for each positive label weighted by the frequency of each label in the test data to calculate the precision, recall, and F1 scores.

## Disambiguation

To disambiguate an instance of an entity shortform, Adeft first searches the provided text for *defining patterns* (DPs) that explicitly define the shortform. A defining pattern (DP) consists of a longform followed by its shortform contained in parentheses. For example the preceding two sentences contain defining patterns for `DP` and `DPs`.

In the example below, the given text contains a defining pattern for the entity shortform `IR`, making disambiguation straightforward:

In [5]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules, thereby ionizing them.')

ir.disambiguate(example1)

('MESH:D011839',
 'Radiation, Ionizing',
 {'MESH:D007333': 0.0,
  'MESH:D015427': 0.0,
  'MESH:D012160': 0.0,
  'EFO:0001216': 0.0,
  'MESH:D012220': 0.0,
  'HGNC:6091': 0.0,
  'ungrounded': 0.0,
  'MESH:D014947': 0.0,
  'MESH:D007259': 0.0,
  'DOID:DOID:12510': 0.0,
  'MESH:D007395': 0.0,
  'HGNC:9958': 0.0,
  'CHEBI:CHEBI:74061': 0.0,
  'MESH:D011839': 1.0,
  'EFO:0000989': 0.0})

The `disambiguate` method returns a tuple containing three elements: 1) the normalized grounding for the entity, formatted as a namespace and ID separated by a colon, 2) a standard name for the grounding, and 3) a dictionary mapping possible groundings to confidence scores. Since a defining pattern exists in this instance, Adeft has 100% confidence.

### Namespaces
Currently available Adeft models ground shortforms to the following namespaces:
* [Hugo Gene Nomenclature](https://www.genenames.org/) (HGNC)
* [FamPlex](https://github.com/sorgerlab/famplex) (FPLX)
* [Gene Ontology](https://geneontology.org/) (GO)
* [Medical Subject Headings](https://id.nlm.nih.gov/mesh/) (MESH)
* [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (CHEBI)
* [NCIThesaurus](https://ncithesaurus.nci.nih.gov/ncitbrowser/) (NCIT)
* [Uniprot](https://www.uniprot.org/) (UP)
* [Interpro](https://www.ebi.ac.uk/interpro/) (IP)

and more.

The 'ungrounded' class refers to the group of entities for which Adeft recognizes a defining pattern but for which the model has no specific grounding.

### Classification models

Adeft uses logistic regression models to disambiguate shortforms in texts where it is unable to find a defining pattern.

In [6]:
example2 = ('The detrimental effects of IR involve a highly orchestrated series of'
            ' events that are amplified by endogenous signaling and culminating in'
            ' oxidative damage to DNA, lipids, proteins, and many metabolites.')

ir.disambiguate(example2)

('MESH:D011839',
 'Radiation, Ionizing',
 {'CHEBI:CHEBI:74061': 0.0029106450442963733,
  'DOID:DOID:12510': 0.002714648506140114,
  'EFO:0000989': 0.002790792571999619,
  'EFO:0001216': 0.002753434628070166,
  'HGNC:6091': 0.014643292414418663,
  'HGNC:9958': 0.002875775266350069,
  'MESH:D007259': 0.012579601284319343,
  'MESH:D007333': 0.006525654716170285,
  'MESH:D007395': 0.002674687838200803,
  'MESH:D011839': 0.8351663813237419,
  'MESH:D012160': 0.0028453280753981272,
  'MESH:D012220': 0.0028269596663538198,
  'MESH:D014947': 0.0020090024116930397,
  'MESH:D015427': 0.04777629432741509,
  'ungrounded': 0.058907501925432655})

Adeft returns the correct grounding for this example. Though this example uses a single sentence as text context, models are trained to disambiguate entities based on abstracts and fulltexts. In practice we have found that taking the concatenation of all paragraphs containing the shortform of interest within a fulltext gives the best performance. We now try to disambiguate based on an entire abstract:

In [7]:
with open('data/example.txt') as f:
    example3 = f.read()

In [8]:
print(example3)

Rates of diabetes are reaching epidemic levels. The key problem in both type 1 and type 2 diabetes is dysfunctional insulin signaling, either due to lack of production or due to impaired insulin sensitivity. A key feature of diabetic retinopathy in animal models is degenerate capillary formation. The goal of this present study was to investigate a potential mechanism for retinal endothelial cell apoptosis in response to hyperglycemia. The hypothesis was that hyperglycemia-induced TNFα leads to retinal endothelial cell apoptosis through inhibition of insulin signaling. To test the hypothesis, primary human retinal endothelial cells were grown in normal glucose (5 mM) or high glucose (25 mM) and treated with exogenous TNFα, TNFα siRNA or suppressor of cytokine signaling 3 (SOCS3) siRNA. Cell lysates were processed for Western blotting and ELISA analyses to verify TNFα and SOCS3 knockdown, as well as key pro- and anti-apoptotic factors, IRS-1, and Akt. Data indicate that high glucose cult

In principle this example could be challenging because the text contains references to both "insulin receptor" (grounding `HGNC:6901`) and "insulin resistance" (grounding `MESH:D007333`), but the shortform `IR` is used exclusively to refer to insulin receptor, as in these two sentences:

>Activation of SOCS3 can lead to insulin resistance in 2 separate ways: increased insulin receptor phosphorylation on tyrosine 960 [ 42 ] or through IRS-1 degradation by proteasomes [ 29 ]. In these experiments with knockdown of SOCS3, we found that SOCS3 likely inhibits insulin signaling in both ways in retinal endothelial cells, as we found decreased IR Tyr960  phosphorylation and increased total IRS-1 levels following SOCS3 siRNA application.
    
However, the text contains sufficient context for Adeft to yield the correct disambiguation with high confidence:

In [9]:
ir.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'CHEBI:CHEBI:74061': 8.653121931414589e-06,
  'DOID:DOID:12510': 1.0470370933297434e-05,
  'EFO:0000989': 6.543495465754423e-06,
  'EFO:0001216': 6.727432817661731e-06,
  'HGNC:6091': 0.9970957622075577,
  'HGNC:9958': 6.628812950508269e-06,
  'MESH:D007259': 6.224876365680733e-06,
  'MESH:D007333': 0.0002892344560520544,
  'MESH:D007395': 6.361983101386227e-06,
  'MESH:D011839': 2.7108083646021005e-06,
  'MESH:D012160': 6.854179224808941e-06,
  'MESH:D012220': 6.8208962358360116e-06,
  'MESH:D014947': 3.936096295151697e-06,
  'MESH:D015427': 0.00042591480328742605,
  'ungrounded': 0.002117156459416963})

### Batch Disambiguation

The disambiguate method can also take lists of texts as input, in which case it will return a list of disambiguation results. Disambiguating a list of texts will run slightly faster than disambiguating each text separately, though this will only be noticeable when disambiguating large batches (the example below using only three texts shows only a very small difference, as expected).

In [10]:
%timeit -n 100 ir.disambiguate([example1, example2, example3])

1.84 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%timeit -n 100 [ir.disambiguate(text) for text in [example1, example2, example3]]

2.69 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Conclusion

We've covered how to use Adeft's pretrained disambiguation models. For information on how to build your own models please see [Model Building](model_building.ipynb).