# Adeft

Acromine based Disambiguation of Entities From Text context. A utility for building models to disambiguate acronyms and other abbreviations of biological terms mentioned in the scientific literature. It uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by the [NaCTeM](http://www.nactem.ac.uk/index.php) at the University of Manchester to identify possible longform expansions for shortforms in text corpora. Allows building of models to disambiguate shortforms in literature based on the context in which they appear. A growing number of pretrained disambiguation models are available for download through Adeft.

## Installation

Adeft is available on PyPI and works with Python versions 3.5 and above. It can be installed with the command
```bash
$ pip install adeft
```

Pretrained disambiguation models can be downloaded with the command
```bash
$ python -m adeft.download
```

Models will be stored in the users home directory in a hidden folder named *.adeft_models*

To update existing models, the user can run
```bash
$ python -m adeft.download --update
```

## Loading pretrained models

A dictionary listing shortforms with available models can be imported as follows


In [19]:
from adeft import available_shortforms

print(available_shortforms)

{'PC': 'PC', 'EMT': 'EMT', 'SP': 'SP', 'PE': 'PE', 'ROS': 'ROS', 'NP': 'NP:NP_S', 'NPs': 'NP:NP_S', 'MS': 'MS', 'MT': 'MT', 'BP': 'BP', 'GH': 'GH', 'AD': 'AD', 'GT': 'GT', 'DA': 'DA', 'GR': 'GR', 'IR': 'IR', 'HK2': 'HK2', 'ARF': 'ARF', 'CS': 'CS', 'EC': 'EC', 'STD': 'STD', 'PD1': 'PD1', 'TGH': 'TGH', 'PKD': 'PKD', 'RA': 'RA', 'PCP': 'PCP', 'PI': 'PI', 'PS': 'PS', 'PA': 'PA', 'MB': 'MB', 'HA': 'HA', 'AR': 'AR', 'HR': 'HR', 'NE': 'NE', 'UBC': 'UBC', 'GSC': 'GSC', 'AA': 'AA', 'NIS': 'NIS', 'GC': 'GC', 'CM': 'CM', 'RB': 'RB:R_B', 'Rb': 'RB:R_B', 'LH': 'LH', 'ER': 'ER', 'TF': 'TF', 'PGP': 'PGP', 'MCT': 'MCT', 'TG': 'TG'}


The dictionary maps shortforms to model names. It's possible and often desirable for synonymous shortforms to share a model. For example NP and NPs (often standing for nanoparticles) share a model.

In [20]:
print('NP can be disambiguated with the model %s' % available_shortforms['NP'])
print('NPs can be disambiguated with the model %s' % available_shortforms['NPs'])

NP can be disambiguated with the model NP:NP_S
NPs can be disambiguated with the model NP:NP_S


A pretrained disambiguator can be loaded as follows

In [21]:
from adeft.disambiguate import load_disambiguator

ir = load_disambiguator('IR')

Disambiguators have a method, info, that produces a summary of relevant information. Users can see the disambiguations a model can produce, the class balance of labels in the models training data, and metrics describing the models crossvalidated performance on the training data. Depending on how the model was trained, classification metrics may or may not be available.

In [22]:
print(ir.info())

Disambiguation model for IR

Produces the disambiguations:
	Radiation, Ionizing*	MESH:D011839
	Insulin Resistance	MESH:D007333
	INSR*	HGNC:6091
	Reperfusion Injury	MESH:D015427

Training data had class balance:
	Radiation, Ionizing*	2703
	Insulin Resistance	1495
	INSR*	1460
	Reperfusion Injury	881
	Ungrounded	726

Classification Metrics:
	F1 score:	0.9741
	Precision:	0.97399
	Recall:		0.9743

* Positive labels
See Docstring for explanation



The labels that appear appended with a star are the positive labels for calculating the classification metrics. For cases where there are multiple positive labels, adeft takes the weighted average of these metrics for each positive label weighted by the frequency of each label in the test data.

## Disambiguating

Adeft first searches for defining patterns (DPs) in a text to produce a disambiguation. A defining pattern (DP) consists of a longform followed by its shortform contained in parentheses. DPs for DP and DPs appear in the previous two sentences

In [23]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')

In [24]:
ir.disambiguate(example1)

('MESH:D011839',
 'Radiation, Ionizing',
 {'ungrounded': 0.0,
  'MESH:D011839': 1.0,
  'HGNC:6091': 0.0,
  'MESH:D007333': 0.0,
  'MESH:D015427': 0.0})

The disambiguate method returns a tuple containing three elements: a grounding consisting of a name space and an ID separated by a colon, a standard name for the grounding, and a dictionary mapping possible groundings to confidence scores. Since a defining pattern exists in this instance, adeft has 100% confidence.

### Name Spaces
Currently pretrained Adeft models ground shortforms to the namespaces.
* [Hugo Gene Nomenclature](https://www.genenames.org/) (HGNC)
* [FamPlex](https://github.com/sorgerlab/famplex) (FPLX)
* [Gene Ontology](http://geneontology.org/) (GO)
* [Medical Subject Headings](https://meshb.nlm.nih.gov/search) (MESH)
* [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (CHEBI)

'ungrounded' refers to entities for which adeft recognizes a defining pattern for which it has no grounding.

### Classification models

Adeft users logistic regression models to disambiguate shortforms in texts where it is unable to find a defining pattern.

In [32]:
example2 = 'IR is radiation that carries enough energy to detach electrons from atoms or molecules'
    

In [33]:
ir.disambiguate(example2)

('ungrounded',
 None,
 {'HGNC:6091': 6.170832364054513e-06,
  'MESH:D007333': 8.372841531299755e-06,
  'MESH:D011839': 1.7851753682531277e-05,
  'MESH:D015427': 2.8970274945505494e-06,
  'ungrounded': 0.9999647075449276})

In this case Adeft is unconfident in its grounding because the input text is shorter than ideal. Adeft's models are designed to work with global text context at the abstract or fulltext level. In practice, we've found best performance when using the concatenation of all paragraphs in a fulltext containing the shortform as input.

We next disambiguate IR within a longer text.

In [34]:
with open('data/example.txt') as f:
    example3 = f.read()

In [35]:
print(example3)

Rates of diabetes are reaching epidemic levels. The key problem in both type 1 and type 2 diabetes is dysfunctional insulin signaling, either due to lack of production or due to impaired insulin sensitivity. A key feature of diabetic retinopathy in animal models is degenerate capillary formation. The goal of this present study was to investigate a potential mechanism for retinal endothelial cell apoptosis in response to hyperglycemia. The hypothesis was that hyperglycemia-induced TNFα leads to retinal endothelial cell apoptosis through inhibition of insulin signaling. To test the hypothesis, primary human retinal endothelial cells were grown in normal glucose (5 mM) or high glucose (25 mM) and treated with exogenous TNFα, TNFα siRNA or suppressor of cytokine signaling 3 (SOCS3) siRNA. Cell lysates were processed for Western blotting and ELISA analyses to verify TNFα and SOCS3 knockdown, as well as key pro- and anti-apoptotic factors, IRS-1, and Akt. Data indicate that high glucose cult

This example could be tricky since insulin receptor and insulin resistance both appear in the text in close proximity to IR, but enough context is provided for Adeft to produce the correct disambiguation.

In [36]:
ir.disambiguate(example3)

('HGNC:6091',
 'INSR',
 {'HGNC:6091': 0.9997547325118725,
  'MESH:D007333': 0.0002120018198221942,
  'MESH:D011839': 1.7078421061932351e-06,
  'MESH:D015427': 2.8792336629004848e-05,
  'ungrounded': 2.765489570281673e-06})

### Batch Disambiguation
The disambiguate method is also able to take lists of texts as input. In this case it will return a list of disambiguation results. Disambiguating a list of texts is optimized and will run slightly faster than disambiguating each text separately, though this will only be noticeable when disambiguating large batches of text.

In [37]:
%timeit -n 100 ir.disambiguate([example1, example2, example3])

1.5 ms ± 64.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [38]:
%timeit -n 100 [ir.disambiguate(text) for text in [example1, example2, example3]]

1.93 ms ± 55.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Conclusion

We've covered how to use Adeft's pretrained disambiguation models. For information on how to build your own models please see [Model Building](model_building.ipynb).