# Adeft

Acromine based Disambiguation of Entities From Text context. A utility for building models to disambiguate acronyms and other abbreviations of biological terms mentioned in the scientific literature. It uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by the [NaCTeM](http://www.nactem.ac.uk/index.php) at the University of Manchester to identify possible longform expansions for shortforms in text corpora. Allows building of models to disambiguate shortforms in literature based on the context in which they appear. A growing number of pretrained disambiguation models are available for download through Adeft.

## Installation

Adeft is available on PyPI and works with Python versions 3.5 and above. It can be installed with the command
```bash
$ pip install adeft
```

Pretrained disambiguation models can be downloaded with the command
```bash
$ python -m adeft.download
```

Models will be stored in the users home directory in a hidden folder named *.adeft_models*

To update existing models, the user can run
```bash
$ python -m adeft.download --update
```

## Loading pretrained models

A dictionary listing shortforms with available models can be imported as follows


In [None]:
from adeft import available_shortforms

print(available_shortforms)

The dictionary maps shortforms to model names. It's possible and often desirable for synonymous shortforms to share a model. For example NP and NPs (often standing for nanoparticles) share a model.

In [None]:
print('NP can be disambiguated with the model %s' % available_shortforms['NP'])
print('NPs can be disambiguated with the model %s' % available_shortforms['NPs'])

A pretrained disambiguator can be loaded as follows

In [None]:
from adeft.disambiguate import load_disambiguator

ir = load_disambiguator('IR')

Disambiguators have a method, info, that produces a summary of relevant information. Users can see the disambiguations a model can produce, the class balance of labels in the models training data, and metrics describing the models crossvalidated performance on the training data. Depending on how the model was trained, classification metrics may or may not be available.

In [None]:
print(ir.info())

The labels that appear appended with a star are the positive labels for calculating the classification metrics. For cases where there are multiple positive labels, adeft takes the weighted average of these metrics for each positive label weighted by the frequency of each label in the test data.

## Disambiguating

Adeft first searches for defining patterns (DPs) in a text to produce a disambiguation. A defining pattern (DP) consists of a longform followed by its shortform contained in parentheses. DPs for DP and DPs appear in the previous two sentences

In [None]:
example1 = ('Ionizing radiation (IR) is radiation that carries enough energy to detach electrons'
            ' from atoms or molecules')

In [None]:
ir.disambiguate(example1)

The disambiguate method returns a tuple containing three elements: a grounding consisting of a name space and an ID separated by a colon, a standard name for the grounding, and a dictionary mapping possible groundings to confidence scores. Since a defining pattern exists in this instance, adeft has 100% confidence.

### Name Spaces
Currently pretrained Adeft models ground shortforms to the namespaces.
* [Hugo Gene Nomenclature](https://www.genenames.org/) (HGNC)
* [FamPlex](https://github.com/sorgerlab/famplex) (FPLX)
* [Gene Ontology](http://geneontology.org/) (GO)
* [Medical Subject Headings](https://meshb.nlm.nih.gov/search) (MESH)
* [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (CHEBI)

'ungrounded' refers to entities for which adeft recognizes a defining pattern for which it has no grounding.

### Classification models

Adeft users logistic regression models to disambiguate shortforms in texts where it is unable to find a defining pattern.

In [None]:
example2 = 'IR is radiation that carries enough energy to detach electrons from atoms or molecules'
    

In [None]:
ir.disambiguate(example2)

In this case Adeft is unconfident in its grounding because the input text is shorter than ideal. Adeft's models are designed to work with global text context at the abstract or fulltext level. In practice, we've found best performance when using the concatenation of all paragraphs in a fulltext containing the shortform as input.

We next disambiguate IR within a longer text.

In [None]:
with open('data/example.txt') as f:
    example3 = f.read()

In [None]:
print(example3)

This example could be tricky since insulin receptor and insulin resistance both appear in the text in close proximity to IR, but enough context is provided for Adeft to produce the correct disambiguation.

In [None]:
ir.disambiguate(example3)

### Batch Disambiguation
The disambiguate method is also able to take lists of texts as input. In this case it will return a list of disambiguation results. Disambiguating a list of texts is optimized and will run slightly faster than disambiguating each text separately, though this will only be noticeable when disambiguating large batches of text.

In [None]:
%timeit -n 100 ir.disambiguate([example1, example2, example3])

In [None]:
%timeit -n 100 [ir.disambiguate(text) for text in [example1, example2, example3]]

## Conclusion

We've covered how to use Adeft's pretrained disambiguation models. For information on how to build your own models please see [Model Building](model_building.ipynb).