# Adeft

Acromine based Disambiguation of Entities From Text context. A utility for building models to disambiguate acronyms and other abbreviations of biological terms mentioned in the scientific literature. It uses an implementation of the [Acromine](http://www.chokkan.org/research/acromine/) algorithm developed by the [NaCTeM](http://www.nactem.ac.uk/index.php) at the University of Manchester to identify possible longform expansions for shortforms in text corpora. Allows building of models to disambiguate shortforms in literature based on the context in which they appear. A growing number of pretrained disambiguation models are available for download through Adeft.

## Installation

Adeft is available on PyPI and works with Python versions 3.5 and above. It can be installed with the command
```bash
$ pip install adeft
```

Pretrained disambiguation models can be downloaded with the command
```bash
$ python -m adeft.download
```

Models will be stored in the users home directory in a hidden folder named *.adeft_models*

To update existing models, the user can run
```bash
$ python -m adeft.download --update
```

## Using pretrained models

A dictionary listing shortforms with available models can be imported as follows


In [1]:
from adeft import available_shortforms
print(available_shortforms)

{'PC': 'PC', 'EMT': 'EMT', 'SP': 'SP', 'PE': 'PE', 'ROS': 'ROS', 'NP': 'NP:NP_S', 'NPs': 'NP:NP_S', 'MS': 'MS', 'MT': 'MT', 'BP': 'BP', 'GH': 'GH', 'AD': 'AD', 'GT': 'GT', 'DA': 'DA', 'GR': 'GR', 'IR': 'IR', 'HK2': 'HK2', 'ARF': 'ARF', 'CS': 'CS', 'EC': 'EC', 'STD': 'STD', 'PD1': 'PD1', 'TGH': 'TGH', 'PKD': 'PKD', 'RA': 'RA', 'PCP': 'PCP', 'PI': 'PI', 'PS': 'PS', 'PA': 'PA', 'MB': 'MB', 'HA': 'HA', 'AR': 'AR', 'HR': 'HR', 'NE': 'NE', 'UBC': 'UBC', 'GSC': 'GSC', 'AA': 'AA', 'NIS': 'NIS', 'GC': 'GC', 'CM': 'CM', 'RB': 'RB:R_B', 'Rb': 'RB:R_B', 'LH': 'LH', 'ER': 'ER', 'TF': 'TF', 'PGP': 'PGP', 'MCT': 'MCT', 'TG': 'TG'}


The dictionary maps shortforms to model names. It's possible and often desirable for synonymous shortforms to share a model. For example NP and NPs (often standing for nanoparticles) share a model.

In [2]:
print('NP can be disambiguated with the model %s' % available_shortforms['NP'])
print('NPs can be disambiguated with the model %s' % available_shortforms['NPs'])

NP can be disambiguated with the model NP:NP_S
NPs can be disambiguated with the model NP:NP_S


A pretrained disambiguator can be loaded as follows

In [1]:
from adeft.disambiguate import load_disambiguator
ir = load_disambiguator('IR')

In [2]:
print(ir.info())

Disambiguation model for IR

Produces the disambiguations:
	Radiation, Ionizing*	MESH:D011839
	Insulin Resistance	MESH:D007333
	INSR*	HGNC:6091
	Reperfusion Injury	MESH:D015427

Training data had class balance:
	Radiation, Ionizing*	2703
	Insulin Resistance	1495
	INSR*	1460
	Reperfusion Injury	881
	Ungrounded	726

Classification Metrics:
	F1 score:	0.9741
	Precision:	0.97399
	Recall:		0.9743

Weighted average of metrics over positive labels.
Weighted by number of datapoints for each positive label.
Positive labels denoted with * in above lists.



In [3]:
ir.names

{'MESH:D011839': 'Radiation, Ionizing',
 'MESH:D007333': 'Insulin Resistance',
 'HGNC:6091': 'INSR',
 'MESH:D015427': 'Reperfusion Injury'}

In [None]:
000