Deep Multilingual Normalization

This repository hosts the code for our Medical concept normalization in French using multilingual terminologies and contextual embeddings article. It was recently reimplemented using edsnlp.

If this method is useful to you, please consider citing our article, and/or giving a star to this repository :

@article{wajsburt2021medical,
    title = {Medical concept normalization in French using multilingual terminologies and contextual embeddings},
    journal = {Journal of Biomedical Informatics},
    volume = {114},
    pages = {103684},
    year = {2021},
    issn = {1532-0464},
    doi = {https://doi.org/10.1016/j.jbi.2021.103684},
    url = {https://www.sciencedirect.com/science/article/pii/S1532046421000137},
    author = {Perceval Wajsbürt and Arnaud Sarfati and Xavier Tannier},
    keywords = {Natural language processing, Information extraction, Medical concept normalization, Multilingual representation},
}

Install

We recommend you use poetry to install the dependencies from the lock file.

# Clone the repo
git clone https://github.com/percevalw/mlg_norm.git
cd mlg_norm

# Install the dependencies with poetry (or use pip otherwise)
poetry install
# pip install -e .

Downloading the UMLS

You will need to download the UMLS version to run this method. For instance, to replicate our results on the Quaero corpus, you will need the 2014AB version. Here are the steps to load the UMLS:

Download and unzip the 2014ab-1-meta.nlm file (it's really a zip with a different extension) under the 2014AB UMLS Full Release Files section at https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html#2014AB_full
Enter the 2014AB/META folder and unzip MRCONSO and MRSTY
```
gunzip MRCONSO.RRF.*.gz MRSTY.RRF.*.gz
```

Concatenate the multiple MRCONSO files:

cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RRF

Move MRCONSO.RRF, MRSTY.RRF and resources/sty_groups.tsv to the data/umls/2014AB folder.

Downloading Quaero

Download Quaero in BRAT format, unzip it and move the QUAERO_FrenchMed/corpus folder to data/dataset.

wget https://quaerofrenchmed.limsi.fr/QUAERO_FrenchMed_brat.zip
unzip QUAERO_FrenchMed_brat.zip
mv QUAERO_FrenchMed/corpus data/dataset

Train and evaluate a model

Our method is composed of two steps:

Pre-training, to learn multilingual representations and produce similar representation for synonyms of a same concept:
```
python scripts/train.py pretrain --config configs/config.cfg
```
Short classifier training. This will probe the pre-trained embedding and finetune the concepts weights.
```
python scripts/train.py train_classifier --config configs/config.cfg
```

Finally, you can evaluate the model:

python scripts/evaluate.py evaluate --config configs/config.cfg

Consider changing the configs/config.cfg to fit your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
mlg_norm		mlg_norm
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

mlg_norm

mlg_norm

resources

resources

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

Deep Multilingual Normalization

Install

Downloading the UMLS

Downloading Quaero

Train and evaluate a model

About

Releases

Packages

Languages

License

percevalw/mlg_norm

Folders and files

Latest commit

History

Repository files navigation

Deep Multilingual Normalization

Install

Downloading the UMLS

Downloading Quaero

Train and evaluate a model

About

Resources

License

Stars

Watchers

Forks

Languages