In [13]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'

# Preparation

First of all, make sure to download the corresponding medical ontologies to build the term dictionaries for bronco.
 - Treatments: [OPS](https://www.bfarm.de/DE/Kodiersysteme/Services/Downloads/_node.html). Scroll to "OPS", 
 - Medications: [ATC](https://www.wido.de/publikationen-produkte/arzneimittel-klassifikation/)
 - Diagnosis: [ICD10GM](https://www.bfarm.de/DE/Kodiersysteme/Services/Downloads/_node.html)
 
In the config file for BRONCO `../conf/bronco.yaml`, modify the paths so they point the extracted ontologies.

We can already use xMEN to prepare the term dictionaries. In your terminal, navigate to the xMEN root folder and run:
 - `xmen dict conf/bronco.yaml --code dicts/atc2017_de.py --output temp/ --key atc`
 - `xmen dict conf/bronco.yaml --code dicts/ops2017.py --output temp/ --key ops`
 - `xmen dict conf/bronco.yaml --code dicts/icd10gm2017.py --output temp/ --key icd10gm`
 
Now use such dictionaries to build the indexes. For this example, we will use only SapBERT indexes and leave aside N-Gram:
 - `xmen index conf/bronco.yaml --dict temp/atc.jsonl --output temp/atc_index --sapbert`
 - `xmen index conf/bronco.yaml --dict temp/ops.jsonl --output temp/ops_index --sapbert`
 - `xmen index conf/bronco.yaml --dict temp/icd10gm.jsonl --output temp/icd10gm_index --sapbert`
 
Finally, load the adapted config file and the BRONCO150 dataset using BigBIO:

In [6]:
from xmen.confhelper import load_config
config = load_config("../conf/bronco.yaml")

import datasets
path_to_data = r"../temp/BRONCO150" # paste here the path to the local data
ds = datasets.load_dataset(path = "bigbio/bronco", 
                           name = "bronco_bigbio_kb", 
                           data_dir=path_to_data)

ds

Using custom data configuration bronco_bigbio_kb-data_dir=..%2Ftemp%2FBRONCO150


Downloading and preparing dataset bronco/bronco_bigbio_kb to /home/Ignacio.Rodriguez/.cache/huggingface/datasets/bigbio___bronco/bronco_bigbio_kb-data_dir=..%2Ftemp%2FBRONCO150/1.0.0/cab8fc4a62807688cb5b36df7a24eb7f364314862c4196f6ff2db3813f2fe68b...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset bronco downloaded and prepared to /home/Ignacio.Rodriguez/.cache/huggingface/datasets/bigbio___bronco/bronco_bigbio_kb-data_dir=..%2Ftemp%2FBRONCO150/1.0.0/cab8fc4a62807688cb5b36df7a24eb7f364314862c4196f6ff2db3813f2fe68b. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 5
    })
})

# Candidate Generation
We will use `SapBERTLinker`, which uses a Transformer model to retrieve candidates with dense embeddings.

We could have also used `TFIDFNGramLinker` (see `xmen/notebooks/BioASQ_DisTEMIST.ipynb` for an example).

In [7]:
from notebook_util import analyze
from xmen.linkers import TFIDFNGramLinker, SapBERTLinker, EnsembleLinker
from xmen.linkers.util import filter_and_apply_threshold
from datasets import DatasetDict

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


In [18]:
embedding_model_name = 'cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR'
index_path = Path("../temp/ops_index/index/sapbert")

# Clear singleton to free up memory
SapBERTLinker.clear()
sapbert_linker = SapBERTLinker(
    embedding_model_name = embedding_model_name,
    index_base_path = "pepe",
    k = 1000
)

pred_sapbert = sapbert_linker.predict_batch(ds, batch_size=128)

OutOfMemoryError: CUDA out of memory. Tried to allocate 734.00 MiB (GPU 0; 44.56 GiB total capacity; 0 bytes already allocated; 337.31 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [23]:
# Recall for different numbers of candidates (k)
_ = evaluate_at_k(ds['validation'], pred_sapbert['validation'])

Perf@1 0.3634558093346574
Perf@2 0.5243296921549155
Perf@4 0.6593843098311817
Perf@8 0.7437934458788481
Perf@16 0.7864945382323734
Perf@32 0.8083416087388282
Perf@64 0.8262164846077458
