# Preparation

Make sure to download the corresponding medical ontologies to build the term dictionaries. For each, look for the 2017 version.
 - Treatments: [OPS](https://www.bfarm.de/DE/Kodiersysteme/Services/Downloads/_node.html).
 - Medications: [ATC](https://www.wido.de/publikationen-produkte/arzneimittel-klassifikation/).
 - Diagnosis: [ICD10GM](https://www.bfarm.de/DE/Kodiersysteme/Services/Downloads/_node.html). 
 
In the config file for BRONCO `../conf/bronco.yaml`, modify the paths so they point the extracted folders. We assume they are located in `xmen/temp`. Otherwise, change the path here and correct accordingly the terminal commands below:

In [1]:
base_path = "../temp"

We can already use xMEN to prepare the term dictionaries. <u>This step only has to be performed the first time. </u> 

In your terminal, navigate to the xMEN root folder and run:
 - `xmen dict conf/bronco.yaml --code dicts/atc2017_de.py --output temp/ --key atc`
 - `xmen dict conf/bronco.yaml --code dicts/ops2017.py --output temp/ --key ops`
 - `xmen dict conf/bronco.yaml --code dicts/icd10gm2017.py --output temp/ --key icd10gm`
 
Now use such dictionaries to build the indexes. For this example, we will use only SapBERT indexes and leave aside N-Gram:
 - `xmen index conf/bronco.yaml --dict temp/atc.jsonl --output temp/atc_index --sapbert`
 - `xmen index conf/bronco.yaml --dict temp/ops.jsonl --output temp/ops_index --sapbert`
 - `xmen index conf/bronco.yaml --dict temp/icd10gm.jsonl --output temp/icd10gm_index --sapbert`
 
Now we can load the BRONCO150 dataset using BigBIO:

In [2]:
import datasets

path_to_data = r"../../BRONCO150" # paste here the path to the local data

bronco = datasets.load_dataset(path = "bigbio/bronco", 
                               name = "bronco_bigbio_kb", 
                               data_dir=path_to_data)

bronco

Found cached dataset bronco (/dhc/home/ignacio.rodriguez/.cache/huggingface/datasets/bigbio___bronco/bronco_bigbio_kb-data_dir=..%2F..%2FBRONCO150/1.0.0/cab8fc4a62807688cb5b36df7a24eb7f364314862c4196f6ff2db3813f2fe68b)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 5
    })
})

Finally, we have to choose the semantic class we want to work on and reestructure the dataset in 5 folds for cross-validation, as originally intended.

In [3]:
label = "DIAGNOSIS" # Choose here TREATMENT, MEDICATION or DIAGNOSIS

In [4]:
label2dict = {
    "TREATMENT": "ops",
    "MEDICATION": "atc",
    "DIAGNOSIS": "icd10gm"
}

def filter_entities(bigbio_entities, valid_entities):
    filtered_entities = []
    for ent in bigbio_entities:
        if ent['type'] in valid_entities:
            filtered_entities.append(ent)
    return filtered_entities

ds = bronco.map(lambda row: {'entities': filter_entities(row['entities'], [label])})

Loading cached processed dataset at /dhc/home/ignacio.rodriguez/.cache/huggingface/datasets/bigbio___bronco/bronco_bigbio_kb-data_dir=..%2F..%2FBRONCO150/1.0.0/cab8fc4a62807688cb5b36df7a24eb7f364314862c4196f6ff2db3813f2fe68b/cache-a26fddd6429a662a.arrow


In [5]:
from datasets import DatasetDict

ground_truth = DatasetDict()
for k in range(5):
    ground_truth[f"k{k+1}"] = ds["train"].select([k])
    
ds = ground_truth
ds

DatasetDict({
    k1: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 1
    })
    k2: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 1
    })
    k3: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 1
    })
    k4: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 1
    })
    k5: Dataset({
        features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
        num_rows: 1
    })
})

# Run Candidate Generator
We will use `SapBERTLinker`, which uses a Transformer model to retrieve candidates with dense embeddings.

In [6]:
from xmen.linkers import SapBERTLinker
from xmen.evaluation import evaluate_at_k

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


<u>This cell can be run just once. For subsequent runs you can load the result with the cell below this one.</u>

In [7]:
embedding_model_name = 'cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR'

SapBERTLinker.clear()
sapbert_linker = SapBERTLinker(
    model_name = embedding_model_name,
    index_base_path = f"{base_path}/{label2dict[label]}_index/index/sapbert",
    k = 1000
)

candidates = sapbert_linker.predict_batch(ds, batch_size=128)

# Save locally to avoid running it every time
candidates.save_to_disk(f"{base_path}/{label2dict[label]}_index/pred_sapbert")

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

In [8]:
candidates = datasets.load_from_disk(f"{base_path}/{label2dict[label]}_index/pred_sapbert")

# Recall for different numbers of candidates (k)
_ = evaluate_at_k(ds['k5'], candidates['k5'])

Perf@1 0.6658624849215923
Perf@2 0.7201447527141134
Perf@4 0.7635705669481303
Perf@8 0.7949336550060314
Perf@16 0.8323281061519904
Perf@32 0.8769601930036188
Perf@64 0.8950542822677925


# Train Cross-encoder
We use a cross-encoder to embed the mention with their context together with all potential candidates. This way, we can learn the best ranking of candidates from the training data.

In [9]:
from xmen.reranking.cross_encoder import CrossEncoderReranker, CrossEncoderTrainingArgs
from xmen.linkers.util import filter_and_apply_threshold
from xmen.kb import load_kb
from xmen.data.indexed_dataset import IndexedDatasetDict, IndexedDataset

<u>This cell can be run just once. For subsequent runs you can load the result with the cell below this one.</u>

In [10]:
K_RERANKING = 64
candidates = filter_and_apply_threshold(candidates, K_RERANKING, 0.0)
kb = load_kb(f"{base_path}/{label2dict[label]}.jsonl")

cross_enc_ds = CrossEncoderReranker.prepare_data(candidates, ds, kb)

# Save locally to avoid running it every time
cross_enc_ds.save_to_disk(f"{base_path}/{label2dict[label]}_index/cross_encoded_dataset")
cross_enc_ds

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Context length: 128


  0%|          | 0/847 [00:00<?, ?it/s]

  0%|          | 0/847 [00:00<?, ?it/s]

  0%|          | 0/847 [00:00<?, ?it/s]

  0%|          | 0/771 [00:00<?, ?it/s]

  0%|          | 0/771 [00:00<?, ?it/s]

  0%|          | 0/771 [00:00<?, ?it/s]

  0%|          | 0/839 [00:00<?, ?it/s]

  0%|          | 0/839 [00:00<?, ?it/s]

  0%|          | 0/839 [00:00<?, ?it/s]

  0%|          | 0/793 [00:00<?, ?it/s]

  0%|          | 0/793 [00:00<?, ?it/s]

  0%|          | 0/793 [00:00<?, ?it/s]

  0%|          | 0/830 [00:00<?, ?it/s]

  0%|          | 0/830 [00:00<?, ?it/s]

  0%|          | 0/830 [00:00<?, ?it/s]

{'k1': [847 items],
 'k2': [771 items],
 'k3': [839 items],
 'k4': [793 items],
 'k5': [830 items]}

In [11]:
cross_enc_ds = IndexedDatasetDict.load_from_disk(f"{base_path}/{label2dict[label]}_index/cross_encoded_dataset")
cross_enc_ds

{'k5': [830 items],
 'k2': [771 items],
 'k1': [847 items],
 'k3': [839 items],
 'k4': [793 items]}

Now we set the training arguments, train-eval splits and fit the model. Depending on the number of epochs, the training can take several hours.

In [12]:
train_args = CrossEncoderTrainingArgs(
    num_train_epochs = 5,
    model_name = "SCAI-BIO/bio-gottbert-base",
    score_regularization=True,
)

rr = CrossEncoderReranker()
output_dir = f'{base_path}/{label2dict[label]}_index/cross_encoder_training/'

In [13]:
# Choose train and evaluation folds
train_folds = [
    "k1",
    "k2",
    "k3",
    #"k4",
    #"k5",
]
train = []
train = sum([train + cross_enc_ds[k].dataset for k in train_folds],[])


val_fold = "k4"
val = cross_enc_ds[val_fold].dataset

test_fold = "k5"

<u>This cell can be run just once. For subsequent runs you can load the result with the cell below this one.</u> (Unless you changed something in the model/input data and you want to train a new one)

In [14]:
rr.fit(train_dataset = train,
       val_dataset = val,
       output_dir=output_dir,
       training_args=train_args)

model_name := SCAI-BIO/bio-gottbert-base
num_train_epochs := 5
fp16 := True
label_smoothing := False
score_regularization := True
train_layers := None
softmax_loss := True
random_seed := 42


Some weights of the model checkpoint at SCAI-BIO/bio-gottbert-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at SCAI-BIO/bio-gottbert-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2457 [00:00<?, ?it/s]

2023-06-16 18:36:10 - EntityLinkingEvaluator: Evaluating the model on eval dataset after epoch 0:
2023-06-16 18:42:55 - Accuracy: 0.6658259773013872
2023-06-16 18:42:55 - Accuracy @ 5: 0.8284993694829761
2023-06-16 18:42:55 - Accuracy @ 64: 0.880201765447667
2023-06-16 18:42:55 - Baseline Accuracy: 0.6368221941992434
2023-06-16 18:42:55 - Save model to ../temp/icd10gm_index/cross_encoder_training/


Iteration:   0%|          | 0/2457 [00:00<?, ?it/s]

2023-06-16 19:15:07 - EntityLinkingEvaluator: Evaluating the model on eval dataset after epoch 1:
2023-06-16 19:21:51 - Accuracy: 0.7150063051702396
2023-06-16 19:21:51 - Accuracy @ 5: 0.8272383354350568
2023-06-16 19:21:51 - Accuracy @ 64: 0.880201765447667
2023-06-16 19:21:51 - Baseline Accuracy: 0.6368221941992434
2023-06-16 19:21:51 - Save model to ../temp/icd10gm_index/cross_encoder_training/


Iteration:   0%|          | 0/2457 [00:00<?, ?it/s]

2023-06-16 19:54:02 - EntityLinkingEvaluator: Evaluating the model on eval dataset after epoch 2:
2023-06-16 20:00:46 - Accuracy: 0.7238335435056746
2023-06-16 20:00:46 - Accuracy @ 5: 0.8259773013871374
2023-06-16 20:00:46 - Accuracy @ 64: 0.880201765447667
2023-06-16 20:00:46 - Baseline Accuracy: 0.6368221941992434
2023-06-16 20:00:46 - Save model to ../temp/icd10gm_index/cross_encoder_training/


Iteration:   0%|          | 0/2457 [00:00<?, ?it/s]

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



# Evaluate Cross-encoder
Now we can take our trained model and test it on data outside of training.

In [15]:
rr = CrossEncoderReranker.load(output_dir, device=0)

2023-06-16 21:18:39 - Use pytorch device: cuda


In [16]:
cross_enc_pred_val = rr.rerank_batch(candidates[val_fold], cross_enc_ds[val_fold])
_ = evaluate_at_k(ds[val_fold], cross_enc_pred_val)

Batches:   0%|          | 0/793 [00:00<?, ?it/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Perf@1 0.7297979797979798
Perf@2 0.7891414141414141
Perf@4 0.8257575757575758
Perf@8 0.8459595959595959
Perf@16 0.8623737373737373
Perf@32 0.8787878787878788
Perf@64 0.8800505050505051


In [17]:
cross_enc_pred_test = rr.rerank_batch(candidates[test_fold], cross_enc_ds[test_fold])
_ = evaluate_at_k(ds[test_fold], cross_enc_pred_test)

Batches:   0%|          | 0/830 [00:00<?, ?it/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Perf@1 0.744270205066345
Perf@2 0.7876960193003619
Perf@4 0.827503015681544
Perf@8 0.8480096501809409
Perf@16 0.8721351025331725
Perf@32 0.8890229191797346
Perf@64 0.8950542822677925
