# Evaluating the model after fine-tuning

Use this notebook to get a quick insight into how the model performs. It shows the output of the model before training and after training for a few random exemples out of the valadation dataset. And runs an evaluation on the entire valadation dataset.

## Importing needed modules

In [1]:
from transformers import (
    T5Tokenizer,
    AutoTokenizer,
    T5ForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import torch
from datasets import load_dataset
from wasabi import msg
import random
import yaml
from pathlib import Path
from os.path import abspath

  from .autonotebook import tqdm as notebook_tqdm


## Setting home directory

In [2]:
home_dir = Path(abspath("")).parent

msg.info(f"Home directory: {home_dir}")

[38;5;4mℹ Home directory: /home/lgrootde/Generative-re-tests[0m


## Load Config & Dataset

In [3]:
# Load the config
with open(home_dir.joinpath('config/config_T5-L_cdr.yaml')) as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

In [4]:
dataset = load_dataset(
        config['dataset_vars']['type'], 
        data_dir=home_dir.joinpath(config['dataset_vars']['dir']),
        column_names=config['dataset_vars']['column_names']
        )

eval_dataset = dataset['validation'].select(range(1,501)) # remove first row that contains column names

## Get the random examples

In [5]:
# Gather random examples from the evaluation dataset
amount_examples_to_show = 5
random_examples = []
for i in range(amount_examples_to_show):
    pick = random.randint(0, len(eval_dataset)-1)
    random_examples.append({'Input':eval_dataset[pick]['input'],
                            'Expected output':eval_dataset[pick]['relations']})

## Performance before traing

In [6]:
# Load model and tokenizer
model_name = config['model_name']
device_map = {"": 0}

global tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(
    model_name,
    device_map=device_map
) # we specificly use T5 for Conditional generations because it has a language modeling head

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [7]:
# Model performance before training
inputs = [i["Input"] for i in random_examples]
expected_output = [i["Expected output"] for i in random_examples]

for input, expected in zip(inputs, expected_output):
    # inference
    input_ids = tokenizer(input, return_tensors="pt").input_ids.to('cuda') 
    output = model.generate(input_ids, max_new_tokens=128)
    decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

    # print overview
    msg.info("Input:")
    print(input)
    msg.good("Expected output:")
    print(expected)
    msg.info("Actual output:")
    print(decoded_output, "\n\n\n")

[38;5;4mℹ Input:[0m
Cortical motor overactivation in parkinsonian patients with L-dopa-induced peak-dose dyskinesia. We have studied the regional cerebral blood flow (rCBF) changes induced by the execution of a finger-to-thumb opposition motor task in the supplementary and primary motor cortex of two groups of parkinsonian patients on L-dopa medication, the first one without L-dopa induced dyskinesia (n = 23) and the other with moderate peak-dose dyskinesia (n = 15), and of a group of 14 normal subjects. Single photon emission tomography with i.v. 133Xe was used to measure the rCBF changes. The dyskinetic parkinsonian patients exhibited a pattern of response which was markedly different from those of the normal subjects and non-dyskinetic parkinsonian patients, with a significant overactivation in the supplementary motor area and the ipsi- and contralateral primary motor areas. These results are compatible with the hypothesis that an hyperkinetic abnormal involuntary movement, like L

Token indices sequence length is longer than the specified maximum sequence length for this model (562 > 512). Running this sequence through the model will result in indexing errors


[38;5;4mℹ Input:[0m
Intraocular pressure in patients with uveitis treated with fluocinolone acetonide implants. OBJECTIVE: To report the incidence and management of elevated intraocular pressure (IOP) in patients with uveitis treated with the fluocinolone acetonide (FA) intravitreal implant. DESIGN: Pooled data from 3 multicenter, double-masked, randomized, controlled, phase 2b/3 clinical trials evaluating the safety and efficacy of the 0.59-mg or 2.1-mg FA intravitreal implant or standard therapy were analyzed. RESULTS: During the 3-year follow-up, 71.0% of implanted eyes had an IOP increase of 10 mm Hg or more than baseline and 55.1%, 24.7%, and 6.2% of eyes reached an IOP of 30 mm Hg or more, 40 mm Hg or more, and 50 mm Hg or more, respectively. Topical IOP-lowering medication was administered in 74.8% of implanted eyes, and IOP-lowering surgeries, most of which were trabeculectomies (76.2%), were performed on 36.6% of implanted eyes. Intraocular pressure-lowering surgeries were c

## Performance after training

In [9]:
# Load model after training
model = T5ForConditionalGeneration.from_pretrained(
    home_dir.joinpath("results/checkpoint-1200"),
    device_map=device_map,
    local_files_only=True
)

In [10]:
# Model performance before training
inputs = [i["Input"] for i in random_examples]
expected_output = [i["Expected output"] for i in random_examples]

for input, expected in zip(inputs, expected_output):
    # inference
    input_ids = tokenizer(input, return_tensors="pt").input_ids.to('cuda') 
    output = model.generate(input_ids, max_new_tokens=128)
    decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

    # print overview
    msg.info("Input:")
    print(input)
    msg.good("Expected output:")
    print(expected)
    msg.info("Actual output:")
    print(decoded_output, "\n\n\n")

[38;5;4mℹ Input:[0m
Cortical motor overactivation in parkinsonian patients with L-dopa-induced peak-dose dyskinesia. We have studied the regional cerebral blood flow (rCBF) changes induced by the execution of a finger-to-thumb opposition motor task in the supplementary and primary motor cortex of two groups of parkinsonian patients on L-dopa medication, the first one without L-dopa induced dyskinesia (n = 23) and the other with moderate peak-dose dyskinesia (n = 15), and of a group of 14 normal subjects. Single photon emission tomography with i.v. 133Xe was used to measure the rCBF changes. The dyskinetic parkinsonian patients exhibited a pattern of response which was markedly different from those of the normal subjects and non-dyskinetic parkinsonian patients, with a significant overactivation in the supplementary motor area and the ipsi- and contralateral primary motor areas. These results are compatible with the hypothesis that an hyperkinetic abnormal involuntary movement, like L

## Evaluation using scores

In [15]:
import sys
sys.path.insert(1, home_dir.__str__()) # Adds the home directory to the system path so that run.py can be imported
from helper_functions import (
    postprocess_text,
    split_on_labels,
    handle_coreforents,
    extract_relation_triples,
    get_group,
    map_coferents,
    split_coferents,
    ner_metric,
    re_metric,
)
import numpy as np
import evaluate
import re

### Setting up trainer

In [16]:
training_arguments = Seq2SeqTrainingArguments(
        output_dir=config['output_dir'],
        per_device_train_batch_size=config['per_device_train_batch_size'],
        gradient_accumulation_steps=config['gradient_accumulation_steps'],
        optim=config['optim'],
        save_steps=config['save_steps'],
        logging_steps=config['logging_steps'],
        learning_rate=config['learning_rate'],
        fp16=config['fp16'],
        bf16=config['bf16'],
        max_grad_norm=config['max_grad_norm'],
        max_steps=config['max_steps'],
        warmup_ratio=config['warmup_ratio'],
        group_by_length=config['group_by_length'],
        lr_scheduler_type=config['lr_scheduler_type'],
        predict_with_generate=True,
        save_total_limit=2,
        save_strategy='steps',
        load_best_model_at_end=True,
        do_eval=config['do_eval'],
        evaluation_strategy=config['evaluation_strategy'],
        eval_steps=config['eval_steps'],
        remove_unused_columns=True,
        generation_max_length=152
    )

In [17]:
data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=model,
        label_pad_token_id=-100,
        pad_to_multiple_of=8 if config['fp16'] else None,
    )

#### we implement an changed version of the preprocess function here because sacreds parameter injection is not availible here.

In [18]:
def preprocess_function(examples):
    '''
    This function takes a dataset of input and target sequences.
    meant to be used with the dataset.map() function
    '''
    
    text_column = dataset_vars['column_names'][0]
    rel_column = dataset_vars['column_names'][1]

    # Split input and target
    inputs, targets = [], []
    for i in range(len(examples[text_column])):
        if examples[text_column][i] and examples[rel_column][i]: # remove pairs where one is None
            inputs.append(examples[text_column][i])
            targets.append(examples[rel_column][i])

    # Tokenize the input
    model_inputs = tokenizer(
        inputs, 
        max_length=max_seq_length, 
        padding=padding, 
        truncation=truncation, 
        return_tensors='pt'
    )

    # Tokenize the target sequence
    labels = tokenizer(
        text_target=targets, 
        max_length=max_seq_length, 
        padding=padding, 
        truncation=truncation,  
        return_tensors='pt'
    )

    # Replace pad tokens with -100 so they don't contribute too the loss
    if ignore_pad_token_for_loss:
        labels["input_ids"] = [
                    [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
                ]

    # Add tokenized target text to output
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [19]:
dataset_vars = config['dataset_vars']
max_seq_length = config['max_seq_length']
padding = config['padding']
truncation = config['truncation']
ignore_pad_token_for_loss = config['ignore_pad_token_for_loss']

eval_dataset = eval_dataset.map(
    preprocess_function,
    batched=True,
    desc="Running tokenizer on train dataset"
)

Running tokenizer on train dataset: 100%|██████████| 500/500 [00:01<00:00, 265.94 examples/s]


In [20]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
        
    # Replace -100s used for padding as we can't decode them
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, rouge_types=['rouge1', 'rouge2'], references=decoded_labels, use_stemmer=False)
    result.update(re_metric(predictions=decoded_preds, references=decoded_labels, ner_labels=['@CHEMICAL@', '@DISEASE@'], re_labels=['@CID@']))
    result.update(ner_metric(predictions=decoded_preds, references=decoded_labels, ner_labels=['@CHEMICAL@', '@DISEASE@'], re_labels=['@CID@']))
    result = {k: round(v * 100, 4) for k, v in result.items()} # rounds all metric values to 4 numvers behind the comma and make them percentages
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens) # mean length of the generated sequences
    return result

In [21]:
# Load metric
global metric # Otherwise the metric object won't be accessible from within compute_metric()
metric = evaluate.load("rouge")

In [22]:
trainer = Seq2SeqTrainer(
        model=model,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        args=training_arguments,
    )

In [23]:
trainer.evaluate()

NameError: name 're' is not defined