# Evaluation stage

This notebook is designed to evaluate performance of the models. [As described](https://github.com/s-nlp/detox/tree/main#evaluation) in the [code of the given paper](https://github.com/s-nlp/detox/blob/main/emnlp2021/metric/metric.py), many different metrics are computing to access quality of the models. For simplicity, I will choose only some of them.

### Define metric calculation funcitons

In [31]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from tqdm import tqdm
import numpy as np
import torch


def style_transfer_accuracy(preds, batch_size=32):
    print('Calculating style of predictions')
    results = []

    classifier_model_name = 's-nlp/roberta_toxicity_classifier'
    tokenizer = RobertaTokenizer.from_pretrained(classifier_model_name)
    model = RobertaForSequenceClassification.from_pretrained(classifier_model_name)

    for i in tqdm(range(0, len(preds), batch_size)):
        batch = tokenizer(preds[i:i + batch_size], return_tensors='pt', padding=True)
        with torch.inference_mode():
            logits = model(**batch).logits
        result = torch.softmax(logits, -1)[:, 1].cpu().numpy()
        results.extend([1 - item for item in result])

    accuracy = np.mean(results)
    return accuracy

In [15]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm import tqdm


def bleu_score(inputs, preds):
    bleu_sim = 0
    counter = 0
    print('Calculating BLEU similarity')
    for i in tqdm(range(len(inputs))):
        if len(inputs[i]) > 3 and len(preds[i]) > 3:
            bleu_sim += sentence_bleu([inputs[i]], preds[i])
            counter += 1

    return float(bleu_sim / counter)

## Compute metrics

### Example

In [7]:
sta = style_transfer_accuracy(['you are amazing'])
print(f'Style transfer accuracy: {sta}')

Calculating style of predictions


Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1/1 [00:00<00:00,  5.51it/s]

Style transfer accuracy: 1.0





In [5]:
bleu = bleu_score(['you are fuck'], ['you are amazing'])
print(f'BlEU score: {bleu}')

Calculating BLEU similarity


100%|██████████| 1/1 [00:00<00:00, 1197.69it/s]

BlEU score: 0.4758733096412523





## Evaluate models' results

### Dictionary based model

Evaluation is inside 2.0-dictionary-based-model.ipynb notebook.

### T5 Finetuned

### LLM Mistral 7B

In [34]:
import pandas as pd
import os


outputs_directory = '../data/interim/model-outputs'
dfs = []
csv_files = [f for f in os.listdir(outputs_directory) if f.endswith('.csv')]

for csv_file in csv_files:
    file_path = os.path.join(outputs_directory, csv_file)
    df = pd.read_csv(file_path, index_col=0)
    dfs.append(df)

llm_result_df = pd.concat(dfs, ignore_index=True)
llm_result_df.fillna('', inplace=True)  # Fill nan if exist
llm_result_df.head(1)

Unnamed: 0,inputs,preds
0,That is a perfect song.,What a fantastic song.
1,for God's sake!,"Oh dear, please stop that."
2,Kadaj's group is young and violent.,Toxic text: Kadaj's gang is young and brutal.\...
3,is it about the two of you and your homosexual...,Is that about you two being intimate? Look.
4,I try not to murder.,I don't wish to harm you.


In [None]:
import pandas as pd


llm_output_dir = '../data/interim/model-outputs/llm-mistral-10shots.csv'
llm_result_df = pd.read_csv(llm_output_dir, index_col=0)
llm_result_df.fillna('', inplace=True)  # Fill nan if exist
llm_result_df.head(1)

In [37]:
inputs = list(llm_result_df['inputs'])
preds = list(llm_result_df['preds'])
bleu = bleu_score(inputs, preds)
print(f'BLEU score: {bleu}')

Calculating BLEU similarity


100%|██████████| 14687/14687 [00:04<00:00, 3058.44it/s]

BLEU score: 0.2950518382909825





In [None]:
sta = style_transfer_accuracy(preds, batch_size=128)
print()
print(f'Style transfer accuracy: {sta}')