# Evaluation stage

This notebook is designed to evaluate performance of the models. [As described](https://github.com/s-nlp/detox/tree/main#evaluation) in the [code of the given paper](https://github.com/s-nlp/detox/blob/main/emnlp2021/metric/metric.py), many different metrics are computing to access quality of the models. For simplicity, I will choose only some of them.

### Define metric calculation funcitons

In [31]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from tqdm import tqdm
import numpy as np
import torch


def style_transfer_accuracy(preds, batch_size=32):
    print('Calculating style of predictions')
    results = []

    classifier_model_name = 's-nlp/roberta_toxicity_classifier'
    tokenizer = RobertaTokenizer.from_pretrained(classifier_model_name)
    model = RobertaForSequenceClassification.from_pretrained(classifier_model_name)

    for i in tqdm(range(0, len(preds), batch_size)):
        batch = tokenizer(preds[i:i + batch_size], return_tensors='pt', padding=True)
        with torch.inference_mode():
            logits = model(**batch).logits
        result = torch.softmax(logits, -1)[:, 1].cpu().numpy()
        results.extend([1 - item for item in result])

    accuracy = np.mean(results)
    return accuracy

In [15]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm import tqdm


def bleu_score(inputs, preds):
    bleu_sim = 0
    counter = 0
    print('Calculating BLEU similarity')
    for i in tqdm(range(len(inputs))):
        if len(inputs[i]) > 3 and len(preds[i]) > 3:
            bleu_sim += sentence_bleu([inputs[i]], preds[i])
            counter += 1

    return float(bleu_sim / counter)

## Compute metrics

### Example

In [7]:
sta = style_transfer_accuracy(['you are amazing'])
print(f'Style transfer accuracy: {sta}')

Calculating style of predictions


Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1/1 [00:00<00:00,  5.51it/s]

Style transfer accuracy: 1.0





In [5]:
bleu = bleu_score(['you are fuck'], ['you are amazing'])
print(f'BlEU score: {bleu}')

Calculating BLEU similarity


100%|██████████| 1/1 [00:00<00:00, 1197.69it/s]

BlEU score: 0.4758733096412523





## Evaluate models' results

### LLM Mistral 7B

In [29]:
import pandas as pd


llm_result_df_path = '../data/interim/model-outputs/llm-mistral-10shots.csv'
llm_result_df = pd.read_csv(llm_result_df_path, index_col=0)
llm_result_df.fillna('', inplace=True)  # Fill nan if exist
llm_result_df.head()

Unnamed: 0,inputs,preds
57809,"Listen, call off the butchers, and I'll tell you.",Call your butchers and I'll inform you.
132693,who the hell was going through my stuff?,Who has been going through my things?
254505,She still might die . . .?,He may still pass away.
451186,that's what his name was.,That's his name.
191213,I'd take you on your shoulders... I'd tie you ...,"I would carry you on my back, and we'll journe..."


In [33]:
inputs = list(llm_result_df['inputs'])
preds = list(llm_result_df['preds'])
sta = style_transfer_accuracy(preds, batch_size=128)
bleu = bleu_score(inputs, preds)
print()
print(f'Style transfer accuracy: {sta}')
print(f'BlEU score: {bleu}')

Calculating style of predictions


Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 2/2 [00:12<00:00,  6.11s/it]


Calculating BLEU similarity


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
100%|██████████| 144/144 [00:00<00:00, 2570.40it/s]


Style transfer accuracy: 0.8257869377947473
BlEU score: 0.3190431760968685



