# Evaluation stage

This notebook is designed to evaluate performance of the models. [As described](https://github.com/s-nlp/detox/tree/main#evaluation) in the [code of the given paper](https://github.com/s-nlp/detox/blob/main/emnlp2021/metric/metric.py), many different metrics are computing to access quality of the models. For simplicity, I will choose only some of them.

### Define metric calculation funcitons

In [6]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from tqdm import tqdm
import numpy as np


def style_transfer_accuracy(preds, batch_size=32):
    print('Calculating style of predictions')
    results = []

    classifier_model_name = 's-nlp/roberta_toxicity_classifier'
    tokenizer = RobertaTokenizer.from_pretrained(classifier_model_name)
    model = RobertaForSequenceClassification.from_pretrained(classifier_model_name)

    for i in tqdm(range(len(preds), batch_size)):
        batch = tokenizer(preds[i:i + batch_size], return_tensors='pt', padding=True)
        result = model(**batch)['logits'].argmax(1).float().data.tolist()
        results.extend([1 - item for item in result])

    accuracy = np.mean(results)
    return accuracy

In [4]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm import tqdm


def bleu_score(inputs, preds):
    bleu_sim = 0
    counter = 0
    print('Calculating BLEU similarity')
    for i in tqdm(range(len(inputs))):
        if len(inputs[i]) > 3 and len(preds[i]) > 3:
            bleu_sim += sentence_bleu([inputs[i]], preds[i])
            counter += 1

    return float(bleu_sim / counter)

## Compute metrics

### Example

In [7]:
sta = style_transfer_accuracy(['you are amazing'])
print(f'Style transfer accuracy: {sta}')

Calculating style of predictions


Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1/1 [00:00<00:00,  5.51it/s]

Style transfer accuracy: 1.0





In [5]:
bleu = bleu_score(['you are fuck'], ['you are amazing'])
print(f'BlEU score: {bleu}')

Calculating BLEU similarity


100%|██████████| 1/1 [00:00<00:00, 1197.69it/s]

BlEU score: 0.4758733096412523





### On model results