## <b>USR</b>: <i>An <b>U</b>n<b>S</b>upervised and <b>R</b>eference Free Evaluation Metric for Dialog Generation
</i>


Credits: the first part of this notebook *1. MaintainsContext (MCtx) Metric* was created by Thomas Bellucci

In this notebook, we show how dialogues can be evaluated with the Unsupervised and Reference Free (USR) Evaluation Metric, as described in: 

* Mehri, Shikib, and Maxine Eskenazi. "Usr: An unsupervised and reference free evaluation metric for dialog generation." arXiv preprint arXiv:2005.00456 (2020). https://arxiv.org/pdf/2005.00456.pdf

USR is also used in Track-3 of the DSTC9 conference: http://dialog.speech.cs.cmu.edu:8003

The soure code USR is given in: https://github.com/Shikib/usr. However, the github contains many other tasks and it is not clear how the models should be used. We therefore present here a easy to follow notebook implementation that mimics their apporach as explained in the paper.

**WARNING!**
Note that this is not the same implementation. There are very likely to be differences between the intended scores of the paper and our implementation.


## 1. MaintainsContext (MCtx) Metric

As the *Think Aloud* project primarily concerns the selection of thoughts from the brain that yield coherent follow-ups to the dialogue, and does not concern particularly their exact phrasing (e.g. <i>Naturalness</i>) nor the information communicated (i.e. <i>Uses Knowledge, Interestingness</i>), we strict the evaluation to those metrics deemed most relevant to the tested component; that is, the <i>Maintains Context</i> metric.

However, implementing the MLM metric to measure *naturalness* is as simple as loading another model (e.g. `'adamlin/usr-topicalchat-roberta_ft'`) and changing the model type to `RobertaForMaskedLM`.

In what follows we will implement this metric with the pretrained `Pytorch` model provided by (Mehri et al., 2020).

### Dependencies

First we'll install all required packages and import all models.

In [None]:
%%capture
!pip install transformers

In [8]:
from transformers import RobertaForSequenceClassification, RobertaModelForMaskedLM, RobertaTokenizer, RobertaConfig, AdamW
from tqdm import tqdm
import numpy as np
import torch
import json
import random

### MaintainsContext (MCtx) Metric

In [10]:
class USR_CTX:
    def __init__(self, path=None):
        """ Load pretrained and finetuned RoBERTa model for ctx. 
        
            params
            str path: path to stored model or None

            returns: None
        """
        self.__config = RobertaConfig.from_pretrained('adamlin/usr-topicalchat-ctx')
        self.__tokenizer = RobertaTokenizer.from_pretrained('adamlin/usr-topicalchat-ctx')

        if path is not None:
            self.__model = RobertaForSequenceClassification.from_pretrained(path, config=self.__config)
        else:
            self.__model = RobertaForSequenceClassification.from_pretrained('adamlin/usr-topicalchat-ctx', config=self.__config)

        self.__device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.__model.to(self.__device)

    def MCtx(self, context, response):
        """ Scores an input consisting of a (context, response) pair using RoBERTa.

            params
            str context:  the context strings
            sre response: response to the context

            returns: score
        """
        # Concatenates and encodes context-response pair
        inputs = self.__tokenizer(context + " [SEP] " + response, return_tensors='pt') # TODO verify separator token used in paper (standard </s> gives bad results)

        inputs['input_ids'] = inputs['input_ids'].to(self.__device)
        inputs['attention_mask'] = inputs['attention_mask'].to(self.__device)

        # Forward pass
        outputs = self.__model(**inputs)
        logits = outputs.logits.detach().cpu().numpy()
        
        # Returns the softmax score of the positive class, i.e. P(y=1|context, response)
        outputs = np.exp(logits) / np.sum(np.exp(logits))
        return outputs[0][1]

Having defined a general class for the MCtx USR metric, we can test it on a number of context-response pairs.

### Default context model: usr-topicalchat-ctx

In [11]:
model_xtc = USR_CTX() 

Some weights of the model checkpoint at adamlin/usr-topicalchat-ctx were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
pairs = [('Do you have a cat?', 'I do not have a cat'), # good
         ('Do you have a cat?', 'I like cats'),         # not as good
         ('Do you have a cat?', 'I like kittens'),      # worse
         ('Do you have a cat?', 'I want a turtle')]     # what are we even saying

for context, response in pairs:
    score = model_xtc.MCtx(context, response)
    print('score:', score, '\t', context, response)

score: 0.9965068 	 Do you have a cat? I do not have a cat
score: 0.652586 	 Do you have a cat? I like cats
score: 0.20761395 	 Do you have a cat? I like kittens
score: 0.0024435509 	 Do you have a cat? I want a turtle


### Fact context model: adamlin/usr-topicalchat-uk

In [119]:
model_uk = USR_CTX(path='adamlin/usr-topicalchat-uk') 

Some weights of the model checkpoint at adamlin/usr-topicalchat-uk were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [120]:
for context, response in pairs:
    score = model_uk.MCtx(context, response)
    print('score:', score, '\t', context, response)

score: 0.63138205 	 Do you have a cat? I do not have a cat
score: 0.6389494 	 Do you have a cat? I like cats
score: 0.6534759 	 Do you have a cat? I like kittens
score: 0.62958485 	 Do you have a cat? I want a turtle


This model performs less well on our text contexts. If the conversations is more factual scores may be better.

## 2. Likelihood metric of a target sentence by transformers

The MLM model of the USR paper is not fine-tuned but pretrained with topical-chat. We cannot use the same task as for the fine-tuned models. Instead, we define a likelihood function for a target sentence by creating a masked-task for all the tokens in a target sentence. By taking the score for the target token from the predicted results, we can obtain an averaged score for the whole target, given a context.

We first define a utility function that creates from a context and a target sentence a list of all masked sentences and the tokens that have been masked.

In [92]:
import re
def mask_target_sentence(pair:[], mask_token:str):
    context = pair[0]
    target = pair[1]
    masked_targets = []
    target_tokens = re.split(' ', target)
    for index, token in enumerate(target_tokens):
        sequence = context+" "
        for token in target_tokens[:index]:
            sequence+= token+" "
        sequence += mask_token
        for token in target_tokens[index+1:]:
            sequence+= " "+token
        masked_targets.append(sequence)
    return masked_targets, target_tokens

Next, we need a function that, given a model and a context, target pair applies the masked task to each token and gets the score from the results. If the token is not in the result, we add a zero score. At the end, we take the same and divide by the number of tokens in the target sentence. 

For comparison, we also get the best score and create the target sentence according the best prediction. We can compare the actual score with the best score.

In [107]:
def sentence_likelihood(pipeline, pair:[]):
    masked_targets, target_tokens = mask_target_sentence(pair, "<mask>")
    expected_target = ""
    max_scores = []
    scores = []
    for masked_target, token in zip(masked_targets, target_tokens):
        results = pipeline(masked_target)
        expected_target+= results[0]['token_str']+" "
        max_scores.append(results[0]['score'])
        match=False
        for result in results:
            if result['token_str'].lower().strip()==token.lower():
                scores.append(result['score'])
                match=True
                break
                
        if not match:
            scores.append(0)
    likelihood = sum(scores)/len(scores)
    max_likelihood =  sum(max_scores)/len(max_scores)

    return likelihood, expected_target, max_likelihood          

We can use the transformers pipeline to do the task. We first load the USR topicalchat-roberta_ft model and apply it to the above pairs

In [122]:
from transformers import pipeline
usr_rft_fillmask= pipeline("fill-mask", model='adamlin/usr-topicalchat-roberta_ft')

We can set the number of results to make it more finegrained or not. More results will result in less zero scores.

In [123]:
usr_rft_fillmask.top_k=20 ### we get the top 20 results

for pair in pairs:
    score = llh, best_sentence, max_score = sentence_likelihood(usr_rft_fillmask, pair)
    print(pair)
    print('Likelihood:',llh, 'Max score:', max_score, 'Best sentence:', best_sentence)

('Do you have a cat?', 'I do not have a cat')
Likelihood: 0.9511937399705251 Max score: 0.9511937399705251 Best sentence:  i  do  not  have  a  cat 
('Do you have a cat?', 'I like cats')
Likelihood: 0.6525216996669769 Max score: 0.750761349995931 Best sentence: i  love  cats 
('Do you have a cat?', 'I like kittens')
Likelihood: 0.2641774927227137 Max score: 0.7216621239980062 Best sentence: i  have  cats 
('Do you have a cat?', 'I want a turtle')
Likelihood: 0.4552861073752865 Max score: 0.8906680643558502 Best sentence:  i  have  a  cat 


Instead of the roberta_ft that was used, we can also use any other transformer for the *fill-mask* task. Below, we use the cross-lingual roberta model, which can be applied to dialogues in 150 languages.

In [124]:
xlm_rob_base_fillmask= pipeline("fill-mask", model='xlm-roberta-base')

In [125]:
xlm_rob_base_fillmask.top_k=20 ### we get the top 20 results

for pair in pairs:
    score = llh, best_sentence, max_score = sentence_likelihood(xlm_rob_base_fillmask, pair)
    print(pair)
    print('Likelihood:',llh, 'Max score:', max_score, 'Best sentence:', best_sentence)

('Do you have a cat?', 'I do not have a cat')
Likelihood: 0.8351189295450846 Max score: 0.8351189295450846 Best sentence: I do not have a cat 
('Do you have a cat?', 'I like cats')
Likelihood: 0.3044796958565712 Max score: 0.7233533263206482 Best sentence: I love it 
('Do you have a cat?', 'I like kittens')
Likelihood: 0.32319798072179157 Max score: 0.6229558785756429 Best sentence: I love it 
('Do you have a cat?', 'I want a turtle')
Likelihood: 0.43128971802070737 Max score: 0.797596886754036 Best sentence: I have a cat 


We can see that roberta pretrained with topical chat performs a bit better.

## 3. Perplexity

From: https://huggingface.co/docs/transformers/perplexity

## End of notebook