* Player objectives
    * Get the AI to say a specific word
    * Avoid taboo words
    * Avoid responses with high loss
* Escalation
    * Seed the AI with thoughts ex. "I hate you\n\<user input\>"
        * Punishment for saying taboo words
    * Choose responses that are classified as angry
        * Punishment for responses with high loss

In [1]:
import numpy
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    AutoModelForSequenceClassification,
)

In [2]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
chat_tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
chat_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill")

In [4]:
emotion_tokenizer = AutoTokenizer.from_pretrained("j-hartmann/emotion-english-distilroberta-base")
emotion_model = AutoModelForSequenceClassification.from_pretrained("j-hartmann/emotion-english-distilroberta-base")

In [5]:
chat_model = chat_model.to(DEVICE)
emotion_model = emotion_model.to(DEVICE)

## Seeding AI thoughts ##
We can encourage the AI to respond more aggressively by asking it to generate a response while pretending that it previously said an aggressive response.

In [13]:
user_input = "How is the weather?"

input_tokens = chat_tokenizer(["How is the weather?"], return_tensors="pt")
input_tokens.to(DEVICE)

hallucination_tokens = chat_tokenizer(
    "<s>" + "I hate you. ",
    return_attention_mask=False,
    return_special_tokens_mask=False,
    add_special_tokens=False,
    return_tensors="pt",
)
hallucination_tokens.to(DEVICE)
hallucination_input_ids = hallucination_tokens["input_ids"]

outputs_with_hallucination = chat_model.generate(
    **input_tokens,
    decoder_input_ids=hallucination_input_ids,
    max_length=20,
    do_sample=True,
    top_p=0.9,
    #top_k=0,
    temperature=0.85,
    num_return_sequences=10,
)
outputs = outputs_with_hallucination[:, len(hallucination_input_ids[0]):]

text_with_hallucination = chat_tokenizer.batch_decode(outputs_with_hallucination)
print(text_with_hallucination)

decoded_text = chat_tokenizer.batch_decode(outputs)
print(decoded_text)

['<s> I hate you.  It is so hot.  I am ready for it to cool</s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is so hot.  I am ready for fall. </s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>', '<s> I hate you.  It is hot and humid.  I hate it. </s>', '<s> I hate you.  It is so hot and humid.  I am ready for</s>']
[' It is so hot.  I am ready for it to cool</s>', ' It is so hot and humid.  I am ready for</s>', ' It is so hot.  I am ready for fall. </s>', ' It is so hot and humid.  I am ready for</s>', ' It is so hot and humid.  I am ready for</s>', ' It is so hot and humid.  I am ready for</s>', ' It is so hot and humid.  I am ready for</s>', ' It is so hot and humid.  I am 

## Selecting reponses based on emotion scores ##
When the model generates a response, it can return the top-k responses. We can then pick the "angriest" response according to an emotion classifier model.

In [112]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = numpy.exp(x - numpy.max(x))
    return e_x / e_x.sum(axis=0) # only difference

In [119]:
decoded_texts = chat_tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(decoded_texts)

tokens = emotion_tokenizer(decoded_texts, padding=True, return_tensors="pt")
tokens = tokens.to(DEVICE)

emotion_outputs = emotion_model(**tokens)

print(emotion_model.config.id2label)
print(emotion_outputs)

# softmax anger and joy, then take the anger index as the score
logits = [torch.take(x, torch.tensor([0, 3])).detach().numpy() for x in emotion_outputs.logits]
logits = numpy.array(logits)
logits = numpy.array([softmax(logit) for logit in logits])
arg_max = numpy.argmax(logits[:, 0])
print(decoded_texts[arg_max])

[' It is a little chilly, but not too bad. How is it where you are?', ' It is a little chilly, but not too bad. How is it where you are?', " It is hot and humid. I hate it. I can't wait for winter", ' It is hot and humid. I hate it. I wish it would cool off already.', ' It is hot and humid. I do not like it at all. How about you?', " It is hot and humid. I can't wait for it to cool off. ", ' It is hot and humid. I hate it so much. I want to go home.', ' It is a little chilly, but it is supposed to be sunny tomorrow.', ' It is hot and humid. I hate it so much. I want to go home.', ' It is a little chilly, but not too bad.  How about you?']
{0: 'anger', 1: 'disgust', 2: 'fear', 3: 'joy', 4: 'neutral', 5: 'sadness', 6: 'surprise'}
SequenceClassifierOutput(loss=None, logits=tensor([[-0.7416, -0.6787, -1.6628, -0.9423,  3.5249,  0.4547, -0.3131],
        [-0.7416, -0.6787, -1.6628, -0.9423,  3.5249,  0.4547, -0.3131],
        [ 2.1532,  1.9581, -2.2039, -1.5009,  0.4576,  1.1097, -1.9817],


## Off Topic Detection ##
User responses that generate high losses (are unlikely to be generated by the AI) are likely to be off topic and should be punished

In [83]:
input_tokens = chat_tokenizer(["What is your favorite sport?"], return_tensors="pt")
input_tokens = input_tokens.to(DEVICE)

#decoder_tokens = chat_tokenizer(["<s> I really like tennis, how about you?</s>"], return_tensors="pt") # 1.0723
decoder_tokens = chat_tokenizer(["<s> How is the weather today?</s>"], return_tensors="pt") # 5.9830
decoder_tokens = decoder_tokens.to(DEVICE)

_outputs = chat_model(input_ids=input_tokens["input_ids"], labels=decoder_tokens["input_ids"])
print(_outputs)

print(dir(chat_model))

Seq2SeqLMOutput(loss=tensor(2.8571, grad_fn=<NllLossBackward0>), logits=tensor([[[-5.7530, 12.1979,  3.4334,  ...,  0.9676, -0.2573,  0.9848],
         [-5.7564, 12.1852,  3.4410,  ...,  0.9671, -0.2604,  0.9786],
         [-5.9042, -2.0672,  3.7101,  ..., -0.8997, -0.1670,  0.2151],
         ...,
         [-6.6357, -0.5389, 11.7164,  ..., -0.3628, -0.8334, -1.0239],
         [-6.7534,  0.9767, 13.8325,  ...,  0.2906, -1.1750, -0.3923],
         [-6.8230,  0.0782, 12.2515,  ...,  0.1398, -1.6383, -0.3332]]],
       grad_fn=<AddBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.1982,  0.0901,  0.1555,  ..., -0.1577, -0.4826,  0.3028],
         [-0.0471, -0.0153,  0.0368,  ...,  0.0147, -0.0657,  0.0081],
         [-0.0329,  0.0509, -0.0154,  ..., -0.3818, -0.5800, -0.2895],
         ...,
         [ 0.0244, -0.1983,  0.1205,  ..., -0.1290, -0.0974,  0.0550],
         [-0.1007,  0.1062, -0.0