In [None]:
%matplotlib inline

import collections
import random
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Conditioned neural language models

Last time we saw how to generate random but acceptable text (hopefully) using RNNs and token sampling.
In this topic we'll see how to control what the text should say by **conditioning** our language models on extra information.
Some examples:

* Generating a description given a photo.
* Generating a translation given a source text (in another language).
* Generating a text transcription given speech recording.
* Generating an answer given a question and a comprehension text.

In all of these tasks, we need a language model that is formally defined as follows:

$P(t_1, t_2, \dots, t_n | c)$

where $c$ is the data you're conditioning your language model on.
This decomposes into the following computable factors:

$P(t_1, t_2, \dots, t_n | c) = P(t_1 | c) P(t_2 | t_1, c) P(t_3 | t_1, t_2, c) \dots P(t_n | t_1, t_2, \dots, t_{n-1}, c)$

So essentially, a conditioned language model is a normal language model whose probability for a given text is also influenced by a second input, such as a question or an image.

In this topic we'll focus on the case where the conditioning input is a single vector rather than a sequence as in the case of translation, which we will be focusing on in the next topic.

## Where to put the conditioning vector

The **unconditioned neural language models** we saw last time can be illustrated as the following architecture:

![](unconditioned_langmod.png)

So if we are to insert a new input that is to influence all the outputs, where can we put it in the architecture?
There many ways to do this, but [many scientific papers](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/where-to-put-the-image-in-an-image-caption-generator/A5B0ACFFE8E4AEAA5840DC61F93153F3) (at least the ones that generate descriptions from images) can have their architectures categorised into one or a combination of the following: **init-inject**, **pre-inject**, **par-inject**, and **merge**.

In order to illustrate conditioned generation, we will be generating the toy sentiment sentences given a sentiment.
So our language model will be generating positive sentences when asked to generate positive sentences whilst negative sentences will be generated when asked to generate negative sentences.
The sentiment will be represented with a one-hot vector.
In this case it is totally possible to train two separate language models for each sentiment, but then you'd be training each model on half of the data set and the common information would not be shared between the two.
Also, conditioned language models are most useful when the conditioning information can have infinite possibilities, such as an image.

In [None]:
train_text_tokens = [
    'I like it .'.split(' '),
    'I hate it .'.split(' '),
    'I don\'t hate it .'.split(' '),
    'I don\'t like it .'.split(' '),
]
sentiments = torch.tensor([
    [1, 0],
    [0, 1],
    [1, 0],
    [0, 1],
], dtype=torch.float32, device=device)

print('sentiments:')
print(sentiments)
print()

num_conds = sentiments.shape[1]

max_len = max(len(text) + 1 for text in train_text_tokens)
print('max_len:', max_len)

vocab = ['<PAD>', '<EDGE>'] + sorted({token for text in train_text_tokens for token in text})
token2index = {t: i for (i, t) in enumerate(vocab)}
pad_index = token2index['<PAD>']
edge_index = token2index['<EDGE>']
print('vocab:', vocab)
print()

train_text_x_indexed_np = np.full((len(train_text_tokens), max_len), pad_index, np.int64)
for i in range(len(train_text_tokens)):
    train_text_x_indexed_np[i, 0] = edge_index
    for j in range(len(train_text_tokens[i])):
        train_text_x_indexed_np[i, j + 1] = token2index[train_text_tokens[i][j]]
train_text_x_indexed = torch.tensor(train_text_x_indexed_np, device=device)

train_text_y_indexed_np = np.full((len(train_text_tokens), max_len), pad_index, np.int64)
for i in range(len(train_text_tokens)):
    for j in range(len(train_text_tokens[i])):
        train_text_y_indexed_np[i, j] = token2index[train_text_tokens[i][j]]
    train_text_y_indexed_np[i, len(train_text_tokens[i])] = edge_index # Add the edge token at the end.
train_text_y_indexed = torch.tensor(train_text_y_indexed_np, device=device)

In [None]:
def sample_generate(model, vocab, edge_index, pad_index, cond, max_len):
    cond_tensor = torch.tensor([cond], dtype=torch.float32, device=device)
    prefix_indexed = torch.tensor([[edge_index] + [pad_index]*max_len], dtype=torch.int64, device=device)
    prefix_prob = 1.0
    prefix_len = 1
    with torch.no_grad():
        for i in range(max_len):
            outputs = torch.softmax(model(cond_tensor, prefix_indexed[:, :i+1]), dim=2)
            token_probs = outputs[0, -1, :].cpu().tolist()
            next_token_index = random.choices(range(len(vocab)), token_probs)[0]
            prefix_indexed[0, i+1] = next_token_index
            prefix_prob *= token_probs[next_token_index]
            prefix_len += 1
            if next_token_index == edge_index:
                break
    text = [vocab[index] for index in prefix_indexed[0, :prefix_len].cpu().tolist()]
    return (text, prefix_prob)

### init-inject

Init-inject is an architecture where the conditioning vector is injected into the RNN as an initial state, that is, the conditioning vector is treated as if it were a state vector.
This means that your RNN will not have an initial state parameter.
It also means that the conditioning vector needs to have the same size as the RNN state, which we do here by expanding the vector using a linear layer.

![](init_inject.png)

Note that in an LSTM you can condition the initial hidden state, the initial cell state, or both (the unconditioned initial state would be initialised as usual).
We will be using the hidden state here.

In [None]:
class Model(torch.nn.Module):

    def __init__(self, num_conds, vocab_size, embedding_size, state_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size)
        self.cond_layer = torch.nn.Linear(num_conds, state_size) # Conditioning vector will be made as big as a state vector.
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, state_size)
        self.output_layer = torch.nn.Linear(state_size, vocab_size)

    def forward(self, cond, text_x_indexed):
        batch_size = text_x_indexed.shape[0]
        time_steps = text_x_indexed.shape[1]
        
        cond_state = self.cond_layer(cond)
        embedded = self.embedding(text_x_indexed)
        
        state = cond_state # Conditioning vector is the initial hidden state.
        c = self.rnn_c0[None, :].tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        return self.output_layer(interm_states)

In [None]:
model = Model(num_conds, len(vocab), embedding_size=2, state_size=3)
model.to(device)

optimiser = torch.optim.Adam(model.parameters(), lr=0.1)

print('epoch', 'error')
train_errors = []
for epoch in range(1, 1000+1):
    batch_size = train_text_x_indexed.shape[0]
    time_steps = train_text_x_indexed.shape[1]
    pad_mask = train_text_x_indexed == pad_index
    
    optimiser.zero_grad()
    logits = model(sentiments, train_text_x_indexed)
    train_token_errors = torch.nn.functional.cross_entropy(logits.transpose(1, 2), train_text_y_indexed, reduction='none')
    train_token_errors = torch.masked_fill(train_token_errors, pad_mask, 0.0)
    train_error = train_token_errors.sum()/(~pad_mask).sum()
    train_errors.append(train_error.detach().cpu().tolist())
    train_error.backward()
    optimiser.step()

    if epoch%100 == 0:
        print(epoch, train_errors[-1])
print()

for (name, sentiment) in [('positive', [1, 0]), ('negative', [0, 1])]:
    print(name)
    (text, text_prob) = sample_generate(model, vocab, edge_index, pad_index, sentiment, max_len=10)
    print(text, f'(p={text_prob:.3f})')
    print()

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('epoch')
ax.set_ylabel('$E$')
ax.plot(range(1, len(train_errors) + 1), train_errors, color='blue', linestyle='-', linewidth=3)
ax.grid()

It is possible that the above model does not actually make use of the conditioning vector.
This can be checked by seeing if the probability of both 'like' and 'hate' following the prefix 'EDGE I' is almost equal, regardless of what the conditioning vector is.
In this case, the model must be retrained.

In [None]:
with torch.no_grad():
    prefix_indexed = torch.tensor([[edge_index, token2index['I']]], dtype=torch.int64, device=device)
    for (name, sentiment) in [('positive', [1, 0]), ('negative', [0, 1])]:
        print(name)
        sentiment_tensor = torch.tensor([sentiment], dtype=torch.float32, device=device)
        outputs = torch.softmax(model(sentiment_tensor, prefix_indexed), dim=2)
        token_probs = outputs[0, -1, :].cpu().tolist()
        print('I...')
        print('like:', token_probs[token2index['like']])
        print('hate:', token_probs[token2index['hate']])
        print()

### pre-inject

Pre-inject is when the conditioning vector is used as a first token vector, that is, the conditioning vector is treated as if it were a token embedding.
This means that the first intermediate state of every text will not be from a real token embedding and should be ignored just like a pad token, which is easy to do because we just need to eliminate the first time step from every sequence.
It also means that the conditioning vector needs to have the same size as the embedding size, which we do here by expanding the vector using a linear layer.

![](pre_inject.png)

In [None]:
class Model(torch.nn.Module):

    def __init__(self, num_conds, vocab_size, embedding_size, state_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size)
        self.cond_layer = torch.nn.Linear(num_conds, embedding_size) # Conditioning vector will be made as big as an embedding vector.
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, state_size)
        self.output_layer = torch.nn.Linear(state_size, vocab_size)

    def forward(self, cond, text_x_indexed):
        batch_size = text_x_indexed.shape[0]
        time_steps = text_x_indexed.shape[1]

        cond_embed = self.cond_layer(cond)
        embedded = self.embedding(text_x_indexed)
        embedded = torch.concat((cond_embed[:, None, :], embedded), dim=1) # Attach the conditioning vector as a first token.
        
        state = self.rnn_s0[None, :].tile((batch_size, 1))
        c = self.rnn_c0[None, :].tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps + 1): # Don't forget the conditioning vector!
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        interm_states = interm_states[:, 1:, :] # Drop the first state from each text.
        return self.output_layer(interm_states)

### par-inject

Par-inject (parallel-inject) is when a copy of the conditioning vector is concatenated to every token vector before being read by the RNN.
This means that the RNN is being reminded of the conditioning vector with every time step.
It also means that the conditioning vector size can be anything as it will just be added to that of the embedding size.

![](par_inject.png)

In [None]:
class Model(torch.nn.Module):

    def __init__(self, num_conds, vocab_size, embedding_size, state_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(num_conds + embedding_size, state_size) # RNN must include the conditioning vector in its input.
        self.output_layer = torch.nn.Linear(state_size, vocab_size)

    def forward(self, cond, text_x_indexed):
        batch_size = text_x_indexed.shape[0]
        time_steps = text_x_indexed.shape[1]
        
        cond_3d = cond[:, None, :].tile((1, time_steps, 1)) # Replicate the same conditioning vector for every token.
        embedded = self.embedding(text_x_indexed)
        embedded = torch.concat((cond_3d, embedded), dim=2) # Attach the replicated conditioning vector to the embedded words.
        
        state = self.rnn_s0[None, :].tile((batch_size, 1))
        c = self.rnn_c0[None, :].tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        return self.output_layer(interm_states)

### merge

Merge is when a copy of the conditioning vector is concatenated to every intermediate state vector before being passed to the output layer.
This means that the RNN never sees the conditioning vector and is allowed to only deal with the tokens, allowing it to dedicate more of its memory to information about the prefix.
It also means that the conditioning vector size can be anything as it will just be added to that of the state size.

![](merge.png)

In [None]:
class Model(torch.nn.Module):

    def __init__(self, num_conds, vocab_size, embedding_size, state_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((state_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, state_size)
        self.output_layer = torch.nn.Linear(num_conds + state_size, vocab_size) # Output layer must include the conditioning vector in its input.

    def forward(self, cond, text_x_indexed):
        batch_size = text_x_indexed.shape[0]
        time_steps = text_x_indexed.shape[1]
        
        embedded = self.embedding(text_x_indexed)
        
        state = self.rnn_s0[None, :].tile((batch_size, 1))
        c = self.rnn_c0[None, :].tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)

        cond_3d = cond[:, None, :].tile((1, time_steps, 1)) # Replicate the same conditioning vector for every token.
        interm_states = torch.concat((cond_3d, interm_states), dim=2) # Attach the replicated conditioning vector to the intermediate states.
        
        return self.output_layer(interm_states)

## Generating the most probable text

Up to now we have been focusing on sampling texts such that a text is generated randomly according to its probability (according to the language model).
But when generating texts for a particular conditioning vector, such as for a translation, we generally don't want probabilitic outputs but the most probable output according to our model.
Therefore we will need to use a **search algorithm** to find the most probable sequence rather than generating randomly.

### Greedy search

The simplest way to find an approximately probable sentence is **greedy search** where the most probable token at every prefix is selected until the edge token is selected.
So it's exactly like the token sampling algorithm we've been using, but instead of choosing a token randomly, we use `argmax` to choose the index of the token with the greatest probability.

![](greedy_search.png)

In [None]:
def greedy_generate(model, vocab, edge_index, pad_index, cond, max_len):
    cond_tensor = torch.tensor([cond], dtype=torch.float32, device=device)
    prefix_indexed = torch.tensor([[edge_index] + [pad_index]*max_len], dtype=torch.int64, device=device)
    prefix_prob = 1.0
    prefix_len = 1
    with torch.no_grad():
        for i in range(max_len):
            outputs = torch.softmax(model(cond_tensor, prefix_indexed[:, :i+1]), dim=2)
            token_probs = outputs[0, -1, :]
            next_token_index = token_probs.argmax() # Pick the most probable token's index.
            prefix_indexed[0, i+1] = next_token_index
            prefix_prob *= token_probs[next_token_index].cpu().tolist()
            prefix_len += 1
            if next_token_index == edge_index:
                break
    text = [vocab[index] for index in prefix_indexed[0, :prefix_len].cpu().tolist()]
    return (text, prefix_prob)

for (name, sentiment) in [('positive', [1, 0]), ('negative', [0, 1])]:
    print(name)
    (text, text_prob) = greedy_generate(model, vocab, edge_index, pad_index, sentiment, max_len=10)
    print(text, f'(p={text_prob:.3f})')
    print()

Alternatively, instead of choosing the most probable token we can choose the most probable prefix each time.
In greedy search, this is exactly the same as choosing the most probable token, but it will help with understanding the next part.

![](greedy_search_prefixes.png)

As with most situations, using a greedy strategy is unlikely to produce the most probable text.
This is because the most probable text would have the largest product of token probabilities which might not begin with the most probable token.
It is entirely possible that a high probability first token will only lead to low probability suffixes.

### Full prefix tree search

If you want to find the actual most probable text, then you'll need to perform a **full prefix tree search**.
This means searching every possible prefix that can be generated (with the edge token only being allowed at the beginning and end) and keeping track of the probabilities of these prefixes.

![](full_prefix_tree_search.png)

We can keep on expanding prefixes for as long as we want, and we can't really eliminate low probability prefixes because they might expand into the most probable full text later on.
This means that the number of prefixes to expand grows exponentially with the vocabulary size, which is intractable.

The important question to ask here is 'when do we stop searching?'.
Can we know that we have found the most probable full text and that no other partial text will ever expand into a more probable full text?
Increasing the length of a prefix will always result in a smaller probability.
This is because the new probability of an expanded prefix is the probability of the previous prefix multiplied by the probability of the new token.
Multiplying two probabilities together will always result in a probability that's smaller than both, so the new prefix's probability will be smaller.
Therefore, we can stop searching for the maximum probability full text as soon as the most probable partial text has a smaller pobability than the most probable full text.
Any future complete texts emerging from the current prefixes can only result in even smaller probabilities.

A full prefix tree search works by keeping a list of all the partial texts in the current tree level as well as the best full text found till now.
The partial texts in the tree level are used to construct the partial and complete texts that are one token longer.
The new partial texts will form the new tree level and the best full text is updated if an even more probable complete text is found.
If the best full text till now has a larger probability than the best partial text in the current level, then that best full text is returned.

### Beam search

The greedy search is the most memory/time efficient generation algorithm but is not likely to give the most probable text.
The full prefix tree search is the least memory/time efficient generation algorithm but is guaranteed to give the most probable text.
An algorithm that is lies between these two extremes, both in terms of memory/time use and confidence in the output being the most probable text, is the **beam search algorithm**.

The beam search algorithm is a clipped full prefix tree search, that is, it only expands the top $k$ prefixes at every tree level.
This $k$ is called the **beam size** or **beam width** and the $k$ prefixes are called the **beam**.
The beam size allows you to balance the amount of memory/time you want to spend with the likelihood of finding the most probable text.
The larger the beam size, the more of the tree will be explored and the more likely the most probable text will be found.

![](beam_search.png)

Since the beam size is fixed, we can put the entire beam into a batch to get the next token predictions faster.
This would not have been possible with a full prefix tree search.

Contrary to greedy search and sampled generation, the prefixes to expand (the beam) will be completely changed with each expansion.
This means that it is no longer faster to update individual token indexes in a single tensor and we will be recreating the tensor with each new beam instead.

In [None]:
def beam_generate(model, vocab, edge_index, cond, max_len, beam_size):
    cond_tensors = torch.tensor([cond], dtype=torch.float32, device=device).tile((beam_size, 1)) # Create enough copies of the conditioning vectors for the whole beam to use in a batch.
    beam_prefixes_indexed = torch.tensor([[edge_index]], dtype=torch.int64, device=device) # It's faster to replace the tensor than to update every element, so no point in adding pad tokens.
    beam_prefixes_probs = np.array([1.0], np.float32)
    
    # Information about the best full text found.
    best_full_prefix_indexed = None
    best_full_prefix_prob = None
    
    with torch.no_grad():
        for _ in range(max_len):
            outputs = torch.softmax(model(cond_tensors[:len(beam_prefixes_indexed), :], beam_prefixes_indexed), dim=2)
            token_probs = outputs[:, -1, :]
            new_prefixes_probs = beam_prefixes_probs[:, None]*token_probs.cpu().numpy()

            # Construct the expanded prefixes using the entire vocabulary and check for a new best full text.
            new_partial_prefixes = [] # This will contain (probability, prefix) tuples.
            for (prefix, probs_group) in zip(beam_prefixes_indexed.cpu().tolist(), new_prefixes_probs.tolist()):
                for (next_token_index, prefix_prob) in enumerate(probs_group):
                    if next_token_index == edge_index:
                        if best_full_prefix_prob is None or prefix_prob > best_full_prefix_prob:
                            best_full_prefix_indexed = prefix + [next_token_index]
                            best_full_prefix_prob = prefix_prob
                    else:
                        new_partial_prefixes.append((prefix_prob, prefix + [next_token_index]))
            
            new_partial_prefixes.sort(reverse=True) # Sort all the partial prefixes by probability.
            (best_partial_prefix_prob, _) = new_partial_prefixes[0]
            if best_full_prefix_prob > best_partial_prefix_prob: # Found the best full prefix that will every be generated.
                text = [vocab[index] for index in best_full_prefix_indexed]
                return (text, best_full_prefix_prob)
            
            new_beam = new_partial_prefixes[:beam_size] # Take the top beam_size partial prefixes for the next beam.
            beam_prefixes_indexed = torch.tensor([prefix for (prob, prefix) in new_beam], dtype=torch.int64, device=device)
            beam_prefixes_probs = np.array([prob for (prob, prefix) in new_beam], np.float32)

    # A best full text was not found at the given max_len, so return the best partial prefix instead.
    text = [vocab[index] for index in beam_prefixes_indexed[0, :].cpu().tolist()]
    return (text, beam_prefixes_probs[0])

for (name, sentiment) in [('positive', [1, 0]), ('negative', [0, 1])]:
    print(name)
    (text, text_prob) = beam_generate(model, vocab, edge_index, sentiment, max_len=10, beam_size=3)
    print(text, f'(p={text_prob:.3f})')
    print()

In the above code we keep a list of all the partial texts in the new beam and then take the top $k$ most probable items in the list.
This is easier to understand but it takes more time and memory than it should.
A more time and memory efficient approach would be to only add a partial text in the new beam if it has a bigger probability than the top $k$ items in the new beam and then remove the partial text with the smallest probability.
This will keep the new beam always at the smallest size needed.
The fastest way to do this is by using a priority queue or heap queue (look up [`heapq`](https://docs.python.org/3/library/heapq.html) in Python).

## Evaluating text generation

Checking if the generated text is correct, such as in the case of a translation, is supposed to be done manually by human evaluators.
Doing it automatically and correctly requires a model of how human evaluators evaluate.
Neural networks that predict the score a human evaluator would give for a sentence exist, such as [COMET](https://aclanthology.org/2020.emnlp-main.213/), but it's possible to also use a simple evaluation algorithm that we can interpret.
Such algorithms typically require us to have a set of reference texts as target values, such as 5 possible translations for the same source text.
The evaluation algorithm will then compare the hypothesis (generated text) to all of these reference texts, in order to allow for variation, and return some measure of similarity.

Note that perplexity measures the quality of the model rather than of the generated text and here we want to measure how good an actual generated translation is.
One of the oldest, simplest, and most popular method to measuring the quality of a hypothesis is called **BLEU** (Bilingual Evaluation Understudy) which basically works by counting the number of n-grams in the hypothesis that are also present in at least one of the references.
**BLEU-1** is when only individual tokens are used, **BLEU-2** is when both individual tokens and bigrams are used, and so on, up to **BLEU-4** usually.
The BLEU score will be between 0 and 1, with 1 being the best score.

It is important to note that BLEU has [the following issues](https://direct.mit.edu/coli/article/44/3/393/1598/A-Structured-Review-of-the-Validity-of-BLEU):

* It was designed for testing translation models and will not work well in other text generation tasks.
* Even on translation models, there is a wide variation of correlations with human evaluators for different data sets, so it's not that reliable.
* It should not be used to evaluate individual generated texts, but only on entire data sets.

NLTK has a function that measures BLEU which defaults to BLEU-4 and expects the sentences to be tokenised.

In [None]:
references = [
    ['I like it .'.split(' '), 'I don\'t hate it .'.split(' ')],
    ['I hate it .'.split(' '), 'I don\'t like it .'.split(' ')],
]
hypotheses = [
    'I kind of like it .'.split(' '),
    'I really hate it .'.split(' '),
]

score = nltk.translate.bleu_score.corpus_bleu(references, hypotheses, smoothing_function=nltk.translate.bleu_score.SmoothingFunction().method1)
print(score)

Another important point is that tokenising the texts differently (using different tokenisers) will result in different BLEU scores and they cannot be compared together.
For this reason, implementations such as [sacreBLEU](https://github.com/mjpost/sacrebleu) include a standard tokeniser that is meant to always be used.

## Exercises

### 1) Sentences from words

Redo the language model from the last topic but this time condition it on the sentiment of the texts (don't measure the perplexity).
A one-hot vector respresentation of the sentiments is provided to condition the language model.
Modify greedy search to avoid generating unknown tokens and use it to generate a positive and negative text.

In [None]:
min_freq = 3

train_df = pd.read_csv('../data_set/sentiment/train.csv')

train_text = train_df['text']
train_class = train_df['class']

categories = ['neg', 'pos']
cat2index = {cat: i for (i, cat) in enumerate(categories)}
cat_onehots = np.eye(len(categories))

nltk.download('punkt')
train_text_tokens = [nltk.word_tokenize(text) for text in train_text]
max_len = max(len(text) for text in train_text_tokens) + 1

frequencies = collections.Counter(token for text in train_text_tokens for token in text)
vocabulary = sorted(frequencies.keys(), key=frequencies.get, reverse=True)
while frequencies[vocabulary[-1]] < min_freq:
    vocabulary.pop()
vocab = ['<PAD>', '<EDGE>', '<UNK>'] + vocabulary
token2index = {token: i for (i, token) in enumerate(vocab)}
pad_index = token2index['<PAD>']
edge_index = token2index['<EDGE>']
unk_index = token2index['<UNK>']

train_text_x_indexed_np = np.full((len(train_text_tokens), max_len), pad_index, np.int64)
for i in range(len(train_text_tokens)):
    train_text_x_indexed_np[i, 0] = edge_index
    for j in range(len(train_text_tokens[i])):
        train_text_x_indexed_np[i, j + 1] = token2index.get(train_text_tokens[i][j], unk_index)
train_text_x_indexed = torch.tensor(train_text_x_indexed_np, device=device)

train_text_y_indexed_np = np.full((len(train_text_tokens), max_len), pad_index, np.int64)
for i in range(len(train_text_tokens)):
    for j in range(len(train_text_tokens[i])):
        train_text_y_indexed_np[i, j] = token2index.get(train_text_tokens[i][j], unk_index)
    train_text_y_indexed_np[i, len(train_text_tokens[i])] = edge_index
train_text_y_indexed = torch.tensor(train_text_y_indexed_np, device=device)

train_class_onehots = torch.tensor(
    cat_onehots[train_class.map(cat2index.get).to_numpy()],
    dtype=torch.float32, device=device
)