# RTML Final 2021

In this exam, we'll have some practical exercises using RNNs and some short answer questions regarding the Transformer/attention
and reinforcement learning.

Consider the AGNews text classification dataset:

In [132]:
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

train_iter = AG_NEWS(split='train')
tokenizer = get_tokenizer('basic_english')
counter = Counter()

def clean(line):
    line = line.replace('\\', ' ')
    return line

labels = {}
for (label, line) in train_iter:
    if label in labels:
        labels[label] += 1
    else:
        labels[label] = 1
    counter.update(tokenizer(clean(line)))

vocab = Vocab(counter, min_freq=1)

print('Label frequencies:', labels)
print('A few token frequencies:', vocab.freqs.most_common(5))
print('Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news')

Label frequencies: {3: 30000, 4: 30000, 2: 30000, 1: 30000}
A few token frequencies: [('.', 225971), ('the', 205040), (',', 165685), ('to', 119817), ('a', 110942)]
Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news


Here's how we can get a sequence of tokens for a sentence with the cleaner, tokenizer, and vocabulary:

In [133]:
[vocab[token] for token in tokenizer(clean('Bangkok, or The Big Mango, is one of the great cities of Asia'))]

[4248, 4, 116, 3, 244, 46857, 4, 23, 62, 7, 3, 812, 2009, 7, 989]

Let's make pipelines for processing a news story and a label:

In [134]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(clean(x))]
label_pipeline = lambda x: int(x) - 1

Here's how to create dataloaders for the training and test datasets:

In [135]:
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        length_list.append(processed_text.shape[0])
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = pad_sequence(text_list, padding_value=0)
    length_list = torch.tensor(length_list, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), length_list.to(device)

train_iter = AG_NEWS(split='train')
train_dataset = list(train_iter)
test_iter = AG_NEWS(split='test')
test_dataset = list(test_iter)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)

Here's how to get a batch from one of these dataloaders. The first entry is a 1D tensor of labels for the batch
(8 values between 0 and 3), then a 2D tensor representing the stories with dimension T x B (number of tokens x batch size). 

In [136]:
batch = next(enumerate(train_dataloader))
print(batch)

(0, (tensor([3, 2, 0, 2, 1, 2, 1, 0], device='cuda:0'), tensor([[ 7830,   342,   119,   369,  1447,   342, 16255, 28640],
        [ 4049, 54733,   949,   170,   130,   219,    90,  1682],
        [12159,  1112,    37,  2424, 13187,   386,  2590,    48],
        [  929,     7,   475,     5,     4,  4233,    14,   119],
        [  884,  1550,   121,  2539,   267, 20173,   263, 15086],
        [   14,   256,     2,    39,  1406,  1367,   203,     8],
        [  406,     4, 12760,   309,    14,   122,    15, 49579],
        [   51,   592,  1041,  5134,  2202, 15144,    16,    14],
        [   15,  2778,     7,    19,    15,   206,     3,    32],
        [  406,  8681,  2533,   309,    54, 11582,  1739,    15],
        [   51,  6621,    12,    13,    49, 11101,    40,    32],
        [   16,    66,  1158,    10,     7,    39,    83,    16],
        [ 1239,     2,  1595,  4977,     3,    78,  3873,     3],
        [ 2939,    14,  1134,     2,   132,    88,     8,  4599],
        [47525, 3823

## Question 1, 10 points

The vocabulary currently is too large for a simple one-hot embedding. Let's reduce the vocabulary size
so that we can use one-hot. First, add a step that removes tokens from a list of "stop words" to the `text_pipeline` function.
You probably want to remove punctuation ('.', ',', '-', etc.) and articles ("a", "the").

Once you've removed stop words, modify the vocabulary to include only the most frequent 1000 tokens (including 0 for an unknown/infrequent word).

Write your revised code in the cell below and output the 999 top words with their frequencies:

In [6]:
# Place code for Question 1 here

In [12]:
vocab.freqs.most_common(100)

[('.', 225971),
 ('the', 205040),
 (',', 165685),
 ('to', 119817),
 ('a', 110942),
 ('of', 98353),
 ('in', 95930),
 ('and', 69326),
 ('s', 61915),
 ('on', 57279),
 ('for', 50417),
 ('#39', 44316),
 ('(', 41106),
 (')', 40787),
 ('-', 39212),
 ("'", 32235),
 ('that', 28167),
 ('with', 26801),
 ('as', 25324),
 ('at', 24999),
 ('its', 22115),
 ('is', 22083),
 ('new', 21297),
 ('by', 20858),
 ('it', 20476),
 ('said', 20265),
 ('reuters', 19321),
 ('has', 19023),
 ('from', 17807),
 ('an', 16988),
 ('ap', 16152),
 ('his', 14941),
 ('will', 14615),
 ('after', 14496),
 ('was', 13730),
 ('us', 12660),
 ('be', 11777),
 ('over', 11219),
 ('have', 11200),
 ('their', 10529),
 ('&lt', 10208),
 ('are', 9791),
 ('up', 9739),
 ('quot', 9593),
 ('but', 9150),
 ('more', 9123),
 ('first', 9087),
 ('two', 8974),
 ('he', 8920),
 ('world', 8522),
 ('u', 8357),
 ('this', 8246),
 ('--', 7969),
 ('company', 7637),
 ('monday', 7614),
 ('wednesday', 7530),
 ('tuesday', 7454),
 ('thursday', 7343),
 ('oil', 7262),


In [147]:
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import numpy as np
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = get_tokenizer('basic_english')
counter = Counter()

def clean(line):
    line = line.replace('\\', ' ')
    line = line.replace('.', ' ')
    line = line.replace('the', ' ')
    line = line.replace(',', ' ')
    line = line.replace('/', ' ')
    line = line.replace('=', ' ')
    # line = line.replace('a', ' ') # replace these will break a word
    # line = line.replace('s', ' ')
    line = line.replace('#39', ' ')
    line = line.replace(')', ' ')
    line = line.replace('(', ' ')
    line = line.replace('-', ' ')
    line = line.replace("'", ' ')
    # line = line.replace('an', ' ')
    line = line.replace('--', ' ')
    return line

def get_freq(batch):
    for _, _line in batch:
        counter.update(tokenizer(clean(_line)))
    vocab = Vocab(counter, min_freq=1)
    return vocab.freqs.most_common(999)

def collate_batch(batch):
    text_pipeline = lambda x: [vocab[token] for token in tokenizer(clean(x))]
    label_pipeline = lambda x: int(x) - 1
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        length_list.append(processed_text.shape[0])
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = pad_sequence(text_list, padding_value=0)
    length_list = torch.tensor(length_list, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), length_list.to(device)

train_iter = AG_NEWS(split='train')
train_dataset = list(train_iter)
train_freq = get_freq(train_dataset)
list_t = []
for i in range(len(train_freq)):
    list_t.append(train_freq[i][0])
test_iter = AG_NEWS(split='test')
test_dataset = list(test_iter)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)

In [152]:
batch = next(enumerate(train_dataloader))
print(batch[1][1].shape)

torch.Size([67, 8])


In [146]:
train_freq

[('to', 120681),
 ('a', 112141),
 ('of', 98655),
 ('in', 96429),
 ('and', 69678),
 ('s', 62049),
 ('on', 57694),
 ('for', 50676),
 ('that', 28169),
 ('with', 26813),
 ('the', 26472),
 ('as', 25375),
 ('at', 25070),
 ('its', 22123),
 ('is', 22095),
 ('new', 21427),
 ('by', 20952),
 ('it', 20537),
 ('said', 20267),
 ('reuters', 19332),
 ('has', 19025),
 ('from', 17829),
 ('an', 17047),
 ('ap', 16253),
 ('his', 14942),
 ('will', 14617),
 ('after', 14554),
 ('was', 13731),
 ('us', 13232),
 ('be', 11853),
 ('over', 11356),
 ('have', 11213),
 ('up', 10740),
 ('ir', 10352),
 ('r', 10265),
 ('two', 10226),
 ('&lt', 10209),
 ('first', 9804),
 ('are', 9792),
 ('year', 9772),
 ('quot', 9596),
 ('but', 9184),
 ('more', 9149),
 ('he', 8942),
 ('world', 8628),
 ('u', 8436),
 ('this', 8251),
 ('one', 8109),
 ('company', 7657),
 ('monday', 7616),
 ('oil', 7564),
 ('out', 7556),
 ('wednesday', 7531),
 ('tuesday', 7455),
 ('thursday', 7345),
 ('not', 7061),
 ('1', 7008),
 ('against', 6901),
 ('friday', 

## Question 2, 30 points

Next, let's build a simple RNN for classification of the AGNews dataset. Use a one-hot embedding of the vocabulary
entries and the basic RNN from Lab 10. Use the lengths tensor (the third element in the batch returned by the dataloaders)
to determine which output to apply the loss to.

Place your training code below, and plot the training and test accuracy as a
function of epoch. Finally, output a confusion matrix for the test set.

*Do not spend a lot of time on the training! A few minutes is enough. The point is to show that the model is
learning, not to get the best possible performance.*

In [7]:
# Place code for Question 2 here

In [153]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class originalRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(originalRNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = F.tanh(self.i2h(combined))
        output = self.i2o(hidden)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_letters = 8
n_hidden = 128
n_categories = 4
model = originalRNN(n_letters, n_hidden, n_categories)

In [154]:
def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

In [155]:
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
criterion = nn.NLLLoss() 

def train(category_tensor, line_tensor, model):
    hidden = model.initHidden()

    model.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = model(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in model.parameters():
        p.data.add_(-learning_rate, p.grad.data)

    return output, loss.item()  

In [156]:
import time
import math

n_iters = 10
print_every = 1
plot_every = 10

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    for category, line, category_tensor, line_tensor in train_dataloader:
        output, loss = train(category_tensor, line_tensor, rnn)
        current_loss += loss

        # Print iter number, loss, name and guess
        if iter % print_every == 0:
            guess, guess_i = categoryFromOutput(output)
            correct = '✓' if guess == category else '✗ (%s)' % category
            print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

        # Add current loss avg to list of losses
        if iter % plot_every == 0:
            all_losses.append(current_loss / plot_every)
            current_loss = 0

ValueError: not enough values to unpack (expected 4, got 3)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

In [None]:
# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 1000

# Just return an output given a line
def evaluate(line_tensor, model):
    hidden = model.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = model(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor, rnn)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

## Question 3, 10 points

Next, replace the SRNN from Question 2 with a single-layer LSTM. Give the same output (training and testing accuracy as a function of epoch, as well as confusion
matrix for the test set). Comment on the differences you observe between the two models.

In [8]:
# Place code for Question 3 here

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM, self).__init__()

        self.hidden_size = hidden_size
        self.hidden = nn.Linear(input_size + hidden_size, hidden_size)
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=output_size,)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = F.tanh(self.i2h(combined))
        lstm_out, self.hidden = self.lstm(combined.view(len(combined), self.seq_len, -1), self.hidden)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
orirnn = LSTM(n_letters, n_hidden, n_categories)

## Question 4, 10 points

Explain how you could use the Transformer model to perform the same task you explored in Questions 2 and 3.
How would attention be useful for this text classification task? Give a precise and detailed answer. Be sure to discuss what
parts of the original Transformer you would use and what you would have to remove.

*Write your answer here.*

Fist, transformer see all the line not just word-to-word like RNN and LSTM. If can capture relationship from all word in line which represent as ATTENTION. ATTENTION will tell how much each word important to classify the line. It help classification better since all word has ATTENTION and ATTENTION remove some noise from unnesseary infomation. 

We can use a TransformerEncoder part to do a text classification. The TransformerEncoder consists of multiple layers of TransformerEncoderLayer. Along with the input sequence, we'll use an attention mask so that the self-attention layers in the TransformerEncoder are only allowed to attend the earlier positions in the sequence (in this language modeling task, tokens in the future positions should be masked). To form output words, the output of the TransformerEncoder model is sent to a final Linear layer, which is followed by a log-Softmax function.

## Question 5, 10 points

In Lab 13, you implemented a DQN model for tic-tac-toe. You method learned to play against a fairly dumb `expert_action` opponent, however.  Also,
DQN has proven to be less stable than other methods such as Double DQN, also discussed in Lab 13.

Explain below how you would apply double DQN and self-play to improve your tic-tac-toe agent.
Provide pseudocode for the algorithm below.

*Write your explanation and pseudocode here.*

Using double DQN, we will copy another DQN as a target network. The target network will compute next Q value instade of use greedy algorithm which is argmax. This will help our network learn with small step because if we use maximum Q-value, at the beginning, it will cause large positive loss when update weight. The target network will get update along with our policy network.

Self-play, in short, is did not west time. Like play to it self, it can learn every move by pretant to be own opponent. This will not require expert_agent.

## Question 6, 30 points

Based on your existing DQN implementation, implement the double DQN and self-play training method
you just described. After some training (don't spend too much time on training -- again, we just want to see that the model can
learn), show the result you playing a game against your learned agent.

In [9]:
# Code for training and playing goes here

In [None]:
Hi