# 7-2. **RNN with Attention**

In this lab, you will learn how to implement attention mechanism in a RNN-based seq2seq model, the code is modified from https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.

In [None]:
import re
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Download Dataset
The dataset was from https://github.com/Microsoft/BotBuilder-PersonalityChat/tree/master/CSharp/Datasets (they have updated the dataset)


In [None]:
!wget https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/qna_chitchat_the_friend.tsv

--2023-10-17 01:52:51--  https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/qna_chitchat_the_friend.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59806 (58K) [text/plain]
Saving to: ‘qna_chitchat_the_friend.tsv’


2023-10-17 01:52:52 (6.08 MB/s) - ‘qna_chitchat_the_friend.tsv’ saved [59806/59806]



In [None]:
df_friend = pd.read_csv('qna_chitchat_the_friend.tsv', sep='\t')
df_friend.sample(10)

Unnamed: 0,Question,Answer,Source,Metadata
453,Good night to you,Nighty night!,qna_chitchat_the_friend,editorial:chitchat
191,Are you asexual?,I'm digital.,qna_chitchat_the_friend,editorial:chitchat
304,What do you like to sing best?,"La la la, tra la la. I'm awesome at this.",qna_chitchat_the_friend,editorial:chitchat
390,What makes you think that?,I'm afraid I didn't follow that.,qna_chitchat_the_friend,editorial:chitchat
521,I hate everything about you,I'm a work in progress.,qna_chitchat_the_friend,editorial:chitchat
437,Later alligator,Bye.,qna_chitchat_the_friend,editorial:chitchat
253,Are you busy?,I'm here!,qna_chitchat_the_friend,editorial:chitchat
401,"Yes, that's right",Cool!,qna_chitchat_the_friend,editorial:chitchat
417,I thank you,You're very welcome.,qna_chitchat_the_friend,editorial:chitchat
57,Who created you?,People made me out of code and a dash of ingen...,qna_chitchat_the_friend,editorial:chitchat


In [None]:
n_data = df_friend.shape[0]

## Preprocessing
This is just a very navie preprocessing.

In [None]:
question_list = df_friend['Question'].tolist()
answer_list = df_friend['Answer'].tolist()

In [None]:
# These are just common English contractions. There are many edge cases. i.e. University's working on it.
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have",
                    "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not",
                    "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did",
                    "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have",
                    "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have",
                    "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                    "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us",
                    "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have",
                    "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                    "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not",
                    "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have",
                    "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
                    "so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                    "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                    "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not",
                    "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have",
                    "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have",
                    "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will",
                    "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have",
                    "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
                    "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                    "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

def pre_process(sent_list):
    output = []
    for sent in sent_list:
        sent = sent.lower()
        for word, new_word in contraction_dict.items():
            sent = sent.replace(word, new_word)
        sent = re.sub(r'[^\w\s]','',sent)
        output.append(word_tokenize(sent))
    return output

input_token_list = pre_process(question_list)
answer_token_list = pre_process(answer_list)
output_token_list = [['<BOS>'] + s for s in answer_token_list]
target_token_list = [s + ['<EOS>'] for s in answer_token_list]

In [None]:
MAX_LENGTH = max([len(s) for s in input_token_list] + [len(s) for s in target_token_list])

In [None]:
word_to_ix = {'<BOS>': 0, '<EOS>':1}
for sentence in input_token_list + output_token_list:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
word_list = list(word_to_ix.keys())

In [None]:
def to_index(data, to_ix):
    index_list = []
    for sent in data:
        index_list.append([to_ix[w] for w in sent])
    return index_list

input_index = to_index(input_token_list, word_to_ix)
output_index = to_index(output_token_list, word_to_ix)
target_index = to_index(target_token_list, word_to_ix)

## Model

In [None]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Encoder

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, embedding):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = embedding # nn.Embedding(len(word_to_ix), hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Decoder

In [None]:
class AttnDecoderRNN(nn.Module):
    ATTN_TYPE_DOT_PRODUCT = 'Dot Product'
    ATTN_TYPE_SCALE_DOT_PRODUCT = 'Scale Dot Product'

    def __init__(self, hidden_size, output_size, embedding, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = embedding # nn.Embedding(len(word_to_ix), hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size*2, self.output_size)

    def cal_attention(self, hidden, encoder_hiddens, method):
        if method == AttnDecoderRNN.ATTN_TYPE_DOT_PRODUCT:
            # bmm: https://pytorch.org/docs/master/generated/torch.bmm.html
            attn_weights = F.softmax(torch.bmm(hidden, encoder_hiddens.T.unsqueeze(0)),dim=-1)
            attn_output = torch.bmm(attn_weights, encoder_hiddens.unsqueeze(0))
            concat_output = torch.cat((attn_output[0], hidden[0]), 1)

        return concat_output

    def forward(self, input, hidden, encoder_hiddens):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        _, hidden = self.gru(embedded, hidden)

        concat_output = self.cal_attention(hidden, encoder_hiddens, AttnDecoderRNN.ATTN_TYPE_DOT_PRODUCT)

        output = F.log_softmax(self.out(concat_output), dim=1)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Train Function

In [None]:
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_hiddens = torch.zeros(max_length, encoder.hidden_size, device=device)
    encoder_hidden = encoder.initHidden()

    for i in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[i], encoder_hidden)
        encoder_hiddens[i] = encoder_hidden[0, 0]

    decoder_input = torch.tensor([[0]], device=device)
    decoder_hidden = encoder_hidden

    loss = 0
    # Teacher forcing: Feed the target as the next input
    for i in range(target_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_hiddens)
        loss += criterion(decoder_output, target_tensor[i])
        decoder_input = target_tensor[i]  # Teacher forcing

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

### Train Iterations Function

In [None]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [None]:
import random
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)

    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        random_choice_ix = random.choice(range(n_data))
        input_index_r = [[ind] for ind in input_index[random_choice_ix]]
        target_index_r = [[ind] for ind in target_index[random_choice_ix]]

        input_tensor = torch.LongTensor(input_index_r).to(device)
        target_tensor = torch.LongTensor(target_index_r).to(device)

        loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

## Training Process

In [None]:
hidden_size = 50
embedding = nn.Embedding(len(word_to_ix), hidden_size)
encoder1 = EncoderRNN(len(word_to_ix), hidden_size, embedding).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, len(word_to_ix), embedding, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 10000, print_every=500)

0m 9s (- 3m 3s) (500 5%) 4.7313
0m 15s (- 2m 15s) (1000 10%) 3.4853
0m 22s (- 2m 5s) (1500 15%) 2.7871
0m 27s (- 1m 50s) (2000 20%) 2.3321
0m 34s (- 1m 42s) (2500 25%) 1.7479
0m 39s (- 1m 33s) (3000 30%) 1.5727
0m 45s (- 1m 25s) (3500 35%) 1.2864
0m 51s (- 1m 17s) (4000 40%) 1.0840
0m 57s (- 1m 10s) (4500 45%) 1.0335
1m 3s (- 1m 3s) (5000 50%) 0.8380
1m 8s (- 0m 56s) (5500 55%) 0.7393
1m 14s (- 0m 49s) (6000 60%) 0.6310
1m 21s (- 0m 43s) (6500 65%) 0.5857
1m 26s (- 0m 37s) (7000 70%) 0.5558
1m 33s (- 0m 31s) (7500 75%) 0.4851
1m 39s (- 0m 24s) (8000 80%) 0.4266
1m 45s (- 0m 18s) (8500 85%) 0.3661
1m 51s (- 0m 12s) (9000 90%) 0.3370
1m 57s (- 0m 6s) (9500 95%) 0.3480
2m 3s (- 0m 0s) (10000 100%) 0.3373


## Evaluation

In [None]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_sent = pre_process([sentence])[0]
        intput_index = [word_to_ix[word] for word in input_sent]
        input_tensor = torch.LongTensor([[ind] for ind in intput_index]).to(device)
        input_length = input_tensor.size()[0]

        encoder_hiddens = torch.zeros(max_length, encoder.hidden_size, device=device)
        encoder_hidden = encoder.initHidden()

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
            encoder_hiddens[ei] += encoder_hidden[0, 0]

        decoded_words = []
        decoder_input = torch.tensor([[0]], device=device)
        decoder_hidden = encoder_hidden

        for di in range(max_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_hiddens)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == 1: # '<EOS>'
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(word_list[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words

In [None]:
sentence1 = 'Are you in love with me?'
sentence2 = 'Who do you love'
sentence3 = 'Are you busy?'
sentence4 = "You're the best"

print(evaluate(encoder1, attn_decoder1, sentence1, max_length=MAX_LENGTH))
print(evaluate(encoder1, attn_decoder1, sentence2, max_length=MAX_LENGTH))
print(evaluate(encoder1, attn_decoder1, sentence3, max_length=MAX_LENGTH))
print(evaluate(encoder1, attn_decoder1, sentence4, max_length=MAX_LENGTH))

['i', 'hear', 'love', 'is', 'lovely', '<EOS>']
['i', 'hear', 'love', 'is', 'lovely', '<EOS>']
['i', 'am', 'here', '<EOS>']
['thanks', 'you', 'are', 'pretty', 'cool', 'yourself', '<EOS>']


# Exercise
Please Change the following **Dot Product** attention into **Scale Dot Product** attention

**Dot Product:**

![Dot_Product](https://drive.google.com/uc?id=1QtBgCp53e_6A_vzaMFEo89GJbTxnXagJ)

**Scale Dot Product:**

![Scale_Dot_Product](https://drive.google.com/uc?id=1v6n9WChBVfy0mBG2yxK9MUvGKzVGCmOt)


In [None]:
class AttnDecoderRNN(nn.Module):
    ATTN_TYPE_DOT_PRODUCT = 'Dot Product'
    ATTN_TYPE_SCALE_DOT_PRODUCT = 'Scale Dot Product'

    def __init__(self, hidden_size, output_size, embedding, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = embedding
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size*2, self.output_size)


    def cal_attention(self, hidden, encoder_hiddens, method):
        if method == AttnDecoderRNN.ATTN_TYPE_DOT_PRODUCT:
            # bmm: https://pytorch.org/docs/master/generated/torch.bmm.html
            attn_weights = F.softmax(torch.bmm(hidden, encoder_hiddens.T.unsqueeze(0)),dim=-1)
            attn_output = torch.bmm(attn_weights, encoder_hiddens.unsqueeze(0))
            concat_output = torch.cat((attn_output[0], hidden[0]), 1)

        elif method == AttnDecoderRNN.ATTN_TYPE_SCALE_DOT_PRODUCT:
            # COMPLETE THIS PART - Scale Dot Product calculation method
            attn_weights =
            attn_output =
            concat_output =

        return concat_output

    def forward(self, input, hidden, encoder_hiddens):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        _, hidden = self.gru(embedded, hidden)

        ## The following attention score calculation method is Dot Product for now
        ## Please change it into Scale Dot Product calculation method
        concat_output = self.cal_attention(hidden, encoder_hiddens, AttnDecoderRNN.ATTN_TYPE_DOT_PRODUCT)
        # concat_output = self.cal_attention(hidden, encoder_hiddens, AttnDecoderRNN.ATTN_TYPE_SCALE_DOT_PRODUCT)

        output = F.log_softmax(self.out(concat_output), dim=1)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)