# Assignment 4 (100 pts, BONUS: 10 pts) - Neural Models
---

In this assignment, you will learn about text classification and language modeling using RNN and LSTM, and use **pytorch** — a deep learning framework.

## Problem 1. Neural Models (35 pts)
---

### 1. (2 pts) In a news framing classification task, where you have 5 frames and your model predicts each of the frames with equal probability for an article, what is the cross entropy loss of the article in this case?

-> answer

### 2. (2 pts) Suppose during training of your neural model you realize that your training loss remains high. Mention some of the ways you can reduce this **underfitting** of your neural network.

-> answer

### 3. (2 pts) After you do many changes to your neural network, you now realize that your training loss is much lower than your validation loss. Mention some of the ways you can reduce this **overfitting** of your neural network.

-> answer

### 4. (2 pts) What is good about setting a large batch size for training? How about a small batch size?

-> answer

### 5. (4 pts) How can an RNN be used for detecting toxic spans (spans of words containing toxic language) in a social media comment? Specifically, what should be the input to the RNN at each time step t? How many outputs (i.e., $\hat{y}$) are produced given a comment containing $n$ words? What is each $\hat{y}^{(t)}$ a probability distribution over?

-> answer

### 6. (4 pts) How about using RNNs for language modeling? Given a start word token as input at time step 1, what should be the input to the RNN at each time step t > 1? How many outputs are produced? What is each $\hat{y}^{(t)}$ a probability distribution over?

-> answer

### 7. (3 pts) How about using RNNs for frame classification? Given an article containing $n$ words as input, what should be the input to the RNN at each time step t? How many outputs are produced? What is each $\hat{y}^{(t)}$ a probability distribution over?

-> answer

### 8. (2 pts) What is the main advantage of using RNNs for frame classification over feed forward neural network?

-> answer

### 9. (4 pts) What is the disadvantage of RNN when used to classify the sentiment of a very long tweet like this? “I am not sure I want this phone. It’s too big to fit in my back pocket. I put it in and accidentally sat on it and now it’s bent. I’m very disappointed. I’m now the proud owner of bendy iPhone13. Very proud.” What is the appropriate sentiment for this tweet? And what would the RNN classify it as?

-> answer

### 10. How about LSTM? Given this formulation of LSTM: $ f_t = \sigma(W_fx_t + U_fh_{t - 1} + b_f) \text{ (forget gate)}, i_t = \sigma(W_ix_t + U_ih_{t - 1} + b_i) \text{ and } \hat{C}_t = \tanh(W_Cx_t + U_Ch_{t - 1} + b_C) \text{ (input gate)}, C_t = f_t * C_{t - 1} + i_t * \hat{C}_t \text{ (update gate), and } \omicron_t = \sigma(W_{\omicron}x_t + U_{\omicron}h_{t - 1} + b_{\omicron}) \text{ and } h_t = \omicron_t * \tanh(C_t) $

(a) (4 pts) Derive the formulation of $\frac{\partial{J}}{U_C}$ , where J is the loss function, for two time steps $t$ and $t − 1$ in terms of $\frac{\partial{J}}{\partial{h_t}}$, $\frac{\partial{h_t}}{\partial{C_t}}$, $\frac{\partial{C_t}}{\partial{U_C}}$, $\frac{\partial{C_t}}{\partial{C_{t - 1}}}$, $\frac{\partial{C_{t - 1}}}{\partial{U_C}}$, $\frac{\partial{h_t}}{\partial{h_{t - 1}}}$, and $\frac{\partial{h_{t - 1}}}{\partial{U_C}}$

-> answer

(b) (2 pts) Which part of $\frac{\partial{J}}{U_C}$ reduces the effect of the vanishing gradient problem in RNNs?

-> answer

(c)  (2 pts) How does this help classify the correct sentiment of the tweet above?

-> answer

(d) (2 pts) Instead of using the last hidden state of LSTM to classify the tweet, what other ways we can do to improve the performance of this sentiment classification?

-> answer

## Problem 2. LSTM for language modeling (33 pts)
---

In [1]:
import re
import os
import random
import nltk

nltk.download('punkt')

from collections import Counter
import numpy as np
import pandas as pd
import torch
import torchtext
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split

seed = 42
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
torch.manual_seed(seed)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

path = '/kaggle/input/cs505-hw4-data/'

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Device: cuda


### 1. (10 pts) Follow the tutorial in [here](https://www.analyticsvidhya.com/blog/2020/08/build-a-natural-language-generation-nlg-system-using-pytorch/) to train a word-level LSTM language modeling. Train the language model on texts from the file prideAndPrejudice.txt. Before using it to train the language model, you need to first sentence-segment, then tokenize, then lower case each line of the file using Spacy. Append start-of-sentence token $\text{'<s>'}$ and end-of-sentence $\text{'</s>'}$ token to each **sentence** and put each sentence in its own line. Use only words that appear more than once in this corpus and assign UNK tokens for the rest; you may also need to pad sentences that are shorter than 5 (see [here](https://github.com/gabrielloye/LSTM_Sentiment-Analysis/blob/master/main.ipynb) in cell 10-12 for adding unknown: UNK token and padding: PAD token to your vocabulary). Train the language model and save the trained model (see [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html)). Generate 10 examples of text from it, starting from $\text{'<s>'}$ token and ending at $\text{'</s>'}$ token.

In [2]:
text = []
with open(path + 'prideAndPrejudice.txt') as f:
    for line in f:
        text.append(line.strip())

text = ' '.join(text)
text_tokens = []
for sentence in nltk.sent_tokenize(text):
    text_tokens.append(['<s>'] + nltk.word_tokenize(sentence) + ['</s>'])


In [3]:
def get_dict(text_tokens, appear_time=1):
    vocab = Counter(sum(text_tokens, []))

    # Removing the words that only appear once
    vocab = {k: v for k, v in vocab.items() if v > appear_time}

    # Sorting the words according to the number of appearances, with the most common word being first
    vocab = sorted(vocab, key=vocab.get, reverse=True)

    # Adding padding and unknown to our vocabulary so that they will be assigned an index
    vocab = ['<pad>', '<unk>'] + vocab

    # Dictionaries to store the word to index mappings and vice versa
    word2idx = {o: i for i, o in enumerate(vocab)}
    idx2word = {i: o for i, o in enumerate(vocab)}

    return word2idx, idx2word


word2idx, idx2word = get_dict(text_tokens)

In [4]:
def createSequences(tokens, sequenceLength=5):
    sequences = []
    if len(tokens) > sequenceLength:
        for i in range(0, len(tokens) - sequenceLength):
            # select sequence of tokens
            sequence = tokens[i:i + sequenceLength]
            # add to the list
            sequences.append(' '.join(sequence))

        return sequences
    else:
        sequence = tokens[:]
        # pad sequence to 5
        for i in range(len(tokens), sequenceLength):
            sequence.append('<pad>')
        return [' '.join(sequence)]


def get_integer_seq(sequence):
    return [
        word2idx[w] if w in word2idx.keys() else word2idx['<unk>']
        for w in sequence.split()
    ]


class SequenceDataset(Dataset):
    def __init__(self, text_tokens, sequenceLength=5):
        seqs = [
            createSequences(tokens, sequenceLength) for tokens in text_tokens
        ]

        # merge list-of-lists into a single list
        seqs = sum(seqs, [])

        # create inputs and targets (x and y)
        x, y = [], []

        for s in seqs:
            x.append(' '.join(s.split()[:-1]))
            y.append(' '.join(s.split()[1:]))

        # convert text sequences to integer sequences
        x = [get_integer_seq(i) for i in x]
        y = [get_integer_seq(i) for i in y]

        # convert lists to numpy arrays
        self.x = np.array(x)
        self.y = np.array(y)

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]


class WordLSTM(nn.Module):
    def __init__(self,
                 vocab_size=len(word2idx),
                 embed_dim=200,
                 n_hidden=256,
                 n_layers=4,
                 drop_prob=0.3):
        super().__init__()
        self.n_layers = n_layers
        self.n_hidden = n_hidden

        self.emb_layer = nn.Embedding(vocab_size, embed_dim)

        ## define the LSTM
        self.lstm = nn.LSTM(embed_dim,
                            n_hidden,
                            n_layers,
                            dropout=drop_prob,
                            batch_first=True)

        ## define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

        ## define the fully-connected layer
        self.fc = nn.Linear(n_hidden, vocab_size)

    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''

        ## pass input through embedding layer
        embedded = self.emb_layer(x)

        ## Get the outputs and the new hidden state from the lstm
        if hidden != None:
            lstm_output, hidden = self.lstm(embedded, hidden)
        else:
            lstm_output, hidden = self.lstm(embedded)

        ## pass through a dropout layer
        out = self.dropout(lstm_output)

        #out = out.contiguous().view(-1, self.n_hidden)
        out = out.reshape(-1, self.n_hidden)

        ## put "out" through the fully-connected layer
        out = self.fc(out)

        # return the final output and the hidden state
        return out, hidden

    def init_hidden(self, batch_size):
        ''' initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        hidden = (weight.new(self.n_layers, batch_size,
                             self.n_hidden).zero_().to(device),
                  weight.new(self.n_layers, batch_size,
                             self.n_hidden).zero_().to(device))

        return hidden


def train(network, data, epochs=5, clip=1, interval=800, sequenceLength=5):

    network = network.to(device)
    loss_function = nn.CrossEntropyLoss()
    loss_function = loss_function.to(device)
    optimiser = torch.optim.Adam(network.parameters(), lr=1e-4)

    network.train()
    min_loss = np.inf

    for epoch in range(epochs):
        epoch_loss = []
        _ = network.init_hidden(batch_size)

        for i, (x, y) in enumerate(data):
            inputs, targets = x.long().to(device), y.long().to(device)
            output, _ = network(inputs, None)
            loss = loss_function(output, targets.view(-1))

            network.zero_grad()
            loss.backward()
            epoch_loss.append(loss.item())
            nn.utils.clip_grad_norm_(network.parameters(), clip)
            optimiser.step()

            if i % interval == 0:
                print('Epoch: {}/{}, {}%, Loss: {}'.format(
                    epoch + 1, epochs, round(i / len(data) * 100, 2), loss),
                      end='\r')

        mean_epoch_loss = np.mean(epoch_loss)
        print('Epoch: {}/{}, Train Loss: {}'.format(epoch + 1, epochs,
                                                    mean_epoch_loss))

        if mean_epoch_loss < min_loss:
            if not os.path.exists('./models'):
                os.mkdir('./models')
            torch.save(network.state_dict(),
                       './models/bestModel_' + str(sequenceLength) + '.pt')
            print('Best model saved\n')
            min_loss = mean_epoch_loss


In [5]:
batch_size = 32
train_data_5 = DataLoader(SequenceDataset(text_tokens),
                          batch_size=batch_size,
                          shuffle=True)

network = WordLSTM()
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network, data=train_data_5)


[1mNetwork:[0m
WordLSTM(
  (emb_layer): Embedding(4152, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=4152, bias=True)
)

[1mTraining...[0m
Epoch: 1/5, Train Loss: 6.065429639391079
Best model saved

Epoch: 2/5, Train Loss: 5.610063239419536
Best model saved

Epoch: 3/5, Train Loss: 5.2835640301066595
Best model saved

Epoch: 4/5, Train Loss: 5.067543165000381
Best model saved

Epoch: 5/5, Train Loss: 4.9195671284426545
Best model saved



In [6]:
def top_k_logits(logits, k):
    value, _ = torch.topk(logits, k)
    logits[logits < value[:, [-1]]] = -float('Inf')
    return logits


# predict next token
@torch.no_grad()
def predict(network, word, hidden, top_k, sample=True):
    word = word if word in word2idx else '<unk>'

    # tensor inputs
    x = np.array([[word2idx[word]]])
    inputs = torch.tensor(x).to(device)

    # detach hidden state from history
    hidden = tuple([i.data for i in hidden])

    # get the output of the network
    logits, hidden = network(inputs, hidden)

    # get the token probabilities
    logits = top_k_logits(logits, top_k)

    # apply softmax to convert to probabilities
    probs = F.softmax(logits, dim=-1)

    # sample from the distribution or take the most likely
    idx = torch.multinomial(probs, num_samples=1)

    # return the encoded value of the predicted word and the hidden state
    return idx2word[idx.item()], hidden


# function to generate text
def sample(network, max_step=30, context='<s>', top_k=20, sample=True):
    # push to GPU
    network.to(device)
    network.eval()

    # batch size is 1
    hidden = network.init_hidden(1)
    tokens = context.split()

    # predict subsequent tokens
    for _ in range(max_step):
        token, hidden = predict(network, tokens[-1], hidden, top_k)
        tokens.append(token)
        if token == '<\s>':
            return ' '.join(tokens)

    return ' '.join(tokens)

In [7]:
generate_sent = []
for _ in range(10):
    generate_sent.append(sample(network))

if not os.path.exists('./sentences'):
    os.mkdir('./sentences')

with open('./sentences/generate_sent_1.txt', 'w') as f:
    for sent in generate_sent:
        f.write('{}\n'.format(sent))


### 2. (7 pts) Compute and report the perplexity of the saved model on *test_1.txt* file. Note that the test files are already pre-processed.

In [8]:
def get_prob(network, word, hidden, target):
    word = word if word in word2idx else '<unk>'
    target = target if target in word2idx else '<unk>'

    x = np.array([[word2idx[word]]])
    y = np.array([[word2idx[target]]])

    inputs = torch.tensor(x).to(device)
    hidden = tuple([i.data for i in hidden])

    logits, hidden = network(inputs, hidden)
    probs = F.softmax(logits, dim=-1)
    prob = probs[0, y].item()
    return prob, hidden


@torch.no_grad()
def compute_perplex(network, test_tokens):
    perplexity = []

    network.to(device)
    network.eval()

    for tokens in test_tokens:
        sent_perplex = 0
        hidden = network.init_hidden(1)

        for i in range(len(tokens) - 1):
            prob, hidden = get_prob(network, tokens[i], hidden, tokens[i + 1])
            sent_perplex += -np.log(prob)

        perplexity.append(sent_perplex / len(tokens))

    test_perplexity = np.exp(np.mean(perplexity))
    print('Testing perplexity: {}'.format(test_perplexity))
    return test_perplexity

In [9]:
seqLengthPerplexity = {}
test_tokens = []
with open(path + 'test_1.txt') as f:
    for line in f:
        test_tokens.append(line.strip().split())

seqLengthPerplexity['5'] = compute_perplex(network, test_tokens)

Testing perplexity: 140.56896915784304


### 3. (5 pts) Train the language model as before, but with input sequence lengths of 25 (currently, it’s inputs are of length 5). You may need to pad some of the shorter sentences to length 25. Save your trained model. Generate 10 examples of text from it, starting from $\text{'<s>'}$ token and ending at $\text{'</s>'}$ token. Are there differences from the generated examples from 2.1?

In [10]:
train_data_25 = DataLoader(SequenceDataset(text_tokens, sequenceLength=25),
                           batch_size=batch_size,
                           shuffle=True)

network = WordLSTM()
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network, data=train_data_25, sequenceLength=25)

print('\n\033[1mGenerating Sentences...\033[0m')
generate_sent = []
for _ in range(10):
    generate_sent.append(sample(network))

with open('./sentences/generate_sent_2.txt', 'w') as f:
    for sent in generate_sent:
        f.write('{}\n'.format(sent))


[1mNetwork:[0m
WordLSTM(
  (emb_layer): Embedding(4152, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=4152, bias=True)
)

[1mTraining...[0m
Epoch: 1/5, Train Loss: 6.026442472838835
Best model saved

Epoch: 2/5, Train Loss: 5.869402631903962
Best model saved

Epoch: 3/5, Train Loss: 5.668060594622424
Best model saved

Epoch: 4/5, Train Loss: 5.214953619522063
Best model saved

Epoch: 5/5, Train Loss: 4.964131011390816
Best model saved


[1mGenerating Sentences...[0m


### 4. (2 pts) Compute and report the perplexity of this saved model on *test_1.txt* file.

In [11]:
seqLengthPerplexity['25'] = compute_perplex(network, test_tokens)

Testing perplexity: 138.0627337030298


### 5. (2 pts) Use the better language model (the one with the lower perplexity on *test_1.txt*) to compute and report the perplexity on *test_2.txt*. Note that the test files are already pre-processed.

In [12]:
test_tokens_2 = []
with open(path + 'test_2.txt') as f:
    for line in f:
        test_tokens_2.append(line.strip().split())

network.load_state_dict(
    torch.load('./models/bestModel_' +
               min(seqLengthPerplexity, key=seqLengthPerplexity.get) + '.pt',
               map_location=device))

_ = compute_perplex(network, test_tokens_2)

Testing perplexity: 226.2459985376151


### 6. (5 pts) Train the better language model as before but start with pre-trained [Glove6B 100d embeddings](https://nlp.stanford.edu/projects/glove/) (see [here](https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76) on how to incorporate pretrained embeddings in your LSTM model). This time, use all your words in the corpus as vocabulary, even those occurring only once in the corpus. Only assign UNK token to words that are not in Glove vocabulary and initialize random vectors in the embedding matrix for the UNK, $\text{'<s>'}$, $\text{'</s>'}$, and PAD tokens using a standard gaussian distribution with σ set to 0.6 (see [numpy.random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html)). Save your trained model. Generate 10 examples of text from it, starting from $\text{'<s>'}$ token and ending at $\text{'</s>'}$ token. Are there differences from the generated examples from before?

In [13]:
embed_dim = 100
Glove = torchtext.vocab.GloVe(name='6B',
                              dim=embed_dim,
                              cache='./vectorEmbeddings')

pretrained_embed = np.zeros((len(word2idx), embed_dim))
for i, word in enumerate(word2idx.keys()):
    if word in Glove.stoi:
        pretrained_embed[i] = Glove[word]
    else:
        pretrained_embed[i] = np.random.normal(scale=0.6, size=(embed_dim, ))


./vectorEmbeddings/glove.6B.zip: 862MB [02:40, 5.36MB/s]                           
100%|█████████▉| 399999/400000 [00:16<00:00, 24394.56it/s]


In [14]:
network = WordLSTM(embed_dim=embed_dim)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

network.emb_layer.weight.data.copy_(torch.from_numpy(pretrained_embed))
network.emb_layer.weight.requires_grad = True

print('\n\033[1mTraining...\033[0m')
train(network, data=train_data_5)


[1mNetwork:[0m
WordLSTM(
  (emb_layer): Embedding(4152, 100)
  (lstm): LSTM(100, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=4152, bias=True)
)

[1mTraining...[0m
Epoch: 1/5, Train Loss: 6.096155865177227
Best model saved

Epoch: 2/5, Train Loss: 5.78182130971532
Best model saved

Epoch: 3/5, Train Loss: 5.41578022683502
Best model saved

Epoch: 4/5, Train Loss: 5.217180793300556
Best model saved

Epoch: 5/5, Train Loss: 5.0735737828540195
Best model saved



In [15]:
generate_sent = []
for _ in range(10):
    generate_sent.append(sample(network))

with open('./sentences/generate_sent_3.txt', 'w') as f:
    for sent in generate_sent:
        f.write('{}\n'.format(sent))


### 7. (2 pts) Compute and report the perplexity of this saved model on *test_1.txt* file.

In [16]:
_ = compute_perplex(network, test_tokens)

Testing perplexity: 158.16866387530243


## Problem 3. LSTM for classification (32 pts, BONUS: 10 pts)
---

In [17]:
import pandas as pd
import numpy as np
import re
import os
import random
import nltk
from tqdm import tqdm

nltk.download('punkt')

from collections import Counter
import torch
import torchtext
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split

seed = 42
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
torch.manual_seed(seed)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)
path = '/kaggle/input/cs505-hw4-data/'

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Device: cuda


In [18]:
df_train, df_test = pd.read_csv(path + 'sentiment-train.csv'), pd.read_csv(
    path + 'sentiment-test.csv')

X_train, Y_train = df_train['text'].to_list(), df_train['sentiment'].to_list()
X_test, Y_test = df_test['text'].to_list(), df_test['sentiment'].to_list()

### 1. (5 pts) Follow the tutorial [here](https://github.com/gabrielloye/LSTM_Sentiment-Analysis/blob/master/main.ipynb) on how to build LSTM model for sentiment classification. Modify the tutorial to train on your tweet sentiment data (sentiment-train.csv) and test on test data (*sentiment-test.csv*) from HW2 (modify the tutorial so that the train data is **not** split into train and validation). Compute and report the accuracy on the test data.

In [19]:
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, sequenceLength):
    features = np.zeros((len(sentences), sequenceLength), dtype=int)
    for i, review in enumerate(sentences):
        if len(review) > 0:
            features[i, :len(review)] = np.array(review)[:sequenceLength]
    return features


def process(train_sentences,
            train_labels,
            test_sentences,
            test_labels,
            padding_length=30):

    # Remove any digits
    for i in range(len(train_sentences)):
        train_sentences[i] = re.sub('\d', '0', train_sentences[i])
    for i in range(len(test_sentences)):
        test_sentences[i] = re.sub('\d', '0', test_sentences[i])

    # Modify URLs to <url>
    for i in range(len(train_sentences)):
        if 'www.' in train_sentences[i] or 'http:' in train_sentences[
                i] or 'https:' in train_sentences[
                    i] or '.com' in train_sentences[i]:
            train_sentences[i] = re.sub(r'([^ ]+(?<=\.[a-z]{3}))', '<url>',
                                        train_sentences[i])
    for i in range(len(test_sentences)):
        if 'www.' in test_sentences[i] or 'http:' in test_sentences[
                i] or 'https:' in test_sentences[
                    i] or '.com' in test_sentences[i]:
            test_sentences[i] = re.sub(r'([^ ]+(?<=\.[a-z]{3}))', '<url>',
                                       test_sentences[i])

    vocab = Counter(
    )  # Dictionary that will map a word to the number of times it appeared in all the training sentences
    train_tokens = []
    for sentence, _ in zip(train_sentences, tqdm(range(len(train_sentences)))):
        # The sentences will be stored as a list of words/tokens
        tokens = []
        for word in nltk.word_tokenize(sentence):  # Tokenizing the words
            vocab.update([word.lower()
                          ])  # Converting all the words to lower case
            tokens.append(word)
        train_tokens.append(tokens)

    # Removing the words that only appear once
    vocab = {k: v for k, v in vocab.items() if v > 1}
    # Sorting the words according to the number of appearances, with the most common word being first
    vocab = sorted(vocab, key=vocab.get, reverse=True)
    # Adding padding and unknown to our vocabulary so that they will be assigned an index
    vocab = ['<pad>', '<unk>'] + vocab
    # Dictionaries to store the word to index mappings and vice versa
    word2idx = {o: i for i, o in enumerate(vocab)}
    idx2word = {i: o for i, o in enumerate(vocab)}

    train_idx = []
    test_idx = []
    for i, tokens in enumerate(train_tokens):
        # Looking up the mapping dictionary and assigning the index to the respective words
        train_idx.append([
            word2idx[word] if word in word2idx else word2idx['<unk>']
            for word in tokens
        ])

    for i, sentence in enumerate(test_sentences):
        # For test sentences, we have to tokenize the sentences as well
        test_idx.append([
            word2idx[word.lower()]
            if word.lower() in word2idx else word2idx['<unk>']
            for word in nltk.word_tokenize(sentence)
        ])

    seq_len = padding_length  # The length that the sentences will be padded/shortened to
    train_idx = pad_input(train_idx, seq_len)
    test_idx = pad_input(test_idx, seq_len)

    # Converting our labels into numpy arrays
    train_labels = np.array(train_labels)
    test_labels = np.array(test_labels)

    train_data = TensorDataset(torch.from_numpy(train_idx),
                               torch.from_numpy(train_labels))
    test_data = TensorDataset(torch.from_numpy(test_idx),
                              torch.from_numpy(test_labels))

    return word2idx, idx2word, train_data, test_data

In [20]:
word2idx, idx2word, train_data, test_data = process(X_train, Y_train, X_test,
                                                    Y_test)


100%|█████████▉| 59999/60000 [00:12<00:00, 4632.79it/s]


In [21]:
batch_size = 32
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [22]:
class SentimentNet(nn.Module):
    def __init__(self,
                 model,
                 bidirectional,
                 vocab_size=len(word2idx),
                 output_size=1,
                 embedding_dim=400,
                 hidden_dim=512,
                 n_layers=2):

        super(SentimentNet, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if model == 'lstm':
            self.rnn = nn.LSTM(embedding_dim,
                               hidden_dim,
                               n_layers,
                               dropout=0.5,
                               batch_first=True,
                               bidirectional=bidirectional)
        elif model == 'gru':
            self.rnn = nn.GRU(embedding_dim,
                              hidden_dim,
                              n_layers,
                              dropout=0.5,
                              batch_first=True,
                              bidirectional=bidirectional)

        self.dropout = nn.Dropout(0.2)
        if bidirectional:
            self.fc = nn.Linear(hidden_dim * 2, output_size)
        else:
            self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, hidden):
        x = x.long()
        embeds = self.embedding(x)
        out, hidden = self.rnn(embeds, hidden)
        out = out[:, -1, :]
        out = self.dropout(out)
        out = self.fc(out)
        out = self.sigmoid(out)
        return out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size,
                             self.hidden_dim).zero_().to(device),
                  weight.new(self.n_layers, batch_size,
                             self.hidden_dim).zero_().to(device))
        return hidden


In [23]:
def train(network,
          train_loader,
          valid_loader=None,
          epochs=5,
          clip=5,
          interval=500,
          validation_accuracy=[],
          verbose=True):

    optimiser = torch.optim.Adam(network.parameters(), lr=0.005)
    loss_function = nn.BCELoss()
    network.to(device)

    loss_min = np.inf
    max_acc = 0
    for epoch in range(epochs):
        network.train()
        train_loss = []
        _ = network.init_hidden(batch_size)

        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.long().to(device), labels.to(device)

            output, _ = network(inputs, None)
            loss = loss_function(output.squeeze(-1), labels.float())
            train_loss.append(loss.item())

            network.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(network.parameters(), clip)
            optimiser.step()

            if i % interval == 0 and verbose:
                print('Epoch: {}/{}, {}%, Loss: {}'.format(
                    epoch + 1, epochs, round(i / len(train_loader) * 100, 2),
                    loss),
                      end='\r')
        epoch_loss = np.mean(train_loss)
        if verbose:
            print('\rEpoch: {}/{}, Training Loss: {}'.format(
                epoch + 1, epochs, epoch_loss))

        if valid_loader != None:
            network.eval()
            valid_loss = []
            num_correct = 0
            for i, (inputs, labels) in enumerate(valid_loader):
                inputs, labels = inputs.long().to(device), labels.to(device)
                output, _ = network(inputs, None)
                loss = loss_function(output.squeeze(-1), labels.float())
                valid_loss.append(loss.item())
                pred = torch.round(output.squeeze(-1))
                correct = pred.eq(labels.float()).cpu().numpy()
                num_correct += np.sum(correct)

            if verbose:
                print('Epoch: {}/{}, Validation Loss: {}'.format(
                    epoch + 1, epochs, np.mean(valid_loss)),
                      end='\r')
            valid_acc = num_correct / len(valid_loader.dataset)
            max_acc = max(max_acc, valid_acc)
            if verbose:
                print('Epoch: {}/{}, Validation Accuracy: {:.3f}%'.format(
                    epoch + 1, epochs, valid_acc * 100),
                      end='\r')

        if epoch_loss < loss_min:
            torch.save(network.state_dict(), 'bestnetwork_sentiment.pt')
            if verbose:
                print('Best network saved\n')
            loss_min = epoch_loss

    if valid_loader != None:
        validation_accuracy.append(max_acc)

In [24]:
def test(network, test_loader):
    loss_function = nn.BCELoss()
    test_loss = []
    num_correct = 0
    _ = network.init_hidden(batch_size)

    network.eval()
    for inputs, labels in test_loader:
        inputs, labels = inputs.long().to(device), labels.to(device)

        output, _ = network(inputs, None)
        loss = loss_function(output.squeeze(-1), labels.float())
        test_loss.append(loss.item())
        pred = torch.round(output.squeeze(-1))

        correct = pred.eq(labels.float()).cpu().numpy()
        num_correct += np.sum(correct)

    print('Testing loss: {:.3f}'.format(np.mean(test_loss)))
    test_acc = num_correct / len(test_loader.dataset)
    print('Testing accuracy: {:.3f}%'.format(test_acc * 100))
    return test_acc


In [25]:
bestNetwork = {}
network = SentimentNet(model='lstm', bidirectional=False)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
bestNetwork[('lstm', False)] = test(network, test_loader)

[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 400)
  (rnn): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.5949169172445933
Best network saved

Epoch: 2/5, Training Loss: 0.4922956688086192
Best network saved

Epoch: 3/5, Training Loss: 0.45235573514302574
Best network saved

Epoch: 4/5, Training Loss: 0.4324088364283244
Best network saved

Epoch: 5/5, Training Loss: 0.42326335916519164
Best network saved


[1mTesting...[0m
Testing loss: 0.509
Testing accuracy: 75.209%


### 2. (2 pts) Modify the model from 3.1 to use GRU. Compute and report the accuracy on the test data.

In [26]:
network = SentimentNet(model='gru', bidirectional=False)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
bestNetwork[('gru', False)] = test(network, test_loader)

[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 400)
  (rnn): GRU(400, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.6132331674893697
Best network saved

Epoch: 2/5, Training Loss: 0.5983517466068268
Best network saved

Epoch: 3/5, Training Loss: 0.6192516043821971
Epoch: 4/5, Training Loss: 0.6375154728571574
Epoch: 5/5, Training Loss: 0.629681940428416

[1mTesting...[0m
Testing loss: 0.641
Testing accuracy: 63.788%


### 3. (5 pts) Modify the model from 3.1 to use bidirectional LSTM. Compute and report the accuracy on the test data.

In [27]:
network = SentimentNet(model='lstm', bidirectional=True)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
bestNetwork[('lstm', True)] = test(network, test_loader)

[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 400)
  (rnn): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.6184585534413656
Best network saved

Epoch: 2/5, Training Loss: 0.511397004421552
Best network saved

Epoch: 3/5, Training Loss: 0.48436869036356606
Best network saved

Epoch: 4/5, Training Loss: 0.4665720727602641
Best network saved

Epoch: 5/5, Training Loss: 0.4656448189576467
Best network saved


[1mTesting...[0m
Testing loss: 0.611
Testing accuracy: 73.816%


### 4. (2 pts) Modify the model from 3.1 to use bidirectional GRU. Compute and report the accuracy on the test data.

In [28]:
network = SentimentNet(model='gru', bidirectional=True)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
bestNetwork[('gru', True)] = test(network, test_loader)

[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 400)
  (rnn): GRU(400, 512, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.6442613546848297
Best network saved

Epoch: 2/5, Training Loss: 0.611183275492986
Best network saved

Epoch: 3/5, Training Loss: 0.6187502720673879
Epoch: 4/5, Training Loss: 0.652715049346288
Epoch: 5/5, Training Loss: 0.6816603932221731

[1mTesting...[0m
Testing loss: 0.887
Testing accuracy: 59.889%


### 5. (5 pts) Pick the best model so far and train the model starting from pretrained [GloveTwitter100d](https://nlp.stanford.edu/projects/glove/) (use the same vocabulary as before, just initialize the embedding of the words using Glove embeddings). Compute and report the accuracy on the test data.

In [29]:
embed_dim = 100
Glove = torchtext.vocab.GloVe(name='twitter.27B',
                              dim=embed_dim,
                              cache='./vectorEmbeddings')

pretrained_embed = np.zeros((len(word2idx), embed_dim))
for i, word in enumerate(word2idx.keys()):
    if word in Glove.stoi:
        pretrained_embed[i] = Glove[word]
    else:
        pretrained_embed[i] = np.random.normal(scale=0.6, size=(embed_dim, ))


./vectorEmbeddings/glove.twitter.27B.zip: 1.52GB [04:48, 5.27MB/s]                            
100%|█████████▉| 1193513/1193514 [00:47<00:00, 24975.50it/s]


In [30]:
model, bidirectional = max(bestNetwork, key=bestNetwork.get)
print('\033[1mBest network:\033[0m', model, bidirectional)

network = SentimentNet(model=model,
                       embedding_dim=embed_dim,
                       bidirectional=bidirectional)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

network.embedding.weight.data.copy_(torch.from_numpy(pretrained_embed))
network.embedding.weight.requires_grad = True

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
_ = test(network, test_loader)

[1mBest network:[0m lstm False
[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 100)
  (rnn): LSTM(100, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.6902948512713114
Best network saved

Epoch: 2/5, Training Loss: 0.6298595042228698
Best network saved

Epoch: 3/5, Training Loss: 0.5164655623277028
Best network saved

Epoch: 4/5, Training Loss: 0.4631766457716624
Best network saved

Epoch: 5/5, Training Loss: 0.42771753466924034
Best network saved


[1mTesting...[0m
Testing loss: 0.526
Testing accuracy: 76.602%


### 6. (10 pts) Using your best model so far, conduct a 5-fold (stratified) cross validation on your training data and a grid search to pick the best hidden size (try 128 or 512) and embedding size (try 100 or 400). Compute and report the average accuracies for each of the choice combination.

In [31]:
def splitData(dataset, ratio=0.8, batch_size=32):
    train_size = int(ratio * len(dataset))
    train_data, valid_data = random_split(
        dataset, [train_size, len(dataset) - train_size])

    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    valid_loader = DataLoader(valid_data, shuffle=False, batch_size=batch_size)

    return train_loader, valid_loader

In [32]:
bestParams = {}
for hidden_dim in [128, 512]:
    for embedding_dim in [100, 400]:
        print('\033[1mHidden Dim:\033[0m', hidden_dim,
              '\033[1mEmbedding Dim:\033[0m', embedding_dim)

        validation_accuracy = []
        for k in range(5):
            print('\rFold: {}'.format(k + 1), end='')

            train_loader, valid_loader = splitData(train_data)

            network = SentimentNet(embedding_dim=embedding_dim,
                                   hidden_dim=hidden_dim,
                                   model=model,
                                   bidirectional=bidirectional)

            train(network=network,
                  train_loader=train_loader,
                  valid_loader=valid_loader,
                  validation_accuracy=validation_accuracy,
                  verbose=False)

        average_accuracy = np.mean(validation_accuracy)
        print('\r\033[1mAverage validation accuracy:\033[0m {:.3f}%'.format(
            average_accuracy * 100))
        bestParams[(hidden_dim, embedding_dim)] = average_accuracy


[1mHidden Dim:[0m 128 [1mEmbedding Dim:[0m 100
[1mAverage validation accuracy:[0m 76.168%
[1mHidden Dim:[0m 128 [1mEmbedding Dim:[0m 400
[1mAverage validation accuracy:[0m 75.772%
[1mHidden Dim:[0m 512 [1mEmbedding Dim:[0m 100
[1mAverage validation accuracy:[0m 74.240%
[1mHidden Dim:[0m 512 [1mEmbedding Dim:[0m 400
[1mAverage validation accuracy:[0m 75.462%


### 7. (3 pts) Train the model on all your training data using the best combination of hyperparameters you find in 3.6. Compute and report the accuracy on the test data.

In [33]:
batch_size = 32
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [34]:
hidden_dim, embedding_dim = max(bestParams, key=bestParams.get)

network = SentimentNet(embedding_dim=embedding_dim,
                       hidden_dim=hidden_dim,
                       model=model,
                       bidirectional=bidirectional)
print('\033[1mNetwork:\033[0m\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader)

print('\n\033[1mTesting...\033[0m')
_ = test(network, test_loader)

[1mNetwork:[0m
SentimentNet(
  (embedding): Embedding(19959, 100)
  (rnn): LSTM(100, 128, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/5, Training Loss: 0.6691614678064982
Best network saved

Epoch: 2/5, Training Loss: 0.5128658341169358
Best network saved

Epoch: 3/5, Training Loss: 0.4514279352426529
Best network saved

Epoch: 4/5, Training Loss: 0.415068078037103
Best network saved

Epoch: 5/5, Training Loss: 0.38777630302906035
Best network saved


[1mTesting...[0m
Testing loss: 0.479
Testing accuracy: 76.602%


### 8. (BONUS: 10 pts) Train your best model using the best hyperparameters from 3.6 on all the [sentiment140 data](http://help.sentiment140.com/for-students/). Compute and report the accuracy on the test data from HW2 (i.e., *sentiment-test.csv*).

In [35]:
df = pd.read_csv(path + 'training.1600000.processed.noemoticon.csv',
                 encoding='ISO-8859-1',
                 header=None)
df = df.drop(columns=[1, 2, 3, 4])
df.columns = ['Label', 'Text']
df['Label'] = df['Label'].apply(lambda label: 1 if label == 4 else 0)
X_train, Y_train = df['Text'].to_list(), df['Label'].to_list()

In [36]:
word2idx, idx2word, train_data, test_data = process(X_train, Y_train, X_test,
                                                    Y_test)


100%|█████████▉| 1599999/1600000 [05:53<00:00, 4532.18it/s]


In [37]:
batch_size = 32
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [38]:
network = SentimentNet(vocab_size=len(word2idx),
                       embedding_dim=embedding_dim,
                       hidden_dim=hidden_dim,
                       model=model,
                       bidirectional=bidirectional)
print('\033[1mNetwork\033[0m:\n{}'.format(network))

print('\n\033[1mTraining...\033[0m')
train(network=network, train_loader=train_loader, epochs=1, interval=5000)

print('\n\033[1mTesting...\033[0m')
_ = test(network, test_loader)

[1mNetwork[0m:
SentimentNet(
  (embedding): Embedding(247693, 100)
  (rnn): LSTM(100, 128, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[1mTraining...[0m
Epoch: 1/1, Training Loss: 0.477465318351686
Best network saved


[1mTesting...[0m
Testing loss: 0.418
Testing accuracy: 81.337%
