# Lab 7: RNNs & Word Embeddings

In [25]:
__author__ = "Ren Yi"
__version__ = "BMSC-GA 4493/BMIN-GA 3007, NYU, Spring 2019"

In [1]:
import re
import os
import time
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

## Goal:
- Understand the mechanics of RNNs in Pytorch
- Train RNN based neural networks on text data
- Basics of word embedding and how to use them

## Problem Setup

### Dataset
Download the two files in the data folder [here](https://drive.google.com/drive/folders/1KBUyfU87zz8eOZwr2ifDi2Z4LBHlSZ28?usp=sharing). Save the folder in the same directory as this notebook.

For the first part, we will be using the [First GOP Debate Twitter Sentiment dataset](https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras/data), which contains Tweets after the first GOP debate and their sentiments (among other stuff).

In [2]:
np.random.seed(1111)

df = pd.read_csv('data/Sentiment.csv')
df.head()

Unnamed: 0,id,candidate,candidate_confidence,relevant_yn,relevant_yn_confidence,sentiment,sentiment_confidence,subject_matter,subject_matter_confidence,candidate_gold,...,relevant_yn_gold,retweet_count,sentiment_gold,subject_matter_gold,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,1,No candidate mentioned,1.0,yes,1.0,Neutral,0.6578,None of the above,1.0,,...,,5,,,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,2,Scott Walker,1.0,yes,1.0,Positive,0.6333,None of the above,1.0,,...,,26,,,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,3,No candidate mentioned,1.0,yes,1.0,Neutral,0.6629,None of the above,0.6629,,...,,27,,,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,4,No candidate mentioned,1.0,yes,1.0,Positive,1.0,None of the above,0.7039,,...,,138,,,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,5,Donald Trump,1.0,yes,1.0,Positive,0.7045,None of the above,1.0,,...,,156,,,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona


Let's first look at some basic stats of the data

In [3]:
pd.DataFrame(df.groupby('sentiment').count()['text'])

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
Negative,8493
Neutral,3142
Positive,2236


For simplicity, 
- we only use ```X = 'sentiment'``` and ```y = 'text'``` from the original dataframe. 
- We only look at positive (1) and negative (0) tweets.

In [4]:
df = df[['sentiment', 'text']]
df = df[df['sentiment'] != 'Neutral']
df['sentiment'] = [1 if s == "Positive" else 0 for s in df['sentiment']]
df.groupby('sentiment').count()

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
0,8493
1,2236


In [5]:
train_data, test_data = train_test_split(df, test_size=0.10, random_state=42)
train_data.groupby('sentiment').count().apply(lambda x: 100 * x / float(x.sum()))

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
0,79.152858
1,20.847142


In [17]:
train_X, train_y = train_data['text'], train_data['sentiment']
test_X, test_y = test_data['text'], test_data['sentiment']

### Input representations

#### Build vocabulary
We need to build a vocabulary using words in our training data. Any words in the test set that are not in our vocabulary will be replaced with an ```<UNK>``` token. We will also add a ```<PAD>``` token as padding.

For computational purposes, we'll only take words that appeared more than 3 times.

In [18]:
UNK = "<UNK>"
PAD = "<PAD>"
def build_vocab(sentences, min_count=3, max_vocab=None):
    """
    Build vocabulary from sentences (list of strings)
    """
    # keep track of the number of appearance of each word
    word_count = Counter()
    
    for s in sentences:
        word_count.update(re.findall(r"[\w']+|[.,!?;]", s.lower()))
    
    vocabulary = list([w for w in word_count if word_count[w] > min_count]) + [UNK, PAD]
    indices = dict(zip(vocabulary, range(len(vocabulary))))

    return vocabulary, indices
    
vocabulary, vocab_indices = build_vocab(train_X)
print(len(vocabulary))

3113


In [24]:
# Training data is a string of words
train_X.iloc[0]

"This is great - let's have a bunch of rich entitled MEN make decisions about #PlannedParenthood #GOPDebates"

#### Word representations
Next, we neeed to convert each word/token in the sentences into its index in the vocabulary so that pytorch can use it. We also pad our sentences to a fixed length of 25 tokens so that we can do batch processing. We do this for both train and test set.

In [25]:
def sentences_to_padded_index_sequences(words, sentences, pad_length=100):
    padded_sequences = np.zeros((len(sentences), pad_length))
    for i, s in enumerate(sentences):
        indices = np.ones(pad_length) * words['<PAD>']
        # only take the first pad_length tokens
        token_indices = np.array([words[w] if w in words else words['<UNK>'] for w in re.findall(r"[\w']+|[.,!?;]", s.lower())[:pad_length]])
        indices[:len(token_indices)] = token_indices
        padded_sequences[i] = indices
    return padded_sequences

In [26]:
train_X = sentences_to_padded_index_sequences(vocab_indices, train_data['text'], 25)
test_X = sentences_to_padded_index_sequences(vocab_indices, test_data['text'], 25)

In [27]:
# Training data is now an array of indices
train_X[0]

array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00,
         3.00000000e+00,   4.00000000e+00,   5.00000000e+00,
         6.00000000e+00,   7.00000000e+00,   8.00000000e+00,
         3.11100000e+03,   9.00000000e+00,   1.00000000e+01,
         1.10000000e+01,   1.20000000e+01,   1.30000000e+01,
         1.40000000e+01,   3.11200000e+03,   3.11200000e+03,
         3.11200000e+03,   3.11200000e+03,   3.11200000e+03,
         3.11200000e+03,   3.11200000e+03,   3.11200000e+03,
         3.11200000e+03])

## Model Time

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score
from torch.utils.data import DataLoader, Dataset

### DataLoader

In [20]:
class TweetDataset(Dataset):
    def __init__(self, sentences, labels):
        self.sentences = sentences.astype(int)
        self.labels = np.array(labels).astype(int)
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, key):
        return (torch.LongTensor(self.sentences[key]), self.labels[key])

BATCH_SIZE = 32
train_loader = DataLoader(TweetDataset(train_X, train_y),
                          batch_size=BATCH_SIZE,
                          shuffle=True)
test_loader = DataLoader(TweetDataset(test_X, test_y),
                          batch_size=BATCH_SIZE,
                          shuffle=True)

### Train and validation loop

In [21]:
def train(model, train_loader=train_loader, test_loader=test_loader, 
          learning_rate=0.001, num_epoch=10, print_every=100):
    # Training steps
    start_time = time.time()
    model.train()
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(num_epoch):
        for i, (data, labels) in enumerate(train_loader):
            outputs = model(data)
            model.zero_grad()
            loss = loss_fn(outputs.squeeze(), labels)
            loss.backward()
            optimizer.step()

             # report performance
            if (i + 1) % print_every == 0:
                print('Train set | epoch: {:3d} | {:6d}/{:6d} batches | Loss: {:6.4f}'.format(
                    epoch, i + 1, len(train_loader), loss.item()))     
#                 print('Epoch: [{0}/{1}], Step: [{2}/{3}], Loss: {4}, Validation Acc:{5}, AUC:{6}'.format(
#                     epoch + 1, EPOCHS, i + 1, len(train_loader), loss.data[0], test_acc, test_auc))
    
    # Evaluate after every epochh
        correct = 0
        total = 0
        model.eval()

        predictions = []
        truths = []

        with torch.no_grad():
            for i, (data, labels) in enumerate(test_loader):
                outputs = model(data).squeeze()
#                 import ipdb; ipdb.set_trace()
#                 predicted = ((outputs > 0.5).long()).view(-1)
                pred = outputs.data.max(1)[1]
                predictions += list(pred.numpy())
                truths += list(labels.numpy())
                total += labels.size(0)
                correct += (pred == labels).sum()
                
            acc = (100 * correct / total)
            auc = roc_auc_score(truths, predictions)
            elapse = time.strftime('%H:%M:%S', time.gmtime(int((time.time() - start_time))))
            print('Test set | Accuracy: {:6.4f} | AUC: {:4.2f} | time elapse: {:>9}'.format(
                acc, auc, elapse))

For this lab, we will be exploring two variants of RNN: vanilla (or Elman) RNN and LSTM (Long-short term memory). In the following code block, please try to define your own model. Here are some hints.

- Each input word is represented by a vector of dimension ```embedding_dim```. Check out ```nn.Embedding``` to see how to initialize embeddings randomly.
- Your model should take the following input parameters
    - ```hidden_dim```: The number of features in the hidden state h of your RNN layer
    - ```output_dim```: Number of output classes
    - ```vocab_size``` Size of your vocabulary. 
    - ```embedding_dim```: Dimension of word embeddings
- Your model should consist of an RNN layer (you can use either ```nn.RNN``` or ```nn.LSTM```) followed by a linear layer.
- $h_{0}$ (and $c$ if you use LSTM) should be initialized as a zero vector of dimension ```hidden_dim```. You might want to check out ```nn.Parameter```

### RNN

In [22]:
class RNN(nn.Module):
    def __init__(self, hidden_dim, output_dim, 
                 vocab_size, embedding_dim, rnn='LSTM'):
        super(RNN, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab_size-1)
        self.hidden_dim = hidden_dim
        self.rnn_fn = rnn
        assert self.rnn_fn in ['LSTM', 'RNN']
        self.rnn = getattr(nn, rnn)(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def init_hidden(self, batch_size):
        """
        Initialize the hidden state values
        """
        hidden = nn.Parameter(torch.zeros(1, batch_size, self.hidden_dim))
        if self.rnn_fn == 'LSTM':
            c = nn.Parameter(torch.zeros(1, batch_size, self.hidden_dim))
            return hidden, c
        return hidden
        
    def forward(self, x):
        x = self.emb(x)
        
        _, last_hidden = self.rnn(x, self.init_hidden(x.size()[0]))
        if self.rnn_fn == 'LSTM':
            last_hidden = last_hidden[0]
        out = self.fc(last_hidden)
        return out

Run the code block below to check your model performance. Using the parameters provided, you should be able to get about 0.6 AUC using vanilla RNN or about 0.7 AUC using LSTM after 10 training epochs.

In [23]:
torch.manual_seed(111)
rnn_model = RNN(40, 2, len(vocabulary), 50, rnn='RNN')
train(rnn_model)

Train set | epoch:   0 |    100/   302 batches | Loss: 0.5581
Train set | epoch:   0 |    200/   302 batches | Loss: 0.3743
Train set | epoch:   0 |    300/   302 batches | Loss: 0.3921
Test set | Accuracy: 79.0000 | AUC: 0.52 | time elapse:  00:00:03
Train set | epoch:   1 |    100/   302 batches | Loss: 0.4053
Train set | epoch:   1 |    200/   302 batches | Loss: 0.3868
Train set | epoch:   1 |    300/   302 batches | Loss: 0.3857
Test set | Accuracy: 79.0000 | AUC: 0.52 | time elapse:  00:00:06
Train set | epoch:   2 |    100/   302 batches | Loss: 0.3922
Train set | epoch:   2 |    200/   302 batches | Loss: 0.3411
Train set | epoch:   2 |    300/   302 batches | Loss: 0.4422
Test set | Accuracy: 79.0000 | AUC: 0.52 | time elapse:  00:00:09
Train set | epoch:   3 |    100/   302 batches | Loss: 0.5017
Train set | epoch:   3 |    200/   302 batches | Loss: 0.2832
Train set | epoch:   3 |    300/   302 batches | Loss: 0.3281
Test set | Accuracy: 80.0000 | AUC: 0.55 | time elapse:  0

In [24]:
torch.manual_seed(111)
lstm_model = RNN(40, 2, len(vocabulary), 50, rnn='LSTM')
train(lstm_model)

Train set | epoch:   0 |    100/   302 batches | Loss: 0.4903
Train set | epoch:   0 |    200/   302 batches | Loss: 0.6375
Train set | epoch:   0 |    300/   302 batches | Loss: 0.3719
Test set | Accuracy: 79.0000 | AUC: 0.51 | time elapse:  00:00:05
Train set | epoch:   1 |    100/   302 batches | Loss: 0.6755
Train set | epoch:   1 |    200/   302 batches | Loss: 0.4191
Train set | epoch:   1 |    300/   302 batches | Loss: 0.4653
Test set | Accuracy: 82.0000 | AUC: 0.62 | time elapse:  00:00:10
Train set | epoch:   2 |    100/   302 batches | Loss: 0.5240
Train set | epoch:   2 |    200/   302 batches | Loss: 0.3050
Train set | epoch:   2 |    300/   302 batches | Loss: 0.4367
Test set | Accuracy: 82.0000 | AUC: 0.66 | time elapse:  00:00:15
Train set | epoch:   3 |    100/   302 batches | Loss: 0.2876
Train set | epoch:   3 |    200/   302 batches | Loss: 0.3122
Train set | epoch:   3 |    300/   302 batches | Loss: 0.2083
Test set | Accuracy: 82.0000 | AUC: 0.71 | time elapse:  0

### Model predictions

In [135]:
def test_sentence(sentence, model):
    model.eval()
    test_tensor = torch.LongTensor(sentences_to_padded_index_sequences(vocab_indices, [sentence]).astype(int))
    score = model(test_tensor).data.numpy().squeeze()
    label = np.argmax(score)
    
    return ("positive" if label == 1 else "negative", score[label])

In [219]:
test_sentence("Enjoyed the #GOPDebates and am looking forward to the #DemocraticDebates next.", lstm_model)

('positive', 1.3321191)

In [215]:
test_sentence("Donald Trump is a really nasty piece of work. Hope he disappears quickly. #GOPDebate", lstm_model)

('negative', 1.9322957)

## Word Embeddings and How to Use Them

When using deep learning methods on NLP tasks, we usually utilize [word embedding](https://en.wikipedia.org/wiki/Word_embedding). To put it briefly, word embedding represent words, or tokens, in a vocabulary as a distributed numerical vector. There are a lot of methods to obtain a word embedding, with some of the most famous being Word2Vec, GloVe, and ELMo. It is not difficult to find a general purpose word embedding trained by one of the aforementioned methods on the Internet that's been trained with a massive amount of data. It is usually a good idea to use these pre-trained embedding to save yourself some time and computing resource.

In this lab, we will be using the [GloVe embedding](https://nlp.stanford.edu/projects/glove/) developed by Stanford,  one of the state-of-the-art word embedding. Please download the file ```glove.6B.50d.txt``` [here](https://drive.google.com/file/d/1JweINiA5JvTNLTm663LH8OdWssK2Kcid/view?usp=sharing).

In [173]:
import numpy as np
from tqdm import tqdm
# load embedding
emb_dim = 50
with open('glove.6B/glove.6B.50d.txt') as f:
    glove_embedding = []
    words = {}
    chars = {}
    idx2words = {}
    ordered_words = []

    for i, line in tqdm(enumerate(f)):
        s = line.split()
        glove_embedding.append(np.asarray(s[1:]))
        
        words[s[0]] = len(words)
        idx2words[i] = s[0]
        ordered_words.append(s[0])
        
# add unknown to word and char
glove_embedding.append(np.random.rand(emb_dim))
words["<UNK>"] = len(words)

# add padding
glove_embedding.append(np.zeros(emb_dim))
words["<PAD>"] = len(words)

chars["<UNK>"] = len(chars)
chars["<PAD>"] = len(chars)

glove_embedding = np.array(glove_embedding).astype(float)

400000it [00:06, 62159.87it/s]


Now we have three variables
- ```glove_embedding``` of shape [106687, 50] consisting of the actual vectors,
- ```words```, a dictionary consisting of each token in the vocabulary and its corresponding row in ```glove_embedding```, and
- ```idx2words```, a list consisting of all the words in their order in ```glove_embedding```

### Word embedding vectors

Now we can play around with these vectors to get a sense of how word embeddings can be used to represent words. Here's how you can look up a word embedding vector.

In [146]:
glove_embedding[words['this']]

array([  5.30740000e-01,   4.01170000e-01,  -4.07850000e-01,
         1.54440000e-01,   4.77820000e-01,   2.07540000e-01,
        -2.69510000e-01,  -3.40230000e-01,  -1.08790000e-01,
         1.05630000e-01,  -1.02890000e-01,   1.08490000e-01,
        -4.96810000e-01,  -2.51280000e-01,   8.40250000e-01,
         3.89490000e-01,   3.22840000e-01,  -2.27970000e-01,
        -4.43420000e-01,  -3.16490000e-01,  -1.24060000e-01,
        -2.81700000e-01,   1.94670000e-01,   5.55130000e-02,
         5.67050000e-01,  -1.74190000e+00,  -9.11450000e-01,
         2.70360000e-01,   4.19270000e-01,   2.02790000e-02,
         4.04050000e+00,  -2.49430000e-01,  -2.04160000e-01,
        -6.27620000e-01,  -5.47830000e-02,  -2.68830000e-01,
         1.84440000e-01,   1.82040000e-01,  -2.35360000e-01,
        -1.61550000e-01,  -2.76550000e-01,   3.55060000e-02,
        -3.82110000e-01,  -7.51340000e-04,  -2.48220000e-01,
         2.81640000e-01,   1.28190000e-01,   2.87620000e-01,
         1.44400000e-01,

### Find similar words

The word embedding vectors can help us find words with similar meanings. Word similarities can be measured by [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). The function below looks up the most similar words to a given word:

In [159]:
def find_nearest(ref_vec, words, embedding, topk=10):
    """
    Finds the top-k most similar words to "word" in terms of cosine similarity in the given embedding
    :param ref_vec: reference word vector
    :param words: dict, word to its index in the embedding
    :param embedding: numpy array of shape [V, embedding_dim]
    :param topk: number of top candidates to return
    :return a list of top-k most similar words
    """
    # compute cosine similarities
    scored_words = cosine_similarity(ref_vec.reshape(1,-1), loaded_embeddings)[0]
    
    # sort the words by similarity and return the topk
    sorted_words = np.argsort(-scored_words)
    
    return [(idx2words[w], scored_words[w]) for w in sorted_words[:topk]]

In [160]:
find_nearest(glove_embedding[words['hate']], words, glove_embedding, topk=5)

[('hate', 0.99999999999999978),
 ('hatred', 0.77468372337488278),
 ('shame', 0.74895365817045212),
 ('racist', 0.73715591114403145),
 ('anyone', 0.73647167276271064)]

### Word arithmetic

In [161]:
find_nearest(loaded_embeddings[words['worse']] - loaded_embeddings[words['better']] + loaded_embeddings[words['best']],
            words, loaded_embeddings, topk=1)

[('worst', 0.81096602138267371)]

In [162]:
find_nearest(loaded_embeddings[words['king']] - loaded_embeddings[words['queen']] + loaded_embeddings[words['woman']],
            words, loaded_embeddings, topk=1)

[('man', 0.87060674388747061)]

### Train an LSTM model withh GloVe embedding

Complete the code below. Replace the randomly generated embeddings withh GloVe embeddings. (Hint: check out ```nn.Embedding.weight```). Using the parameters provided, you should be able to get about 0.75 AUC using GloVe embeddings after 10 training epochs. 

In [179]:
# Re-indexing tokens
train_X_glove = sentences_to_padded_index_sequences(words, train_data['text'], 25)
test_X_glove = sentences_to_padded_index_sequences(words, test_data['text'], 25)

train_loader_glove = DataLoader(TweetDataset(train_X_glove, train_y),
                                batch_size=BATCH_SIZE,
                                shuffle=True)
test_loader_glove = DataLoader(TweetDataset(test_X_glove, test_y),
                               batch_size=BATCH_SIZE,
                               shuffle=True)

In [178]:
torch.manual_seed(111)
glove_model = RNN(40, 2, len(glove_embedding), 50, rnn='LSTM')
glove_model.emb.weight.data.copy_(torch.from_numpy(glove_embedding))
train(glove_model, train_loader=train_loader_glove, test_loader=test_loader_glove)

Train set | epoch:   0 |    100/   302 batches | Loss: 0.4807
Train set | epoch:   0 |    200/   302 batches | Loss: 0.3671
Train set | epoch:   0 |    300/   302 batches | Loss: 0.6753
Test set | Accuracy: 80.0000 | AUC: 0.53 | time elapse:  00:00:57
Train set | epoch:   1 |    100/   302 batches | Loss: 0.6563
Train set | epoch:   1 |    200/   302 batches | Loss: 0.2139
Train set | epoch:   1 |    300/   302 batches | Loss: 0.3274
Test set | Accuracy: 80.0000 | AUC: 0.75 | time elapse:  00:01:53
Train set | epoch:   2 |    100/   302 batches | Loss: 0.2582
Train set | epoch:   2 |    200/   302 batches | Loss: 0.1890
Train set | epoch:   2 |    300/   302 batches | Loss: 0.2146
Test set | Accuracy: 84.0000 | AUC: 0.69 | time elapse:  00:02:49
Train set | epoch:   3 |    100/   302 batches | Loss: 0.1729
Train set | epoch:   3 |    200/   302 batches | Loss: 0.1464
Train set | epoch:   3 |    300/   302 batches | Loss: 0.2360
Test set | Accuracy: 84.0000 | AUC: 0.76 | time elapse:  0