## NLP Homework 1

### IMDB Movie Review Sentiment Analysis

#### Data Preprocessing

First, Load libaries and read in datasets. 

In [1]:
import random
import spacy
import string
import os


In [2]:
filepath = "aclImdb"
train_p_name = os.listdir(filepath + "/train/pos")
train_n_name = os.listdir(filepath + "/train/neg")
test_p_name = os.listdir(filepath + "/test/pos")
test_n_name = os.listdir(filepath + "/test/neg")

In [3]:
def readtxtfile(filepath, filenames):
    filelist = []
    for i in filenames:
        file = open(filepath + i, "r")
        filelist.append(file.read())
        file.close()
    return filelist

In [4]:
train_pos = readtxtfile(filepath + "/train/pos/", train_p_name)

In [5]:
len(train_pos)

12500

In [6]:
train_pos[5]

'I saw the movie with two grown children. Although it was not as clever as Shrek, I thought it was rather good. In a movie theatre surrounded by children who were on spring break, there was not a sound so I know the children all liked it. There parents also seemed engaged. The death and apparent death of characters brought about the appropriate gasps and comments. Hopefully people realize this movie was made for kids. As such, it was successful although I liked it too. Personally I liked the Scrat!!'

In [7]:
train_neg = readtxtfile(filepath + "/train/neg/", train_n_name)
test_pos = readtxtfile(filepath + "/test/pos/", test_p_name)
test_neg = readtxtfile(filepath + "/test/neg/", test_n_name)

In [8]:
train_label_p = [1] * len(train_pos)
train_label_n = [0] * len(train_neg)
test_label_p = [1] * len(test_pos)
test_label_n = [0] * len(test_neg)

In [9]:
len(train_label_n)

12500

split train data into train set and validation set. Train dataset has 10,000 positive reviews and 10,000 negative reviews. Test data has 2,500 each. 

In [10]:
train_split = 10000

train_data_p = train_pos[:train_split]
train_data_p_label = train_label_p[:train_split]
train_data_n = train_neg[:train_split]
train_data_n_label = train_label_n[:train_split]


val_data_p = train_pos[train_split:]
val_data_p_label = train_label_p[train_split:]
val_data_n = train_neg[train_split:]
val_data_n_label = train_label_n[train_split:]

In [11]:
len(train_data_p)

10000

In [12]:
train_data = train_data_p + train_data_n
train_label = train_data_p_label + train_data_n_label
val_data = val_data_p + val_data_n
val_label = val_data_p_label + val_data_n_label

In [13]:
train_data[1]

'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary\'s Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbe

Combine test reviews and test labels into test set. 

In [14]:
test_data = test_pos + test_neg
test_label = test_label_p + test_label_n

In [15]:
len(test_data)

25000

#### Tokenize

Tokenize the dataset. 

In [16]:
import spacy
import string

# code from Lab 3
# Load English tokenizer, tagger, parser, NER and word vectors
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

# lowercase and remove punctuation
def tokenize(sent):
  tokens = tokenizer(sent)
  return [token.text.lower() for token in tokens if (token.text not in punctuations)]

In [17]:
def tokenize_dataset(dataset):
    token_dataset = []
    # we are keeping track of all tokens in dataset 
    # in order to create vocabulary later
    all_tokens = []
    
    for sample in dataset:
        tokens = tokenize(sample)
        token_dataset.append(tokens)
        all_tokens += tokens

    return token_dataset, all_tokens

In [19]:
import pickle as pkl
print ("Tokenizing val data")
val_data_tokens, _ = tokenize_dataset(val_data)
pkl.dump(val_data_tokens, open("val_data_tokens.p", "wb"))

Tokenizing val data


In [20]:
# test set tokens
print ("Tokenizing test data")
test_data_tokens, _ = tokenize_dataset(test_data)
pkl.dump(test_data_tokens, open("test_data_tokens.p", "wb"))

# train set tokens
print ("Tokenizing train data")
train_data_tokens, all_train_tokens = tokenize_dataset(train_data)
pkl.dump(train_data_tokens, open("train_data_tokens.p", "wb"))
pkl.dump(all_train_tokens, open("all_train_tokens.p", "wb"))

Tokenizing test data
Tokenizing train data


In [21]:
len(train_data_tokens)

20000

Create vocabulary in the training set. 

In [22]:
from collections import Counter

max_vocab_size = 10000
# save index 0 for unk and 1 for pad
PAD_IDX = 0
UNK_IDX = 1

def build_vocab(all_tokens):
    # Returns:
    # id2token: list of tokens, where id2token[i] returns token that corresponds to token i
    # token2id: dictionary where keys represent tokens and corresponding values represent indices
    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab_size))
    id2token = list(vocab)
    token2id = dict(zip(vocab, range(2,2+len(vocab)))) 
    id2token = ['<pad>', '<unk>'] + id2token
    token2id['<pad>'] = PAD_IDX 
    token2id['<unk>'] = UNK_IDX
    return token2id, id2token

token2id, id2token = build_vocab(all_train_tokens)

In [32]:
token2id

{'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'it': 8,
 'in': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 "'s": 13,
 '/><br': 14,
 'was': 15,
 'as': 16,
 'for': 17,
 'with': 18,
 'movie': 19,
 'but': 20,
 'film': 21,
 'you': 22,
 'on': 23,
 "n't": 24,
 'not': 25,
 'are': 26,
 'he': 27,
 'his': 28,
 'have': 29,
 'be': 30,
 'one': 31,
 'at': 32,
 'all': 33,
 'they': 34,
 'by': 35,
 'an': 36,
 'who': 37,
 'from': 38,
 'so': 39,
 'like': 40,
 'her': 41,
 'there': 42,
 'or': 43,
 'just': 44,
 'about': 45,
 'do': 46,
 'has': 47,
 'out': 48,
 'what': 49,
 'if': 50,
 'some': 51,
 'good': 52,
 'she': 53,
 'very': 54,
 'when': 55,
 'more': 56,
 'up': 57,
 'would': 58,
 'no': 59,
 'even': 60,
 'time': 61,
 'can': 62,
 'my': 63,
 'which': 64,
 'only': 65,
 'really': 66,
 'story': 67,
 'their': 68,
 'had': 69,
 'see': 70,
 'were': 71,
 'we': 72,
 'me': 73,
 'did': 74,
 'does': 75,
 'well': 76,
 '...': 77,
 'than': 78,
 'much': 79,
 'could': 80,
 'get': 81,
 'been': 82,
 'into': 83,
 'pe

In [23]:
# check the dictionary 
random_token_id = random.randint(0, len(id2token)-1)
random_token = id2token[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id[random_token]))

Token id 9061 ; token anchors
Token anchors; token id 9061


In [24]:
# convert token to id in the dataset
def token2index_dataset(tokens_data):
    indices_data = []
    for tokens in tokens_data:
        index_list = [token2id[token] if token in token2id else UNK_IDX for token in tokens]
        indices_data.append(index_list)
    return indices_data

train_data_indices = token2index_dataset(train_data_tokens)
val_data_indices = token2index_dataset(val_data_tokens)
test_data_indices = token2index_dataset(test_data_tokens)

# double checking
print ("Train dataset size is {}".format(len(train_data_indices)))
print ("Val dataset size is {}".format(len(val_data_indices)))
print ("Test dataset size is {}".format(len(test_data_indices)))

Train dataset size is 20000
Val dataset size is 5000
Test dataset size is 25000


#### Build PyTorch Dataloader

In [26]:
MAX_SENTENCE_LENGTH = 200

import numpy as np
import torch
from torch.utils.data import Dataset

class NewsGroupDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        
        token_idx = self.data_list[key][:MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, len(token_idx), label]

def newsgroup_collate_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []
    #print("collate batch: ", batch[0][0])
    #batch[0][0] = batch[0][0][:MAX_SENTENCE_LENGTH]
    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]), 
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
    return [torch.from_numpy(np.array(data_list)), torch.LongTensor(length_list), torch.LongTensor(label_list)]

# create pytorch dataloader
#train_loader = NewsGroupDataset(train_data_indices, train_targets)
#val_loader = NewsGroupDataset(val_data_indices, val_targets)
#test_loader = NewsGroupDataset(test_data_indices, test_targets)

BATCH_SIZE = 32
train_dataset = NewsGroupDataset(train_data_indices, train_label)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

val_dataset = NewsGroupDataset(val_data_indices, val_label)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

test_dataset = NewsGroupDataset(test_data_indices, test_label)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=False)

#for i, (data, lengths, labels) in enumerate(train_loader):
#    print (data)
#    print (labels)
#    break

#### Bag-of-words Model

In [27]:
# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class BagOfWords(nn.Module):
    """
    BagOfWords classification model
    """
    def __init__(self, vocab_size, emb_dim):
        """
        @param vocab_size: size of the vocabulary. 
        @param emb_dim: size of the word embedding
        """
        super(BagOfWords, self).__init__()
        # pay attention to padding_idx 
        self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.linear = nn.Linear(emb_dim,20)
    
    def forward(self, data, length):
        """
        
        @param data: matrix of size (batch_size, max_sentence_length). Each row in data represents a 
            review that is represented using n-gram index. Note that they are padded to have same length.
        @param length: an int tensor of size (batch_size), which represents the non-trivial (excludes padding)
            length of each sentences in the data.
        """
        out = self.embed(data)
        out = torch.sum(out, dim=1)
        out /= length.view(length.size()[0],1).expand_as(out).float()
     
        # return logits
        out = self.linear(out.float())
        return out

emb_dim = 100 # bigger is better, 200, 500...
model = BagOfWords(len(id2token), emb_dim)

In [28]:
learning_rate = 0.01
num_epochs = 10 # number epoch to train

# Criterion and Optimizer
criterion = torch.nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Function for testing the model
def test_model(loader, model):
    """
    Help function that tests the model's performance on a dataset
    @param: loader - data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    for data, lengths, labels in loader:
        data_batch, length_batch, label_batch = data, lengths, labels
        outputs = F.softmax(model(data_batch, length_batch), dim=1)
        predicted = outputs.max(1, keepdim=True)[1]
        
        total += labels.size(0)
        correct += predicted.eq(labels.view_as(predicted)).sum().item()
    return (100 * correct / total)

for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        data_batch, length_batch, label_batch = data, lengths, labels
        optimizer.zero_grad()
        outputs = model(data_batch, length_batch)
        loss = criterion(outputs, label_batch)
        loss.backward()
        optimizer.step()
        # validate every 100 iterations
        if i > 0 and i % 100 == 0:
            # validate
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format( 
                       epoch+1, num_epochs, i+1, len(train_loader), val_acc))


Epoch: [1/10], Step: [101/625], Validation Acc: 75.5
Epoch: [1/10], Step: [201/625], Validation Acc: 83.28
Epoch: [1/10], Step: [301/625], Validation Acc: 85.1
Epoch: [1/10], Step: [401/625], Validation Acc: 84.82
Epoch: [1/10], Step: [501/625], Validation Acc: 86.32
Epoch: [1/10], Step: [601/625], Validation Acc: 86.72
Epoch: [2/10], Step: [101/625], Validation Acc: 86.52
Epoch: [2/10], Step: [201/625], Validation Acc: 86.58
Epoch: [2/10], Step: [301/625], Validation Acc: 86.28
Epoch: [2/10], Step: [401/625], Validation Acc: 86.86
Epoch: [2/10], Step: [501/625], Validation Acc: 86.78
Epoch: [2/10], Step: [601/625], Validation Acc: 86.58
Epoch: [3/10], Step: [101/625], Validation Acc: 86.58
Epoch: [3/10], Step: [201/625], Validation Acc: 86.56
Epoch: [3/10], Step: [301/625], Validation Acc: 86.36
Epoch: [3/10], Step: [401/625], Validation Acc: 85.66
Epoch: [3/10], Step: [501/625], Validation Acc: 86.04
Epoch: [3/10], Step: [601/625], Validation Acc: 85.58
Epoch: [4/10], Step: [101/625]

In [29]:
print ("After training for {} epochs".format(num_epochs))
print ("Val Acc {}".format(test_model(val_loader, model)))
print ("Test Acc {}".format(test_model(test_loader, model)))

After training for 10 epochs
Val Acc 82.72
Test Acc 80.36


### Hyperparameter Tuning

#### Try different Tokenize Scheme