# COMP5046 Assignment 1
*Make sure you change the file name with your unikey.*

# Readme

**Run the cells in which the following comment is written.**

`#**********************************RUN THIS CELL**********************************#`

***Visualising the comparison of different results is a good way to justify your decision.***

# 1 - Data Preprocessing

## 1.1. Download Dataset

In [1]:
#**********************************RUN THIS CELL**********************************#

# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1vF3FqgBC1Y-RPefeVmY8zetdZG1jmHzT'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('imdb_train.csv')

id = '1XhaV8YMuQeSwozQww8PeyiWMJfia13G6'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('imdb_test.csv')

import pandas as pd
df_train = pd.read_csv("imdb_train.csv")
df_test = pd.read_csv("imdb_test.csv")

reviews_train = df_train['review'].tolist()
sentiments_train = df_train['sentiment'].tolist()
reviews_test = df_test['review'].tolist()
sentiments_test = df_test['sentiment'].tolist()

print("Training set number:",len(reviews_train))
print("Testing set number:",len(reviews_test))

Training set number: 25000
Testing set number: 25000


## 1.2. Import Libraries

In [0]:
#**********************************RUN THIS CELL**********************************#

# Import libraries for pre processing
import re
import nltk
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np
from sklearn.preprocessing import LabelEncoder
from itertools import chain

# Import libraries for model
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score
from torch.autograd import Variable
import torch.nn.functional as F

# You can enable GPU here (cuda); or just CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## 1.3. Preprocess data

*Justification:*

**PREPROCESSING:**

**CLEAN DATA**

*To clean data, all the HTML tags from the text must be removed before processing it. IMDB data contains only `<br/ >` tags. To remove these `<br/ >` tags, I replaced all the `<br/ >` tags in the training and testing dataset with space.*

*Regular Expression describes the search pattern in a text. I have used regular expression to remove all the puntuations, numbers and special characters from the text. Anything which is not a character or a space is removed from the text.*

**CASE FOLDING**

*Case Folding is done so that the text can be compared irrespective of it's case. I have converted all my text into lower case for better comparison and processing of the model.*

**TOKENIZATION**

*Tokenization is a process of breaking sentences into tokens such as words, phrases, etc. Each of these tokens have some value associated to it and are used for processing. I have used these tokens as the input to the model.*

**STOP WORDS REMOVAL**

*Stop words are the words which do not contribute to the meaning or the context of the sentence, for example "a", "an", "the", etc. Hence they should be removed as they only increase the size of the database and does not affect precision or recall in any way. Hence, stop words removal is used as a preprocessing technique. I have used the list of stop words from "English nltk" to remove them from data set.*

**LEMMATIZATION**

*Lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. It converts all the words to its roots and returns the base word, example: "geese" to "goose". It provides precision to the dataset and improves recall.*

**LABEL ENCODING**

*Label Encoding is used to convert the text data into model understandable language. It converts each value to a number. The categorical data must be encoded to numbers before using it in a model. Hence, I have used label encoding to encode the labels from training and testing data set, "neg" label to "0" and "pos" label to 1.*

In [3]:
#**********************************RUN THIS CELL**********************************#

# PREPROCESSING
def cleanData(x):

  # Remove all the <br/> tags and replace them with space. 
  # (Only <br/> tags removed as there are no other tags present in the database)
  x = x.replace("<br />", " ")

  # Remove everything that is not a-z, A-Z, or a space
  # Remove numbers, puntuations, accented characters, underscores as well
  x = re.sub(r'[^a-zA-Z\s]', ' ', x)
  return x

# CLEAN DATA

# Remove puntuations and HTML tags for training data set
reviews_train = [cleanData(s) for s in reviews_train]

# Remove puntuations and HTML tags for testing data set
reviews_test = [cleanData(s) for s in reviews_test]

# CASE FOLDING

# Convert training data set to lower case 
reviews_train = [s.lower() for s in reviews_train]

# Convert testing data set to lower case 
reviews_test = [s.lower() for s in reviews_test]
#--------------------------------------------------------------------#

# TOKENIZATION
# Tokenization is used for splitting your data set into tokens.

nltk.download('punkt')

# Tokenize the training data set
reviews_train = [word_tokenize(s) for s in reviews_train]

# Tokenize the testing data set
reviews_test = [word_tokenize(s) for s in reviews_test]
#--------------------------------------------------------------------#

#STOP WORDS REMOVAL
# Stop words are the most common words used in any natural language. 
# These words don't add value to the meaning of the senetence.
# Hence, they must be removed from your data set.

nltk.download('stopwords')
stop_words = sw.words("english")

# Remove stop words from training data set
reviews_train_ns=[]
for tokens in reviews_train:
    filtered_sentence1 = [w1 for w1 in tokens if not w1 in stop_words]
    reviews_train_ns.append(filtered_sentence1)

# Remove stop words from testing data set
reviews_test_ns=[]
for tokens in reviews_test:
    filtered_sentence2 = [w2 for w2 in tokens if not w2 in stop_words]
    reviews_test_ns.append(filtered_sentence2)
#--------------------------------------------------------------------#

# LEMMETIZATION
# Lemmatisation is the process of grouping together the inflected forms of a word so they 
# can be analysed as a single item, identified by the word's lemma, or dictionary form.

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

# Lemmetize training data set
reviews_train_le = []
for tokens in reviews_train_ns:
    lemma_sentence1 = [lemmatizer.lemmatize(w1) for w1 in tokens ]
    reviews_train_le.append(lemma_sentence1)

# Lemmetize testing data set
reviews_test_le = []
for tokens in reviews_test_ns:
    lemma_sentence2 = [lemmatizer.lemmatize(w2) for w2 in tokens ]
    reviews_test_le.append(lemma_sentence2)
#--------------------------------------------------------------------#

#LABEL ENCODING

# Encoding the given labels by label encoder.
#  neg = 0, pos = 1

labels = np.unique(sentiments_train)

lEnc = LabelEncoder()
lEnc.fit(labels)

# Encoding training labels
sentiments_train_encoded = lEnc.transform(sentiments_train)

# Encoding testing labels
sentiments_test_encoded = lEnc.transform(sentiments_test)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


# 2 - Model Implementation

## 2.1. Word Embeddings

*Justification:*

**WORD2VEC**

*Word2Vec is used for better representation of words. It detects the similarities between the words and group similar words together in vector space. Every word is represented by a vector*

**CONTINUOUS BAG OF WORDS (CBOW)**

*CBOW predicts the center word from context words. CBOW does not work well with the infrequent words as they do not appear frequently in the context words.*

**SKIPGRAM MODEL**

*CBOW predicts context words given the center word. Skip gram is better for infrequent words. It even works well on small amount of data. Hence, I have used skip gram instead of CBOW so that the model will get trained for rare words as well.*

### 2.1.1. Data Preprocessing for Word Embeddings

*Justification:*

**PREPROCESSING FOR WORD EMBEDDING MODEL**

*In training input, we have a list of list (multiple reviews in one reviews). To do the word embeddings for every word, I have flattened the list into a single list by using "chain" function. From this single list, I have extracted the uniques words by forming a set. I appended `<PAD>` in list of unique words as later on in the Sequence to Sequence Model the initial list of lists will be padded by word `<PAD>`. I also appended `<OOV>` in the list as this will be required to handle the out of vocabulary words.*

In [0]:
#**********************************RUN THIS CELL**********************************#

training_data = []
testing_data = []

# Converting multiple lists into a single list for training data set 
training_data = list(chain(*[i for i in reviews_train_le]))

# Getting unique words from a list and then sorting the list
training_set_list = list(set(training_data))
training_set_list.sort()

# Appending the <PAD> to get it's word embedding
training_set_list.append('<PAD>')

# Appending the <OOV> to get word embeddings for out of vocabulary words
training_set_list.append('<OOV>')

# Make dictionary so that we can be reference each index of unique word
training_word_dict = {w: i for i, w in enumerate(training_set_list)}

### 2.1.2. Build Word Embeddings Model

In [0]:
# Create skipgrams
skip_grams = []

for i in range(1, len(training_data) - 1):
  # Context size = 2
  target = training_word_dict[training_data[i]]
  context = [training_word_dict[training_data[i - 1]], training_word_dict[training_data[i + 1]]]

  # skipgrams - (target, context[0]), (target, context[1])..
  for w in context:
      skip_grams.append([target, w])

# Prepare random batches from skip-gram
def prepare_batch(data, size):
  random_inputs = []
  random_labels = []
  random_index = np.random.choice(range(len(data)), size, replace=False)

  for i in random_index:
      input_temp = [0]*vocab_size
      input_temp[data[i][0]] = 1
      random_inputs.append(input_temp)  # target
      random_labels.append(data[i][1])  # context word

  return np.array(random_inputs), np.array(random_labels)

*Justification:*

**HYPERPARAMETERS:**

**Learning Rate: 0.03**

*- As with 0.01 learning rate, the model was learning very slow and loss kept on increasing. With learning rate greater than 0.1, model was training very fast and was overfitting. That is why I chose the optimal learning rate of 0.03 so that the model does not get stuck or converge too quickly.*

**Batch Size: 1000**

*- Batch Size is 1000 as the data is too huge and most of the data should be used for our model to train properly.*

**Embedding Size: 34**

*- 34 as in the character embedding model, the character array is of 34 characters.*

In [0]:
#**********************************RUN THIS CELL**********************************#

#HYPERPARAMETERS
vocab_size = len(training_set_list) # no of unique words in training data set
learning_rate_wordEmb = 0.03 # learning rate
batchSize_wordEmb = 1000 # batch size
embedding_size = 34 # embedding size

In [0]:
#**********************************RUN THIS CELL**********************************#

# Skip Gram Model
class SkipGramModel(nn.Module):
    def __init__(self):
        super(SkipGramModel, self).__init__()
        self.linear1 = nn.Linear(vocab_size, embedding_size, bias = False)
        self.linear2 = nn.Linear(embedding_size, vocab_size, bias = False)

    def forward(self, x):
        hidden = self.linear1(x) #z1
        out = self.linear2(F.relu(hidden)) #zout
        return out

### 2.1.3. Train Word Embeddings Model

In [0]:
skip_gram_model = SkipGramModel()

# Cross Entropy is used so that the decision boundary is large.
criterion = nn.CrossEntropyLoss()

# Adam optimizer can handle sparse gradients and provide high F1 score as well
optimiser = optim.Adam(skip_gram_model.parameters(), lr = learning_rate_wordEmb)

# Train the model
for epoch in range(1500):

    # Make batches
    inputs,labels = prepare_batch(skip_grams, batchSize_wordEmb)

    # Convert input and labels into torch
    inputs_torch = torch.from_numpy(inputs).float()
    labels_torch = torch.from_numpy(labels)
    
    # Train the model
    skip_gram_model.train()

    # Zero grad
    optimiser.zero_grad()

    # Forward propagation
    outputs = skip_gram_model(inputs_torch)
  
    # Calculate loss
    loss = criterion(outputs, labels_torch)

    # Back propagation
    loss.backward()
    optimiser.step()

    if epoch % 100 == 99:
        print('Epoch: %d, loss: %.4f' %(epoch + 1, loss))    

Epoch: 100, loss: 8.9823
Epoch: 200, loss: 8.7778
Epoch: 300, loss: 8.6844
Epoch: 400, loss: 8.8544
Epoch: 500, loss: 9.0072
Epoch: 600, loss: 8.8038
Epoch: 700, loss: 8.9635
Epoch: 800, loss: 9.0254
Epoch: 900, loss: 9.0590
Epoch: 1000, loss: 8.8086
Epoch: 1100, loss: 9.0261
Epoch: 1200, loss: 8.8473
Epoch: 1300, loss: 8.9149
Epoch: 1400, loss: 8.7768
Epoch: 1500, loss: 8.7186


### 2.1.4. Save Word Embeddings Model

In [0]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Save the Skip Gram Model
torch.save(skip_gram_model, '/content/gdrive/My Drive/Models/word_embedding_model.pt')

  "type " + obj.__name__ + ". It won't be checked "


In [0]:
#**********************************RUN THIS CELL**********************************#

# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1jGiRUbEJmaChHgUf15WwIg3DPh80z_ll'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('word_embedding_model.pt')

### 2.1.5. Load Word Embeddings Model

In [8]:
#**********************************RUN THIS CELL**********************************#

# Load the model
wordEmbeddingModel = torch.load('word_embedding_model.pt')

# Evaluate the model
wordEmbeddingModel.eval()

SkipGramModel(
  (linear1): Linear(in_features=65520, out_features=34, bias=False)
  (linear2): Linear(in_features=34, out_features=65520, bias=False)
)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Extract the weights from the models

weight2 = wordEmbeddingModel.linear2.weight
word_embeddings = weight2.detach().numpy()  

## 2.2. Character Embeddings

### 2.2.1. Data Preprocessing for Character Embeddings

*Justification:*

**PREPROCESSING FOR CHARACTER EMBEDDING MODEL**

*For preprocessing, I have padded every word in the unique training data word list by "#". This was done so that every word will be of equal length which will help model to learn better.*

*Also, added 'P', 'A', 'D', 'O', 'V', '<', '>' in character array as the word list also contains `<PAD>` to handle padding and `<OOV>` to handle out of vocabulary words. *

In [0]:
#**********************************RUN THIS CELL**********************************#

# Function to add padding to all the words to make them of equal length
# Add # at the last of each words to make them of equal length

def add_padding_to_words(corpus, seq_length):
    output = []
    for word in corpus:
        if len(word) > seq_length:
            output.append(word[:seq_length])
        else:
            for j in range(seq_length-len(word)):
                word = word + "#"
            output.append(word)
    return output

In [0]:
#**********************************RUN THIS CELL**********************************#

# Get the length of words having maximum characters
maxword_length_train = len(max(training_set_list, key=len))

# Add paddings to all your words
train_char_pad = add_padding_to_words(training_set_list, maxword_length_train)

### 2.2.2. Build Character Embeddings Model

In [0]:
#**********************************RUN THIS CELL**********************************#

# Assume that we have the following character instances
# Characters in <PAD> and <OOV> are also added in this as we have it in out dictionary
char_arr = ['a', 'b', 'c', 'd', 'e', 'f', 'g',
            'h', 'i', 'j', 'k', 'l', 'm', 'n',
            'o', 'p', 'q', 'r', 's', 't', 'u',
            'v', 'w', 'x', 'y', 'z', '#', '<', 
            'P', 'A', 'D', '>', 'O', 'V']

# Create a dictionary for above char_arr
char_dict = {n: i for i, n in enumerate(char_arr)}
# Get the dictionary length
charDict_len = len(char_dict)

# Get one-hot encoding for every word
def encode_words(seq_data):
  input_batch = []
    
  for seq in seq_data:
    input_data = [char_dict[n] for n in seq]
    input_batch.append(np.eye(charDict_len)[input_data])
  return input_batch

def generate_batch(input_embeddings, label, batch_size):
  idx = np.random.choice(input_embeddings.shape[0], size=batch_size, replace = False)
  return input_embeddings[idx,:,:],label[idx, :]

*Justification:*

**HYPERPARAMETERS:**

**Learning Rate: 0.05**

*- As with 0.01 learning rate, the model was learning very slow and loss kept on increasing. With learning rate greater than 0.1, model was training very fast and was overfitting. That is why I chose the optimal learning rate of 0.05 so that the model does not get stuck or converge too quickly.*

**Batch Size: 1000**

*- Batch Size is 1000 as the data is too huge and most of the data should be used for our model to train properly.*

**Number of hidden layers: 20**

*- 20 so that the embedding size if not too big for each character.*

**n_input, n_class: 34**

*- length of character array*

In [0]:
#**********************************RUN THIS CELL**********************************#

# HYPERPARAMETERS

learning_rate_charEmb = 0.05 # Learning Rate
n_hidden = 20 # Number of Hidden Layers
n_input = charDict_len # Number of inputs = 34
n_class = charDict_len # Number of classes = 34
batchSize_charEmb = 1000 # Batch Size

In [14]:
#**********************************RUN THIS CELL**********************************#

# Bi-LSTM Model for character based word embeddings

class CharacterEmbedding_Model(nn.Module):
  def __init__(self):
    super(CharacterEmbedding_Model, self).__init__()
    self.lstm = nn.LSTM(n_input, n_hidden, batch_first =True,bidirectional=True, dropout=0.2)
    self.linear = nn.Linear(n_hidden*2, n_class)

  def forward(self, sentence):
    
    # h_n of shape (num_layers * num_directions, batch, hidden_size)
    # tensor containing the hidden state for t = seq_len.
    lstm_out, (h_n,c_n) = self.lstm(sentence)

    # Concatenate the last hidden state from two directions
    hidden_out =torch.cat((h_n[0,:,:],h_n[1,:,:]),1)
    z = self.linear(hidden_out)
    log_output = F.log_softmax(z, dim=1)
    return log_output,hidden_out

character_embedding_model = CharacterEmbedding_Model()

# Loss function and optimizer
# MSE loss is sum of squared distance between output and predicted results 
criterion = nn.MSELoss()

# Adam optimizer can handle sparse gradients and provide high F1 score as well
optimizer = optim.Adam(character_embedding_model.parameters(), lr=learning_rate_charEmb)

# Prepare input by encoding the words
input_batch = encode_words(train_char_pad)

# Target will be the word embeddings from the skip gram model
target_batch = word_embeddings

# Convert input into tensors and target to tensors
input_batch_torch = torch.from_numpy(np.array(input_batch)).float()
target_batch_torch = torch.from_numpy(np.array(target_batch)).float()

  "num_layers={}".format(dropout, num_layers))


### 2.1.4. Train Character Embeddings Model

In [0]:
# Train the model

for epoch in range(5000):  
    
  character_embedding_model.train()

  # Make batches
  batch_input, batch_output = generate_batch(input_batch_torch, target_batch_torch, batchSize_charEmb)

  # Forward + Backward + Optimize
  outputs,_ = character_embedding_model(batch_input)

  # Calculate loss
  loss = criterion(outputs, batch_output)

  # Back propagation
  loss.backward()
  optimizer.step()

  # Zero Grad
  optimizer.zero_grad()

  if epoch % 100 == 99:
      print('Epoch: %d, loss: %.4f' %(epoch + 1, loss.item()))    

print('Finished Training')

Epoch: 100, loss: 7.7940
Epoch: 200, loss: 8.0034
Epoch: 300, loss: 7.8183
Epoch: 400, loss: 8.1443
Epoch: 500, loss: 8.0240
Epoch: 600, loss: 8.0723
Epoch: 700, loss: 7.9003
Epoch: 800, loss: 7.7201
Epoch: 900, loss: 7.7944
Epoch: 1000, loss: 8.5158
Epoch: 1100, loss: 7.8755
Epoch: 1200, loss: 8.0678
Epoch: 1300, loss: 7.6818
Epoch: 1400, loss: 7.5123
Epoch: 1500, loss: 7.9060
Epoch: 1600, loss: 8.2075
Epoch: 1700, loss: 7.7961
Epoch: 1800, loss: 8.0386
Epoch: 1900, loss: 7.6458
Epoch: 2000, loss: 8.0211
Epoch: 2100, loss: 7.9764
Epoch: 2200, loss: 7.8107
Epoch: 2300, loss: 7.5797
Epoch: 2400, loss: 7.8151
Epoch: 2500, loss: 7.9793
Epoch: 2600, loss: 7.8212
Epoch: 2700, loss: 7.6934
Epoch: 2800, loss: 7.6424
Epoch: 2900, loss: 7.3485
Epoch: 3000, loss: 7.7182
Epoch: 3100, loss: 7.8717
Epoch: 3200, loss: 7.7502
Epoch: 3300, loss: 7.7979
Epoch: 3400, loss: 7.7241
Epoch: 3500, loss: 7.6360
Epoch: 3600, loss: 7.7643
Epoch: 3700, loss: 7.9334
Epoch: 3800, loss: 7.9570
Epoch: 3900, loss: 7.

### 2.1.5. Save Character Embeddings Model

In [0]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
# Save the Bi-LSTM Model
torch.save(character_embedding_model, '/content/gdrive/My Drive/Models/character_embedding_model.pt')

  "type " + obj.__name__ + ". It won't be checked "


In [0]:
#**********************************RUN THIS CELL**********************************#

# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1-0TXvRdSIF1vOgwMFj4Q4__zCdmF5VyN'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('character_embedding_model.pt')

### 2.1.6. Load Character Embeddings Model

In [16]:
#**********************************RUN THIS CELL**********************************#

# Load the model
characterEmbeddingModel = torch.load('character_embedding_model.pt')

# Evaluate the model
characterEmbeddingModel.eval()

CharacterEmbedding_Model(
  (lstm): LSTM(34, 20, batch_first=True, dropout=0.2, bidirectional=True)
  (linear): Linear(in_features=40, out_features=34, bias=True)
)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Extract the concatenated weights from the models
_, character_embeddings = character_embedding_model(input_batch_torch)
char_embed = character_embeddings.detach().numpy()

## 2.3. Sequence model

### 2.3.1. Apply/Import Word Embedding and Character Embedding Model

In [0]:
#**********************************RUN THIS CELL**********************************#

# Concatenate word embeddings and character embedding
concatenated_embeddings = np.concatenate((word_embeddings, char_embed), axis=1)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Map the final embeddings to data sets

def map_embeddings(input_set):
  mapped_embeddings = []
  for review in input_set:
    embedding_list = []
    for word in review:

      try:
        index = int(training_word_dict[word])
        embedding_list.append(concatenated_embeddings[index])

      except KeyError:
        # Handle Out of Vocabulary words
        index = int(training_word_dict['<OOV>'])
        embedding_list.append(concatenated_embeddings[index]) 

    mapped_embeddings.append(embedding_list)
  return mapped_embeddings


In [0]:
#**********************************RUN THIS CELL**********************************#

# Function to add <PAD> to all the lists to make them of equal length

def add_padding_to_lists(corpus, seq_length):
    output = []
    for sentence in corpus:
        if len(sentence)>seq_length:
            output.append(sentence[:seq_length])
        else:
            for j in range(seq_length-len(sentence)):
                sentence.append("<PAD>")
            output.append(sentence)
    return output

# Generate batch function
def make_batch(mapped_embeddings, label, batch_size):
    idx = np.random.choice(len(mapped_embeddings), size = batch_size, replace = False)
    list_onereview = []
    for ele in idx:
      list_onereview = list_onereview + [mapped_embeddings[ele]]
    return list_onereview, label[idx]

### 2.3.2. Build Sequence Model

*Justification*

**PREPROCESSING FOR SEQUENCE TO SEQUENCE MODEL**

*For preprocessing, I have padded every list in the training reviews by appending <PAD> at the end. This was done so that every list in the tarining data will be of equal length which will help model to learn better.*

In [0]:
#**********************************RUN THIS CELL**********************************#

# Get the length of list having maximum elements
list_length_train = [len(s) for s in reviews_train_ns]
maxlength_list_train = max(list_length_train)

# Add paddings to all your lists
train_word_pad = add_padding_to_lists(reviews_train_le, maxlength_list_train)

# Get the embeddings for training data set our final embeddings
mapped_train_embeddings = map_embeddings(train_word_pad)

*Justification:*

**HYPERPARAMETERS:**

**input_n: 74**

*- Total Embedding Size of word and characters*

**class_n: 2**

*- Number of distinct labels*

**Learning Rate: 0.01**

*-  With learning rate greater than 0.01, model was training very fast and was overfitting. That is why I chose the optimal learning rate of 0.01 so that the model does not get stuck or converge too quickly.*

**Batch Size: 500**

*- Batch Size is 500 as the data is too huge and most of the data should be used for our model to train properly.*

**Number of hidden layers: 10**

*- 10 for proper training of model for every review*



In [0]:
#**********************************RUN THIS CELL**********************************#

# HYPERPARAMETERS

input_n = len(mapped_train_embeddings[0][0]) # Number of input: Embedding size
class_n = len(list(set(sentiments_train)))
hidden_n = 10 # Number of hidden layers
batchSize_seqtoSeq = 500 # Batch Size
learningRate_seqtoSeq = 0.01 # Learning Rate

In [0]:
#**********************************RUN THIS CELL**********************************#

# Sequence to Sequence Sentiment Analysis Model (M to 1)

class SeqNet(nn.Module):
    def __init__(self):
        super(SeqNet, self).__init__()
        self.lstm = nn.LSTM(input_n, hidden_n, batch_first =True)
        self.linear = nn.Linear(hidden_n, class_n)

    def forward(self, x):        

        # lstm layer
        x,_ = self.lstm(x)

        # linear layer
        x = self.linear(x[:,-1,:])
        
        # softmax layer
        x = F.log_softmax(x, dim=1)
              
        return x

### 2.3.3. Train Sequence Model

In [0]:
seq_net = SeqNet()

# Cross Entropy is used so that the decision boundary is large.
criterion = nn.CrossEntropyLoss()

# Adam optimizer can handle sparse gradients and provide high F1 score as well
optimizer = optim.Adam(seq_net.parameters(), lr=learningRate_seqtoSeq)

# Train the model
for epoch in range(1000):

    # Make batches
    input_batch, target_batch = make_batch(mapped_train_embeddings, sentiments_train_encoded, batchSize_seqtoSeq)

    # Convert your final embeddings to an array 
    input_batch_array = np.asarray(input_batch)

    # Convert input and labels into torch
    input_batch_torch = torch.from_numpy(input_batch_array).float()
    target_batch_torch = torch.from_numpy(target_batch).view(-1)

    # Forward propagation
    seq_net.train()
    outputs = seq_net(input_batch_torch) 

    # Calculate loss
    loss = criterion(outputs, target_batch_torch)

    # Back propagation
    loss.backward()
    optimizer.step()

    # Zero Grad
    optimizer.zero_grad()

    if epoch % 100 == 99:
      print('Epoch: %d, loss: %.4f' %(epoch + 1, loss.item()))

print('Finished Training')

Epoch: 100, loss: 0.6924
Epoch: 200, loss: 0.6952
Epoch: 300, loss: 0.6931
Epoch: 400, loss: 0.6908
Epoch: 500, loss: 0.6933
Epoch: 600, loss: 0.6948
Epoch: 700, loss: 0.6928
Epoch: 800, loss: 0.6930
Epoch: 900, loss: 0.6934
Epoch: 1000, loss: 0.6929
Finished Training


### 2.3.4. Save Sequence Model

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
# Save the Sequence to Sequence Model
torch.save(seq_net, '/content/gdrive/My Drive/Models/sequence2sequence_model.pt')

  "type " + obj.__name__ + ". It won't be checked "


### 2.3.5. Load Sequence Model

In [0]:
#**********************************RUN THIS CELL**********************************#

# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1ipDYLgOsUjPFwj2tCn-awiDhl_m7B3rH'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('sequence2sequence_model.pt')

In [25]:
#**********************************RUN THIS CELL**********************************#

# Load the model
seqtoseqmodel = torch.load('sequence2sequence_model.pt')

# Evaluate the model
seqtoseqmodel.eval()

SeqNet(
  (lstm): LSTM(74, 10, batch_first=True)
  (linear): Linear(in_features=10, out_features=2, bias=True)
)

# 3 - Evaluation

(*Please show your empirical evidence*)

## 3.1. Performance Evaluation


You are required to provide the table with precision, recall, f1 of test set.

In [0]:
#**********************************RUN THIS CELL**********************************#

# Get the length of list having maximum elements
list_length_test = [len(s) for s in reviews_test_ns]
maxlength_list_test = max(list_length_test)

# Add paddings to all your lists
test_word_pad = add_padding_to_lists(reviews_test_le, maxlength_list_test)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Get the embeddings for testing data set our final embeddings
mapped_test_embeddings = map_embeddings(test_word_pad)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Make batches for testing data set
test_input, test_labels = make_batch(mapped_test_embeddings, sentiments_test_encoded, 5000)

# Convert your final embeddings to an array 
test_input_array = np.asarray(test_input)

In [0]:
#**********************************RUN THIS CELL**********************************#

# Get the predicted labels for your testing data set
outputs = seqtoseqmodel(torch.from_numpy(test_input_array).float()) 

# Convert your predicted labels into torch
_, predicted = torch.max(outputs, 1)

In [30]:
#**********************************RUN THIS CELL**********************************#

# Calculate the precision, recall, F1-score and support

from sklearn.metrics import classification_report
print(classification_report(test_labels, predicted.cpu().numpy(),digits=4))

              precision    recall  f1-score   support

           0     0.5058    1.0000    0.6718      2529
           1     0.0000    0.0000    0.0000      2471

    accuracy                         0.5058      5000
   macro avg     0.2529    0.5000    0.3359      5000
weighted avg     0.2558    0.5058    0.3398      5000



  _warn_prf(average, modifier, msg_start, len(result))


F1 score is the balance between recall and precision. It takes into account false positives and false negatives. This score is better than accuracy.

## 3.2. Hyperparameter Testing
*You are required to draw a graph(y-axis: f1, x-axis: epoch) for test set and explain the optimal number of epochs based on the learning rate you have already chosen.*

In [0]:
# Please comment your code

## Object Oriented Programming codes here

*You can use multiple code snippets. Just add more if needed* 

In [0]:
# If you used OOP style, use this section