In [1]:
import pandas as pd
import numpy as np
import re
import gensim.downloader as w2vapi
from gensim.test.utils import datapath
from gensim import utils
import gensim.models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC
import torch
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.sampler import SubsetRandomSampler
import torch.nn as nn
import torch.nn.functional as F
import warnings
warnings.filterwarnings('ignore')



#### Package versions:

Python: 3.9.7

Gensim: 4.3.0

PyTorch: 1.13.1

In [2]:
print(gensim.__version__)

4.3.0


In [3]:
print(torch.__version__)

1.13.1


#### Note: I have not included the data.tsv file in my submission as it was stated in the HW description that it is not required.

However, I have included these 6 extra files:

custom_word2vec.model, fnn_model_avg.pt, fnn_model_concat.pt, rnn_model.pt, gru_model.pt, lstm_model.pt.

The .pt files are the best outcomes of my models based on validation loss. If needed, they can be deleted before execution, as this notebook will automatically generate them again when executed.

## 1. Dataset Generation

In [4]:
data = pd.read_csv('data.tsv',on_bad_lines='skip',sep='\t')
data = data[['review_body','star_rating']]
data = data.dropna()

data['rating_class'] = 0
data.loc[(data['star_rating']==1) | (data['star_rating']=='1') | (data['star_rating']==2) | (data['star_rating']=='2'),'rating_class'] = 0
data.loc[(data['star_rating']==3) | (data['star_rating']=='3'),'rating_class'] = 1
data.loc[(data['star_rating']==4) | (data['star_rating']=='4') | (data['star_rating']==5) | (data['star_rating']=='5'),'rating_class'] = 2

class_1_20k = data.loc[data['rating_class']==0].sample(n=20000, random_state=47)
class_2_20k = data.loc[data['rating_class']==1].sample(n=20000, random_state=47)
class_3_20k = data.loc[data['rating_class']==2].sample(n=20000, random_state=47) #47

df = pd.concat([class_1_20k, class_2_20k, class_3_20k], ignore_index=True)

The above creates a dataset of 60,000 reviews, with 20,000 reviews of each created category:

Rating class 0: rating 1, rating 2

Rating class 1: rating 3

Rating class 2: rating 4, rating 5

## 2. Word Embedding

### (a) Load pre-trained word embeddings and experiment with some words

In [5]:
w2v = w2vapi.load('word2vec-google-news-300')

In [6]:
# w2vapi.load('word2vec-google-news-300',return_path=True)

In [7]:
print(w2v.similarity('important','essential'))

0.65639


In [8]:
print(w2v.similarity('return','replace'))

0.22030458


In [9]:
print(w2v.similarity('expensive','costly'))

0.73467165


In [10]:
print(w2v.similarity('product','item'))

0.25702554


In [11]:
print(w2v.similarity('service','customer'))

0.4600251


In [12]:
print(w2v.similarity('satisfied','pleased') - w2v.similarity('dissatisfied','disappointed'))

0.053421497


The above 6 examples are meant to show semantic similarity between words, using the pretrained Google word embeddings.

Examples chosen are of words which are likely to appear in customer reviews, to make a comprehensive comparison with the model trained from our own dataset, in 2(b) below.

### (b) Train a Word2Vec model using your own dataset

In [13]:
class ReviewsCorpus:
    
    def __iter__(self):
        for index, row in df.iterrows():
            row_sents = str(row['review_body'])
            row_sents = row_sents.replace(',','')
            row_sents = row_sents.split('.')
            for sent in row_sents:
                yield utils.simple_preprocess(sent)

In [14]:
sentences = ReviewsCorpus()
w2v_custom_model = gensim.models.Word2Vec(sentences=sentences, vector_size=300, window=13, min_count=9)
w2v_custom_model.save('custom_word2vec.model')

In [15]:
print(w2v_custom_model.wv.similarity('important','essential'))

0.18706413


In [16]:
print(w2v_custom_model.wv.similarity('return','replace'))

0.5946355


In [17]:
print(w2v_custom_model.wv.similarity('expensive','costly'))

0.6565095


In [18]:
print(w2v_custom_model.wv.similarity('product','item'))

0.7072196


In [19]:
print(w2v_custom_model.wv.similarity('service','customer'))

0.8479623


In [20]:
print(w2v_custom_model.wv.similarity('satisfied','pleased') - w2v_custom_model.wv.similarity('dissatisfied','disappointed'))

0.27745116


#### As seen above, the custom model accurately depicts similarity between words whose context is more similar in our dataset, compared to Word2Vec pretrained embeddings.

(Return, replace), (product, item), (service, customer) have greater relation between each other when we consider our dataset of reviews, hence their semantic similarity is higher.

(Important, essential), (expensive, costly) are not strictly related to our dataset, hence their similarities are better computed from the pre-trained embeddings, which are trained over a much larger corpus.

Also, the similarity between pleased and satisified, and between disappointed and dissatisfied is shown better in manual trained compared to pre-trained. In short Pleased - Disappointed + Satisifed = Dissatisifed is shown better in manual trained. This is because the above words are again more likely to appear in customer reviews.

To conclude, the custom model accurately depicts semantic similarity between words whose context is more similar in our dataset, compared to word2vec pretrained embeddings. But general words' semantic similarity is better in the pre-trained model.

## 3. Simple models

In [21]:
X_avg = []
for i in range(df.shape[0]):
    curr_review = df.iloc[i]['review_body']
    curr_review = curr_review.replace(',','') #removing commas and periods is necessary because otherwise words will not be recognized by the Word2Vec model
    curr_review = curr_review.replace('.','')
    curr_review = curr_review.split()
    curr_vect = []
    for word in curr_review:
        if word in w2v:
            curr_vect.append(w2v[word])
    if len(curr_vect)==0:
        curr_vect = np.zeros((300,), dtype=float)
#         curr_vect = np.array(curr_vect)
    else:
        curr_vect = np.array(curr_vect)
        curr_vect = np.mean(curr_vect,axis=0)
    X_avg.append(curr_vect)
X_avg = np.array(X_avg)

Note: Even though no data pre-processing or cleaning was conducted on data, full stops and commas were removed.

This was necessary because the Word2Vec model would not be able to recognize words if they are attached behind a full stop or comma. Eg. Word embedding would not exist for 'happy,' or 'happy.', but will exist for 'happy'.

Hence, to pass only the word to the Word2Vec model, the commas and full stops are removed from the dataset.

In [22]:
X_avg.shape

(60000, 300)

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(X_avg, df['rating_class'], test_size=0.2, random_state=45)

### Perceptron

In [24]:
percep = Perceptron(penalty='elasticnet',alpha=0.00001, random_state=168)
percep = percep.fit(X_train, Y_train)

In [25]:
print('Accuracy when using Word2Vec features:',str(percep.score(X_test,Y_test)))

Accuracy when using Word2Vec features: 0.5851666666666666


Accuracy when using TF-IDF features: 0.6384166666666666

### SVM

In [26]:
lin_svc = LinearSVC(penalty='l1', dual=False,C=0.3)
lin_svc = lin_svc.fit(X_train, Y_train)

In [27]:
print('Accuracy when using Word2Vec features:',str(lin_svc.score(X_test,Y_test)))

Accuracy when using Word2Vec features: 0.6538333333333334


Accuracy when using TF-IDF features: 0.7143333333333334

#### It is seen that the models performed better when using TF-IDF features, compared to Word2Vec features.
This can be because TF-IDF vectorizer would consider only words from our corpus of reviews, while the Word2Vec pre-trained embeddings are trained on a large corpus, which contains general words of all kinds.

Some words in reviews tend to be common, and TF-IDF enables the functionality which shows how important a word is in a document based on its frequency in a corpus. This may have led to a better performance than Word2Vec.

#### Note: The accuracies for both of the above models when computed using TF-IDF features is a part of HW1. The random seeds for dataset generation, as well as model hyperparameters are the same. The only difference is that HW1 used TF-IDF features. I have only reported those numbers here, as was told by Professor during one of the lectures.

## 4. Feedforward Neural Networks

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(pd.DataFrame(df['review_body']), pd.DataFrame(df['rating_class']), test_size=0.2, random_state=45)

In [29]:
def accuracy(model, dataloader):
    prediction_list = []
    actual_list = []
    for i, batch in enumerate(dataloader):
        outputs = model(batch[0])
        _, predicted = torch.max(outputs.data, 1) 
        prediction_list.append(int(predicted[0]))
        actual_list.append(int(batch[1][0]))
    
    total=0
    for i in range(len(prediction_list)):
        if prediction_list[i]==actual_list[i]:
            total+=1
    return float(total)/len(prediction_list)

### (a) Taking the average of all Word2Vec vectors

In [30]:
len(X_train)

48000

In [31]:
len(X_test)

12000

In [32]:
class TrainDataAvg:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        for word in curr_review:
            if word in w2v:
                curr_vect.append(w2v[word])
        if len(curr_vect)==0:
            curr_vect = np.zeros((300,), dtype=np.float32)
    #         curr_vect = np.array(curr_vect)
        else:
            curr_vect = np.array(curr_vect)
            curr_vect = np.mean(curr_vect,axis=0)
        
        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label
    


class TestDataAvg:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        for word in curr_review:
            if word in w2v:
                curr_vect.append(w2v[word])
        if len(curr_vect)==0:
            curr_vect = np.zeros((300,), dtype=np.float32)
    #         curr_vect = np.array(curr_vect)
        else:
            curr_vect = np.array(curr_vect)
            curr_vect = np.mean(curr_vect,axis=0)
        
        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label

In [33]:
train_data_avg = TrainDataAvg(X_train,Y_train)
test_data_avg = TestDataAvg(X_test,Y_test)

In [34]:
batch_size=100
validation_size=0.2

num_train = len(train_data_avg)
inds = list(range(num_train))
np.random.shuffle(inds)
split = int(np.floor(validation_size*num_train))
train_idx, valid_idx = inds[split:], inds[:split]

train_sampler = SubsetRandomSampler(train_idx)
validn_sampler = SubsetRandomSampler(valid_idx)

train_loader = DataLoader(train_data_avg, batch_size=batch_size, sampler=train_sampler)
validn_loader = DataLoader(train_data_avg, batch_size=batch_size, sampler=validn_sampler)

In [35]:
class FFNetAvg(nn.Module):
    
    def __init__(self):
        super(FFNetAvg, self).__init__()
        
        hidden_1 = 100
        hidden_2 = 10
        
        self.fc1 = nn.Linear(300, hidden_1)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
        self.fc3 = nn.Linear(hidden_2, 3)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self,x):
        x = x.to(torch.float32)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        
        x = self.fc3(x)
        
        return x

fnn_model_avg = FFNetAvg()
print(fnn_model_avg)

FFNetAvg(
  (fc1): Linear(in_features=300, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [36]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(fnn_model_avg.parameters(), lr=0.0005)

In [37]:
epochs = 50

validn_min_loss = np.Inf

for epoch in range(epochs):
    train_loss = 0.0
    validn_loss = 0.0
    
    fnn_model_avg.train()
    for rev, rev_class in train_loader:
        optimizer.zero_grad()
        output = fnn_model_avg(rev)
        
        loss = criterion(output,rev_class)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*rev.size(0)
        
    fnn_model_avg.eval()
    for rev, rev_class in validn_loader:
        output = fnn_model_avg(rev)
        
        loss = criterion(output,rev_class)
        
        validn_loss += loss.item()*rev.size(0)
    
    train_loss = train_loss/(len(train_loader)*batch_size)
    validn_loss = validn_loss/(len(validn_loader)*batch_size)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch+1, train_loss, validn_loss))
    
    if validn_loss <= validn_min_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model.'.format(validn_min_loss, validn_loss))
        torch.save(fnn_model_avg.state_dict(), 'fnn_model_avg.pt')
        validn_min_loss = validn_loss

Epoch: 1 	Training Loss: 0.981316 	Validation Loss: 0.891588
Validation loss decreased (inf --> 0.891588).  Saving model.
Epoch: 2 	Training Loss: 0.875733 	Validation Loss: 0.848071
Validation loss decreased (0.891588 --> 0.848071).  Saving model.
Epoch: 3 	Training Loss: 0.844206 	Validation Loss: 0.829830
Validation loss decreased (0.848071 --> 0.829830).  Saving model.
Epoch: 4 	Training Loss: 0.829030 	Validation Loss: 0.816238
Validation loss decreased (0.829830 --> 0.816238).  Saving model.
Epoch: 5 	Training Loss: 0.819646 	Validation Loss: 0.811278
Validation loss decreased (0.816238 --> 0.811278).  Saving model.
Epoch: 6 	Training Loss: 0.810481 	Validation Loss: 0.806048
Validation loss decreased (0.811278 --> 0.806048).  Saving model.
Epoch: 7 	Training Loss: 0.806280 	Validation Loss: 0.803247
Validation loss decreased (0.806048 --> 0.803247).  Saving model.
Epoch: 8 	Training Loss: 0.801907 	Validation Loss: 0.801237
Validation loss decreased (0.803247 --> 0.801237).  Sav

In [38]:
test_loader = DataLoader(test_data_avg, batch_size=1)

In [39]:
fnn_model_avg = FFNetAvg()
fnn_model_avg.load_state_dict(torch.load('fnn_model_avg.pt'))

<All keys matched successfully>

In [40]:
print('Accuracy of FNN using average Word2Vec vectors:',str(accuracy(fnn_model_avg, test_loader)))

Accuracy of FNN using average Word2Vec vectors: 0.6505


### (b) Concatenating the first 10 Word2Vec vectors

In [41]:
class TrainDataConcat:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        word_count = 0
        for word in curr_review:
            if word_count==10:
                break
            if word in w2v:
                word_count+=1
                curr_vect.append(w2v[word])

        while word_count<10:
            curr_vect.append(np.zeros((300,), dtype=np.float32))
            word_count+=1

        if len(curr_vect)==0:
            curr_vect = np.zeros((3000,), dtype=np.float32)
        else:
            curr_vect = np.array(curr_vect)
            curr_vect = curr_vect.flatten()

        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label
    


class TestDataConcat:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        word_count = 0
        for word in curr_review:
            if word_count==10:
                break
            if word in w2v:
                word_count+=1
                curr_vect.append(w2v[word])

        while word_count<10:
            curr_vect.append(np.zeros((300,),dtype=np.float32))
            word_count+=1

        if len(curr_vect)==0:
            curr_vect = np.zeros((3000,),dtype=np.float32)
        else:
            curr_vect = np.array(curr_vect)
            curr_vect = curr_vect.flatten()

        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label

In [42]:
train_data_concat = TrainDataConcat(X_train,Y_train)
test_data_concat = TestDataConcat(X_test,Y_test)

In [43]:
batch_size=100
validation_size=0.2

num_train = len(train_data_concat)
inds = list(range(num_train))
np.random.shuffle(inds)
split = int(np.floor(validation_size*num_train))
train_idx, valid_idx = inds[split:], inds[:split]

train_sampler = SubsetRandomSampler(train_idx)
validn_sampler = SubsetRandomSampler(valid_idx)

train_loader = DataLoader(train_data_concat, batch_size=batch_size, sampler=train_sampler)
validn_loader = DataLoader(train_data_concat, batch_size=batch_size, sampler=validn_sampler)

In [44]:
class FFNetConcat(nn.Module):
    
    def __init__(self):
        super(FFNetConcat, self).__init__()
        
        hidden_1 = 100
        hidden_2 = 10
        
        self.fc1 = nn.Linear(3000, hidden_1)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
        self.fc3 = nn.Linear(hidden_2, 3)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self,x):
#         x = x.to(torch.float32)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        
        x = self.fc3(x)
        
        return x

fnn_model_concat = FFNetConcat()
print(fnn_model_concat)

FFNetConcat(
  (fc1): Linear(in_features=3000, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [45]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(fnn_model_concat.parameters(), lr=0.00005)

In [46]:
epochs = 50

validn_min_loss = np.Inf

for epoch in range(epochs):
    train_loss = 0.0
    validn_loss = 0.0
    
    fnn_model_concat.train()
    for rev, rev_class in train_loader:
        optimizer.zero_grad()
        output = fnn_model_concat(rev)
        
        loss = criterion(output,rev_class)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*rev.size(0)
        
    fnn_model_concat.eval()
    for rev, rev_class in validn_loader:
        output = fnn_model_concat(rev)
        
        loss = criterion(output,rev_class)
        
        validn_loss += loss.item()*rev.size(0)
    
    train_loss = train_loss/(len(train_loader)*batch_size)
    validn_loss = validn_loss/(len(validn_loader)*batch_size)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch+1, train_loss, validn_loss))
    
    if validn_loss <= validn_min_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model.'.format(validn_min_loss, validn_loss))
        torch.save(fnn_model_concat.state_dict(), 'fnn_model_concat.pt')
        validn_min_loss = validn_loss

Epoch: 1 	Training Loss: 1.094918 	Validation Loss: 1.073728
Validation loss decreased (inf --> 1.073728).  Saving model.
Epoch: 2 	Training Loss: 1.051941 	Validation Loss: 1.025497
Validation loss decreased (1.073728 --> 1.025497).  Saving model.
Epoch: 3 	Training Loss: 1.006022 	Validation Loss: 0.976844
Validation loss decreased (1.025497 --> 0.976844).  Saving model.
Epoch: 4 	Training Loss: 0.963434 	Validation Loss: 0.946867
Validation loss decreased (0.976844 --> 0.946867).  Saving model.
Epoch: 5 	Training Loss: 0.933874 	Validation Loss: 0.926890
Validation loss decreased (0.946867 --> 0.926890).  Saving model.
Epoch: 6 	Training Loss: 0.916045 	Validation Loss: 0.917052
Validation loss decreased (0.926890 --> 0.917052).  Saving model.
Epoch: 7 	Training Loss: 0.903279 	Validation Loss: 0.910322
Validation loss decreased (0.917052 --> 0.910322).  Saving model.
Epoch: 8 	Training Loss: 0.893837 	Validation Loss: 0.905126
Validation loss decreased (0.910322 --> 0.905126).  Sav

In [47]:
test_loader = DataLoader(test_data_concat, batch_size=1)

In [48]:
fnn_model_concat = FFNetConcat()
fnn_model_concat.load_state_dict(torch.load('fnn_model_concat.pt'))

<All keys matched successfully>

In [49]:
print('Accuracy of FNN using first 10 concatenatead Word2Vec vectors:',str(accuracy(fnn_model_concat, test_loader)))

Accuracy of FNN using first 10 concatenatead Word2Vec vectors: 0.5769166666666666


#### Considering average Word2Vec vectors, the Feedforward Neural Network outperformed the single perceptron, and gave comparable results to the Support Vector Machine.

It outperforms single perceptron due to it's larger network of hidden layers and nodes, which enables better learning across epochs to classify the data.
The linear SVM gave similar results, which might mean that the data was linearly separable, hence a comparable accuracy.

#### The first 10 concatenated Word2Vec FNN gave much inferior results compared to SVM, and slightly inferior or similar results to the single perceptron.

This could be the case as the first 10 words being concatenated do not necessarily have all the information needed to conclude the sentiment of the review, resulting in likely wrong classifications. Also, the number of input features in this case become too large, which could sometimes inhibit efficient learning.

## 5. Recurrent Neural Networks

### (a) Simple RNN cell

In [50]:
class TrainDataRNN:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        word_count = 0
        for word in curr_review:
            if word_count==20:
                break
            if word in w2v:
                word_count+=1
                curr_vect.append(w2v[word])

        while word_count<20:
            curr_vect.append(np.zeros((300,), dtype=np.float32))
            word_count+=1

        if len(curr_vect)==0:
            curr_vect = np.zeros((20,300,), dtype=np.float32)
        else:
            curr_vect = np.array(curr_vect)
#             curr_vect = curr_vect.flatten()

        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label
    


class TestDataRNN:
    
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, i):
        curr_review = self.reviews.iloc[i]['review_body']
        curr_review = curr_review.replace(',','')
        curr_review = curr_review.replace('.','')
        curr_review = curr_review.split()
        curr_vect = []
        word_count = 0
        for word in curr_review:
            if word_count==20:
                break
            if word in w2v:
                word_count+=1
                curr_vect.append(w2v[word])

        while word_count<20:
            curr_vect.append(np.zeros((300,),dtype=np.float32))
            word_count+=1

        if len(curr_vect)==0:
            curr_vect = np.zeros((20,300,),dtype=np.float32)
        else:
            curr_vect = np.array(curr_vect)
#             curr_vect = curr_vect.flatten()

        curr_vect = torch.from_numpy(curr_vect)
        
        label = self.ratings.iloc[i]['rating_class']
        
        return curr_vect, label

In [51]:
train_data_rnn = TrainDataRNN(X_train,Y_train)
test_data_rnn = TestDataRNN(X_test,Y_test)

In [52]:
batch_size=100
validation_size=0.2

num_train = len(train_data_rnn)
inds = list(range(num_train))
np.random.shuffle(inds)
split = int(np.floor(validation_size*num_train))
train_idx, valid_idx = inds[split:], inds[:split]

train_sampler = SubsetRandomSampler(train_idx)
validn_sampler = SubsetRandomSampler(valid_idx)

train_loader = DataLoader(train_data_rnn, batch_size=batch_size, sampler=train_sampler)
validn_loader = DataLoader(train_data_rnn, batch_size=batch_size, sampler=validn_sampler)

In [53]:
class RNNModel(nn.Module):
    
    def __init__(self):
        super(RNNModel, self).__init__()
        
        self.rnn = nn.RNN(input_size=300, hidden_size=20, num_layers=1, dropout=0.2, batch_first=True, nonlinearity='tanh')
#         self.fc1 = nn.Linear(20,10)
        self.fc = nn.Linear(20,3)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self,x):

        x = self.rnn(x)
        x = self.fc(x[0][:,-1,:])
        
        return x

rnn_model = RNNModel()
print(rnn_model)

RNNModel(
  (rnn): RNN(300, 20, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=20, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [54]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn_model.parameters(), lr=0.0005) #0.0005

In [55]:
epochs = 50

validn_min_loss = np.Inf

for epoch in range(epochs):
    train_loss = 0.0
    validn_loss = 0.0
    
    rnn_model.train()
    for rev, rev_class in train_loader:
        optimizer.zero_grad()
        output = rnn_model(rev)
        loss = criterion(output,rev_class)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*rev.size(0)
        
    rnn_model.eval()
    for rev, rev_class in validn_loader:
        output = rnn_model(rev)
        
        loss = criterion(output,rev_class)
        
        validn_loss += loss.item()*rev.size(0)
    
    train_loss = train_loss/(len(train_loader)*batch_size)
    validn_loss = validn_loss/(len(validn_loader)*batch_size)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch+1, train_loss, validn_loss))
    
    if validn_loss <= validn_min_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model.'.format(validn_min_loss, validn_loss))
        torch.save(rnn_model.state_dict(), 'rnn_model.pt')
        validn_min_loss = validn_loss

Epoch: 1 	Training Loss: 1.089949 	Validation Loss: 1.012769
Validation loss decreased (inf --> 1.012769).  Saving model.
Epoch: 2 	Training Loss: 0.960192 	Validation Loss: 0.941047
Validation loss decreased (1.012769 --> 0.941047).  Saving model.
Epoch: 3 	Training Loss: 0.923711 	Validation Loss: 0.913140
Validation loss decreased (0.941047 --> 0.913140).  Saving model.
Epoch: 4 	Training Loss: 0.902098 	Validation Loss: 0.899408
Validation loss decreased (0.913140 --> 0.899408).  Saving model.
Epoch: 5 	Training Loss: 0.891008 	Validation Loss: 0.892349
Validation loss decreased (0.899408 --> 0.892349).  Saving model.
Epoch: 6 	Training Loss: 0.882429 	Validation Loss: 0.907064
Epoch: 7 	Training Loss: 0.876262 	Validation Loss: 0.882344
Validation loss decreased (0.892349 --> 0.882344).  Saving model.
Epoch: 8 	Training Loss: 0.870632 	Validation Loss: 0.877208
Validation loss decreased (0.882344 --> 0.877208).  Saving model.
Epoch: 9 	Training Loss: 0.865073 	Validation Loss: 0.8

In [56]:
test_loader = DataLoader(test_data_rnn, batch_size=1)

In [57]:
rnn_model = RNNModel()
rnn_model.load_state_dict(torch.load('rnn_model.pt'))

<All keys matched successfully>

In [58]:
print('Accuracy of RNN model:',str(accuracy(rnn_model, test_loader)))

Accuracy of RNN model: 0.63375


#### The RNN model performed better when compared to the FNN where first 10 words are concatenated.

This is because RNN takes into account sequential data. Hence, the sequence of the first 20 words (for RNN) is taken into account across time steps, rather than simply passing the concatenated words as input.

#### However, the RNN model did not perform as well as the FNN where average of word embeddings was taken.

This can be because the average computed word vectors take the entire review into account, while RNN is only taking the first 20 words. Hence, even though RNN takes sequence into account, as only 20 words are passed to it, they may not always be enough to make an accurate classification.

### (b) GRU

In [59]:
class GRUModel(nn.Module):
    
    def __init__(self):
        super(GRUModel, self).__init__()
        
        self.gru = nn.GRU(input_size=300, hidden_size=20, dropout=0.2, batch_first=True, num_layers=1)
        self.fc = nn.Linear(20,3)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self,x):
        x = self.gru(x)
        x = self.fc(x[0][:,-1,:])
        
        return x

gru_model = GRUModel()
print(gru_model)

GRUModel(
  (gru): GRU(300, 20, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=20, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [60]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(gru_model.parameters(), lr=0.0005) #0.0005

In [61]:
epochs = 50

validn_min_loss = np.Inf

for epoch in range(epochs):
    train_loss = 0.0
    validn_loss = 0.0
    
    gru_model.train()
    for rev, rev_class in train_loader:
        optimizer.zero_grad()
        output = gru_model(rev)
        
        loss = criterion(output,rev_class)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*rev.size(0)
        
    gru_model.eval()
    for rev, rev_class in validn_loader:
        output = gru_model(rev)
        
        loss = criterion(output,rev_class)
        
        validn_loss += loss.item()*rev.size(0)
    
    train_loss = train_loss/(len(train_loader)*batch_size)
    validn_loss = validn_loss/(len(validn_loader)*batch_size)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch+1, train_loss, validn_loss))
    
    if validn_loss <= validn_min_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model.'.format(validn_min_loss, validn_loss))
        torch.save(gru_model.state_dict(), 'gru_model.pt')
        validn_min_loss = validn_loss

Epoch: 1 	Training Loss: 1.038516 	Validation Loss: 0.933581
Validation loss decreased (inf --> 0.933581).  Saving model.
Epoch: 2 	Training Loss: 0.903905 	Validation Loss: 0.888318
Validation loss decreased (0.933581 --> 0.888318).  Saving model.
Epoch: 3 	Training Loss: 0.867890 	Validation Loss: 0.867774
Validation loss decreased (0.888318 --> 0.867774).  Saving model.
Epoch: 4 	Training Loss: 0.834430 	Validation Loss: 0.830153
Validation loss decreased (0.867774 --> 0.830153).  Saving model.
Epoch: 5 	Training Loss: 0.808288 	Validation Loss: 0.810311
Validation loss decreased (0.830153 --> 0.810311).  Saving model.
Epoch: 6 	Training Loss: 0.788979 	Validation Loss: 0.804769
Validation loss decreased (0.810311 --> 0.804769).  Saving model.
Epoch: 7 	Training Loss: 0.776055 	Validation Loss: 0.793745
Validation loss decreased (0.804769 --> 0.793745).  Saving model.
Epoch: 8 	Training Loss: 0.766220 	Validation Loss: 0.797796
Epoch: 9 	Training Loss: 0.756490 	Validation Loss: 0.7

In [62]:
test_loader = DataLoader(test_data_rnn, batch_size=1)

In [63]:
gru_model = GRUModel()
gru_model.load_state_dict(torch.load('gru_model.pt'))

<All keys matched successfully>

In [64]:
print('Accuracy of GRU model:',str(accuracy(gru_model, test_loader)))

Accuracy of GRU model: 0.6644166666666667


### (c) LSTM

In [65]:
class LSTMModel(nn.Module):
    
    def __init__(self):
        super(LSTMModel, self).__init__()
        
        self.lstm = nn.LSTM(input_size=300, hidden_size=20, batch_first=True, num_layers=1)
        self.fc = nn.Linear(20,3)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self,x):
        x = self.lstm(x)
        x = self.fc(x[0][:,-1,:])
        return x

lstm_model = LSTMModel()
print(lstm_model)

LSTMModel(
  (lstm): LSTM(300, 20, batch_first=True)
  (fc): Linear(in_features=20, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [66]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=0.0005)

In [67]:
epochs = 50

validn_min_loss = np.Inf

for epoch in range(epochs):
    train_loss = 0.0
    validn_loss = 0.0
    
    lstm_model.train()
    for rev, rev_class in train_loader:
        optimizer.zero_grad()
        output = lstm_model(rev)
        
        loss = criterion(output,rev_class)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()*rev.size(0)
        
    lstm_model.eval()
    for rev, rev_class in validn_loader:
        output = lstm_model(rev)
        
        loss = criterion(output,rev_class)
        
        validn_loss += loss.item()*rev.size(0)
    
    train_loss = train_loss/(len(train_loader)*batch_size)
    validn_loss = validn_loss/(len(validn_loader)*batch_size)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch+1, train_loss, validn_loss))
    
    if validn_loss <= validn_min_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model.'.format(validn_min_loss, validn_loss))
        torch.save(lstm_model.state_dict(), 'lstm_model.pt')
        validn_min_loss = validn_loss

Epoch: 1 	Training Loss: 1.054660 	Validation Loss: 0.962084
Validation loss decreased (inf --> 0.962084).  Saving model.
Epoch: 2 	Training Loss: 0.916386 	Validation Loss: 0.890870
Validation loss decreased (0.962084 --> 0.890870).  Saving model.
Epoch: 3 	Training Loss: 0.865927 	Validation Loss: 0.856398
Validation loss decreased (0.890870 --> 0.856398).  Saving model.
Epoch: 4 	Training Loss: 0.839467 	Validation Loss: 0.839902
Validation loss decreased (0.856398 --> 0.839902).  Saving model.
Epoch: 5 	Training Loss: 0.819968 	Validation Loss: 0.834164
Validation loss decreased (0.839902 --> 0.834164).  Saving model.
Epoch: 6 	Training Loss: 0.805430 	Validation Loss: 0.818035
Validation loss decreased (0.834164 --> 0.818035).  Saving model.
Epoch: 7 	Training Loss: 0.792433 	Validation Loss: 0.811700
Validation loss decreased (0.818035 --> 0.811700).  Saving model.
Epoch: 8 	Training Loss: 0.778912 	Validation Loss: 0.803525
Validation loss decreased (0.811700 --> 0.803525).  Sav

In [68]:
test_loader = DataLoader(test_data_rnn, batch_size=1)

In [69]:
lstm_model = LSTMModel()
lstm_model.load_state_dict(torch.load('lstm_model.pt'))

<All keys matched successfully>

In [70]:
print('Accuracy of LSTM model:',str(accuracy(lstm_model, test_loader)))

Accuracy of LSTM model: 0.66225


#### Both the GRU and the LSTM model show a better performance than regular RNN.

This is because both GRU and LSTM compute long-term dependencies better than the regular RNN.
Hence, the context of a review is better taken into account on these networks.

#### Also, when comparing GRU and LSTM, GRU gave slightly better results across MOST of my model runs.

This can be because GRU is known to perform better than LSTM when the dataset is relatively small.

If the dataset had even longer sequences, and a larger input size, then LSTM would significantly perform better than GRU due to its superior ability to encode long-term sequences.

#### Both GRU and LSTM also give better results than both the Feedforward Networks, the one with average Word2Vec vectors, and the one with first 10 words concatenated.

Since GRU and LSTM are taking 20 words sequentially and are known to maintain long-term dependencies well, they outperform both types of FNNs. Even if average word vectors were taken, the data would not be as informative as 20 words taken as input sequentially.

### References:

[1] https://www.kaggle.com/code/mishra1993/pytorch-multi-layer-perceptron-mnist/notebook - Mentioned in HW description PDF.