# Sentiment Analysis - Amazon Reviews - 2

## Binary and Ternary Sentiment Analysis
## Data generation -> Word Embedding -> Data Preprocessing -> TF-IDF -> Simple Model (Perceptron, SVM) -> Feedforward Neural Networks -> Recurrent Neural Networks

## Libraries

## Note: Summary of Accuracies is present at the end of the notebook along with some obervations

In [137]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet', quiet=True)
import re
from bs4 import BeautifulSoup

In [2]:
import gensim.downloader as api
from gensim import utils
import gensim.models

In [3]:
import contractions
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [6]:
import torch
from torch.utils import data
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence

In [351]:
#! pip install bs4 # in case you don't have it installed

# Dataset: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Kitchen_v1_00.tsv.gz

# Task 1 : Dataset Generation

## Read Data

In [355]:
 data=pd.read_csv('amazon_reviews_us_Kitchen_v1_00.tsv', sep="\t", error_bad_lines=False, warn_bad_lines=False)

## Keep Reviews and Ratings

In [356]:
df=data[["review_body","star_rating"]]

## Labelling Reviews:

In [357]:
# statistics of all the reviews
df=df.dropna()
df_grouped = df.groupby('star_rating')
df_grouped.size()

star_rating
1.0     426870
2.0     241939
3.0     349539
4.0     731701
5.0    3124595
dtype: int64

In [358]:
df_subset = df.groupby('star_rating', group_keys=False).apply(lambda grp: grp.sample(n=50000))

In [359]:
df_subset.groupby('star_rating').size()

star_rating
1.0    50000
2.0    50000
3.0    50000
4.0    50000
5.0    50000
dtype: int64

In [59]:
df_labeled=df_subset.copy()
df_labeled['star_rating'][(df_labeled['star_rating'] <= 2 )] = 0
df_labeled['star_rating'][(df_labeled['star_rating'] > 3 )] = 1
df_labeled['star_rating'][(df_labeled['star_rating'] == 3 )] = 2

In [361]:
df_labeled.sample(5)

Unnamed: 0,review_body,star_rating
2854151,Exactly what I thought I was purchasing. I mea...,1.0
2413003,The Top keeps opening for a shaker bottle. Def...,0.0
4779547,This has a bitter taste. It is no good.,0.0
2715603,I haven't used this yet as I am waiting for my...,2.0
753248,Authentic shape. Used black and yellow sprinkl...,1.0


In [362]:
#statistics of just the 2 classes used
df_labeled.groupby('star_rating').count()

Unnamed: 0_level_0,review_body
star_rating,Unnamed: 1_level_1
0.0,100000
1.0,100000
2.0,50000


In [392]:
df_labeled.reset_index(inplace=True)

In [3]:
df_labeled=df_labeled[['review_body','star_rating']]

 ## Save dataset


In [384]:
# df_labeled.to_csv('dataset_balanced.csv')

In [189]:
# df_labeled=pd.read_csv('dataset_balanced.csv', usecols=['review_body','star_rating'])

 # Task 2 : Word Embedding


Refer https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

 ## Task 2A


In [11]:
# !pip install gensim
wv = api.load('word2vec-google-news-300')

In [191]:
print("Semantic Similarity between Good and Evil: ",wv.similarity("good", "evil"))
print("Semantic Similarity between Good and Excellent: ",wv.similarity("good", "excellent"))
print("We can see that Good is much more similary semantically to Excellent than Evil")

Semantic Similarity between Good and Evil:  0.20598836
Semantic Similarity between Good and Excellent:  0.6442929
We can see that Good is much more similary semantically to Excellent than Evil


In [192]:
print(wv.most_similar('king', topn=5))

[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474)]


 ## Task 2B


In [6]:
class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for review in df_labeled["review_body"]:
            yield utils.simple_preprocess(review)

In [7]:
sentences = MyCorpus()
model_w2v = gensim.models.Word2Vec(sentences=sentences, min_count=10, window=11, vector_size=300)

In [12]:
#save model
# model_w2v.save('Word2Vec_model.bin')
# model_w2v = gensim.models.Word2Vec.load('Word2Vec_model.bin')

In [196]:
print("Semantic Similarity between Good and Evil: ",model_w2v.wv.similarity("good", "evil"))
print("Semantic Similarity between Good and Excellent: ",model_w2v.wv.similarity("good", "excellent"))
print("We can see that the similarities have reduced in magnitude and Good and Evil has become even more smaller while there is a small impact on Excellent. After experimenting with other values, I feel the pre-build model worked better as it gave much higher similairties for similar items")

Semantic Similarity between Good and Evil:  0.0873883
Semantic Similarity between Good and Excellent:  0.5899097
We can see that the similarities have reduced in magnitude and Good and Evil has become even more smaller while there is a small impact on Excellent. After experimenting with other values, I feel the pre-build model worked better as it gave much higher similairties for similar items


 # Task 3 : Simple models


## Data Cleaning

### Convert the all reviews into the lower case.

In [198]:
df_labeled['review_body']=df_labeled['review_body'].str.lower()

### remove the HTML and URLs from the reviews

In [199]:
def remove_html_tags(text):
    clean = re.compile('<.*?>')
    text=str(text)
    text = re.sub(clean, '', text)
    return re.sub(r"\S*http\S+", "", text)

df_labeled['review_body']=df_labeled['review_body'].apply(remove_html_tags)

### perform contractions on the reviews.

In [200]:
def contractionfunction(s):
    s=contractions.fix(s)
    return s
df_labeled['review_body']=df_labeled['review_body'].apply(contractionfunction)

### remove non-alphabetical characters

In [201]:
def remove_non_alpha(text):
    clean = re.compile('[^a-zA-Z]+')
    text=str(text)
    return re.sub(clean, ' ', text)

df_labeled['review_body']=df_labeled['review_body'].apply(remove_non_alpha)

### Remove the extra spaces between the words

In [202]:
def remove_extra_space(text):
     return re.sub(' +', ' ', str(text.strip()))

df_labeled['review_body']=df_labeled['review_body'].apply(remove_extra_space)

 ### average character length after cleaning



In [203]:
char_len_after = sum(df_labeled["review_body"].str.len())/df_labeled.shape[0]
print(char_len_after)

325.634672


# Pre-processing

### remove the stop words 

In [204]:
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
stopwords_set = set(stopwords.words("english"))                  


### perform lemmatization  

In [205]:
wnl=WordNetLemmatizer()


def lemmatize_remove_stopwords(text):
    text = ' '.join([wnl.lemmatize(word) for word in nltk.word_tokenize(text) if  word not in stopwords_set])
    return text

df_labeled['review_body']=df_labeled['review_body'].apply(lemmatize_remove_stopwords)

 ### average character length after preprocessing



In [206]:
char_len_after_prep = sum(df_labeled["review_body"].str.len())/df_labeled.shape[0]
print(char_len_after_prep)

198.415288


In [207]:
df_labeled.replace("", float("NaN"), inplace=True)
df_labeled.dropna(inplace=True)

 ## sample review



In [208]:
df_labeled.sample(5)

Unnamed: 0,review_body,star_rating
214276,second one gave first gift friend use everyday...,1.0
150795,efficient chopping onion tomato vegetable draw...,1.0
124344,plastic really flimsy warp easily filter best ...,2.0
58434,gingerbread bit stale,0.0
171666,hour review figure said said main purpose writ...,1.0


In [7]:
#save
# df_labeled.to_csv('df_labeled_pre.csv')
# df_labeled=pd.read_csv('df_labeled_pre.csv', usecols=['review_body','star_rating'])

 ## Dataframe preperation



In [13]:
list_mymodel=[]
list_prebuilt=[]
empty_word2vec=[]
for i in range(len(df_labeled)):
    count_mymodel=0
    count_prebuilt=0
    list_temp_mymodel=np.zeros([300])
    list_temp_prebuilt=np.zeros([300])
    for word in nltk.word_tokenize(df_labeled.iloc[i,0]):
        
        if word in model_w2v.wv:
            word_emb_mymodel = np.asarray(model_w2v.wv[word])
            count_mymodel+=1
            list_temp_mymodel+=word_emb_mymodel
        
        if word in wv:
            word_emb_prebuilt = np.asarray(wv[word])
            count_prebuilt+=1
            list_temp_prebuilt+=word_emb_prebuilt
         
    if count_mymodel!=0:     
        list_mymodel.append(np.append(list_temp_mymodel/count_mymodel,df_labeled.iloc[i,-1]))
    if count_prebuilt!=0:
        list_prebuilt.append(np.append(list_temp_prebuilt/count_prebuilt,df_labeled.iloc[i,-1]))
    if count_mymodel==0 or count_prebuilt==0:
        empty_word2vec.append(i)

In [14]:
df_labeled.drop(empty_word2vec, inplace=True)
df_labeled.reset_index(drop=True, inplace=True)

In [15]:
df_binary=df_labeled[df_labeled["star_rating"]!=2.0]

In [16]:
df_mymodel_ternary = pd.DataFrame(data=list_mymodel)
df_prebuilt_ternary = pd.DataFrame(data=list_prebuilt)

In [17]:
# Drop Infinite values
df_prebuilt_ternary.replace([np.inf, -np.inf], np.nan, inplace=True)
df_mymodel_ternary.replace([np.inf, -np.inf], np.nan, inplace=True)

# Drop null values
df_prebuilt_ternary.dropna(inplace=True)
df_mymodel_ternary.dropna(inplace=True)

In [18]:
df_mymodel_binary=df_mymodel_ternary[df_mymodel_ternary[300]!=2.0]
df_prebuilt_binary=df_prebuilt_ternary[df_prebuilt_ternary[300]!=2.0]

##  train-test split

In [19]:
#tf-idf
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(df_binary["review_body"], df_binary["star_rating"], test_size=0.2, random_state=42)

In [20]:
#mymodel
X_train_mymodel_binary, X_test_mymodel_binary, y_train_mymodel_binary, y_test_mymodel_binary = train_test_split(df_mymodel_binary.iloc[:,:-1].values, df_mymodel_binary.iloc[:,-1].values, test_size=0.2, random_state=42)

In [21]:
#prebuilt
X_train_prebuilt, X_test_prebuilt, y_train_prebuilt, y_test_prebuilt = train_test_split(df_prebuilt_binary.iloc[:,:-1].values, df_prebuilt_binary.iloc[:,-1].values, test_size=0.2, random_state=42)

In [22]:
#mymodel_ternary
X_train_mymodel_ternary, X_test_mymodel_ternary, y_train_mymodel_ternary, y_test_mymodel_ternary = train_test_split(df_mymodel_ternary.iloc[:,:-1].values, df_mymodel_ternary.iloc[:,-1].values, test_size=0.2, random_state=42)

In [23]:
#prebuilt_ternary
X_train_prebuilt_ternary, X_test_prebuilt_ternary, y_train_prebuilt_ternary, y_test_prebuilt_ternary = train_test_split(df_prebuilt_ternary.iloc[:,:-1].values, df_prebuilt_ternary.iloc[:,-1].values, test_size=0.2, random_state=42)

##  TF-IDF feature extraction

In [24]:
vectorizer = TfidfVectorizer()
X_train_tfidf= vectorizer.fit_transform(X_train_tfidf)
X_test_tfidf=vectorizer.transform(X_test_tfidf)

# Perceptron

In [25]:
#tfidf
clf_tfidf = Perceptron(tol=1e-3, random_state=0)
clf_tfidf.fit(X_train_tfidf, y_train_tfidf)
y_pred_tfidf=clf_tfidf.predict(X_test_tfidf)

#mymodel
clf_mymodel = Perceptron(tol=1e-3, random_state=0)
clf_mymodel.fit(X_train_mymodel_binary, y_train_mymodel_binary)
y_pred_mymodel_binary=clf_mymodel.predict(X_test_mymodel_binary)

#prebuilt
clf_prebuilt = Perceptron(tol=1e-3, random_state=0)
clf_prebuilt.fit(X_train_prebuilt, y_train_prebuilt)
y_pred_prebuilt=clf_prebuilt.predict(X_test_prebuilt)

In [28]:
print("----------Perceptron----------")

print("----------TF-IDF----------")
cl_report_tfidf=classification_report(y_test_tfidf, y_pred_tfidf, output_dict=True)
print(classification_report(y_test_tfidf, y_pred_tfidf))

print("----------My Model----------")
cl_report_mymodel=classification_report(y_test_mymodel_binary, y_pred_mymodel_binary, output_dict=True)
print(classification_report(y_test_mymodel_binary, y_pred_mymodel_binary))

print("----------Pre-Built----------")
cl_report_prebuilt=classification_report(y_test_prebuilt, y_pred_prebuilt, output_dict=True)
print(classification_report(y_test_prebuilt, y_pred_prebuilt))

----------Perceptron----------
----------TF-IDF----------
              precision    recall  f1-score   support

         0.0       0.82      0.84      0.83     19973
         1.0       0.84      0.81      0.82     19982

    accuracy                           0.83     39955
   macro avg       0.83      0.83      0.83     39955
weighted avg       0.83      0.83      0.83     39955

----------My Model----------
              precision    recall  f1-score   support

         0.0       0.77      0.82      0.80     20004
         1.0       0.81      0.76      0.78     19953

    accuracy                           0.79     39957
   macro avg       0.79      0.79      0.79     39957
weighted avg       0.79      0.79      0.79     39957

----------Pre-Built----------
              precision    recall  f1-score   support

         0.0       0.64      0.95      0.77     19965
         1.0       0.90      0.47      0.62     20002

    accuracy                           0.71     39967
   macro av

In [136]:
#tfidf
clf_tfidf = LinearSVC(random_state=0, tol=1e-5, max_iter=100)
clf_tfidf.fit(X_train_tfidf, y_train_tfidf)
y_pred_tfidf=clf_tfidf.predict(X_test_tfidf)

# #mymodel
clf_mymodel = LinearSVC(random_state=0, tol=1e-5)
clf_mymodel.fit(X_train_mymodel_binary, y_train_mymodel_binary)
y_pred_mymodel_binary=clf_mymodel.predict(X_test_mymodel_binary)

# #prebuilt
clf_prebuilt = LinearSVC(random_state=0, tol=1e-5)
clf_prebuilt.fit(X_train_prebuilt, y_train_prebuilt)
y_pred_prebuilt=clf_prebuilt.predict(X_test_prebuilt)

In [135]:
print("----------SVM----------")

print("----------TF-IDF----------")
cl_report_tfidf=classification_report(y_test_tfidf, y_pred_tfidf, output_dict=True)
print(classification_report(y_test_tfidf, y_pred_tfidf))

print("----------My Model----------")
cl_report_mymodel=classification_report(y_test_mymodel_binary, y_pred_mymodel_binary, output_dict=True)
print(classification_report(y_test_mymodel_binary, y_pred_mymodel_binary))

print("----------Pre-Built----------")
cl_report_prebuilt=classification_report(y_test_prebuilt, y_pred_prebuilt, output_dict=True)
print(classification_report(y_test_prebuilt, y_pred_prebuilt))

----------SVM----------
----------TF-IDF----------
              precision    recall  f1-score   support

         0.0       0.87      0.88      0.87     19973
         1.0       0.87      0.87      0.87     19982

    accuracy                           0.87     39955
   macro avg       0.87      0.87      0.87     39955
weighted avg       0.87      0.87      0.87     39955

----------My Model----------
              precision    recall  f1-score   support

         0.0       0.83      0.85      0.84     20004
         1.0       0.85      0.83      0.84     19953

    accuracy                           0.84     39957
   macro avg       0.84      0.84      0.84     39957
weighted avg       0.84      0.84      0.84     39957

----------Pre-Built----------
              precision    recall  f1-score   support

         0.0       0.81      0.84      0.82     19965
         1.0       0.83      0.80      0.81     20002

    accuracy                           0.82     39967
   macro avg      

# Task 4 : Feedforward Neural Networks

## Task 4A

In [30]:
class myDataset(data.Dataset):
    def __init__(self, features, labels):
        self.features=features
        self.labels=labels
        self.len = features.shape[0]
    def __len__(self):
        return self.len
    def __getitem__(self, index):
        row=self.features[index,:]
        row_label=self.labels[index]
        return row,row_label

In [47]:
def train_model(training_generator, model, n_epochs, lr):
    # specify loss function (categorical cross-entropy)
    criterion = nn.CrossEntropyLoss()

# specify optimizer (stochastic gradient descent) and learning rate = 0.01
    optimizer = torch.optim.SGD(model.parameters(), lr)
    loss_values = []
    for epoch in range(n_epochs):
        train_loss = 0.0
        running_loss = 0.0

        model.train() # prep model for training
        for local_data, target in training_generator:

            optimizer.zero_grad()

            output = model(local_data)

            loss = criterion(output, target.type(torch.LongTensor))

            loss.backward()

            optimizer.step()

            train_loss += loss.item()*local_data.size(0)
        
        train_loss = train_loss/len(training_generator.dataset)

        print('Epoch: {} \tTraining Loss: {:.6f}'.format(
            epoch+1, 
            train_loss,
            ))
        loss_values.append(train_loss)
    return model

### Binary model - My Model

In [48]:

# define the NN architecture
class Net_bin(nn.Module):
    def __init__(self, output_size):
        super(Net_bin, self).__init__()
        # number of hidden nodes in each layer (512)
        hidden_1 = 50
        hidden_2 = 10
      
        self.fc1 = nn.Linear(300, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
        # linear layer (n_hidden -> 10)
        self.fc3 = nn.Linear(hidden_2, output_size)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # flatten image input
        x = x.type(torch.FloatTensor)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

# initialize the NN
model_binary = Net_bin(output_size=2)
print(model_binary)


Net_bin(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [49]:
training_set=myDataset(X_train_mymodel_binary, y_train_mymodel_binary)
training_generator = data.DataLoader(training_set, batch_size=512, shuffle=True)

In [34]:
model_binary=train_model(training_generator, model_binary, n_epochs=20, lr=0.01)

Epoch: 1 	Training Loss: 0.682409
Epoch: 2 	Training Loss: 0.613176
Epoch: 3 	Training Loss: 0.524642
Epoch: 4 	Training Loss: 0.481360
Epoch: 5 	Training Loss: 0.457034
Epoch: 6 	Training Loss: 0.439227
Epoch: 7 	Training Loss: 0.426178
Epoch: 8 	Training Loss: 0.415912
Epoch: 9 	Training Loss: 0.409563
Epoch: 10 	Training Loss: 0.402036
Epoch: 11 	Training Loss: 0.398236
Epoch: 12 	Training Loss: 0.393365
Epoch: 13 	Training Loss: 0.390367
Epoch: 14 	Training Loss: 0.387317
Epoch: 15 	Training Loss: 0.384438
Epoch: 16 	Training Loss: 0.382616
Epoch: 17 	Training Loss: 0.381207
Epoch: 18 	Training Loss: 0.379443
Epoch: 19 	Training Loss: 0.377797
Epoch: 20 	Training Loss: 0.376551


In [35]:
_, predictions = torch.max(model_binary(torch.Tensor(X_test_mymodel_binary)),1)
predictions=predictions.numpy()

In [36]:
cl_report_prebuilt=classification_report(y_test_mymodel_binary, predictions, output_dict=True)
print(classification_report(y_test_mymodel_binary, predictions))

              precision    recall  f1-score   support

         0.0       0.84      0.83      0.84     20004
         1.0       0.83      0.84      0.84     19953

    accuracy                           0.84     39957
   macro avg       0.84      0.84      0.84     39957
weighted avg       0.84      0.84      0.84     39957



### Binary model - Prebuilt

In [55]:
# initialize the NN
model_binary = Net_bin(output_size=2)
print(model_binary)

Net_bin(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [56]:
training_set=myDataset(X_train_prebuilt, y_train_prebuilt)
training_generator = data.DataLoader(training_set, batch_size=512, shuffle=True)

In [58]:
model_binary=train_model(training_generator, model_binary, n_epochs=20, lr=0.01)

Epoch: 1 	Training Loss: 0.693409
Epoch: 2 	Training Loss: 0.693032
Epoch: 3 	Training Loss: 0.692660
Epoch: 4 	Training Loss: 0.691964
Epoch: 5 	Training Loss: 0.691203
Epoch: 6 	Training Loss: 0.689850
Epoch: 7 	Training Loss: 0.688258
Epoch: 8 	Training Loss: 0.685967
Epoch: 9 	Training Loss: 0.682848
Epoch: 10 	Training Loss: 0.678540
Epoch: 11 	Training Loss: 0.672079
Epoch: 12 	Training Loss: 0.663336
Epoch: 13 	Training Loss: 0.650485
Epoch: 14 	Training Loss: 0.633019
Epoch: 15 	Training Loss: 0.611667
Epoch: 16 	Training Loss: 0.587206
Epoch: 17 	Training Loss: 0.563649
Epoch: 18 	Training Loss: 0.544071
Epoch: 19 	Training Loss: 0.528583
Epoch: 20 	Training Loss: 0.514886


In [59]:
_, predictions = torch.max(model_binary(torch.Tensor(X_test_prebuilt)),1)
predictions=predictions.numpy()

In [60]:
cl_report_prebuilt=classification_report(y_test_prebuilt, predictions, output_dict=True)
print(classification_report(y_test_prebuilt, predictions))

              precision    recall  f1-score   support

         0.0       0.75      0.77      0.76     19965
         1.0       0.77      0.75      0.76     20002

    accuracy                           0.76     39967
   macro avg       0.76      0.76      0.76     39967
weighted avg       0.76      0.76      0.76     39967



### Ternary model - My Model

In [76]:
# initialize the NN
model_ternary = Net_bin(output_size=3)
print(model_ternary)

Net_bin(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [77]:
training_set=myDataset(X_train_mymodel_ternary, y_train_mymodel_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [78]:
model_ternary=train_model(training_generator, model_ternary, n_epochs=50, lr=0.01)

Epoch: 1 	Training Loss: 0.883107
Epoch: 2 	Training Loss: 0.786406
Epoch: 3 	Training Loss: 0.769730
Epoch: 4 	Training Loss: 0.758912
Epoch: 5 	Training Loss: 0.752620
Epoch: 6 	Training Loss: 0.746356
Epoch: 7 	Training Loss: 0.742579
Epoch: 8 	Training Loss: 0.738241
Epoch: 9 	Training Loss: 0.734611
Epoch: 10 	Training Loss: 0.731602
Epoch: 11 	Training Loss: 0.729157
Epoch: 12 	Training Loss: 0.726867
Epoch: 13 	Training Loss: 0.723501
Epoch: 14 	Training Loss: 0.723181
Epoch: 15 	Training Loss: 0.719598
Epoch: 16 	Training Loss: 0.718858
Epoch: 17 	Training Loss: 0.716493
Epoch: 18 	Training Loss: 0.716961
Epoch: 19 	Training Loss: 0.714315
Epoch: 20 	Training Loss: 0.712818
Epoch: 21 	Training Loss: 0.711524
Epoch: 22 	Training Loss: 0.710509
Epoch: 23 	Training Loss: 0.710020
Epoch: 24 	Training Loss: 0.708535
Epoch: 25 	Training Loss: 0.708730
Epoch: 26 	Training Loss: 0.707423
Epoch: 27 	Training Loss: 0.706796
Epoch: 28 	Training Loss: 0.705185
Epoch: 29 	Training Loss: 0.7

In [79]:
_, predictions = torch.max(model_ternary(torch.Tensor(X_test_mymodel_ternary)),1)
predictions=predictions.numpy()

In [80]:
cl_report_prebuilt=classification_report(y_test_mymodel_ternary, predictions, output_dict=True)
print(classification_report(y_test_mymodel_ternary, predictions))

              precision    recall  f1-score   support

         0.0       0.69      0.83      0.75     19892
         1.0       0.73      0.83      0.77     20153
         2.0       0.44      0.14      0.21      9902

    accuracy                           0.69     49947
   macro avg       0.62      0.60      0.58     49947
weighted avg       0.65      0.69      0.65     49947



### Ternary model - Prebuilt

In [81]:
# initialize the NN
model_ternary = Net_bin(output_size=3)
print(model_ternary)

Net_bin(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [82]:
training_set=myDataset(X_train_prebuilt_ternary, y_train_prebuilt_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [83]:
model_ternary=train_model(training_generator, model_ternary, n_epochs=50, lr=0.01)

Epoch: 1 	Training Loss: 1.057546
Epoch: 2 	Training Loss: 1.053613
Epoch: 3 	Training Loss: 1.045289
Epoch: 4 	Training Loss: 0.993860
Epoch: 5 	Training Loss: 0.922670
Epoch: 6 	Training Loss: 0.879874
Epoch: 7 	Training Loss: 0.859903
Epoch: 8 	Training Loss: 0.847216
Epoch: 9 	Training Loss: 0.837847
Epoch: 10 	Training Loss: 0.830215
Epoch: 11 	Training Loss: 0.825294
Epoch: 12 	Training Loss: 0.822504
Epoch: 13 	Training Loss: 0.819849
Epoch: 14 	Training Loss: 0.817067
Epoch: 15 	Training Loss: 0.814475
Epoch: 16 	Training Loss: 0.813459
Epoch: 17 	Training Loss: 0.811938
Epoch: 18 	Training Loss: 0.811020
Epoch: 19 	Training Loss: 0.808682
Epoch: 20 	Training Loss: 0.808509
Epoch: 21 	Training Loss: 0.806614
Epoch: 22 	Training Loss: 0.805619
Epoch: 23 	Training Loss: 0.804396
Epoch: 24 	Training Loss: 0.803214
Epoch: 25 	Training Loss: 0.801834
Epoch: 26 	Training Loss: 0.799392
Epoch: 27 	Training Loss: 0.798673
Epoch: 28 	Training Loss: 0.796322
Epoch: 29 	Training Loss: 0.7

In [84]:
_, predictions = torch.max(model_ternary(torch.Tensor(X_test_prebuilt_ternary)),1)
predictions=predictions.numpy()

In [85]:
cl_report_prebuilt=classification_report(y_test_prebuilt_ternary, predictions, output_dict=True)
print(classification_report(y_test_prebuilt_ternary, predictions))

              precision    recall  f1-score   support

         0.0       0.66      0.82      0.73     19875
         1.0       0.71      0.79      0.75     20101
         2.0       0.42      0.12      0.19      9984

    accuracy                           0.67     49960
   macro avg       0.60      0.58      0.55     49960
weighted avg       0.63      0.67      0.63     49960



## Task 4b

### Data Preprocessing

In [101]:
list_mymodel=[]
temp_tensor=torch.ones(10,300)
for i in range(len(df_labeled)):
    count=0
    for word in nltk.word_tokenize(df_labeled.iloc[i,0]):

            
        try:
            word_emb = np.asarray(model_w2v.wv[word])
            if count==0:
                list_temp=word_emb.reshape(1,300)
            else:
                list_temp=np.vstack((list_temp,word_emb))
            count+=1
        except KeyError:
            pass

        
        
        if count==10:
            break

    list_mymodel.append(pad_sequence([temp_tensor,torch.tensor(list_temp)], True)[1].T)


In [102]:
list_train_mymodel_ternary = df_labeled.sample(frac = 0.8).index
list_test_mymodel_ternary = df_labeled.drop(list_train_mymodel_ternary).index

In [86]:
list_prebuilt=[]
temp_tensor=torch.ones(10,300)
for i in range(len(df_labeled)):
    count=0
    for word in nltk.word_tokenize(df_labeled.iloc[i,0]):

            
        try:
            word_emb = np.asarray(wv[word])
            if count==0:
                list_temp=word_emb.reshape(1,300)
            else:
                list_temp=np.vstack((list_temp,word_emb))
            count+=1
        except KeyError:
            pass

        
        
        if count==10:
            break

    list_prebuilt.append(pad_sequence([temp_tensor,torch.tensor(list_temp)], True)[1].T)


In [87]:
list_train_prebuilt_ternary = df_labeled.sample(frac = 0.8).index
list_test_prebuilt_ternary = df_labeled.drop(list_train_prebuilt_ternary).index

In [88]:
from torch.nn.utils.rnn import pad_sequence
class myDataset(data.Dataset):
    def __init__(self, list_model, df_labels, features_index_list):
        self.features_index_list=features_index_list
        self.df_labels=df_labels
        self.list_model=list_model
        self.len = len(self.features_index_list)
    def __len__(self):
        return self.len
    def __getitem__(self, index):
        row=self.list_model[self.features_index_list[index]]
        row_label=self.df_labels[self.features_index_list[index]]
        return row,row_label

### Ternary Model - My Model

In [121]:
# define the NN architecture
class Net_ter(nn.Module):
    def __init__(self, output_size):
        super(Net_ter, self).__init__()
        hidden_1 = 50
        hidden_2 = 10
      
        self.fc1 = nn.Linear(300*10, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
        # linear layer (n_hidden -> 10)
        self.fc3 = nn.Linear(hidden_2, output_size)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # flatten image input
        x = x.type(torch.FloatTensor).reshape(-1,300*10)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

# initialize the NN
model_ternary = Net_ter(output_size=3)
print(model_ternary)


Net_ter(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [122]:
training_set=myDataset(list_mymodel, df_labeled.iloc[:,-1], list_train_mymodel_ternary)
training_generator = data.DataLoader(training_set, batch_size=512, shuffle=True)

In [123]:
model_ternary=train_model(training_generator, model_ternary, n_epochs=20, lr=0.01)

Epoch: 1 	Training Loss: 1.002745
Epoch: 2 	Training Loss: 0.929227
Epoch: 3 	Training Loss: 0.901097
Epoch: 4 	Training Loss: 0.883688
Epoch: 5 	Training Loss: 0.869640
Epoch: 6 	Training Loss: 0.859142
Epoch: 7 	Training Loss: 0.850714
Epoch: 8 	Training Loss: 0.843251
Epoch: 9 	Training Loss: 0.836320
Epoch: 10 	Training Loss: 0.830435
Epoch: 11 	Training Loss: 0.825203
Epoch: 12 	Training Loss: 0.817989
Epoch: 13 	Training Loss: 0.814520
Epoch: 14 	Training Loss: 0.810111
Epoch: 15 	Training Loss: 0.804028
Epoch: 16 	Training Loss: 0.800475
Epoch: 17 	Training Loss: 0.797352
Epoch: 18 	Training Loss: 0.791817
Epoch: 19 	Training Loss: 0.786255
Epoch: 20 	Training Loss: 0.783807


In [124]:
test_mymodel_ternary=[list_mymodel[i] for i in list_test_mymodel_ternary]

In [125]:
y_pred_ter=[]
for test_batch in test_mymodel_ternary:
    _, predictions = torch.max(model_ternary(test_batch),1)
    predictions=predictions.numpy()
    y_pred_ter.append(predictions)

In [126]:
cl_report_ter=classification_report(df_labeled.iloc[list_test_mymodel_ternary,-1], y_pred_ter, output_dict=True)
print(classification_report(df_labeled.iloc[list_test_mymodel_ternary,-1], y_pred_ter))

              precision    recall  f1-score   support

         0.0       0.63      0.75      0.69     19990
         1.0       0.65      0.76      0.70     19933
         2.0       0.42      0.13      0.20     10021

    accuracy                           0.63     49944
   macro avg       0.57      0.55      0.53     49944
weighted avg       0.60      0.63      0.59     49944



### Ternary Model - Prebuilt

In [127]:
training_set=myDataset(list_prebuilt, df_labeled.iloc[:,-1], list_train_prebuilt_ternary)
training_generator = data.DataLoader(training_set, batch_size=128, shuffle=True)

In [128]:
# initialize the NN
model_ternary = Net_ter(output_size=3)
print(model_ternary)

Net_ter(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [129]:
model_ternary=train_model(training_generator, model_ternary, n_epochs=20, lr=0.01)

Epoch: 1 	Training Loss: 1.032913
Epoch: 2 	Training Loss: 0.947315
Epoch: 3 	Training Loss: 0.908351
Epoch: 4 	Training Loss: 0.889106
Epoch: 5 	Training Loss: 0.877799
Epoch: 6 	Training Loss: 0.867469
Epoch: 7 	Training Loss: 0.860476
Epoch: 8 	Training Loss: 0.854139
Epoch: 9 	Training Loss: 0.848138
Epoch: 10 	Training Loss: 0.843173
Epoch: 11 	Training Loss: 0.838382
Epoch: 12 	Training Loss: 0.833776
Epoch: 13 	Training Loss: 0.828013
Epoch: 14 	Training Loss: 0.824496
Epoch: 15 	Training Loss: 0.819308
Epoch: 16 	Training Loss: 0.814039
Epoch: 17 	Training Loss: 0.810289
Epoch: 18 	Training Loss: 0.804642
Epoch: 19 	Training Loss: 0.799210
Epoch: 20 	Training Loss: 0.793725


In [130]:
test_prebuilt_ternary=[list_prebuilt[i] for i in list_test_prebuilt_ternary]

In [131]:
y_pred_ter=[]
for test_batch in test_prebuilt_ternary:
    _, predictions = torch.max(model_ternary(test_batch),1)
    predictions=predictions.numpy()
    y_pred_ter.append(predictions)

In [132]:
cl_report_ter=classification_report(df_labeled.iloc[list_test_prebuilt_ternary,-1], y_pred_ter, output_dict=True)
print(classification_report(df_labeled.iloc[list_test_prebuilt_ternary,-1], y_pred_ter))

              precision    recall  f1-score   support

         0.0       0.62      0.75      0.68     19957
         1.0       0.65      0.74      0.69     20052
         2.0       0.41      0.13      0.19      9935

    accuracy                           0.62     49944
   macro avg       0.56      0.54      0.52     49944
weighted avg       0.59      0.62      0.59     49944



### Binary Model - My Model

In [58]:
df_binary_indexed=df_labeled.drop(df_labeled[df_labeled['star_rating']==2].index, axis=0)

In [59]:
list_train_mymodel_binary = df_binary_indexed.sample(frac = 0.8).index
list_test_mymodel_binary = df_binary_indexed.drop(list_train_mymodel_binary).index

In [60]:
training_set=myDataset(list_mymodel, df_binary_indexed.iloc[:,-1], list_train_mymodel_binary)
training_generator = data.DataLoader(training_set, batch_size=512, shuffle=True)

In [61]:
# initialize the NN
model_binary = Net_ter(output_size=2)
print(model_binary)


Net_bin(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [62]:
model_binary=train_model(training_generator, model_binary, n_epochs=20)

Epoch: 1 	Training Loss: 0.645827
Epoch: 2 	Training Loss: 0.546711
Epoch: 3 	Training Loss: 0.515771
Epoch: 4 	Training Loss: 0.496976
Epoch: 5 	Training Loss: 0.483101
Epoch: 6 	Training Loss: 0.474836
Epoch: 7 	Training Loss: 0.465672
Epoch: 8 	Training Loss: 0.459892
Epoch: 9 	Training Loss: 0.453065
Epoch: 10 	Training Loss: 0.448516
Epoch: 11 	Training Loss: 0.443470
Epoch: 12 	Training Loss: 0.438487
Epoch: 13 	Training Loss: 0.433792
Epoch: 14 	Training Loss: 0.428711
Epoch: 15 	Training Loss: 0.424195
Epoch: 16 	Training Loss: 0.420590
Epoch: 17 	Training Loss: 0.415587
Epoch: 18 	Training Loss: 0.411432
Epoch: 19 	Training Loss: 0.407033
Epoch: 20 	Training Loss: 0.403376


In [63]:
test_mymodel_binary=[list_mymodel[i] for i in list_test_mymodel_binary]

In [64]:
y_pred_bin=[]
for test_batch in test_mymodel_binary:
    _, predictions = torch.max(model_binary(test_batch),1)
    predictions=predictions.numpy()
    y_pred_bin.append(predictions)

In [65]:
cl_report_bin=classification_report(df_binary_indexed.loc[list_test_mymodel_binary]['star_rating'], y_pred_bin, output_dict=True)
print(classification_report(df_binary_indexed.loc[list_test_mymodel_binary]['star_rating'], y_pred_bin))

              precision    recall  f1-score   support

         0.0       0.77      0.79      0.78     19879
         1.0       0.79      0.77      0.78     20076

    accuracy                           0.78     39955
   macro avg       0.78      0.78      0.78     39955
weighted avg       0.78      0.78      0.78     39955



### Binary Model - Prebuilt

In [368]:
df_binary_indexed=df_labeled.drop(df_labeled[df_labeled['star_rating']==2].index, axis=0)

In [370]:
list_train_prebuilt_binary = df_binary_indexed.sample(frac = 0.8).index
list_test_prebuilt_binary = df_binary_indexed.drop(list_train_prebuilt_binary).index

In [372]:
training_set=myDataset(list_prebuilt, df_binary_indexed.iloc[:,-1], list_train_prebuilt_binary)
training_generator = data.DataLoader(training_set, batch_size=512, shuffle=True)

In [373]:
# initialize the NN
model_binary = Net_ter(output_size=2)
print(model_binary)


Net_bin(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [374]:
model_binary=train_model(training_generator, model_binary, n_epochs=20)

Epoch: 1 	Training Loss: 0.692747
Epoch: 2 	Training Loss: 0.689540
Epoch: 3 	Training Loss: 0.680256
Epoch: 4 	Training Loss: 0.661752
Epoch: 5 	Training Loss: 0.630659
Epoch: 6 	Training Loss: 0.595139
Epoch: 7 	Training Loss: 0.567392
Epoch: 8 	Training Loss: 0.550157
Epoch: 9 	Training Loss: 0.539487
Epoch: 10 	Training Loss: 0.530948
Epoch: 11 	Training Loss: 0.523500
Epoch: 12 	Training Loss: 0.518405
Epoch: 13 	Training Loss: 0.513158
Epoch: 14 	Training Loss: 0.509051
Epoch: 15 	Training Loss: 0.504877
Epoch: 16 	Training Loss: 0.500806
Epoch: 17 	Training Loss: 0.498415
Epoch: 18 	Training Loss: 0.495142
Epoch: 19 	Training Loss: 0.492861
Epoch: 20 	Training Loss: 0.490396


In [377]:
test_prebuilt_binary=[list_prebuilt[i] for i in list_test_prebuilt_binary]

In [378]:
y_pred_bin=[]
for test_batch in test_prebuilt_binary:
    _, predictions = torch.max(model_binary(test_batch),1)
    predictions=predictions.numpy()
    y_pred_bin.append(predictions)

In [379]:
cl_report_bin=classification_report(df_binary_indexed.loc[list_test_prebuilt_binary]['star_rating'], y_pred_bin, output_dict=True)
print(classification_report(df_binary_indexed.loc[list_test_prebuilt_binary]['star_rating'], y_pred_bin))

              precision    recall  f1-score   support

         0.0       0.75      0.79      0.77     19967
         1.0       0.77      0.74      0.76     19988

    accuracy                           0.76     39955
   macro avg       0.76      0.76      0.76     39955
weighted avg       0.76      0.76      0.76     39955



# Task 5 : Recurrent Neural Networks

## Task 5a

In [157]:
class myDataset(data.Dataset):
    def __init__(self, features, labels):
        self.features=features
        self.labels=labels
        self.len = features.shape[0]
    def __len__(self):
        return self.len
    def __getitem__(self, index):
#         row_tensor=self.features[index]
  
        
        
        list_rnn=[]
        count=0
        temp_tensor=torch.ones(50,300)
        for word in nltk.word_tokenize(self.features[index]):
            
            try:
                word_emb = np.asarray(model_w2v.wv[word])
                if count==0:
                    list_rnn=word_emb
                else:
                    list_rnn=np.append([list_rnn],word_emb)
                count+=1
            except KeyError:
                pass



            if count==50:
                break

        
        if len(list_rnn)!=0:
            return pad_sequence([temp_tensor,torch.tensor(list_rnn.reshape(-1,300))], True)[1],self.labels[index], count-1
#         else:
#             return torch.zeros(50,300),self.labels[index]

In [8]:
class prebuiltDataset(data.Dataset):
    def __init__(self, features, labels):
        self.features=features
        self.labels=labels
        self.len = features.shape[0]
    def __len__(self):
        return self.len
    def __getitem__(self, index):
#         row_tensor=self.features[index]
  
        
        
        list_rnn=[]
        count=0
        temp_tensor=torch.ones(50,300)
        for word in nltk.word_tokenize(self.features[index]):
            
            try:
                word_emb = np.asarray(wv[word])
                if count==0:
                    list_rnn=word_emb
                else:
                    list_rnn=np.append([list_rnn],word_emb)
                count+=1
            except KeyError:
                pass



            if count==50:
                break

        
        if len(list_rnn)!=0:
            return pad_sequence([temp_tensor,torch.tensor(list_rnn.reshape(-1,300))], True)[1],self.labels[index], count-1
#         else:
#             return torch.zeros(50,300),self.labels[index]

### Ternary Model - My Model

In [27]:
#rnn_ternary
X_train_rnn_ternary, X_test_rnn_ternary, y_train_rnn_ternary, y_test_rnn_ternary = train_test_split(df_labeled.iloc[:,0].values, df_labeled.iloc[:,-1].values, test_size=0.2, random_state=42)

In [28]:
import torch.nn as nn

class Model_RNN(nn.Module):
    
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        
        super(Model_RNN, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        
        #dense layer
        self.fc = nn.Linear(hidden_dim, output_size)
        
        self.softmax = nn.LogSoftmax(dim=1)
        #activation function
#         self.act = nn.Sigmoid()
        
    def forward(self, x, x_length):
   
        batch_size = x.size(0)

        #Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)

        out = out[torch.arange(batch_size),x_length,:].reshape(1,batch_size,self.hidden_dim)
        dense_outputs=self.fc(out)
        

        output = self.softmax(dense_outputs)
        return output
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
         # We'll send the tensor holding the hidden state to the device we specified earlier as well
        return hidden

In [29]:
# Instantiate the model with hyperparameters
model = Model_RNN(input_size=300, output_size=3, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [30]:
def train(model, iterator, optimizer, criterion):
    
    #initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    #set the model in training phase
    model.train()  
    for x_input, y_label, x_length in iterator:
        
        #resets the gradients after every batch
        optimizer.zero_grad()   
        
        output= model(x_input, x_length).squeeze()
  
        #compute the loss
        loss = criterion(output, y_label.long())        
        
        #compute the binary accuracy
#         acc = binary_accuracy(output, batch.label)   
        
        #backpropage the loss and compute the gradients
        loss.backward()     
        
        #clip gradient, to prevent from exploding
        nn.utils.clip_grad_norm_(model.parameters(), 0.7)
        
        #update the weights
        optimizer.step()      
        
        #loss and accuracy
        epoch_loss += loss.item()  
#         epoch_acc += acc.item()    
        

        

    return epoch_loss / len(iterator)
# , epoch_acc / len(iterator)

In [222]:
training_set=myDataset(X_train_rnn_ternary, y_train_rnn_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [279]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.678
Epoch: 1	Train Loss: 0.479
Epoch: 2	Train Loss: 0.450
Epoch: 3	Train Loss: 0.435
Epoch: 4	Train Loss: 0.423
Epoch: 5	Train Loss: 0.414
Epoch: 6	Train Loss: 0.407
Epoch: 7	Train Loss: 0.399
Epoch: 8	Train Loss: 0.393
Epoch: 9	Train Loss: 0.388
Epoch: 10	Train Loss: 0.381
Epoch: 11	Train Loss: 0.376
Epoch: 12	Train Loss: 0.371
Epoch: 13	Train Loss: 0.367
Epoch: 14	Train Loss: 0.363
Epoch: 15	Train Loss: 0.360
Epoch: 16	Train Loss: 0.355
Epoch: 17	Train Loss: 0.352
Epoch: 18	Train Loss: 0.349
Epoch: 19	Train Loss: 0.346


In [280]:
testing_set=myDataset(X_test_rnn_ternary, y_test_rnn_ternary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [281]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [282]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.85      0.69      0.76     24334
         1.0       0.87      0.71      0.78     24742
         2.0       0.02      0.27      0.04       868

    accuracy                           0.69     49944
   macro avg       0.58      0.56      0.53     49944
weighted avg       0.84      0.69      0.76     49944



### Ternary Model - Prebuilt Model

In [31]:
training_set=prebuiltDataset(X_train_rnn_ternary, y_train_rnn_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [29]:
# Instantiate the model with hyperparameters
model = Model_RNN(input_size=300, output_size=3, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [33]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.912
Epoch: 1	Train Loss: 0.835
Epoch: 2	Train Loss: 0.800
Epoch: 3	Train Loss: 0.790
Epoch: 4	Train Loss: 0.783
Epoch: 5	Train Loss: 0.776
Epoch: 6	Train Loss: 0.770
Epoch: 7	Train Loss: 0.765
Epoch: 8	Train Loss: 0.761
Epoch: 9	Train Loss: 0.757
Epoch: 10	Train Loss: 0.754
Epoch: 11	Train Loss: 0.750
Epoch: 12	Train Loss: 0.745
Epoch: 13	Train Loss: 0.741
Epoch: 14	Train Loss: 0.737
Epoch: 15	Train Loss: 0.734
Epoch: 16	Train Loss: 0.730
Epoch: 17	Train Loss: 0.726
Epoch: 18	Train Loss: 0.722
Epoch: 19	Train Loss: 0.719


In [34]:
testing_set=prebuiltDataset(X_test_rnn_ternary, y_test_rnn_ternary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [35]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [36]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.82      0.69      0.75     23299
         1.0       0.79      0.74      0.77     21569
         2.0       0.22      0.42      0.29      5076

    accuracy                           0.69     49944
   macro avg       0.61      0.62      0.60     49944
weighted avg       0.75      0.69      0.71     49944



### Binary Model - My Model

In [37]:
#rnn_ternary
X_train_rnn_binary, X_test_rnn_binary, y_train_rnn_binary, y_test_rnn_binary = train_test_split(df_binary.iloc[:,0].values, df_binary.iloc[:,-1].values, test_size=0.2, random_state=42)

In [38]:
# Instantiate the model with hyperparameters
model = Model_RNN(input_size=300, output_size=2, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [244]:
training_set=myDataset(X_train_rnn_binary, y_train_rnn_binary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [246]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.537
Epoch: 1	Train Loss: 0.425
Epoch: 2	Train Loss: 0.406
Epoch: 3	Train Loss: 0.393
Epoch: 4	Train Loss: 0.383
Epoch: 5	Train Loss: 0.377
Epoch: 6	Train Loss: 0.370
Epoch: 7	Train Loss: 0.364
Epoch: 8	Train Loss: 0.359
Epoch: 9	Train Loss: 0.356
Epoch: 10	Train Loss: 0.351
Epoch: 11	Train Loss: 0.348
Epoch: 12	Train Loss: 0.344
Epoch: 13	Train Loss: 0.341
Epoch: 14	Train Loss: 0.339
Epoch: 15	Train Loss: 0.335
Epoch: 16	Train Loss: 0.334
Epoch: 17	Train Loss: 0.330
Epoch: 18	Train Loss: 0.328
Epoch: 19	Train Loss: 0.326


In [247]:
testing_set=myDataset(X_test_rnn_binary, y_test_rnn_binary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [250]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [251]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.83      0.86      0.85     19207
         1.0       0.87      0.84      0.85     20748

    accuracy                           0.85     39955
   macro avg       0.85      0.85      0.85     39955
weighted avg       0.85      0.85      0.85     39955



### Binary Model - Prebuilt

In [40]:
training_set=prebuiltDataset(X_train_rnn_binary, y_train_rnn_binary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [41]:
# Instantiate the model with hyperparameters
model = Model_RNN(input_size=300, output_size=2, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [42]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.482
Epoch: 1	Train Loss: 0.427
Epoch: 2	Train Loss: 0.414
Epoch: 3	Train Loss: 0.401
Epoch: 4	Train Loss: 0.396
Epoch: 5	Train Loss: 0.390
Epoch: 6	Train Loss: 0.387
Epoch: 7	Train Loss: 0.382
Epoch: 8	Train Loss: 0.377
Epoch: 9	Train Loss: 0.371
Epoch: 10	Train Loss: 0.367
Epoch: 11	Train Loss: 0.361
Epoch: 12	Train Loss: 0.356
Epoch: 13	Train Loss: 0.351
Epoch: 14	Train Loss: 0.346
Epoch: 15	Train Loss: 0.342
Epoch: 16	Train Loss: 0.337
Epoch: 17	Train Loss: 0.333
Epoch: 18	Train Loss: 0.330
Epoch: 19	Train Loss: 0.327


In [43]:
testing_set=prebuiltDataset(X_test_rnn_binary, y_test_rnn_binary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [44]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [45]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.85      0.87      0.86     19555
         1.0       0.87      0.85      0.86     20400

    accuracy                           0.86     39955
   macro avg       0.86      0.86      0.86     39955
weighted avg       0.86      0.86      0.86     39955



## Task 5b - GRU

### Ternary Model - My Model

In [46]:
import torch.nn as nn

class Model_GRU(nn.Module):
    
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        
        super(Model_GRU, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.gru = nn.GRU(input_size, hidden_dim, n_layers, batch_first=True)   
        
        #dense layer
        self.fc = nn.Linear(hidden_dim, output_size)
        
        self.softmax = nn.LogSoftmax(dim=1)
        #activation function
#         self.act = nn.Sigmoid()
        
    def forward(self, x, x_length):
   
        batch_size = x.size(0)

        #Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.gru(x, hidden)

        out = out[torch.arange(batch_size),x_length,:].reshape(1,batch_size,self.hidden_dim)
        dense_outputs=self.fc(out)
        

        output = self.softmax(dense_outputs)
        return output
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
         # We'll send the tensor holding the hidden state to the device we specified earlier as well
        return hidden

In [47]:
def train(model, iterator, optimizer, criterion):
    
    #initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    #set the model in training phase
    model.train()  
    for x_input, y_label, x_length in iterator:
        
        #resets the gradients after every batch
        optimizer.zero_grad()   
        
        output= model(x_input, x_length).squeeze()
  
        #compute the loss
        loss = criterion(output, y_label.long())        
        
        #compute the binary accuracy
#         acc = binary_accuracy(output, batch.label)   
        
        #backpropage the loss and compute the gradients
        loss.backward()     
        
        #clip gradient, to prevent from exploding
        nn.utils.clip_grad_norm_(model.parameters(), 0.7)
        
        #update the weights
        optimizer.step()      
        
        #loss and accuracy
        epoch_loss += loss.item()  
#         epoch_acc += acc.item()    
        

        

    return epoch_loss / len(iterator)
# , epoch_acc / len(iterator)

In [260]:
# Instantiate the model with hyperparameters
model = Model_GRU(input_size=300, output_size=3, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [258]:
training_set=myDataset(X_train_rnn_ternary, y_train_rnn_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [262]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.895
	Train Loss: 0.751
	Train Loss: 0.711
	Train Loss: 0.690
	Train Loss: 0.677
	Train Loss: 0.667
	Train Loss: 0.658
	Train Loss: 0.651
	Train Loss: 0.645
	Train Loss: 0.639
	Train Loss: 0.634
	Train Loss: 0.630
	Train Loss: 0.626
	Train Loss: 0.622
	Train Loss: 0.618
	Train Loss: 0.614
	Train Loss: 0.611
	Train Loss: 0.608
	Train Loss: 0.605
	Train Loss: 0.602


In [263]:
testing_set=myDataset(X_test_rnn_ternary, y_test_rnn_ternary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [264]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [265]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.81      0.74      0.78     21579
         1.0       0.82      0.78      0.80     21304
         2.0       0.32      0.45      0.37      7061

    accuracy                           0.72     49944
   macro avg       0.65      0.66      0.65     49944
weighted avg       0.75      0.72      0.73     49944



### Ternary Model - Prebuilt

In [49]:
training_set=prebuiltDataset(X_train_rnn_ternary, y_train_rnn_ternary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [50]:
# Instantiate the model with hyperparameters
model = Model_GRU(input_size=300, output_size=3, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [52]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.895
Epoch: 1	Train Loss: 0.794
Epoch: 2	Train Loss: 0.744
Epoch: 3	Train Loss: 0.724
Epoch: 4	Train Loss: 0.711
Epoch: 5	Train Loss: 0.700
Epoch: 6	Train Loss: 0.691
Epoch: 7	Train Loss: 0.684
Epoch: 8	Train Loss: 0.678
Epoch: 9	Train Loss: 0.673
Epoch: 10	Train Loss: 0.669
Epoch: 11	Train Loss: 0.665
Epoch: 12	Train Loss: 0.661
Epoch: 13	Train Loss: 0.657
Epoch: 14	Train Loss: 0.655
Epoch: 15	Train Loss: 0.652
Epoch: 16	Train Loss: 0.649
Epoch: 17	Train Loss: 0.647
Epoch: 18	Train Loss: 0.645
Epoch: 19	Train Loss: 0.643


In [53]:
testing_set=prebuiltDataset(X_test_rnn_ternary, y_test_rnn_ternary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [54]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [55]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.84      0.72      0.78     22858
         1.0       0.81      0.79      0.80     20765
         2.0       0.29      0.45      0.35      6321

    accuracy                           0.72     49944
   macro avg       0.64      0.66      0.64     49944
weighted avg       0.76      0.72      0.73     49944



### Binary Model - My Model

In [270]:
training_set=myDataset(X_train_rnn_binary, y_train_rnn_binary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [271]:
# Instantiate the model with hyperparameters
model = Model_GRU(input_size=300, output_size=2, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [272]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.470
Epoch: 1	Train Loss: 0.354
Epoch: 2	Train Loss: 0.330
Epoch: 3	Train Loss: 0.316
Epoch: 4	Train Loss: 0.303
Epoch: 5	Train Loss: 0.294
Epoch: 6	Train Loss: 0.286
Epoch: 7	Train Loss: 0.280
Epoch: 8	Train Loss: 0.274
Epoch: 9	Train Loss: 0.268
Epoch: 10	Train Loss: 0.264
Epoch: 11	Train Loss: 0.259
Epoch: 12	Train Loss: 0.255
Epoch: 13	Train Loss: 0.250
Epoch: 14	Train Loss: 0.247
Epoch: 15	Train Loss: 0.243
Epoch: 16	Train Loss: 0.240
Epoch: 17	Train Loss: 0.237
Epoch: 18	Train Loss: 0.234
Epoch: 19	Train Loss: 0.231


In [273]:
testing_set=myDataset(X_test_rnn_binary, y_test_rnn_binary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [274]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [275]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.88      0.88      0.88     19864
         1.0       0.88      0.88      0.88     20091

    accuracy                           0.88     39955
   macro avg       0.88      0.88      0.88     39955
weighted avg       0.88      0.88      0.88     39955



### Binary Model - Prebuilt

In [56]:
training_set=prebuiltDataset(X_train_rnn_binary, y_train_rnn_binary)
training_generator = data.DataLoader(training_set, batch_size=64, shuffle=True)

In [57]:
# Instantiate the model with hyperparameters
model = Model_GRU(input_size=300, output_size=2, hidden_dim=50, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to(device)

# Define hyperparameters
# n_epochs = 2
lr=0.0001

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [58]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
#     train_loss, train_acc = train(model, training_generator, optimizer, criterion)
    train_loss= train(model, training_generator, optimizer, criterion)
    
#     #evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     #save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'Epoch: {epoch}\tTrain Loss: {train_loss:.3f}') 
#           | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 0	Train Loss: 0.458
Epoch: 1	Train Loss: 0.387
Epoch: 2	Train Loss: 0.361
Epoch: 3	Train Loss: 0.343
Epoch: 4	Train Loss: 0.331
Epoch: 5	Train Loss: 0.321
Epoch: 6	Train Loss: 0.314
Epoch: 7	Train Loss: 0.307
Epoch: 8	Train Loss: 0.303
Epoch: 9	Train Loss: 0.299
Epoch: 10	Train Loss: 0.295
Epoch: 11	Train Loss: 0.291
Epoch: 12	Train Loss: 0.288
Epoch: 13	Train Loss: 0.285
Epoch: 14	Train Loss: 0.282
Epoch: 15	Train Loss: 0.280
Epoch: 16	Train Loss: 0.277
Epoch: 17	Train Loss: 0.275
Epoch: 18	Train Loss: 0.273
Epoch: 19	Train Loss: 0.271


In [59]:
testing_set=prebuiltDataset(X_test_rnn_binary, y_test_rnn_binary)
testing_generator = data.DataLoader(testing_set, batch_size=64, shuffle=False)

In [60]:
y_pred=[]
y_true=[]
for test_batch in testing_generator:
    pred=model(test_batch[0], test_batch[2]).squeeze().topk(1)[1].T[0].numpy()
    y_pred=np.append(y_pred,pred)
    y_true=np.append(y_true,test_batch[1].numpy())

In [61]:
from sklearn.metrics import classification_report
cl_report_bin=classification_report(y_pred, y_true, output_dict=True)
print(classification_report(y_pred, y_true))

              precision    recall  f1-score   support

         0.0       0.87      0.88      0.88     19581
         1.0       0.89      0.87      0.88     20374

    accuracy                           0.88     39955
   macro avg       0.88      0.88      0.88     39955
weighted avg       0.88      0.88      0.88     39955



## Summary of Accuracies - Binary:

### Simple Model (task 3)- 

#### SVM
Tf-Idf - 0.87

My Model - 0.84

Prebuilt Model - 0.82 

#### Perceptron
Tf-Idf - 0.83

My Model - 0.79

Prebuilt Model - 0.71

### Feed-Forward Neural Network (task 4)

#### Input Type 1 (task 4A)
My Model - 0.84

Prebuilt Model - 0.76

#### Input Type 2 (task 4B)
My Model - 0.78

Prebuilt Model - 0.76

### RNN (task 5A)-
My Model - 0.85

Prebuilt Model - 0.86

### GRU (task 5B)-
My Model - 0.88

Prebuilt Model - 0.88

## Summary of Accuracies - Ternary:

### Feed-Forward Neural Network (task 4)

#### Input Type 1 (task 4A)
My Model - 0.69

Prebuilt Model - 0.67

#### Input Type 2 (task 4B)
My Model - 0.63

Prebuilt Model - 0.62

### RNN (task 5A)-
My Model - 0.69

Prebuilt Model - 0.69

### GRU (task 5B)-
My Model - 0.72

Prebuilt Model - 0.72

# Observations

It can be concluded that TF-IDF feature type works the best. Among the rest of the two, the model which we made works better than the prebuilt model. This is mainly because our model is generated from the dataset which is used in this testing while the prebuilt model is more generic.

SVM performed best for binary classification task. But Feed-Forward Neural Network with same word embeddings input as SVM also gave promising results.

RNN and GRU gave promising results for Binary Classification problem

Accuracies were consistently low for ternary classification. This is due to the class imbalance as class 2 was only half as compared to class 0 or 1. 

If there would have been more computational resources, the grid-search could be applied to find out better hyperparamters and give better result. But due to less RAM and Memory, my laptop was crashing again and again.