<a href="https://colab.research.google.com/github/jeet1912/ms/blob/main/ds680/assignments/DS680_HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Report

The Large Movie Review Dataset, developed by the [Stanford NLP](https://huggingface.co/stanfordnlp) research group, is designed for binary sentiment classification. [This dataset](https://huggingface.co/datasets/stanfordnlp/imdb) provides 25,000 movie reviews each for training and testing, along with additional unlabeled data. It has two features:

*   **label:**  Indicates sentiment with two unique values: **0** for negative and **1** for positive.
*   **text:** Contains the textual movie review, however, there's no reference to the movies.

Other observations / main challenge :
* Test and train datasets have the same distribution for the feature *label*.
* There are 96 duplicates in train set and 199 duplicates in test set.
* Given the organization of the train and test sets, which is exactly the same as seen by the outputs of the following statements: dfDATA.describe() where DATA is {Train,Test}. It is imperative to combine them into one dataset, called **df** and then, shuffle it. The newly observed difference between the sentiments is **0.2%**.
* Given this negligible difference, df was approximated as a balanced dataset with 49750 tuples.


Moving on, a preprocessing function sequentially prepares tuples in *text* for subsequent classification tasks. Given the randomness in the texts observed while viewing the various chunks of data, this approach attempts to establish a sequential sense of uniformity by cleaning it. First, each tuple under *text* was converted to lowercase to extract word tokens. These tokens were filtered to remove stopwords and maintain alphabets as per the language of the text selected, i.e. English. These tokens were shortened to their root words using lemmatization. These set of tokens are used to create a Bag Of Words of a trigram model using the train set, which was used to transform the test set.

Finally, two classifiers, SGD and MLPs were utilized for the task. The main challenge encountered here was in trying out different hyperparameters for MLP. Their performance has been compared under **Comparison** section.






### Code

#### Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from scipy import stats
from sklearn import metrics
from sklearn.neural_network import MLPClassifier
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

#### Import Dataset

In [None]:
splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
dfTrain = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])
dfTest = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["test"])

#### Data Wrangling

In [None]:
dfTrain.describe()

Unnamed: 0,label
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [None]:
dfTest.describe()

Unnamed: 0,label
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [None]:
dfTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
dfTest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
nums = [5,12500,12501]
for n in nums:
  if n == 12501:
    print(f'The {n} tuple in Train \n',dfTrain.iloc[[n]])
    break
  print(f'First {n} tuples in Train \n',dfTrain.head(n))
  print(f'Last {n} tuples in Train \n',dfTrain.tail(n))

First 5 tuples in Train 
                                                 text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0
Last 5 tuples in Train 
                                                     text  label
24995  A hit at the time but now better categorised a...      1
24996  I love this movie like no other. Another time ...      1
24997  This film and it's sequel Barry Mckenzie holds...      1
24998  'The Adventures Of Barry McKenzie' started lif...      1
24999  The story centers around Barry McKenzie who mu...      1
First 12500 tuples in Train 
                                                     text  label
0      I rented I AM CURIOUS-YELLOW from my video sto...      0
1      "I Am Curious: Yellow" is a risible and 

In [None]:
nums = [5,12500,12501]
for n in nums:
  if n == 12501:
    print(f'The {n} tuple in Test \n',dfTest.iloc[[n]])
    break
  print(f'First {n} tuples in Test \n',dfTest.head(n))
  print(f'Last {n} tuples in Test \n',dfTest.tail(n))

First 5 tuples in Test 
                                                 text  label
0  I love sci-fi and am willing to put up with a ...      0
1  Worth the entertainment value of a rental, esp...      0
2  its a totally average film with a few semi-alr...      0
3  STAR RATING: ***** Saturday Night **** Friday ...      0
4  First off let me say, If you haven't enjoyed a...      0
Last 5 tuples in Test 
                                                     text  label
24995  Just got around to seeing Monster Man yesterda...      1
24996  I got this as part of a competition prize. I w...      1
24997  I got Monster Man in a box set of three films ...      1
24998  Five minutes in, i started to feel how naff th...      1
24999  I caught this movie on the Sci-Fi channel rece...      1
First 12500 tuples in Test 
                                                     text  label
0      I love sci-fi and am willing to put up with a ...      0
1      Worth the entertainment value of a rental, 

In [None]:
dfTrain.isnull().sum()

Unnamed: 0,0
text,0
label,0


In [None]:
dfTest.isnull().sum()

Unnamed: 0,0
text,0
label,0


In [None]:
print(f'Train duplicates: {dfTrain.duplicated().sum()}')
print(f'Test duplicates: {dfTest.duplicated().sum()}')

dfTrain.drop_duplicates(inplace=True)
dfTest.drop_duplicates(inplace=True)

print(f'Train duplicates: {dfTrain.duplicated().sum()}')
print(f'Test duplicates: {dfTest.duplicated().sum()}')

Train duplicates: 96
Test duplicates: 199
Train duplicates: 0
Test duplicates: 0


In [None]:
print(f'Train shape: {dfTrain.shape}')
print(f'Test shape: {dfTest.shape}')

print(f'Difference: ',dfTrain.shape[0]-dfTest.shape[0])

Train shape: (24904, 2)
Test shape: (24801, 2)
Difference:  103


In [None]:
df = pd.concat([dfTrain, dfTest], axis=0)
df.shape

(49705, 2)

In [None]:
df = df.sample(frac=1, random_state=33).reset_index(drop=True)
df.head()

Unnamed: 0,text,label
0,"The 40 Year Old Virgin, is about Andy Stitzer,...",1
1,Worst movie ever made!!! Please see the Real m...,0
2,The murders in Opera are not actual murders as...,1
3,Boasting an all-star cast so impressive that i...,1
4,To those who say that this movie deserves anyt...,1


#### Split into test and train

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.37, random_state=33)

#### Preprocessing

In [None]:
def preprocess(text):
  text = text.lower()
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words('english'))
  tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(token) for token in tokens]
  return ' '.join(tokens)

In [None]:
X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

#### BoW

In [None]:
vectorizer = CountVectorizer(ngram_range=(3, 3))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#### SGD

In [None]:
sgd_model = SGDClassifier(loss='log_loss', random_state=33, max_iter=100, tol=1e-4)
sgd_model.fit(X_train_bow, y_train)

In [None]:
y_pred = sgd_model.predict(X_test_bow)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
crSGD = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(crSGD)


Accuracy: 0.7167636343863847

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.59      0.68      9269
           1       0.67      0.85      0.75      9122

    accuracy                           0.72     18391
   macro avg       0.73      0.72      0.71     18391
weighted avg       0.73      0.72      0.71     18391



In [None]:
def createRocCurve(y_test, y_pred):
  fpr, tpr, thresholds = roc_curve(y_test, y_pred)
  roc_auc = auc(fpr, tpr)
  trace = go.Scatter(x=fpr, y=tpr, mode='lines',
                  name='ROC curve (area = %0.8f)' % roc_auc,
                  line=dict(color='darkorange', width=2))
  diagonal_trace = go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
                            name='Random Classifier',
                            line=dict(color='navy',
                                      width=2, dash='dash'))
  layout = go.Layout(title='Receiver Operating Characteristic of SGD',
                    xaxis_title='FPR',
                    yaxis_title='TPR',
                    xaxis=dict(range=[0, 1]),
                    yaxis=dict(range=[0, 1.05]),
                    showlegend=True)
  fig = go.Figure(data=[trace, diagonal_trace], layout=layout)
  fig.show()

createRocCurve(y_test,y_pred)

In [None]:
y_pred[:5]

array([1, 1, 0, 1, 1])

#### MLP


In [None]:
mlp = MLPClassifier(
    hidden_layer_sizes=(5, 5, 5, 5),
    max_iter=100,
    activation='relu',
    solver='adam',
    random_state=33,
    tol=1e-4,
    early_stopping=True,
    verbose=True
)

mlp.fit(X_train_bow, y_train)

y_pred = mlp.predict(X_test_bow)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Iteration 1, loss = 0.89367299
Validation score: 0.504151
Iteration 2, loss = 0.59949441
Validation score: 0.686462
Iteration 3, loss = 0.20530266
Validation score: 0.743934
Iteration 4, loss = 0.07048060
Validation score: 0.759898
Iteration 5, loss = 0.03208655
Validation score: 0.725096
Iteration 6, loss = 0.01254654
Validation score: 0.739144
Iteration 7, loss = 0.00592192
Validation score: 0.745849
Iteration 8, loss = 0.00409110
Validation score: 0.735313
Iteration 9, loss = 0.00319179
Validation score: 0.726373
Iteration 10, loss = 0.00254402
Validation score: 0.720307
Iteration 11, loss = 0.00224476
Validation score: 0.720626
Iteration 12, loss = 0.00205752
Validation score: 0.716156
Iteration 13, loss = 0.00193462
Validation score: 0.713602
Iteration 14, loss = 0.00184771
Validation score: 0.709770
Iteration 15, loss = 0.00178297
Validation score: 0.707535
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Accuracy: 0.7627100212060247

C

In [None]:
y_pred[:5]

array([1, 1, 0, 1, 0])

In [None]:
createRocCurve(y_test,y_pred)

#### Comparison

In [None]:
metrics_data = {
    'Model': ['MLP', 'SGD'],
    'Accuracy': [0.7627100212060247, 0.7167636343863847],
    'Precision (Class 0)': [0.81, 0.80],
    'Recall (Class 0)': [0.69, 0.59],
    'F1-score (Class 0)': [0.75, 0.68],
    'Precision (Class 1)': [0.73, 0.67],
    'Recall (Class 1)': [0.84, 0.85],
    'F1-score (Class 1)': [0.78, 0.75],
    'Area under ROC': [0.76330075, 0.71781085]
}

In [None]:
df_metrics = pd.DataFrame(metrics_data)

In [None]:
df_metrics.head()

Unnamed: 0,Model,Accuracy,Precision (Class 0),Recall (Class 0),F1-score (Class 0),Precision (Class 1),Recall (Class 1),F1-score (Class 1),Area under ROC
0,MLP,0.76271,0.81,0.69,0.75,0.73,0.84,0.78,0.763301
1,SGD,0.716764,0.8,0.59,0.68,0.67,0.85,0.75,0.717811


In [None]:
df_transposed = df_metrics.set_index('Model').T.reset_index()
df_transposed = df_transposed.rename(columns={'index': 'Metric'})
df_transposed['Comparison'] = df_transposed.apply(lambda row: 'A' if row['MLP'] > row['SGD'] else 'Z', axis=1)

print(df_transposed)

Model               Metric       MLP       SGD Comparison
0                 Accuracy  0.762710  0.716764          A
1      Precision (Class 0)  0.810000  0.800000          A
2         Recall (Class 0)  0.690000  0.590000          A
3       F1-score (Class 0)  0.750000  0.680000          A
4      Precision (Class 1)  0.730000  0.670000          A
5         Recall (Class 1)  0.840000  0.850000          Z
6       F1-score (Class 1)  0.780000  0.750000          A
7           Area under ROC  0.763301  0.717811          A


For the sake of brevity and simplicity, the following code:

```
lambda row: 'A' if row['MLP'] > row['SGD'] else 'Z', axis=1
```
 has been adopted to identify that SGD only performs better than MLP classifier by 0.01 in its recall score for positive reviews.