## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [11]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Splitting the data into train, validation, and test sets
X_train, X_temp, Y_train, Y_temp = train_test_split(reviews, Y, test_size=0.2, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.2, random_state=42)

# Creating a Bag-of-Words representation
vectorizer = CountVectorizer(max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train.values.ravel())
X_val_bow = vectorizer.transform(X_val.values.ravel())
X_test_bow = vectorizer.transform(X_test.values.ravel())

print("Train set shape:", X_train_bow.shape)
print("Validation set shape:", X_val_bow.shape)
print("Test set shape:", X_test_bow.shape)

Train set shape: (20000, 10000)
Validation set shape: (4000, 10000)
Test set shape: (1000, 10000)


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [13]:
word = vectorizer.get_feature_names_out()[136]

print("Vector representation of a single word: ", vectorizer.transform([word]).nonzero())
print("\nVector representation of a whole review: ", X_train_bow[0].nonzero())

Vector representation of a single word:  (array([0], dtype=int32), array([136], dtype=int32))

Vector representation of a whole review:  (array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0], dtype=int32), array([8962, 4391, 6190, 5433, 5717,   16,  846, 7845, 9090, 5576,  509,
       6233, 5828, 4642, 3751, 6775, 9668, 5873, 4729, 4716, 8972, 8960,
       4086, 6100,  786, 3175, 5882, 9964, 4330, 5430, 9005, 7902, 8617,
       9856, 1624, 8468,  279, 2769, 2025, 5872, 9780, 2503, 1972, 6500,
       4056, 1056, 5362, 3758, 4123, 4169, 4369, 9790, 2601, 2472, 4476,
       7642,  33

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

model = MLPClassifier(max_iter=1000, random_state=42)

param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (150,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],  
    'learning_rate_init': [0.001, 0.01],
}

# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)
# grid_search.fit(X_train_bow, Y_train)
# best_model = grid_search.best_estimator_
# Y_pred_best_model = best_model.predict(X_test_bow)
# print("Best MLP Hyperparameters:", grid_search.best_params_)

# To save computation time, using best parameters from a previous run:
best_params = {
    'hidden_layer_sizes': (100,),
    'activation': 'relu',
    'alpha': 0.0001,
    'learning_rate_init': 0.001
}

model = MLPClassifier(max_iter=1000, random_state=42, **best_params)
model.fit(X_train_bow, Y_train)
Y_valid_pred = model.predict(X_val_bow)

accuracy = accuracy_score(Y_val, Y_valid_pred)
print("MLP Accuracy: ", accuracy)


  y = column_or_1d(y, warn=True)


MLP Accuracy:  0.869


In [18]:
#Second approach without the hyperparameters from the grid search

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Creating a neural network classifier
model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)

# Training on the training data
model.fit(X_train_bow, Y_train)

# Evaluating on the validation set
Y_valid_pred = model.predict(X_val_bow)

accuracy = accuracy_score(Y_val, Y_valid_pred)
print("Validation accuracy: ", accuracy)

  y = column_or_1d(y, warn=True)


Validation accuracy:  0.869


**(d)** Test your sentiment-classifier on the test set.

In [16]:
#Evaluating on the test set
Y_test_pred = model.predict(X_test_bow)
accuracy = accuracy_score(Y_test, Y_test_pred)
print(f"Test accuracy: {accuracy:.4f}")

Test accuracy: 0.8640


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [None]:
# Classifying a new review
new_review = [
    "This movie was absolutely wonderful and touching.",
    "I hated every minute of this film.",
    "An average film with a few good moments.",
]
new_review_bow = vectorizer.transform(new_review)
new_review_pred = model.predict(new_review_bow)

for review, pred in zip(new_review, new_review_pred):
    label = "positive" if pred == 1 else "negative"
    print(f"Review: {review}\nPredicted sentiment: {label}\n")

Review: This movie was absolutely wonderful and touching.
Predicted sentiment: positive

Review: I hated every minute of this film.
Predicted sentiment: negative

Review: An average film with a few good moments.
Predicted sentiment: positive

