## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [71]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())
print(labels.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...
          0
0  positive
1  negative
2  positive
3  negative
4  positive


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

# create the transform
vectorizer = CountVectorizer(max_features=10000)

revs_array = reviews[0].values

# Fit and transform the text data
bag = vectorizer.fit_transform(revs_array)

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [73]:
bag

# Inspecting the output below, which is the data-type of the representation of the reviews,
# we see that it is a sparse matrix. Let's take a closer look below.

<25000x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 3156666 stored elements in Compressed Sparse Row format>

In [74]:

# convert to a DataFrame for visualization
df = pd.DataFrame(bag.toarray(), columns=vectorizer.get_feature_names_out())

print(df.tail())

       aaron  abandon  abandoned  abc  abilities  ability  able  aboard  \
24995      0        0          0    0          0        0     0       0   
24996      0        0          0    0          0        0     1       0   
24997      0        0          0    0          0        0     0       0   
24998      0        0          0    0          0        0     0       0   
24999      0        0          0    0          0        0     0       0   

       abominable  abomination  ...  zhang  zizek  zodiac  zombi  zombie  \
24995           0            0  ...      0      0       0      0       0   
24996           0            0  ...      0      0       0      0       0   
24997           0            0  ...      0      0       0      0       2   
24998           0            0  ...      0      0       0      0       0   
24999           0            0  ...      0      0       0      0       0   

       zombies  zone  zoom  zorro  zu  
24995        0     0     0      0   0  
24996       

We see that the bag-of-words representation of a review has each word of the vocabulary (at least the 10000 most frequent ones) as a row and then their occurrence in each review (which are the columns) as an integer. So the value we will see under the column of each word will be the amount of times it occurrs in each review. This is why the matrix is so scarce - there are 10000 words in our "vocabulary" and each review probably only uses a very small percentage of that.

We see in the tail of the reviews that review number 24997 said the word "zombie" twice, for example.

In [75]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

X = bag
Y = to_categorical(Y, 2)
# splitting data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [108]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from numpy.random import seed, randint

seed(0)
tf.random.set_seed(0)

model = Sequential() #initialize neural network
model.add(Dense(units = 30, activation = 'sigmoid', input_dim = 10000)) #add the first hidden layer
model.add(Dense(units = 2, activation = 'softmax')) #output layer

sgd = optimizers.SGD(learning_rate = 0.03)
model.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

history = model.fit(X_train, Y_train, epochs = 20, batch_size = 50, validation_split = 0.2, verbose = 1)

Epoch 1/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.6118 - loss: 0.6500 - val_accuracy: 0.7423 - val_loss: 0.5583
Epoch 2/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7490 - loss: 0.5434 - val_accuracy: 0.7890 - val_loss: 0.4900
Epoch 3/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7926 - loss: 0.4818 - val_accuracy: 0.8097 - val_loss: 0.4445
Epoch 4/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8179 - loss: 0.4384 - val_accuracy: 0.8283 - val_loss: 0.4142
Epoch 5/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8334 - loss: 0.4063 - val_accuracy: 0.8337 - val_loss: 0.3926
Epoch 6/20
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8452 - loss: 0.3810 - val_accuracy: 0.8440 - val_loss: 0.3761
Epoch 7/20
[1m240/240[0m 

I messed around with the Learning Rate, Curve type, number of epochs and batch size until I achieved 91% accuracy on the test set and 87% accuracy on the validation set. I was keeping my eye mostly on the validation set and this was the absolute maximum I managed to get across all experiments.

What happened first was that I was looking at the train accuracy and realized that if I added enough epochs it would eventually reach 100% and not go anywhere. However, I then started keeping my eye in the validation accuracy and realized that this one "plateau'd" far sooner, which made me reduce the epoch number even further.

I then realized that if my learning rate was too high, there would be considerate fluctuations (both up and down) across my epochs. Considering that I had so little, I decided to reduce the learning rate so that I would have a higher chance of increasing the accuracy with each epoch. This worked out well for the hyperparameters I ended up going with.

**(d)** Test your sentiment-classifier on the test set.

In [110]:
loss, accuracy = model.evaluate(X_test, Y_test)

print(f'Test Loss: {loss}')
print(f'Test Accuracy: {accuracy}')

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 583us/step - accuracy: 0.8599 - loss: 0.3350
Test Loss: 0.3363731801509857
Test Accuracy: 0.8611999750137329


It seems that the test accuracy is extremely similar to the validation accuracy, which is good!

It means that the model is good at generalizing on unseen data and that the split between train/test/validation was good as well, since there was no considerable fluctutation between the test and validation accuracy results.

Since I couldn't bring my validation accuracy any higher, I also didn't expect the test accuracy to go higher than what we got in the previous step.

I am happy with these results.

**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [143]:
# Preprocess the new text
new_text = [
    "This movie was terrible! The acting was awful.",
    "Amazing movie!",
    "Cate Blanchett",
    "I have seen more boobs in this movie than in my life",
    "I love the authenticity of the actors",
    "a lot of violence",
    "too much violence",
    "i hated this movie",
    "Adam Sandler is ugly and acts bad",
    "I actually cried",
    "beautiful animation",
    "This studio only does shitty crappy movies",
    "terrible",
    "This studio only does terrible movies",
    "I hate this movie",
    "I hated this movie",
    "I hated hated hated hated this movie"]

# Vectorize the text using the same vectorizer used during training
new_text_vectorized = vectorizer.transform(new_text)

# Make predictions
predictions = model.predict(new_text_vectorized)

# Loop through each prediction
for i, pred in enumerate(predictions):
    # Determine sentiment label
    print(pred)
    sentiment = "positive" if pred[0] < 0.5 else "negative"
    # Print the original text and the predicted sentiment
    print(f"Original Text: {new_text[i]}")
    print(f"Predicted Sentiment: {sentiment}\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[0.86603284 0.13396718]
Original Text: This movie was terrible! The acting was awful.
Predicted Sentiment: negative

[0.28333795 0.71666205]
Original Text: Amazing movie!
Predicted Sentiment: positive

[0.43251127 0.56748873]
Original Text: Cate Blanchett
Predicted Sentiment: positive

[0.29950377 0.7004962 ]
Original Text: I have seen more boobs in this movie than in my life
Predicted Sentiment: positive

[0.3316189 0.6683811]
Original Text: I love the authenticity of the actors
Predicted Sentiment: positive

[0.3597052  0.64029485]
Original Text: a lot of violence
Predicted Sentiment: positive

[0.5069899  0.49301013]
Original Text: too much violence
Predicted Sentiment: negative

[0.46559173 0.5344083 ]
Original Text: i hated this movie
Predicted Sentiment: positive

[0.64204276 0.35795733]
Original Text: Adam Sandler is ugly and acts bad
Predicted Sentiment: negative

[0.39344728 0.6065528 ]
Original Text: I ac

By printing out the predicted sentiment alongside the model's confidence for each of these reviews, we can conclude that the "Bag of Words" method might not be the best way to go for this exercise because perhaps it removes emphasis from the words that actually matter for the sentiment of the review. An average person would immediatelly determine "I hated this movie" and "This studio only does shitty crappy movies" as negative reviews, but the model, despite being a bit "on edge" (only around 60% sure), still classified them as positive.

From testing with my own sentences, I also realized that the amount of times that a word appears in a review greatly influences the output of this model ("I hated this movie" being classified as positive but "I hated hated hated hated this movie" being classified as negative).

One way to make this model more robust would be to remove neutral words from the bag of words (for example "movie", "this", "is") and all of these words that are required to make a sentence make sense. Then, the remaining words would carry far more meaning to the model, perhaps making it perform a lot better (Because the sentence "I hated this movie" would have just become "hated" and be far more likely to be classified as negative).

I would also like to note that perhaps the reason why the model didn't perform so well on my hand-written reviews is because they are drastically shorter than those we see on imdb, so this type of data might be a bit foreign to our model ahahahah.

This exercise/assignment was very interesting! :)