## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [49]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


In [50]:
# Renaming for convinience
reviews = reviews.rename(columns={0: 'review'})
y = Y

In [51]:
reviews.head()

Unnamed: 0,review
0,bromwell high is a cartoon comedy . it ran at ...
1,story of a man who has unnatural feelings for ...
2,homelessness or houselessness as george carli...
3,airport starts as a brand new luxury pla...
4,brilliant over acting by lesley ann warren . ...


In [52]:
labels.head()

Unnamed: 0,0
0,positive
1,negative
2,positive
3,negative
4,positive


In [53]:
# 1 -> positive, 0 -> negative
y.head()

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

vecotrizer = CountVectorizer(max_features=10_000)
X = vecotrizer.fit_transform(reviews["review"]).toarray()

In [55]:
from sklearn.model_selection import train_test_split

X_, X_test, y_, y_test = train_test_split(X, y, test_size=.2, random_state=42)
X_train, X_validation, y_train, y_validation= train_test_split(X_, y_, test_size=.25, random_state=42)

In [56]:
print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_validation.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (15000, 10000)
Validation set shape: (5000, 10000)
Test set shape: (5000, 10000)


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [57]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [59]:
# Each row in the data set has 10_000 columns. 
# The index of the column maps to a word such that 
# 0 -> word is not present, 1 -> word is present
# [0,0,0] -> word with index 0,1,2 is not present
# [1,0,1] -> word with index 0 (aaron) and 2 (abandoned) is present, while word with index 1 is not present
vecotrizer.get_feature_names_out()

array(['aaron', 'abandon', 'abandoned', ..., 'zoom', 'zorro', 'zu'],
      dtype=object)

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 