## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [13]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Assuming `reviews` is a pandas DataFrame with the review text in the first column
# Convert the first column of the DataFrame into a list (if it's a DataFrame with one column)
reviews = reviews.iloc[:, 0].tolist()  # Convert the review column to a list

# Assuming `labels` is already a list, Series, or an array
# Ensure labels is in the same format (list or Series) with the same length as `reviews`

# For example:
# labels = [0, 1, 0, 1, 0, ...] # Your sentiment labels corresponding to reviews

# Split data into train, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(reviews, labels, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Initialize CountVectorizer with max_features set to 10,000
vectorizer = CountVectorizer(max_features=10000)

# Fit the vectorizer to the training data and transform the reviews into Bag-of-Words representations
X_train_bow = vectorizer.fit_transform(X_train)
X_val_bow = vectorizer.transform(X_val)
X_test_bow = vectorizer.transform(X_test)

# Output the shape of the transformed datasets to confirm the size
print(f"Training set shape: {X_train_bow.shape}")
print(f"Validation set shape: {X_val_bow.shape}")
print(f"Test set shape: {X_test_bow.shape}")


Training set shape: (15000, 10000)
Validation set shape: (5000, 10000)
Test set shape: (5000, 10000)


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load your dataset into a pandas DataFrame (replace with your actual file path)
# For example, assume 'reviews' column contains the text reviews and 'labels' column contains the sentiment labels.
df = pd.read_csv("reviews.txt")  # Modify this with your actual data file

# If the dataset is a pandas DataFrame, ensure you're accessing the correct columns:
reviews = df['reviews']  # Assuming reviews are in this column

# Step 2: Split your data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(reviews, labels, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Check the shapes of the datasets
print(f"Training set shape: {X_train.shape}, Validation set shape: {X_val.shape}, Test set shape: {X_test.shape}")


Vocabulary: ['bad' 'great' 'is' 'loved' 'movie' 'the' 'this' 'was']
Bag-of-Words representation of the reviews:
[[0 1 1 0 1 0 1 0]
 [0 0 0 1 1 1 0 0]
 [1 0 0 0 1 0 1 1]]


**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 