# Sentiment Classification of Movie Reviews

This notebook builds a sentiment classifier to predict positive or negative sentiment from Rotten Tomatoes movie reviews using the Sentence Polarity dataset. The workflow covers data loading, text preprocessing, TF-IDF feature extraction, model training with hyperparameter tuning, and evaluation with K-fold cross-validation.

## 1 - Reading in & Preprocessing the Data

### 1.1 Importing Libraries

In [3]:
# Importing necessary libraries
import numpy as np  # for numerical operations
import pandas as pd  # for data manipulation
import random  # for shuffling the data
import nltk
import re  # for handling regular expressions

from nltk.stem import WordNetLemmatizer  # for lemmatizing words
from nltk.corpus import stopwords  # for stop word removal
from nltk.tokenize import word_tokenize  # for tokenizing sentences into words
nltk.download('punkt_tab')  # Downloads the 'punkt' tokenizer table used for tokenization of text into sentences or words

# Downloading necessary NLTK resources
nltk.download('stopwords')  # List of common stop words in English
nltk.download('punkt')  # Pre-trained tokenizer models
nltk.download('wordnet')  # WordNet lemmatizer dataset

# Libraries for text feature extraction and model training
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text into numerical features (TF-IDF)
from sklearn.linear_model import LogisticRegression  # Logistic regression for classification
from sklearn.svm import LinearSVC  # Support Vector Machines for classification

# Libraries for model evaluation
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix  # For model evaluation metrics
from sklearn.model_selection import KFold, cross_val_score  # For cross-validation

[nltk_data] Downloading package punkt_tab to /Users/annie/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/annie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/annie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/annie/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1.2 Read Data Files

The `Sentence Polarity` dataset contains 5,331 positive and 5,331 negative sentences. We'll load this dataset and prepare it for analysis.

In [4]:
# Read the positive and negative sentiment files
df_sent_pos = pd.read_csv('../data/rt-polarity.pos', sep='\t', header=None)  # Positive sentiment sentences
df_sent_neg = pd.read_csv('../data/rt-polarity.neg', sep='\t', header=None)  # Negative sentiment sentences

# Alternatively, use GitHub raw links to read the datasets directly into pandas DataFrames
# df_sent_pos = pd.read_csv('https://raw.githubusercontent.com/chrisvdweth/nus-cs4248x/refs/heads/master/1-foundations/data/corpora/sentence-polarity-dataset/sentence-polarity.neg', sep='\t', header=None)  # Positive sentiment sentences
# df_sent_neg = pd.read_csv('https://raw.githubusercontent.com/chrisvdweth/nus-cs4248x/refs/heads/master/1-foundations/data/corpora/sentence-polarity-dataset/sentence-polarity.pos', sep='\t', header=None)  # Negative sentiment sentences

# Display the first few rows of the positive dataset to understand its structure
print(df_sent_pos.head())

                                                   0
0  the rock is destined to be the 21st century's ...
1  the gorgeously elaborate continuation of " the...
2                     effective but too-tepid biopic
3  if you sometimes like to go to the movies to h...
4  emerges as something rare , an issue movie tha...


### 1.3 Rename Columns

In [5]:
# Rename the column to 'sentence' for clarity
df_sent_pos.rename(columns={0: "sentence"}, inplace=True)
df_sent_neg.rename(columns={0: "sentence"}, inplace=True)

### 1.4 Data Preprocessing

The sentences is preprocessed by defining a function called `preprocess_text` that performs the following:

1. Converts text to lowercase.
2. Removes punctuation using regular expressions.
3. Removes extra whitespace.
4. Tokenizes sentences into words.
5. Removes stop words.
6. Lemmatizes words.

In [6]:
# Define the preprocessing function
def preprocess_text(sentences):
    # Convert all tokens to lowercase
    sentences = [sentence.lower() for sentence in sentences]

    # Remove punctuation using regex
    sentences = [re.sub(r"[^\w\s]", "", sentence) for sentence in sentences]

    # Remove extra whitespace between words
    sentences = [" ".join(sentence.split()) for sentence in sentences]

    # Tokenize sentences into words
    sentences = [word_tokenize(sentence) for sentence in sentences]

    # Remove stop words
    stop_words = set(stopwords.words('english'))  # Load English stop words
    filtered_sentences = []
    for sentence in sentences:
        filtered_sentence = [word for word in sentence if word not in stop_words]
        filtered_sentences.append(filtered_sentence)

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentences = []
    for sentence in filtered_sentences:
        lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
        lemmatized_sentences.append(lemmatized_sentence)

    return [' '.join(sentence) for sentence in lemmatized_sentences]

### 1.5 Apply Preprocessing

Preprocessing function is applied to both negative and positive sentences.

In [7]:
# Preprocess the sentences
pos_preprocessed_sentences = preprocess_text(df_sent_pos['sentence'])
neg_preprocessed_sentences = preprocess_text(df_sent_neg['sentence'])

# Print the first preprocessed negative sentence
print(neg_preprocessed_sentences[0])

simplistic silly tedious


### 1.6 Combine Dataset

The negative and positive sentences are merged into a single list called `sentences`.

In [8]:
# Combine preprocessed positive and negative sentences
sentences = pos_preprocessed_sentences + neg_preprocessed_sentences

### 1.7 Create Labels

Labels (also targets) distinguish negative (labeled as `0`) and positive (labeled as `1`) sentences.

In [9]:
# Create a list for all labels
polarities = []
polarities.extend([0] * len(df_sent_neg))  # Label negative sentences as 0
polarities.extend([1] * len(df_sent_pos))  # Label positive sentences as 1

### 1.8 Shuffle Data

Randomly shuffle the dataset to ensure sentence are in random order.

In [10]:
# Combine sentences and labels into a single list
combined = list(zip(sentences, polarities))

# Shuffle the combined list
random.shuffle(combined)

# Split the shuffled list back into sentences and labels
sentences[:], polarities[:] = zip(*combined)

### 1.9 Split Dataset

The dataset is split into 80% for training and 20% for testing.

In [11]:
# Define train-test split ratio
train_test_ratio = 0.8

# Calculate the size of the training set
train_set_size = int(train_test_ratio * len(sentences))

# Split data into training and test sets
X_train, X_test = sentences[:train_set_size], sentences[train_set_size:]
y_train, y_test = polarities[:train_set_size], polarities[train_set_size:]

# Print sizes of training and test sets
print("Size of training set:", len(X_train))
print("Size of test set:", len(X_test))

Size of training set: 8529
Size of test set: 2133


## 2 - Vectorizing Texts, Training Models & Evaluating Their Performance

### 2.1 Transforming Text into Features

To use text data in machine learning models, we need to convert it into a numerical format that algorithms can process. The TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer transforms our preprocessed text into numerical features by measuring how important each term is within a sentence while reducing the weight of commonly occurring words.

In [12]:
# Import TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer with default parameters
tfidf_vectorizer = TfidfVectorizer()

# Transform the training data into a TF-IDF matrix
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Check the number of samples and features
num_samples, num_features = X_train_tfidf.shape
print("#Samples: {}, #Features: {}".format(num_samples, num_features))

#Samples: 8529, #Features: 16513


In [13]:
# Diagnostic check - compare vocabulary size
from collections import Counter
all_words = ' '.join(X_train).split()
unique_words = set(all_words)
print(f"Unique words in training set: {len(unique_words)}")
print(f"TF-IDF features: {X_train_tfidf.shape[1]}")
print(f"Number of stopwords: {len(stopwords.words('english'))}")

Unique words in training set: 16543
TF-IDF features: 16513
Number of stopwords: 198


### 2.2 Train the Classifier

Logistic Regression is a simple yet effective algorithm for binary classification tasks, such as predicting sentiment polarity.

In [14]:
# Import the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression classifier
logistic_regression_classifier = LogisticRegression().fit(X_train_tfidf, y_train)

### 2.3 Evaluation the Classifier

After training the model, we need to assess its performance on unseen test data. Evaluation involves transforming the test data into the same TF-IDF format as the training data, making predictions, and calculating key metrics.

In [15]:
# Transform the test data into TF-IDF format
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Predict polarities for the test data
y_pred = logistic_regression_classifier.predict(X_test_tfidf)

# Import evaluation metrics
from sklearn.metrics import classification_report, accuracy_score

# Generate and display the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.75      0.75      1052
           1       0.75      0.75      0.75      1081

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.75      0.75      0.75      2133



The Logistic Regression model achieved an F1-score of 0.75 for sentiment classification, demonstrating significantly better performance than random guessing (~50%) with 75% precision and recall for both classes. While this result shows the model performs well above baseline, it indicates there is still room for improvement. The next steps involve exploring cross-validation and hyperparameter tuning techniques to further enhance the model's performance.

## 3 - Training Using K-Fold Cross Validation

In machine learning, a model's performance on a single train-test split can sometimes be misleading. A model might perform exceptionally well on one particular split by chance, while performing poorly on others. To build confidence in our model's generalization ability and ensure its robustness across different data subsets, we need a more rigorous evaluation approach. K-fold cross-validation is the industry-standard technique that addresses this challenge by systematically evaluating your model across multiple data splits, providing a comprehensive and reliable assessment of its real-world performance.

In [16]:
# Import necessary library
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Perform 10-fold cross-validation on the training data
f1_scores_list = cross_val_score(
    LogisticRegression(),            # Model: Logistic Regression
    X_train_tfidf,                   # Features: TF-IDF transformed training data
    y_train,                         # Labels: Training labels
    cv=10,                           # Number of folds
    scoring="f1"                     # Evaluation metric: F1 score
)

# Display the F1 scores for each fold
print(f"F1 Scores for each fold: {f1_scores_list}")

# Calculate and display the mean and standard deviation of the F1 scores
print("F1 Score (Mean/Average): {:.3f}".format(f1_scores_list.mean()))
print("F1 Score (Standard Deviation): {:.3f}".format(f1_scores_list.std()))

F1 Scores for each fold: [0.74735605 0.73923445 0.76245655 0.78837209 0.74390244 0.77130045
 0.74056604 0.76190476 0.75116279 0.74730539]
F1 Score (Mean/Average): 0.755
F1 Score (Standard Deviation): 0.015


- **F1 Scores Across Folds:** The individual fold scores range from 0.739 to 0.788, demonstrating that the model performs reasonably consistently across different folds.
- **Mean F1 Score:** The average F1 score of 0.755 is a reliable measure of the model's overall performance, indicating strong predictive capability for sentiment classification.
- **Standard Deviation:** The low standard deviation of 0.015 indicates that the model performs consistently across different subsets of the data, confirming reliable generalization.
- **Implications:** A high variation in F1 scores or a high standard deviation would suggest issues with data shuffling or an insufficient dataset size.

## 4 - Perform Hyperparameter Tuning

Having validated our model's consistency through K-Fold Cross-Validation, the next step is to push its performance even further through hyperparameter tuning. While our Logistic Regression model achieved a solid mean F1 score of 0.755, the right combination of hyperparameters, such as the n-gram size and choice of classifier, can meaningfully improve accuracy and generalization. In this section, we'll systematically explore different parameter configurations to identify the optimal setup for our sentiment analysis task.

### 4.1 Setting Up the Experiment

To find the optimal model configuration, we need to systematically test every possible combination of our chosen hyperparameters:

- the classifier type (LinearSVC vs. Logistic Regression) and
- the n-gram size (1 to 4)

and evaluate each using 10-fold cross-validation. The combination that produces the highest mean F1-score will be selected as our best configuration.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

# Initialize placeholders to store the best configuration
best_score = -1.0
best_classifier = None
best_ngram_size = -1

# Define the hyperparameters to test
classifiers = [LinearSVC(), LogisticRegression(solver="sag")]
ngram_sizes = [1, 2, 3, 4]

# Loop through all combinations of classifiers and n-gram sizes
for classifier in classifiers:
    for n in ngram_sizes:
        # Define the vectorizer with the current n-gram size
        vectorizer = TfidfVectorizer(ngram_range=(1, n))
        X_train_tfidf = vectorizer.fit_transform(X_train)  # Transform training data

        # Perform 10-fold cross-validation
        f1_scores = cross_val_score(classifier, X_train_tfidf, y_train, cv=10, scoring='f1')
        avg_f1_score = f1_scores.mean()  # Calculate average F1-score

        # Print the result for this combination
        print(f"Classifier: {type(classifier).__name__}, n-gram size: {n} => F1-score: {avg_f1_score:.3f}")

        # Save the best configuration
        if avg_f1_score > best_score:
            best_score = avg_f1_score
            best_classifier = classifier
            best_ngram_size = n

# Print the best configuration
print("\nBest Configuration:")
print(f"Classifier: {type(best_classifier).__name__}, Max n-gram size: {best_ngram_size}, F1-score: {best_score:.3f}")

Classifier: LinearSVC, n-gram size: 1 => F1-score: 0.754
Classifier: LinearSVC, n-gram size: 2 => F1-score: 0.766
Classifier: LinearSVC, n-gram size: 3 => F1-score: 0.763
Classifier: LinearSVC, n-gram size: 4 => F1-score: 0.763
Classifier: LogisticRegression, n-gram size: 1 => F1-score: 0.756
Classifier: LogisticRegression, n-gram size: 2 => F1-score: 0.754
Classifier: LogisticRegression, n-gram size: 3 => F1-score: 0.748
Classifier: LogisticRegression, n-gram size: 4 => F1-score: 0.745

Best Configuration:
Classifier: LinearSVC, Max n-gram size: 2, F1-score: 0.766


With all combinations tested, the results clearly show that LinearSVC with an n-gram size of 2 delivered the best performance, achieving an F1-score of 0.766. Now that we've identified the optimal hyperparameters, the next step is to retrain the model using this configuration on the full training dataset and evaluate it against the unseen test data.

### 4.2 Training the Best Model

After identifying the best combination of parameters, we can train the final model on the entire training dataset and evaluate it using the unseen test data.

In [18]:
from sklearn.metrics import classification_report, accuracy_score

# Use the best configuration to train the final model
final_vectorizer = TfidfVectorizer(ngram_range=(1, best_ngram_size))
X_train_tfidf = final_vectorizer.fit_transform(X_train)
X_test_tfidf = final_vectorizer.transform(X_test)

best_classifier.fit(X_train_tfidf, y_train)
y_pred = best_classifier.predict(X_test_tfidf)

# Evaluate and display results
print("\nFinal Model Results:")
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")


Final Model Results:
              precision    recall  f1-score   support

           0       0.75      0.76      0.75      1052
           1       0.76      0.75      0.76      1081

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.76      0.75      0.75      2133

Accuracy: 0.755


Training the final model using the optimal configuration:

- classifier type: LinearSVC
- n-gram range of (1, 2)

yielded an accuracy of 75.5% and a consistent F1-score of 0.75â€“0.76 across both sentiment classes. This is a marginal improvement over the initial Logistic Regression baseline (F1 of 0.75), suggesting that for this dataset and preprocessing pipeline, the choice of classifier and n-gram range has limited impact. The balanced precision and recall scores for both classes confirm that the model is not biased toward either positive or negative sentiment.