1. Research Question: 

Can we accurately predict whether a movie review from Rotten Tomatoes expresses a positive or negative sentiment using machine learning techniques, and how does a deep learning model’s performance compare to traditional feature-based methods?


2. Data Source:

The dataset, titled Rotten Tomatoes Movies Reviews (available on Kaggle), contains short text reviews labeled as either positive or negative. Each observation includes the review text and a corresponding binary sentiment label. The dataset comprises thousands of reviews and is suitable for supervised text classification.


3. Planned Methods:

Our approach will follow a similar method as the spam classification challenge on homework #3, with an emphasis on feature engineering, pipeline encapsulation, and model selection. In the traditional machine learning portion, we will transform the raw text reviews into numeric features using scikit-learn's TfidfVectorizer(). We will pair this vectorization step with model training using a scikit-learn “pipeline” so that preprocessing occurs within each fold during cross-validation. We will perform hyperparameter tuning using GridSearchCV() and 5-fold cross-validation to find the best-performing traditional model. We plan to compare logistic regression, support vector machines, and XGBoost, optimizing hyperparameters such as regularization strength, kernel type, and tree depth. We'll evaluate the performance of the models using the F1-score and accuracy metrics.
For the deep learning component, we will fine-tune a pre-trained transformer model, potentially DistilBERT from Hugging Face, for binary sentiment classification. This will pick up on deeper contextual relationships in language that are out of reach for the classical vectorization methods. By comparing model performance and interpretability, we would like to see if deep learning yields a considerable improvement compared to the classical pipeline-based method.


4. References: 

https://www.kaggle.com/datasets/thedevastator/movie-review-data-set-from-rotten-tomatoes



In [4]:
import pandas as pd

df = pd.read_csv("data_rt.csv")

print("First 5 rows:")
print(df.head())

print("\nColumn names:")
print(df.columns)


First 5 rows:
                                             reviews  labels
0                  simplistic , silly and tedious .        0
1  it's so laddish and juvenile , only teenage bo...       0
2  exploitative and largely devoid of the depth o...       0
3  [garbus] discards the potential for pathologic...       0
4  a visually flashy but narratively opaque and e...       0

Column names:
Index(['reviews', 'labels'], dtype='object')


In [5]:
print(df.isna().sum())
print(df['reviews'].str.len().describe())
print(df['labels'].value_counts(normalize=True))


reviews    0
labels     0
dtype: int64
count    10662.000000
mean       115.156256
std         51.199546
min          5.000000
25%         77.000000
50%        112.000000
75%        150.000000
max        269.000000
Name: reviews, dtype: float64
labels
0    0.5
1    0.5
Name: proportion, dtype: float64


In [6]:
from sklearn.model_selection import train_test_split

X = df['reviews']
y = df['labels']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42
)

import sys
print(sys.executable)



c:\Users\kimin\AppData\Local\Python\pythoncore-3.14-64\python.exe


In [7]:
import sys, xgboost
print(sys.executable)
print("xgboost OK, version:", xgboost.__version__)


c:\Users\kimin\AppData\Local\Python\pythoncore-3.14-64\python.exe
xgboost OK, version: 3.1.2


In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000))
])

params_grid = [
    # Logistic Regression grid
    {
        'tfidf__ngram_range': [(1,1), (1,2)],
        'tfidf__min_df': [1, 2],
        'tfidf__max_df': [0.9, 1.0],
        'clf': [LogisticRegression(max_iter=1000)],
        'clf__C': [0.1, 1, 10],
        'clf__class_weight': [None, 'balanced']
    },

    # SVM grid
    {
        'tfidf__ngram_range': [(1,1), (1,2)],
        'tfidf__min_df': [1, 2],
        'tfidf__max_df': [0.9, 1.0],
        'clf': [LinearSVC()],
        'clf__C': [0.1, 1, 10]
        # no class_weight here unless using SVC()
    },

    # XGBoost grid
    {
        'tfidf__ngram_range': [(1,1), (1,2)],
        'tfidf__min_df': [1, 2],
        'tfidf__max_df': [0.9, 1.0],
        'clf': [XGBClassifier(eval_metric='logloss')],
        'clf__max_depth': [3, 5],
        'clf__gamma': [0, 0.25]
    },
]


gs = GridSearchCV(
    pipe,
    params_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=2
)

gs.fit(X_train, y_train)


from sklearn.metrics import accuracy_score, f1_score, classification_report

best = gs.best_estimator_

y_test_pred = best.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("Test F1:", f1_score(y_test, y_test_pred))
print("\nClassification report:\n", classification_report(y_test, y_test_pred))



Fitting 5 folds for each of 104 candidates, totalling 520 fits
Test Accuracy: 0.7744960150023441
Test F1: 0.7757575757575758

Classification report:
               precision    recall  f1-score   support

           0       0.77      0.77      0.77      1062
           1       0.77      0.78      0.78      1071

    accuracy                           0.77      2133
   macro avg       0.77      0.77      0.77      2133
weighted avg       0.77      0.77      0.77      2133



In [12]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

best_lr = gs.best_estimator_

y_test_pred = best_lr.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("Test F1:", f1_score(y_test, y_test_pred))
print("\nClassification report:\n", classification_report(y_test, y_test_pred))


Test Accuracy: 0.7740271917487107
Test F1: 0.7760223048327137

Classification report:
               precision    recall  f1-score   support

           0       0.78      0.77      0.77      1062
           1       0.77      0.78      0.78      1071

    accuracy                           0.77      2133
   macro avg       0.77      0.77      0.77      2133
weighted avg       0.77      0.77      0.77      2133

