# 1- Dataset

- Load the cleaned IMDb 

In [1]:
import pandas as pd

In [2]:
data_path = "Data/imdb_cleaned_sample.csv"

df = pd.read_csv(data_path)

In [3]:
df

Unnamed: 0,id,rating,txt,label,cleaned_review
0,5903,8,I liked this movie sort of reminded me of my m...,1,liked movie sort reminded marriage clean see f...
1,194,8,Perhaps the funniest 'backstage at Hollywood' ...,1,perhaps funniest backstage hollywood movie eve...
2,5211,8,Since their nasty divorce from the Disney Comp...,1,since nasty divorce disney company disney keep...
3,7176,10,OK - you want to test somebody on how comforta...,1,want test somebody comfortable adolescence emb...
4,5754,9,I remember seeing this one when I was seven or...,1,remember seeing one seven eight must found cha...
...,...,...,...,...,...
9995,12250,2,"Upon viewing Tobe Hooper's gem, Crocodile, in ...",0,upon viewing tobe hoopers gem crocodile develo...
9996,1686,1,Imagine that you are asked by your date what m...,0,imagine asked date movie wanted see remember s...
9997,8252,3,Whattt was with the sound? It sounded like it ...,0,whattt sound sounded like dubbedotherwise bad ...
9998,6290,3,Recap: Ron is about to marry Mel. They are dee...,0,recap ron marry mel deeply love certain perfec...


- Check that the required columns exist

In [4]:

assert {'cleaned_review', 'label'}.issubset(df.columns), \
       "The file must contain 'cleaned_review' and 'label' columns"


- Keep only the two required columns and drop empty/short records

In [5]:
df = df[['cleaned_review', 'label']].dropna()
df = df[df['cleaned_review'].str.len() > 3].reset_index(drop=True)

- Display first few rows and dataset shape

In [6]:
print(df.head(), df.shape)

                                      cleaned_review  label
0  liked movie sort reminded marriage clean see f...      1
1  perhaps funniest backstage hollywood movie eve...      1
2  since nasty divorce disney company disney keep...      1
3  want test somebody comfortable adolescence emb...      1
4  remember seeing one seven eight must found cha...      1 (10000, 2)


# 2- Feature Extraction

- Use TF-IDF vectorization on the cleaned reviews.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

in this step i will 80/20 split with stratify to keep label balance. After that, TF-IDF vectorizer (unigrams + bigrams, max 50k features)

In [8]:
X = df['cleaned_review']
y = df['label']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 
tfidf = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(1, 2),
    max_features=50000
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("Train shape:", X_train_tfidf.shape)
print("Test shape:", X_test_tfidf.shape)


Train shape: (8000, 50000)
Test shape: (2000, 50000)


# 3- Models


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [10]:
models = {
    "LogisticRegression": LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        random_state=42
    ),
    "DecisionTree": DecisionTreeClassifier(
        random_state=42  # keep defaults as baseline
    ),
    "RandomForest": RandomForestClassifier(
        n_estimators=200,
        n_jobs=-1,
        random_state=42
    )
}


for name, clf in models.items():
    clf.fit(X_train_tfidf, y_train)


# 4- Evaluation

In [11]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

In [13]:
rows = []
reports = {}

for name, clf in models.items():
    y_pred = clf.predict(X_test_tfidf)

    acc = accuracy_score(y_test, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(
        y_test, y_pred, average="weighted", zero_division=0
    )

    rows.append({
        "Model": name,
        "Accuracy": acc,
        "Precision_w": prec,
        "Recall_w": rec,
        "F1_w": f1
    })
    reports[name] = classification_report(y_test, y_pred, zero_division=0)

metrics_df = pd.DataFrame(rows).sort_values("F1_w", ascending=False).reset_index(drop=True)


metrics_show = metrics_df.copy()
for c in ["Accuracy", "Precision_w", "Recall_w", "F1_w"]:
    metrics_show[c] = metrics_show[c].map(lambda x: f"{x:.3f}")

metrics_show


Unnamed: 0,Model,Accuracy,Precision_w,Recall_w,F1_w
0,LogisticRegression,0.866,0.866,0.866,0.865
1,RandomForest,0.84,0.84,0.84,0.84
2,DecisionTree,0.702,0.702,0.702,0.702


- Print classification reports.

In [14]:
for name, rep in reports.items():
    print(f"\n=== {name} ===")
    print(rep)



=== LogisticRegression ===
              precision    recall  f1-score   support

           0       0.87      0.86      0.86      1000
           1       0.86      0.87      0.87      1000

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000


=== DecisionTree ===
              precision    recall  f1-score   support

           0       0.70      0.71      0.70      1000
           1       0.71      0.69      0.70      1000

    accuracy                           0.70      2000
   macro avg       0.70      0.70      0.70      2000
weighted avg       0.70      0.70      0.70      2000


=== RandomForest ===
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      1000
           1       0.84      0.84      0.84      1000

    accuracy                           0.84      2000
   macro avg       0.84      0.84      0.84      2000
we

# 5- Reporting

- Logistic Regression performed best (F1 ≈ 0.865), followed by Random Forest (0.840) and Decision Tree (0.702).

- Logistic Regression works well with high-dimensional TF-IDF features and is also the most interpretable (coefficients show which words drive sentiment).

- Random Forest is robust but less interpretable; Decision Tree is easy to visualize but overfits and performs poorly.

- Recommendation: Use Logistic Regression as the baseline model for both strong accuracy and clear explainability.