# Movie Review Sentiment Analysis and Rating Prediction

In this homework, you will:
1. Load IMDB movie reviews dataset using Hugging Face datasets
2. Perform sentiment analysis
3. Build a ML model to predict movie ratings


In [5]:
# TODO: Install required packages
%pip install pandas numpy scikit-learn transformers torch datasets



In [13]:
# TODO: Import required libraries
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import re
from transformers import pipeline
import os
import time
from tqdm.notebook import tqdm
tqdm.pandas()
# Add any other libraries you need

## Part 1: Load Dataset

Load the IMDB dataset using Hugging Face datasets library

In [6]:
# TODO: Load the IMDB dataset
# Hint: Use load_dataset('imdb')
imdb_dataset = load_dataset('imdb')
# Convert to pandas DataFrame for easier manipulation

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Part 2: Data Preprocessing

Clean and prepare the text data

In [7]:
test_df = imdb_dataset['test'].to_pandas()
train_df = imdb_dataset['train'].to_pandas()

In [8]:
test_df.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [9]:
# TODO: Create a function to clean text
def clean_text(text):
  # 1. Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # 2. Remove special characters
  text = re.sub(r'[^a-zA-Z0-9\s]','', text)
  # 3. Convert to lowercase
  text = text.lower()
  return text
test_df['clean_text'] = test_df['text'].apply(clean_text)
train_df['clean_text'] = train_df['text'].apply(clean_text)
# Hint: Use regular expressions



> Add blockquote



## Part 3: Advanced Sentiment Analysis

Go beyond binary classification - use a pre-trained model to get continuous sentiment scores

In [10]:
# check if I can run a gpu to speed things up
import torch
print(torch.cuda.is_available())


False


In [11]:
# no gpu available so we'll batch instead
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:

# TODO: Implement advanced sentiment analysis
# 1. Load a pre-trained model (hint: try 'distilbert-base-uncased-finetuned-sst-2-english')
# create the pipline using sentiment-analysis and the suggested model
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    max_length=512,
    truncation=True,
    device=0  # CPU fallback if no GPU
)

# 2. Create a function to get continuous sentiment scores
def batch_sentiment_scores(
    texts,
    batch_size=10,
    save_every=10,
    save_path = "/content/drive/MyDrive/sentiment_progress.csv"
):
    scores = []
    if os.path.exists(save_path):
        saved = pd.read_csv(save_path)
        scores = saved["score"].tolist()
        print(f"Resuming from {len(scores)} saved scores...")
    else:
        print("Starting fresh...")

    start_index = len(scores)
    print(f"Processing {len(texts)} texts in batches of {batch_size}...")

    for i in tqdm(range(start_index, len(texts), batch_size)):
        batch_start = time.time()
        batch = texts[i:i+batch_size].tolist()
        results = sentiment_pipeline(batch, truncation=True)

        # Flip the sign if the label is negative
        for r in results:
            score = r["score"]
            if r["label"] == "NEGATIVE":
                score = -score
            scores.append(score)

        # Save progress every N batches or at the end
        if (i // batch_size + 1) % save_every == 0 or i + batch_size >= len(texts):
            pd.DataFrame({
                "row_id": list(range(len(scores))),
                "score": scores
            }).to_csv(save_path, index=False)
            print(f"Progress saved at {save_path} ({len(scores)} scores).")

        print(f"Batch {i // batch_size + 1} completed in {time.time() - batch_start:.2f} sec")

    print("✅ All batches complete. Final results saved.")
    return np.array(scores)

# 3. Apply it to your cleaned text data
test_df["score"] = batch_sentiment_scores(
    test_df["clean_text"],
    batch_size=10,
    save_every=10
)

# Optional smaller sample test
# sample_df = test_df.sample(250, random_state=42).reset_index(drop=True)
# sample_df["score"] = batch_sentiment_scores(sample_df["clean_text"], batch_size=10)


Device set to use cpu


Starting fresh...
Processing 25000 texts in batches of 10...


  0%|          | 0/2500 [00:00<?, ?it/s]

Batch 1 completed in 10.05 sec
Batch 2 completed in 8.42 sec
Batch 3 completed in 10.66 sec
Batch 4 completed in 6.17 sec
Batch 5 completed in 5.72 sec
Batch 6 completed in 6.44 sec
Batch 7 completed in 4.74 sec
Batch 8 completed in 8.03 sec
Batch 9 completed in 5.53 sec
Progress saved at /content/drive/MyDrive/sentiment_progress.csv (100 scores).
Batch 10 completed in 6.85 sec
Batch 11 completed in 3.97 sec
Batch 12 completed in 5.36 sec
Batch 13 completed in 6.33 sec
Batch 14 completed in 6.04 sec
Batch 15 completed in 4.90 sec
Batch 16 completed in 12.71 sec
Batch 17 completed in 6.69 sec
Batch 18 completed in 3.18 sec
Batch 19 completed in 3.90 sec
Progress saved at /content/drive/MyDrive/sentiment_progress.csv (200 scores).
Batch 20 completed in 6.01 sec
Batch 21 completed in 4.02 sec
Batch 22 completed in 4.97 sec
Batch 23 completed in 5.92 sec
Batch 24 completed in 4.23 sec
Batch 25 completed in 4.45 sec
Batch 26 completed in 7.35 sec
Batch 27 completed in 3.81 sec
Batch 28 comp

In [None]:
from google.colab import files
files.download("sentiment_progress.csv")

In [None]:
uploaded = files.upload()
test_df = pd.read_csv("sentiment_progress.csv")

## Part 4: Feature Engineering

Create rich features for your model

In [None]:
sample_df.head()

In [None]:
# TODO: Create features
# 1. Use your continuous sentiment scores

# 2. Calculate text statistics:
#    - Length
#    - Word count
#    - Average word length
#    - Sentence count
# 3. Any other features you think might help!

## Part 5: Multi-Class Rating Prediction

Instead of binary classification, predict a 5-star rating!

In [None]:
# TODO: Create target variable
# Convert binary labels to 5-star ratings using your features
# Hint: Use sentiment scores and other features to estimate star rating

In [None]:
# TODO: Build and train your model
# 1. Split data into train and test sets
# 2. Choose a model suitable for multi-class classification
# 3. Train the model
# 4. Make predictions
# 5. Evaluate performance

## Part 6: Analysis

Analyze your results and suggest improvements

In [None]:
# TODO: Create visualizations and analyze:
# 1. Confusion matrix for multi-class predictions
# 2. Feature importance
# 3. Error analysis
# 4. Suggest improvements