# Movie Review Sentiment Analysis and Rating Prediction

In this homework, you will:
1. Load IMDB movie reviews dataset using Hugging Face datasets
2. Perform sentiment analysis
3. Build a ML model to predict movie ratings


In [1]:
# TODO: Install required packages
%pip install pandas numpy scikit-learn transformers torch datasets



In [55]:
# TODO: Import required libraries
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import re
from transformers import pipeline
import os
from tqdm.notebook import tqdm
tqdm.pandas()
# Add any other libraries you need

## Part 1: Load Dataset

Load the IMDB dataset using Hugging Face datasets library

In [3]:
# TODO: Load the IMDB dataset
# Hint: Use load_dataset('imdb')
imdb_dataset = load_dataset('imdb')
# Convert to pandas DataFrame for easier manipulation

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Part 2: Data Preprocessing

Clean and prepare the text data

In [24]:
test_df = imdb_dataset['test'].to_pandas()
train_df = imdb_dataset['train'].to_pandas()

In [25]:
test_df.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [26]:
# TODO: Create a function to clean text
def clean_text(text):
  # 1. Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # 2. Remove special characters
  text = re.sub(r'[^a-zA-Z0-9\s]','', text)
  # 3. Convert to lowercase
  text = text.lower()
  return text
test_df['clean_text'] = test_df['text'].apply(clean_text)
train_df['clean_text'] = train_df['text'].apply(clean_text)
# Hint: Use regular expressions



> Add blockquote



## Part 3: Advanced Sentiment Analysis

Go beyond binary classification - use a pre-trained model to get continuous sentiment scores

In [45]:
# check if I can run a gpu to speed things up
import torch
print(torch.cuda.is_available())


False


In [57]:
# no gpu available so we'll batch instead
# TODO: Implement advanced sentiment analysis
# 1. Load a pre-trained model (hint: try 'distilbert-base-uncased-finetuned-sst-2-english')
# create the pipline using sentiment-analysis and the suggested model
import time
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    max_length=512,
    truncation=True,
    device=0
)
# 2. Create a function to get continuous sentiment scores
def batch_sentiment_scores(
    texts,
    batch_size=10,
    save_every=10,
    save_path="sentiment_progress.csv"
    ):
    scores = []
    if os.path.exists(save_path):
      saved = pd.read_csv(save_path)
      scores = saved["score"].tolist()
      print(f"Resuming from {len(scores)} saved scores...")
    else:
      print("Starting fresh...")

    start_index = len(scores)
    print(f"Processing {len(texts)} texts in batches of {batch_size}...")

    for i in tqdm(range(start_index, len(texts), batch_size)):
        batch_start = time.time()

        batch = texts[i:i+batch_size].tolist()
        results = sentiment_pipeline(batch, truncation=True)

        #Flip the sign if the label is negative
        for r in results:
            score = r['score']
            if r['label'] == 'NEGATIVE':
                score = -score
            scores.append(score)

        if (i // batch_size + 1) % save_every == 0 or i + batch_size >= len(texts):
            pd.DataFrame({"score": scores}).to_csv(save_path, index=False)
            print(f"Progress saved at {save_path} ({len(scores)} scores).")

        print(f"Batch {i // batch_size + 1} completed in {time.time() - batch_start:.2f} sec")

    print("✅ All batches complete. Final results saved.")

    return np.array(scores)

# 3. Apply it to your cleaned text data
# test_df["score"] = batch_sentiment_scores(
#     test_df["clean_text"],
#     batch_size=10,
#     save_every=10,
#     save_path="sentiment_progress.csv"
# )


# Smaller sample first
sample_df = test_df.sample(250, random_state=42).reset_index(drop=True)
sample_df['score'] = batch_sentiment_scores(sample_df['clean_text'], batch_size=10)


Starting fresh...
Processing 250 texts in batches of 10...


  0%|          | 0/25 [00:00<?, ?it/s]

Batch 1 completed in 10.60 sec
Batch 2 completed in 9.01 sec
Batch 3 completed in 12.51 sec
Batch 4 completed in 11.91 sec
Batch 5 completed in 12.72 sec
Batch 6 completed in 13.54 sec
Batch 7 completed in 11.75 sec
Batch 8 completed in 15.26 sec
Batch 9 completed in 16.70 sec
Progress saved at sentiment_progress.csv (100 scores).
Batch 10 completed in 14.55 sec
Batch 11 completed in 13.25 sec
Batch 12 completed in 16.77 sec
Batch 13 completed in 7.74 sec
Batch 14 completed in 16.47 sec
Batch 15 completed in 14.16 sec
Batch 16 completed in 5.34 sec
Batch 17 completed in 12.64 sec
Batch 18 completed in 18.68 sec
Batch 19 completed in 10.75 sec
Progress saved at sentiment_progress.csv (200 scores).
Batch 20 completed in 11.32 sec
Batch 21 completed in 11.78 sec
Batch 22 completed in 11.91 sec
Batch 23 completed in 11.30 sec
Batch 24 completed in 12.06 sec
Progress saved at sentiment_progress.csv (250 scores).
Batch 25 completed in 16.72 sec
✅ All batches complete. Final results saved.


## Part 4: Feature Engineering

Create rich features for your model

In [56]:
sample_df.head()

Unnamed: 0,text,label,clean_text,score
0,I could not believe how terrible and boring th...,0,i could not believe how terrible and boring th...,-0.999779
1,I rented Boogie Nights last week and I could t...,1,i rented boogie nights last week and i could t...,0.84676
2,"First off, this movie is not near complete, my...",0,first off this movie is not near complete my g...,-0.999643
3,I watched this mini in the early eighties. Sam...,1,i watched this mini in the early eighties sam ...,0.978252
4,This movie was never intended as a big-budget ...,1,this movie was never intended as a bigbudget f...,-0.803366


In [8]:
# TODO: Create features
# 1. Use your continuous sentiment scores
# 2. Calculate text statistics:
#    - Length
#    - Word count
#    - Average word length
#    - Sentence count
# 3. Any other features you think might help!

## Part 5: Multi-Class Rating Prediction

Instead of binary classification, predict a 5-star rating!

In [9]:
# TODO: Create target variable
# Convert binary labels to 5-star ratings using your features
# Hint: Use sentiment scores and other features to estimate star rating

In [10]:
# TODO: Build and train your model
# 1. Split data into train and test sets
# 2. Choose a model suitable for multi-class classification
# 3. Train the model
# 4. Make predictions
# 5. Evaluate performance

## Part 6: Analysis

Analyze your results and suggest improvements

In [11]:
# TODO: Create visualizations and analyze:
# 1. Confusion matrix for multi-class predictions
# 2. Feature importance
# 3. Error analysis
# 4. Suggest improvements