## Assignment 6.2
# Final Project Dataset Assembly

This notebook assembles and cleans movie review data from The Movie Database (TMDB) API. It retrieves movie IDs and reviews across multiple years, filters the data for quality,and prepares a final cleaned dataset suitable for sentiment analysis or predictive modeling.


Step 1: Importing Libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")  # hides version-related warnings

import pandas as pd
import requests, json, time, re
from langdetect import detect
import nltk
from nltk.corpus import stopwords

# Download stopwords quietly (no output)
nltk.download("stopwords", quiet=True)

print("✅ All libraries loaded successfully and environment is clean.")



✅ All libraries loaded successfully and environment is clean.


 Step 2: TMDB API Setup

In this step, we connect to The Movie Database (TMDB) API, which provides structured access to  
movie information and user reviews. We'll use two API endpoints:
- `/discover/movie` – to list movies by release year  
- `/movie/{id}/reviews` – to retrieve detailed audience reviews

A valid API token is required for authentication.


In [2]:
API_URL = "https://api.themoviedb.org/3"
HEADERS = {
    "accept": "application/json",
    "Authorization": "Bearer YOUR_TMDB_API_KEY"  # 🔑 Replace with your TMDB API token
}

print("✅ TMDB API configured successfully.")


✅ TMDB API configured successfully.


Step 3: Fetching Movie Metadata

Retrieves movie IDs, titles, and release years via TMDB’s /discover/movie endpoint using a v3 API key, limiting to two pages per year with a 0.05-second delay.

In [8]:
import time, json, requests

# ✅ Use the short API Key (v3 auth) instead of the long Bearer token
API_KEY = "356e6d6154811b6a3b8400b63c010eaf"  # your v3 key
API_URL = "https://api.themoviedb.org/3"

def get_movies(years):
    movies = []
    for year in years:
        page, total_pages = 1, 1
        max_pages = 2  # limit for speed: 1–2 pages per year
        while page <= min(total_pages, max_pages):
            url = f"{API_URL}/discover/movie?primary_release_year={year}&sort_by=vote_count.desc&page={page}&api_key={API_KEY}"
            try:
                res = requests.get(url, timeout=5)
                print(f"Fetching year {year}, page {page} - status: {res.status_code}")
                res.raise_for_status()
                data = res.json()
            except requests.exceptions.RequestException as e:
                print(f"⚠️ Error fetching {year}, page {page}: {e}")
                break

            total_pages = data.get("total_pages", 1)
            for m in data.get("results", []):
                movies.append({
                    "movie_id": m.get("id"),
                    "title": m.get("original_title"),
                    "year": year
                })

            page += 1
            time.sleep(0.05)  # polite delay
    print(f"\n✅ Collected {len(movies)} movies across {len(years)} years (up to {max_pages} pages each).")
    return movies

# ✅ Example run
years = range(2018, 2022)
movies = get_movies(years)




Fetching year 2018, page 1 - status: 200
Fetching year 2018, page 2 - status: 200
Fetching year 2019, page 1 - status: 200
Fetching year 2019, page 2 - status: 200
Fetching year 2020, page 1 - status: 200
Fetching year 2020, page 2 - status: 200
Fetching year 2021, page 1 - status: 200
Fetching year 2021, page 2 - status: 200

✅ Collected 160 movies across 4 years (up to 2 pages each).


 Step 4: Retrieving Movie Reviews

Once movie IDs are collected, the next step is to fetch their user reviews  
via the `/movie/{id}/reviews` endpoint.  
The function below collects up to five pages of reviews per movie  
to stay within TMDB’s API request limits.


In [10]:
def get_reviews(movie_id, title):
    reviews, page = [], 1
    while page <= 5:  # limit to 5 pages per movie
        url = f"{API_URL}/movie/{movie_id}/reviews?page={page}&api_key={API_KEY}"
        try:
            res = requests.get(url, timeout=5)
            print(f"Fetching reviews for {title} - page {page} (status: {res.status_code})")
            res.raise_for_status()
            data = res.json()
        except requests.exceptions.RequestException as e:
            print(f"⚠️ Error fetching reviews for {title}, page {page}: {e}")
            break

        for r in data.get("results", []):
            reviews.append({
                "movie_id": movie_id,
                "title": title,
                "review": r.get("content"),
                "rating": r.get("author_details", {}).get("rating")
            })

        if page >= data.get("total_pages", 1):
            break
        page += 1
        time.sleep(0.05)  # short polite delay
    return reviews

# ✅ Example run: fetch reviews for first 3 movies
sample_movies = movies[:3]
all_reviews = []
for m in sample_movies:
    movie_reviews = get_reviews(m["movie_id"], m["title"])
    all_reviews.extend(movie_reviews)

print(f"\n✅ Collected {len(all_reviews)} total reviews for {len(sample_movies)} movies.")


Fetching reviews for Avengers: Infinity War - page 1 (status: 200)
Fetching reviews for Avengers: Infinity War - page 2 (status: 200)
Fetching reviews for Black Panther - page 1 (status: 200)
Fetching reviews for Black Panther - page 2 (status: 200)
Fetching reviews for Deadpool 2 - page 1 (status: 200)

✅ Collected 62 total reviews for 3 movies.


Step 5: Assembling the Raw Dataset

This function merges all movie reviews from multiple years into a single DataFrame.  
It removes rows with missing ratings and exports the raw dataset as a CSV file.  
This dataset will later be cleaned and normalized for analysis.


In [11]:
import pandas as pd

def assemble_raw_dataset(years):
    all_reviews = []
    movie_list = get_movies(years)  # uses your Step 3 function
    print(f"\n🎬 Fetching reviews for {len(movie_list)} movies...")

    for movie in movie_list:
        movie_reviews = get_reviews(movie["movie_id"], movie["title"])  # Step 4 function
        all_reviews.extend(movie_reviews)
        time.sleep(0.05)  # consistent short delay between movies

    # Convert to DataFrame
    df = pd.DataFrame(all_reviews)

    # Drop missing ratings and select key columns
    df = df.dropna(subset=["rating"], axis=0)
    df = df[["title", "review", "rating"]]

    # Export to CSV
    df.to_csv("tmdb_reviews_raw.csv", index=False)
    print("✅ Raw dataset saved as 'tmdb_reviews_raw.csv'")
    print(f"📊 Final dataset shape: {df.shape}")

    return df

# ✅ Example run
YEARS = range(2018, 2022)
raw_df = assemble_raw_dataset(YEARS)
raw_df.head()


Fetching year 2018, page 1 - status: 200
Fetching year 2018, page 2 - status: 200
Fetching year 2019, page 1 - status: 200
Fetching year 2019, page 2 - status: 200
Fetching year 2020, page 1 - status: 200
Fetching year 2020, page 2 - status: 200
Fetching year 2021, page 1 - status: 200
Fetching year 2021, page 2 - status: 200

✅ Collected 160 movies across 4 years (up to 2 pages each).

🎬 Fetching reviews for 160 movies...
Fetching reviews for Avengers: Infinity War - page 1 (status: 200)
Fetching reviews for Avengers: Infinity War - page 2 (status: 200)
Fetching reviews for Black Panther - page 1 (status: 200)
Fetching reviews for Black Panther - page 2 (status: 200)
Fetching reviews for Deadpool 2 - page 1 (status: 200)
Fetching reviews for Bohemian Rhapsody - page 1 (status: 200)
Fetching reviews for Venom - page 1 (status: 200)
Fetching reviews for Spider-Man: Into the Spider-Verse - page 1 (status: 200)
Fetching reviews for Spider-Man: Into the Spider-Verse - page 2 (status: 200)


Unnamed: 0,title,review,rating
1,Avengers: Infinity War,Amazing. Visually stunning. So much going on...,10.0
2,Avengers: Infinity War,"Just a very short, NO SPOILERS review I wanted...",8.0
3,Avengers: Infinity War,The third act turns on a character being an id...,4.0
4,Avengers: Infinity War,"Best MCU movie, more than that.... BEST SUPERH...",9.5
6,Avengers: Infinity War,"Massive, epic movie. I'm so happy that Marvel ...",8.0
