Project Overview and Goals

# 🎬 Movie Recommender System (MVP)

## 🧠 Project Overview
This is a beginner-friendly **Movie Recommender System** built using content-based filtering. It aims to suggest movies similar to a user's input based on features like genres, keywords, and plot metadata.

The project follows the **Software Development Life Cycle (SDLC)** to demonstrate clean design, modularity, and progressive improvement.

---

## 📌 Goals

### ✅ Minimum Viable Product (MVP)
- Load and explore a movie metadata dataset (e.g., from TMDb or IMDB)
- Preprocess features like genres, overview, cast, and keywords
- Create a content-based recommender using TF-IDF and cosine similarity
- Allow a user to input a movie and receive top 5 similar recommendations

### 🔄 Expansion Ideas
- Add collaborative filtering using user ratings
- Use BERT or sentence transformers for better plot similarity
- Personalize recommendations using user history or favorites
- Build a Streamlit or Flask app interface

---

## 📁 Tech Stack
- **Python**
- **Pandas**, **NumPy** for data handling
- **Scikit-learn** for vectorization and similarity
- **Streamlit** (optional) for frontend
- **TMDb API** or Kaggle dataset as data source

---

> This notebook is designed to be self-contained and modular for easy iteration and expansion.


### Data Loading

Imports

In [1]:
import requests
import os
import time
import json
from dotenv import load_dotenv

Setup

In [4]:
load_dotenv()
API_KEY = os.getenv("TMDB_API_KEY")
BASE_URL = "https://api.themoviedb.org/3"

# Save progress in case of error
OUTPUT_FILE = "movie_metadata.json"

API Calls (Helper Functions)

In [11]:
def get_popular_movies(page, api_key=API_KEY):
    url = f"{BASE_URL}/movie/popular"
    params = {"api_key": api_key, "page": page}
    return requests.get(url, params=params).json()

def get_top_rated_movies(page, api_key=API_KEY):
    url = f"{BASE_URL}/movie/top_rated"
    params = {"api_key": api_key, "page": page}
    return requests.get(url, params=params).json()

def get_movie_details(movie_id, api_key=API_KEY):
    url = f"{BASE_URL}/movie/{movie_id}"
    params = {"api_key": api_key}
    return requests.get(url, params=params).json()

def get_movie_credits(movie_id, api_key=API_KEY):
    url = f"{BASE_URL}/movie/{movie_id}/credits"
    params = {"api_key": api_key}
    return requests.get(url, params=params).json()

def get_movie_keywords(movie_id, api_key=API_KEY):
    url = f"{BASE_URL}/movie/{movie_id}/keywords"
    params = {"api_key": api_key}
    return requests.get(url, params=params).json()


Metadata - What to store in our DB

In [12]:
def collect_movie_metadata(movie):
    movie_id = movie.get("id")
    try:
        details = get_movie_details(movie_id)
        credits = get_movie_credits(movie_id)
        keywords = get_movie_keywords(movie_id)

        genres = [g['name'] for g in details.get("genres", [])]
        top_cast = [member['name'] for member in credits.get("cast", [])[:5]]
        director = next((c['name'] for c in credits.get("crew", []) if c['job'] == 'Director'), None)
        keyword_list = [kw['name'] for kw in keywords.get("keywords", [])]
        production_companies = [p['name'] for p in details.get("production_companies", [])]

        return {
            "id": movie_id,
            "title": details.get("title"),
            "overview": details.get("overview"),
            "genres": genres,
            "keywords": keyword_list,
            "top_cast": top_cast,
            "director": director,
            "release_year": details.get("release_date", "")[:4],
            "runtime": details.get("runtime"),
            "budget": details.get("budget"),
            "revenue": details.get("revenue"),
            "popularity": details.get("popularity"),
            "vote_average": details.get("vote_average"),
            "vote_count": details.get("vote_count"),
            "original_language": details.get("original_language"),
            "production_companies": production_companies
        }
    except Exception as e:
        print(f"[ERROR] Skipping movie ID {movie_id}: {e}")
        return None


Crawling call

In [14]:
if False: 
    all_metadata = []
    pages_to_fetch = 25  # 25 pages × 20 movies = 500 total
    calls = 0

    def get_top_rated_movies(page, api_key=API_KEY):
        url = f"{BASE_URL}/movie/top_rated"
        params = {"api_key": api_key, "page": page}
        return requests.get(url, params=params).json()

    for page in range(1, pages_to_fetch + 1):
        print(f"Fetching page {page} of top-rated movies")
        page_data = get_top_rated_movies(page)
        
        for movie in page_data.get("results", []):
            metadata = collect_movie_metadata(movie)
            if metadata:
                all_metadata.append(metadata)

            # Track rate-limited calls (approx 3 calls per movie)
            calls += 3
            if calls % 12 == 0:
                time.sleep(1.5)  # Pause after every ~12 API calls
        
        # Save progress after each page
        with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
            json.dump(all_metadata, f, ensure_ascii=False, indent=2)

    print(f"\n✅ Finished fetching {len(all_metadata)} movies.")


### Preprocessing 

### TF-IDF Vectorization

### Similarity Calculation

### Recommender Function

### UI