# Setup Notebook - Movie Recommendation System

**Purpose:** Generate `recommendations_top50.parquet` and verify all data files are ready.

**Run this notebook ONCE before running the Streamlit app.**

---

## Table of Contents
1. [Install Dependencies](#1.-Install-Dependencies)
2. [Import Libraries](#2.-Import-Libraries)
3. [Setup Paths](#3.-Setup-Paths)
4. [Verify Data Files](#4.-Verify-Data-Files)
5. [Generate Recommendations Parquet](#5.-Generate-Recommendations-Parquet)
6. [Test Data Loading](#6.-Test-Data-Loading)
7. [Run Streamlit App](#7.-Run-Streamlit-App)

## 1. Install Dependencies

Install required packages for local environment.

In [None]:
# Uncomment the line below if packages are not installed

# !pip install streamlit pandas numpy pyarrow requests python-dotenv tqdm

## 2. Import Libraries

Import all necessary libraries for data processing.

In [1]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
from pathlib import Path

print("● All libraries imported successfully!")

● All libraries imported successfully!


## 3. Setup Paths

Define paths to data directory and files.

In [2]:
# Configuration
DATA_DIR = "tmdb_dataset"  # Relative path to data directory
TOP_K = 50  # Number of recommendations to pre-compute per movie

# File paths
SIMILARITY_MATRIX_FILE = os.path.join(DATA_DIR, "similarity_matrix.npy")
MOVIE_INDICES_FILE = os.path.join(DATA_DIR, "movie_indices.csv")
MOVIES_FINAL_FILE = os.path.join(DATA_DIR, "movies_final.csv")
OUTPUT_FILE = os.path.join(DATA_DIR, "recommendations_top50.parquet")

# Display configuration
print("● Configuration:")
print(f"  Data directory: {DATA_DIR}")
print(f"  Top-K recommendations: {TOP_K}")
print(f"  Output file: {OUTPUT_FILE}")
print()
print(f"● Paths configured successfully!")

● Configuration:
  Data directory: tmdb_dataset
  Top-K recommendations: 50
  Output file: tmdb_dataset\recommendations_top50.parquet

● Paths configured successfully!


## 4. Verify Data Files

Check that all required files exist before proceeding.

In [3]:
print("● Verifying required files...")
print("=" * 80)

required_files = {
    "Similarity Matrix": SIMILARITY_MATRIX_FILE,
    "Movie Indices": MOVIE_INDICES_FILE,
    "Movies Final": MOVIES_FINAL_FILE
}

all_exist = True

for name, filepath in required_files.items():
    exists = os.path.exists(filepath)
    
    if exists:
        # Get file size
        size_mb = os.path.getsize(filepath) / (1024 ** 2)
        status = f"  Found ({size_mb:,.2f} MB)"
    else:
        status = "  MISSING"
        all_exist = False
    
    print(f"{name:20s}: {status}")
    print(f"{'':20s}  Path: {filepath}")
    print()

print("=" * 80)

if all_exist:
    print("● All required files found! Ready to proceed.")
else:
    print("● Some files are missing!")
    print("  Please make sure to:")
    print("  1. Run the complete MovieRecSys.ipynb pipeline first")
    print("  2. Copy all generated files to the tmdb_dataset/ folder")
    raise FileNotFoundError("Missing required data files. Cannot proceed.")

● Verifying required files...
Similarity Matrix   :   Found (2,023.57 MB)
                      Path: tmdb_dataset\similarity_matrix.npy

Movie Indices       :   Found (0.45 MB)
                      Path: tmdb_dataset\movie_indices.csv

Movies Final        :   Found (19.37 MB)
                      Path: tmdb_dataset\movies_final.csv

● All required files found! Ready to proceed.


## 5. Generate Recommendations Parquet

Generate pre-computed recommendations from similarity matrix.

In [4]:
# Check if parquet file already exists
if os.path.exists(OUTPUT_FILE):
    print(f"{OUTPUT_FILE} already exists!")
    print()

    # Ask user if they want to regenerate
    regenerate = input("Do you want to regenerate? (yes/no): ").strip().lower()

    if regenerate == 'yes':
        os.remove(OUTPUT_FILE)
        print(f"  Deleted old {OUTPUT_FILE}")
        print()
    else:
        print("  Skipping generation. Using existing file.")
        print()

# Generate if file doesn't exist
if not os.path.exists(OUTPUT_FILE):
    print("\nGENERATING RECOMMENDATIONS PARQUET FILE")
    print()
    
    # Step 1: Load data
    print("[1/4] Loading data files...")
    similarity_matrix = np.load(SIMILARITY_MATRIX_FILE)
    print(f"  Loaded {SIMILARITY_MATRIX_FILE}")
    print(f"  Shape: {similarity_matrix.shape}")
    print(f"  Size: {similarity_matrix.nbytes / 1024**2:.2f} MB")
    print()
    
    movie_indices = pd.read_csv(MOVIE_INDICES_FILE)
    print(f"  Loaded {MOVIE_INDICES_FILE}")
    print(f"  Records: {len(movie_indices):,}")
    print()
    
    # Step 2: Validate data
    print("[2/4] Validating data...")
    n_movies = len(similarity_matrix)
    
    assert similarity_matrix.shape[0] == similarity_matrix.shape[1], "Similarity matrix must be square!"
    assert len(movie_indices) == n_movies, f"Index length mismatch! Matrix: {n_movies}, Indices: {len(movie_indices)}"
    
    print(f"  Data validated: {n_movies:,} movies")
    print()
    
    # Step 3: Generate recommendations
    print(f"[3/4] Generating top-{TOP_K} recommendations for each movie...")
    print(f"  This may take 1-3 minutes...")
    print()
    
    recommendations_data = []
    
    for idx in tqdm(range(n_movies), desc="   Processing"):
        # Get similarity scores for this movie
        sim_scores = similarity_matrix[idx]
        
        # Get top-K indices (excluding itself)
        top_indices = np.argsort(sim_scores)[::-1][1:TOP_K+1]
        top_scores = sim_scores[top_indices]
        
        # Get movie IDs
        movie_id = movie_indices.iloc[idx]['movie_id']
        recommended_ids = movie_indices.iloc[top_indices]['movie_id'].values
        
        # Store recommendations
        for rank, (rec_id, score) in enumerate(zip(recommended_ids, top_scores), 1):
            recommendations_data.append({
                'movie_id': int(movie_id),
                'rank': rank,
                'recommended_id': int(rec_id),
                'similarity': float(score)
            })
    
    print()
    print(f"  Generated {len(recommendations_data):,} recommendation records")
    print()
    
    # Step 4: Save to Parquet
    print(f"[4/4] Saving to {OUTPUT_FILE}...")
    recommendations_df = pd.DataFrame(recommendations_data)
    recommendations_df.to_parquet(OUTPUT_FILE, compression='snappy', index=False)
    
    file_size = os.path.getsize(OUTPUT_FILE) / 1024**2
    print(f"  Saved successfully!")
    print(f"  File size: {file_size:.2f} MB")
    print(f"  Compression ratio: {similarity_matrix.nbytes / 1024**2 / file_size:.1f}x")
    print()
    
    # Verification
    print("[VERIFICATION] Testing file...")
    test_df = pd.read_parquet(OUTPUT_FILE)
    sample_id = movie_indices.iloc[0]['movie_id']
    sample_recs = test_df[test_df['movie_id'] == sample_id].head(5)
    
    print(f"  Successfully loaded parquet file")
    print(f"  Sample query (movie_id={sample_id}):")
    print(sample_recs.to_string(index=False))
    print()
    
    print("\nRECOMMENDATIONS FILE GENERATED SUCCESSFULLY!")
    
else:
    print("\nRecommendations file already exists. Skipping generation.")

tmdb_dataset\recommendations_top50.parquet already exists!



Do you want to regenerate? (yes/no):  no


  Skipping generation. Using existing file.


Recommendations file already exists. Skipping generation.


## 6. Test Data Loading

Verify that the Streamlit app can load data correctly.

In [5]:
print("Testing data loading...")
print("=" * 80)

try:
    # Load movies
    movies_df = pd.read_csv(MOVIES_FINAL_FILE)
    print(f"● Movies loaded successfully")
    print(f"  Shape: {movies_df.shape}")
    print(f"  Columns: {list(movies_df.columns[:5])}...")
    print()

    # Load recommendations
    recommendations_df = pd.read_parquet(OUTPUT_FILE)
    print(f"● Recommendations loaded successfully")
    print(f"  Shape: {recommendations_df.shape}")
    print(f"  Columns: {list(recommendations_df.columns)}")
    print()

    # Test query
    sample_movie = movies_df.iloc[0]
    sample_movie_id = sample_movie['movie_id']
    sample_title = sample_movie['title']

    print(f"● Sample query: '{sample_title}' (ID: {sample_movie_id})")

    sample_recs = recommendations_df[recommendations_df['movie_id'] == sample_movie_id].head(5)
    print(f"  Found {len(sample_recs)} recommendations")
    print()
    print(sample_recs.to_string(index=False))
    print()

    print("● ALL TESTS PASSED! Data is ready for Streamlit app.")


except Exception as e:
    print(f"● Error during testing: {e}")
    print()
    print("● Please check:")
    print("  1. All required files exist")
    print("  2. Files are not corrupted")
    print("  3. File formats are correct")

Testing data loading...
● Movies loaded successfully
  Shape: (16286, 104)
  Columns: ['movie_id', 'title', 'original_language', 'release_year', 'runtime']...

● Recommendations loaded successfully
  Shape: (814300, 4)
  Columns: ['movie_id', 'rank', 'recommended_id', 'similarity']

● Sample query: 'The Running Man' (ID: 798645)
  Found 5 recommendations

 movie_id  rank  recommended_id  similarity
   798645     1             865    0.836517
   798645     2          822119    0.816022
   798645     3           19959    0.813180
   798645     4         1071585    0.798758
   798645     5          500664    0.788185

● ALL TESTS PASSED! Data is ready for Streamlit app.


## 7. Run Streamlit App

### Option A: Run from Terminal 

Open your terminal/command prompt and run:

```bash
streamlit run app.py
```

The app will open in your browser at `http://localhost:8501`

### Option B: Run from Notebook 

Uncomment and run the cell below to start the app from this notebook.

**Note:** The cell will keep running until you stop it manually.

In [None]:
# Uncomment the line below to run Streamlit from notebook
# Note: This cell will keep running. Press 'Stop' button to terminate.

# !streamlit run app.py