# Plot Embeddings for Rating Prediction (v2)

## Goal
Test whether movie plot/overview text can improve our rating prediction model.

## Strategy
1. Load TMDb data (has plot overviews)
2. Load IMDb data (has ratings)
3. Merge datasets (Strategic decision point: Inner vs Left Join)
4. Generate embeddings from **overview text only**
5. Apply PCA to reduce dimensions
6. Combine with existing IMDb features
7. Compare models: Base vs Base + Plot PCA


## Setup

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import sys
import os

# Add project root to path so we can import floportop
sys.path.append(os.path.abspath(os.path.join(os.path.dirname('__file__'), '..')))
from floportop.preprocessing import VALID_GENRES

# For embeddings
from sentence_transformers import SentenceTransformer

# For PCA and modeling
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Paths
DATA_DIR = Path('../data')
MODELS_DIR = Path('../models')

print("Setup complete!")

## Load Data

Loading the cleaned TMDb features and IMDb data.

In [None]:
# Load cleaned TMDb features
tmdb_features = pd.read_csv(DATA_DIR / 'tmdb_features.csv')
print(f"TMDb features: {len(tmdb_features):,} movies")

# Load IMDb data
imdb = pd.read_csv(DATA_DIR / 'movies_clean.csv')
print(f"IMDb data: {len(imdb):,} movies")