# Genome PCA Laboratory
This notebook builds **dense movie embeddings** from the MovieLens **Genome** (tag relevance) features using **PCA**.

**Goal:** compress the ~1,100 tag relevance dimensions into a smaller set (e.g., 50–200) that can be used as movie features in downstream models.

**Leakage note:** Genome relevance scores are **movie-level, static content features** (not derived from user ratings in your split), so PCA embeddings are safe to use in time-based evaluation.


## 0) Imports
We keep imports at the top for reproducibility.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


## 1) Load Genome data
We load:
- `genome_scores.csv`: (movieId, tagId, relevance)
- `genome_tags.csv`: (tagId, tag)

Only `genome_scores` is required for PCA; `genome_tags` is useful for inspection.


In [None]:
# Adapt paths to your environment if needed
genome_scores = pd.read_csv(paths["genome_scores"])
genome_tags = pd.read_csv(paths["genome_tags"])

print("genome_scores:", genome_scores.shape)
print("genome_tags:", genome_tags.shape)
genome_scores.head()


## 2) Sanity checks
We verify basic integrity:
- relevance range
- number of movies/tags
- missing values


In [None]:
print("relevance min/max:", genome_scores["relevance"].min(), genome_scores["relevance"].max())
print("n movies:", genome_scores["movieId"].nunique())
print("n tags:", genome_scores["tagId"].nunique())
print("missing relevance:", genome_scores["relevance"].isna().sum())


## 3) Pivot to a movie × tag matrix
PCA expects a matrix. We pivot to shape:
- rows: movies
- columns: tagId
- values: relevance

Missing combinations (should be rare in Genome) are filled with 0.


In [None]:
genome_mat = (
    genome_scores.pivot(index="movieId", columns="tagId", values="relevance")
    .fillna(0.0)
)

print("Genome matrix shape:", genome_mat.shape)
genome_mat.iloc[:3, :5]


## 4) Scaling choice
Genome relevance values are already bounded in [0, 1].

- **Option A (default):** run PCA directly on relevance values.
- **Option B:** standardize columns (tags) before PCA.

In practice, both can work. We'll implement **Option A** first (simpler, common for bounded features), and keep **Option B** as an optional toggle.


In [None]:
USE_STANDARD_SCALER = False  # set True to standardize tag columns before PCA

X = genome_mat.values
if USE_STANDARD_SCALER:
    scaler = StandardScaler(with_mean=True, with_std=True)
    X = scaler.fit_transform(X)

print("X dtype:", X.dtype, "X shape:", X.shape)


## 5) Fit PCA and pick number of components
We fit PCA and inspect explained variance to select a reasonable dimensionality (e.g., 50–200).


In [None]:
# Fit PCA with enough components to inspect explained variance
pca_probe = PCA(n_components=300, random_state=42)
pca_probe.fit(X)

explained = pca_probe.explained_variance_ratio_
cum_explained = np.cumsum(explained)

plt.figure(figsize=(7, 4))
plt.plot(np.arange(1, len(cum_explained) + 1), cum_explained)
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("Genome PCA: Cumulative Explained Variance")
plt.grid(True)
plt.show()

for k in [10, 25, 50, 75, 100, 150, 200, 300]:
    print(f"k={k:3d} -> cum explained variance={cum_explained[k-1]:.4f}")


## 6) Build final PCA embeddings
Choose `N_COMPONENTS` based on the plot (typical choices: 50, 100, 150).
We output a dataframe keyed by `movieId`, ready to merge into ratings.


In [None]:
N_COMPONENTS = 50  # adjust after inspecting explained variance

pca = PCA(n_components=N_COMPONENTS, random_state=42)
Z = pca.fit_transform(X)

genome_pca = pd.DataFrame(
    Z,
    index=genome_mat.index,
    columns=[f"gen_pca_{i}" for i in range(N_COMPONENTS)]
).reset_index()

print("genome_pca:", genome_pca.shape)
genome_pca.head()


## 7) Persist embeddings
We save the PCA embeddings for reuse in other notebooks (fast reload).


In [None]:
out_path = f"genome_pca_{N_COMPONENTS}.parquet"
genome_pca.to_parquet(out_path, index=False)
print("Saved:", out_path)


## 8) Optional: Inspect PCA components (tag loadings)
This helps interpret what each component captures semantically.
We list the strongest positive/negative loading tags for a selected component.


In [None]:
COMPONENT_ID = 0  # pick a component to inspect

loadings = pca.components_[COMPONENT_ID]
tag_map = genome_tags.set_index("tagId")["tag"].to_dict()
tag_ids = genome_mat.columns.to_numpy()

top_pos_idx = np.argsort(loadings)[-15:][::-1]
top_neg_idx = np.argsort(loadings)[:15]

print("Top + loadings (tagId, tag, loading):")
for i in top_pos_idx:
    tid = int(tag_ids[i])
    print(tid, tag_map.get(tid, None), float(loadings[i]))

print("\nTop - loadings (tagId, tag, loading):")
for i in top_neg_idx:
    tid = int(tag_ids[i])
    print(tid, tag_map.get(tid, None), float(loadings[i]))


## 9) Optional: Merge PCA embeddings into ratings
If you already have `ratings` loaded, you can merge movie embeddings by `movieId`.
This step is safe with time splits because embeddings are static per movie.


In [None]:
# Example (uncomment if ratings is available in this notebook):
# ratings = pd.read_parquet("train_prepared.parquet")  # or read ratings directly
# ratings = ratings.merge(genome_pca, on="movieId", how="left")
# ratings[["movieId"] + [c for c in genome_pca.columns if c.startswith("gen_pca_")]].head()
