<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# SAR Single Node on MovieLens (Python, CPU)

Simple Algorithm for Recommendation (SAR) is a fast and scalable algorithm for personalized recommendations based on user transaction history. It produces easily explainable and interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user. More details about SAR can be found in the [deep dive notebook](../02_model_collaborative_filtering/sar_deep_dive.ipynb). 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do. ***This updated notebook example demonstrates the implementation of a hybrid approach that leverages generative AI to include item features as part of the recommendation results.***
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

This notebook provides an example of how to utilize and evaluate SAR in Python on a CPU.

## üèóÔ∏è High-Level Architecture Overview

This notebook demonstrates a **Hybrid Recommendation System** that combines traditional collaborative filtering with modern AI-powered content understanding. The architecture consists of three main components:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        HYBRID RECOMMENDATION ARCHITECTURE                        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                                 ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îÇ
‚îÇ  ‚îÇ   DATA LAYER    ‚îÇ    ‚îÇ  AI ENRICHMENT  ‚îÇ    ‚îÇ    RECOMMENDATION ENGINE    ‚îÇ ‚îÇ
‚îÇ  ‚îÇ                 ‚îÇ    ‚îÇ                 ‚îÇ    ‚îÇ                             ‚îÇ ‚îÇ
‚îÇ  ‚îÇ ‚Ä¢ MovieLens     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ ‚Ä¢ Azure OpenAI  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ ‚Ä¢ SAR Collaborative Filter  ‚îÇ ‚îÇ
‚îÇ  ‚îÇ ‚Ä¢ User ratings  ‚îÇ    ‚îÇ ‚Ä¢ Web scraping  ‚îÇ    ‚îÇ ‚Ä¢ Azure AI Search Vectors   ‚îÇ ‚îÇ
‚îÇ  ‚îÇ ‚Ä¢ Movie catalog ‚îÇ    ‚îÇ ‚Ä¢ Subgenres     ‚îÇ    ‚îÇ ‚Ä¢ Hybrid Scoring            ‚îÇ ‚îÇ
‚îÇ  ‚îÇ ‚Ä¢ TMDb links    ‚îÇ    ‚îÇ ‚Ä¢ Overviews     ‚îÇ    ‚îÇ                             ‚îÇ ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Pipeline Stages:

| Stage | Component | Purpose |
|-------|-----------|---------|
| **1. Data Ingestion** | MovieLens Dataset | Load user-item interactions and movie metadata |
| **2. Catalog Enrichment** | TMDb + Azure OpenAI | Scrape movie descriptions, generate subgenres via LLM |
| **3. Embedding Generation** | Azure OpenAI Embeddings | Create semantic vectors from genres, subgenres, overviews |
| **4. Vector Indexing** | Azure AI Search | Store embeddings for fast similarity search |
| **5. Collaborative Filtering** | SAR Model | Train on user interactions for personalized predictions |
| **6. Hybrid Fusion** | Combined Scoring | Merge SAR predictions with cold-start content similarity |

### Key Innovation: Solving the Cold-Start Problem

Traditional collaborative filtering (SAR) cannot recommend items that have no user interaction history. This notebook addresses this limitation by:

1. **Content-based fallback**: Using semantic embeddings to find similar items based on content
2. **Hybrid scoring**: Blending SAR predictions with embedding similarity scores
3. **Guaranteed diversity**: Ensuring at least one "cold start" item appears in recommendations

# 0 Global Settings and Imports

This section loads the core libraries for:
- **Data manipulation**: `pandas`, `numpy`
- **Recommenders library**: SAR model, evaluation metrics, dataset loaders
- **External integrations**: Azure OpenAI, Azure AI Search (loaded in later cells)

In [None]:
import sys
import logging
import numpy as np
import pandas as pd
from sklearn.preprocessing import minmax_scale

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.utils.python_utils import binarize
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.evaluation.python_evaluation import (
    map,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
    rmse,
    mae,
    logloss,
    rsquared,
    exp_var
)
from recommenders.utils.notebook_utils import store_metadata

%load_ext autoreload
%autoreload 2

print(f"System version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# 1 Load Data

SAR is intended to be used on interactions with the following schema:
`<User ID>, <Item ID>,<Time>,[<Event Type>], [<Event Weight>]`. 

Each row represents a single interaction between a user and an item. These interactions might be different types of events on an e-commerce website, such as a user clicking to view an item, adding it to a shopping basket, following a recommendation link, and so on. Each event type can be assigned a different weight, for example, we might assign a "buy" event a weight of 10, while a "view" event might only have a weight of 1.

The MovieLens dataset is well formatted interactions of Users providing Ratings to Movies (movie ratings are used as the event weight) - we will use it for the rest of the example.

In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, 20m, or latest-small
MOVIELENS_DATA_SIZE = "latest-small"

### 1.1 Download and use the MovieLens Dataset

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE
)

# Convert the float precision to 32-bit in order to reduce memory consumption 
data["rating"] = data["rating"].astype(np.float32)

data.head()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 956/956 [00:00<00:00, 5.83kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### 1.2 Load Movie Catalog (Default Contains Genres Only)

In [4]:
# Load movie information (title, genres)
movies = movielens.load_item_df(
    size=MOVIELENS_DATA_SIZE,
    title_col="title",
    genres_col="genres"
)

movies.head()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 956/956 [00:00<00:00, 5.13kKB/s]


Unnamed: 0,itemID,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# 2 AI-Powered Catalog Enrichment üìö

### Why Enrich the Catalog?

The default MovieLens dataset only provides basic genre tags (e.g., "Action|Adventure"). To enable richer semantic similarity, we enhance the catalog with:

| Enhancement | Source | Purpose |
|-------------|--------|---------|
| **TMDb URLs** | links.csv | Connect to external movie database |
| **Plot Overviews** | Web scraping TMDb | Natural language movie descriptions |
| **Subgenres** | Azure OpenAI GPT-4o | Fine-grained classification (e.g., "Cyberpunk", "Heist") |

### Architecture for Catalog Enrichment

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  MovieLens     ‚îÇ    ‚îÇ   TMDb Website  ‚îÇ    ‚îÇ  Azure OpenAI    ‚îÇ
‚îÇ  links.csv     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   (Web Scrape)  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  GPT-4o          ‚îÇ
‚îÇ                ‚îÇ    ‚îÇ                 ‚îÇ    ‚îÇ                  ‚îÇ
‚îÇ  movieId ‚Üí     ‚îÇ    ‚îÇ  Get overview   ‚îÇ    ‚îÇ  Extract:        ‚îÇ
‚îÇ  tmdbId        ‚îÇ    ‚îÇ  content        ‚îÇ    ‚îÇ  - Clean summary ‚îÇ
‚îÇ                ‚îÇ    ‚îÇ                 ‚îÇ    ‚îÇ  - Subgenres     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                      ‚îÇ
                                                      ‚ñº
                                             ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                             ‚îÇ  Enriched        ‚îÇ
                                             ‚îÇ  movies_sample   ‚îÇ
                                             ‚îÇ  DataFrame       ‚îÇ
                                             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Subgenre Taxonomy

We define a structured subgenre taxonomy mapped to main genres for consistent classification:

### 2.1 Add TMDb URLs from links.csv

In [5]:
# Load movie links (IMDb and TMDb IDs)
# Note: links.csv is only available in the latest-small

links = movielens.load_links_df(
        size=MOVIELENS_DATA_SIZE,
        movie_col="itemID"
    )

# Join movies with links and create TMDb URL column
movies_with_links = movies.merge(links, left_on="itemID", right_on="itemID", how="left")

# Create the TMDb URL column (handle missing tmdbId values)
movies_with_links["tmdburl"] = movies_with_links["tmdbId"].apply(
    lambda x: f"https://www.themoviedb.org/movie/{int(x)}" if pd.notna(x) else None
)

movies_with_links = movies_with_links[["itemID", "title", "genres", "tmdburl"]]

display(movies_with_links.head())



100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 956/956 [00:00<00:00, 5.44kKB/s]


Unnamed: 0,itemID,title,genres,tmdburl
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,https://www.themoviedb.org/movie/862
1,2,Jumanji (1995),Adventure|Children|Fantasy,https://www.themoviedb.org/movie/8844
2,3,Grumpier Old Men (1995),Comedy|Romance,https://www.themoviedb.org/movie/15602
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,https://www.themoviedb.org/movie/31357
4,5,Father of the Bride Part II (1995),Comedy,https://www.themoviedb.org/movie/11862


### 2.2 Use GenAI to Extract Subgenres and Summaries

In [6]:

# Subgenres mapped to main genres for more detailed classification
SUBGENRES = {
    "Action": (
        "Martial Arts",
        "Spy",
        "Superhero",
        "Military Action",
        "Disaster",
    ),
    "Adventure": (
        "Exploration",
        "Survival",
        "Treasure Hunt",
        "Jungle",
        "Sea Adventure",
    ),
    "Animation": (
        "Anime",
        "CGI",
        "Stop Motion",
        "Hand Drawn",
        "Claymation",
    ),
    "Children's": (
        "Family",
        "Fairy Tale",
        "Coming of Age",
        "Educational",
        "Puppet",
    ),
    "Comedy": (
        "Romantic Comedy",
        "Slapstick",
        "Satire",
        "Parody",
        "Dark Comedy",
        "Screwball",
    ),
    "Crime": (
        "Heist",
        "Gangster",
        "Detective",
        "Legal Thriller",
        "True Crime",
    ),
    "Documentary": (
        "Nature",
        "Biographical",
        "Historical",
        "Social",
        "Sports Documentary",
    ),
    "Drama": (
        "Melodrama",
        "Psychological",
        "Family Drama",
        "Legal Drama",
        "Political Drama",
        "Medical Drama",
    ),
    "Fantasy": (
        "High Fantasy",
        "Urban Fantasy",
        "Dark Fantasy",
        "Fairy Tale Fantasy",
        "Mythological",
    ),
    "Film-Noir": (
        "Neo-Noir",
        "Tech-Noir",
        "Nordic Noir",
        "Psychological Noir",
    ),
    "Horror": (
        "Slasher",
        "Supernatural",
        "Psychological Horror",
        "Body Horror",
        "Found Footage",
        "Zombie",
        "Vampire",
    ),
    "Musical": (
        "Jukebox Musical",
        "Opera",
        "Dance Film",
        "Concert Film",
        "Backstage Musical",
    ),
    "Mystery": (
        "Whodunit",
        "Cozy Mystery",
        "Noir Mystery",
        "Paranormal Mystery",
        "Locked Room",
    ),
    "Romance": (
        "Period Romance",
        "Contemporary Romance",
        "Tragic Romance",
        "Romantic Drama",
        "Teen Romance",
    ),
    "Sci-Fi": (
        "Space Opera",
        "Cyberpunk",
        "Time Travel",
        "Dystopian",
        "Post-Apocalyptic",
        "Alien",
        "Hard Sci-Fi",
    ),
    "Thriller": (
        "Psychological Thriller",
        "Spy Thriller",
        "Action Thriller",
        "Erotic Thriller",
        "Techno Thriller",
    ),
    "War": (
        "World War I",
        "World War II",
        "Vietnam War",
        "Civil War",
        "Anti-War",
        "Military Drama",
    ),
    "Western": (
        "Spaghetti Western",
        "Revisionist Western",
        "Contemporary Western",
        "Comedy Western",
        "Epic Western",
    ),
}

In [None]:
# =============================================================================
# AI-POWERED MOVIE ENRICHMENT PIPELINE
# This cell demonstrates how to enhance movie metadata using web scraping
# and Azure OpenAI for intelligent content extraction and classification.
# =============================================================================

#!pip install openai requests beautifulsoup4 tqdm

import requests
from bs4 import BeautifulSoup      # HTML parsing library for web scraping
from openai import AzureOpenAI     # Azure's OpenAI client for GPT models
from tqdm import tqdm              # Progress bar for long-running loops
import time
import json

# -----------------------------------------------------------------------------
# AZURE OPENAI CONFIGURATION
# These credentials connect to Azure's hosted OpenAI service.
# In production, use environment variables or Azure Key Vault for secrets.
# -----------------------------------------------------------------------------
AZURE_OPENAI_ENDPOINT = "https://[yourazureopenairesource].openai.azure.com/"
AZURE_OPENAI_API_KEY = "your openai api key here"  # Replace with your key
AZURE_OPENAI_DEPLOYMENT = "gpt-4o"

# Initialize the Azure OpenAI client with API version and credentials
client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
)

# -----------------------------------------------------------------------------
# PREPARE SUBGENRES LIST FOR LLM PROMPT
# Flatten the nested SUBGENRES dict into a single list of "Subgenre (Genre)"
# strings. This gives the LLM a constrained vocabulary for classification.
# -----------------------------------------------------------------------------
ALL_SUBGENRES = [
    f"{subgenre} ({genre})"
    for genre, subgenres in SUBGENRES.items()
    for subgenre in subgenres
]
SUBGENRES_LIST = ", ".join(ALL_SUBGENRES)

# -----------------------------------------------------------------------------
# WEB SCRAPING FUNCTION
# Fetches movie plot summaries from TMDb pages. Uses polite scraping practices:
# - Custom User-Agent header to identify our bot
# - Timeout to prevent hanging on slow responses
# - Graceful error handling with fallback to meta description
# -----------------------------------------------------------------------------
def scrape_tmdb_page(url):
    """Scrape the TMDb page content for movie overview text."""
    if not url:
        return None
    try:
        headers = {"User-Agent": "Mozilla/5.0 (compatible; MovieBot/1.0)"}
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            # Primary target: the overview div contains the plot summary
            overview_div = soup.find("div", class_="overview")
            if overview_div:
                return overview_div.get_text(strip=True)
            # Fallback: meta description often has a short summary
            meta = soup.find("meta", attrs={"name": "description"})
            if meta:
                return meta.get("content", "")
        return None
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

# -----------------------------------------------------------------------------
# LLM-POWERED CONTENT EXTRACTION & CLASSIFICATION
# Uses GPT-4o to:
# 1. Summarize raw scraped content into a clean 2-3 sentence overview
# 2. Classify the movie into relevant subgenres from our predefined taxonomy
# 
# Key techniques demonstrated:
# - System prompt with structured output format (JSON mode)
# - Temperature=0 for deterministic, reproducible results
# - Token limits to control cost and response length
# -----------------------------------------------------------------------------
def extract_overview_and_subgenres(url, title, genres, scraped_content=None):
    """Use Azure OpenAI to extract plot overview and assign subgenres."""
    if not url and not scraped_content:
        return None, None
    
    # Use provided content or scrape fresh
    content = scraped_content or scrape_tmdb_page(url)
    if not content:
        return None, None
    
    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT,
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a movie classification assistant. Given movie information, you must:
1. Extract or summarize the plot overview (2-3 sentences max)
2. Assign 1-3 relevant subgenres from this list: {SUBGENRES_LIST}

Respond in JSON format only:
{{"overview": "plot summary here", "subgenres": ["Subgenre1", "Subgenre2"]}}"""
                },
                {
                    "role": "user", 
                    "content": f"""Movie: {title}
Main Genres: {genres}
Content from TMDb: {content[:2000]}

Extract the overview and assign appropriate subgenres."""
                }
            ],
            max_tokens=400,
            temperature=0,  # Deterministic output for reproducibility
            response_format={"type": "json_object"}  # Enforce JSON response
        )
        
        # Parse the structured JSON response
        result = json.loads(response.choices[0].message.content.strip())
        overview = result.get("overview", "")
        subgenres = result.get("subgenres", [])
        
        # Format subgenres as pipe-separated string (matches MovieLens format)
        subgenres_str = "|".join(subgenres) if subgenres else None
        return overview, subgenres_str
        
    except Exception as e:
        print(f"Error with Azure OpenAI for {title}: {e}")
        return content, None  # Graceful fallback: return raw scraped content

# -----------------------------------------------------------------------------
# BATCH PROCESSING WITH RATE LIMITING
# Process a sample of movies to demonstrate the pipeline. In production:
# - Use async/await for parallel processing
# - Implement exponential backoff for API errors
# - Consider caching results to avoid redundant API calls
# -----------------------------------------------------------------------------
MAX_RECORDS = 200  # Limit for demo purposes (API costs add up!)
movies_sample = movies_with_links.head(MAX_RECORDS).copy()

overviews = []
subgenres_list = []

for idx, row in tqdm(movies_sample.iterrows(), total=len(movies_sample), 
                     desc="Extracting overviews & subgenres"):
    overview, subgenres = extract_overview_and_subgenres(
        row["tmdburl"], row["title"], row["genres"]
    )
    overviews.append(overview)
    subgenres_list.append(subgenres)
    time.sleep(0.5)  # Rate limit: 2 requests/sec to respect API limits

# Add enriched columns to our sample DataFrame
movies_sample["overview"] = overviews
movies_sample["subgenres"] = subgenres_list

# Preview the enriched data
display(movies_sample[["itemID", "title", "genres", "subgenres", "overview"]].head(10))

Extracting overviews & subgenres:  18%|‚ñà‚ñä        | 35/200 [01:06<06:49,  2.48s/it]

Error with Azure OpenAI for Clueless (1995): 'NoneType' object has no attribute 'strip'


Extracting overviews & subgenres:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 147/200 [06:05<02:37,  2.98s/it]

Error with Azure OpenAI for Kids (1995): 'NoneType' object has no attribute 'strip'


Extracting overviews & subgenres: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [08:20<00:00,  2.50s/it]


Unnamed: 0,itemID,title,genres,subgenres,overview
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,CGI|Family|Urban Fantasy,"Woody, a toy cowboy, feels threatened when Buz..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,Treasure Hunt|Family|High Fantasy,Siblings Judy and Peter discover an enchanted ...
2,3,Grumpier Old Men (1995),Comedy|Romance,Romantic Comedy|Slapstick,A family wedding reignites the feud between ne...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Romantic Drama|Melodrama,"Four women, Vannah, Bernie, Glo, and Robin, na..."
4,5,Father of the Bride Part II (1995),Comedy,Romantic Comedy|Family,George Banks is shocked to learn that both his...
5,6,Heat (1995),Action|Crime|Thriller,Heist|Action Thriller,Master thief Neil McCauley leads a skilled cre...
6,7,Sabrina (1995),Comedy|Romance,Romantic Comedy,"After returning from school in Paris, Sabrina,..."
7,8,Tom and Huck (1995),Adventure|Children,Coming of Age|Treasure Hunt,Tom Sawyer witnesses a murder and befriends Hu...
8,9,Sudden Death (1995),Action,Action Thriller,A man's daughter is kidnapped during a champio...
9,10,GoldenEye (1995),Action|Adventure|Thriller,Spy (Action)|Action Thriller (Thriller),"James Bond must stop his former ally, Alec Tre..."


# 3 Embedding Generation & Vector Indexing üî¢

### What Are Embeddings?

Embeddings are dense numerical vectors that capture the semantic meaning of text. Similar content produces vectors that are close together in high-dimensional space, enabling similarity search without exact keyword matching.

### Embedding Pipeline Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                         EMBEDDING GENERATION PIPELINE                        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                             ‚îÇ
‚îÇ   For each movie:                                                           ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                                      ‚îÇ
‚îÇ   ‚îÇ Combine Text:    ‚îÇ                                                      ‚îÇ
‚îÇ   ‚îÇ ‚Ä¢ Genres         ‚îÇ     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ
‚îÇ   ‚îÇ ‚Ä¢ Subgenres      ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Azure OpenAI    ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ 1536-dim Vector  ‚îÇ    ‚îÇ
‚îÇ   ‚îÇ ‚Ä¢ Overview       ‚îÇ     ‚îÇ ada-002         ‚îÇ     ‚îÇ (embedding)      ‚îÇ    ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ
‚îÇ                                                                             ‚îÇ
‚îÇ   Example combined text:                                                    ‚îÇ
‚îÇ   "Genres: Action|Sci-Fi | Subgenres: Cyberpunk|Dystopian |                ‚îÇ
‚îÇ    Overview: In a dark future, a hacker discovers..."                      ‚îÇ
‚îÇ                                                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Azure AI Search Index Schema

The embeddings are stored in Azure AI Search for fast vector similarity search:

| Field | Type | Purpose |
|-------|------|---------|
| `itemID` | String (Key) | Unique movie identifier |
| `title` | Searchable | Movie title for display |
| `genres` | Searchable | Main genre categories |
| `subgenres` | Searchable | AI-generated subgenres |
| `overview` | Searchable | Plot summary |
| `embedding` | Vector (1536-dim) | Semantic embedding for similarity |

### Vector Search Configuration

- **Algorithm**: HNSW (Hierarchical Navigable Small World)
- **Dimensions**: 1536 (OpenAI ada-002 output size)
- **Similarity Metric**: Cosine similarity

In [None]:
#!pip install azure-search-documents

# Generate OpenAI embeddings and insert into Azure AI Search
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
)
from azure.core.credentials import AzureKeyCredential

# Azure AI Search configuration
AZURE_SEARCH_ENDPOINT = "https://[yourazuresearchservice].search.windows.net"
AZURE_SEARCH_API_KEY = "your search api key here"
AZURE_SEARCH_INDEX_NAME = "movies-recommendations"

# Azure OpenAI Embedding model configuration
AZURE_OPENAI_EMBEDDING_DEPLOYMENT = "text-embedding-ada-002"  # or your embedding deployment name

def get_embedding(text):
    """Generate embedding for a text using Azure OpenAI"""
    if not text or pd.isna(text):
        return None
    try:
        response = client.embeddings.create(
            model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
            input=text[:8000]  # Truncate to max token limit
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

def create_combined_text(row):
    """Combine genres, subgenres, and overview into a single text for embedding"""
    parts = []
    if row.get("genres") and pd.notna(row["genres"]):
        parts.append(f"Genres: {row['genres']}")
    if row.get("subgenres") and pd.notna(row["subgenres"]):
        parts.append(f"Subgenres: {row['subgenres']}")
    if row.get("overview") and pd.notna(row["overview"]):
        parts.append(f"Overview: {row['overview']}")
    return " | ".join(parts) if parts else None

# Generate embeddings for each movie
print("Generating embeddings for movies...")
embeddings = []

for idx, row in tqdm(movies_sample.iterrows(), total=len(movies_sample), desc="Generating embeddings"):
    combined_text = create_combined_text(row)
    embedding = get_embedding(combined_text)
    embeddings.append(embedding)
    time.sleep(0.1)  # Rate limiting for embedding API

movies_sample["embedding"] = embeddings

# Filter out rows without embeddings
movies_with_embeddings = movies_sample[movies_sample["embedding"].notna()].copy()
print(f"Generated embeddings for {len(movies_with_embeddings)} movies")

# Create Azure AI Search index
search_index_client = SearchIndexClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    credential=AzureKeyCredential(AZURE_SEARCH_API_KEY)
)

# Define the index schema with vector search
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="hnsw-config")
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-config"
        )
    ]
)

index = SearchIndex(
    name=AZURE_SEARCH_INDEX_NAME,
    fields=[
        SimpleField(name="itemID", type=SearchFieldDataType.String, key=True),
        SearchableField(name="title", type=SearchFieldDataType.String),
        SearchableField(name="genres", type=SearchFieldDataType.String),
        SearchableField(name="subgenres", type=SearchFieldDataType.String),
        SearchableField(name="overview", type=SearchFieldDataType.String),
        SimpleField(name="tmdburl", type=SearchFieldDataType.String),
        SearchField(
            name="embedding",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,  # ada-002 embedding dimensions
            vector_search_profile_name="vector-profile"
        )
    ],
    vector_search=vector_search
)

# Create or update the index
try:
    search_index_client.create_or_update_index(index)
    print(f"Index '{AZURE_SEARCH_INDEX_NAME}' created/updated successfully")
except Exception as e:
    print(f"Error creating index: {e}")

# Prepare documents for upload
documents = []
for idx, row in movies_with_embeddings.iterrows():
    doc = {
        "itemID": str(row["itemID"]),
        "title": row["title"] if pd.notna(row["title"]) else "",
        "genres": row["genres"] if pd.notna(row["genres"]) else "",
        "subgenres": row["subgenres"] if pd.notna(row["subgenres"]) else "",
        "overview": row["overview"] if pd.notna(row["overview"]) else "",
        "tmdburl": row["tmdburl"] if pd.notna(row["tmdburl"]) else "",
        "embedding": row["embedding"]
    }
    documents.append(doc)

# Upload documents to Azure AI Search
search_client = SearchClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    index_name=AZURE_SEARCH_INDEX_NAME,
    credential=AzureKeyCredential(AZURE_SEARCH_API_KEY)
)

# Upload in batches of 100
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    try:
        result = search_client.upload_documents(documents=batch)
        succeeded = sum(1 for r in result if r.succeeded)
        print(f"Uploaded batch {i//batch_size + 1}: {succeeded}/{len(batch)} documents succeeded")
    except Exception as e:
        print(f"Error uploading batch {i//batch_size + 1}: {e}")

print(f"\nTotal documents uploaded to Azure AI Search: {len(documents)}")


In [66]:
display(movies_with_embeddings[["itemID", "title", "genres", "subgenres", "overview", "embedding"]].head())

Unnamed: 0,itemID,title,genres,subgenres,overview,embedding
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,CGI|Family|Urban Fantasy,"Woody, a toy cowboy, feels threatened when Buz...","[-0.00462139630690217, -0.04702836275100708, -..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,Treasure Hunt|Family|High Fantasy,Siblings Judy and Peter discover an enchanted ...,"[0.022550068795681, -0.041529927402734756, -0...."
2,3,Grumpier Old Men (1995),Comedy|Romance,Romantic Comedy|Slapstick,A family wedding reignites the feud between ne...,"[-0.0032039934303611517, -0.01420772448182106,..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Romantic Drama|Melodrama,"Four women, Vannah, Bernie, Glo, and Robin, na...","[-0.019989266991615295, -0.040414195507764816,..."
4,5,Father of the Bride Part II (1995),Comedy,Romantic Comedy|Family,George Banks is shocked to learn that both his...,"[-0.013063669204711914, -0.007676921784877777,..."


### 3.1 Vector Similarity Search Function

The function below enables finding semantically similar movies using the embeddings stored in Azure AI Search. It:

1. Retrieves the embedding vector for the query item from the local DataFrame
2. Executes a vector search using the HNSW algorithm
3. Returns the top-N most similar items with their similarity scores

In [18]:
# Vector similarity search function to find similar movies
from azure.search.documents.models import VectorizedQuery

def find_similar_items(item_id, top_n=5):
    """
    Find the top N most similar items to a given item using cosine similarity search.
    
    Args:
        item_id: The ID of the item to find similar items for
        top_n: Number of similar items to return (default: 5)
    
    Returns:
        DataFrame with similar items and their similarity scores
    """
    # First, get the embedding from our local DataFrame (more reliable)
    item_row = movies_with_embeddings[movies_with_embeddings["itemID"] == item_id]
    
    if item_row.empty:
        print(f"Item {item_id} not found in movies_with_embeddings")
        return None
    
    item_embedding = item_row.iloc[0]["embedding"]
    item_title = item_row.iloc[0]["title"]
    
    if item_embedding is None or (isinstance(item_embedding, list) and len(item_embedding) == 0):
        print(f"No embedding found for item {item_id}")
        return None
    
    # Display source item details
    print(f"Finding items similar to:")
    print(f"  Title:     {item_title}")
    print(f"  Item ID:   {item_id}")
    print(f"  Genres:    {item_row.iloc[0]['genres']}")
    print(f"  Subgenres: {item_row.iloc[0]['subgenres']}")
    print(f"  Overview:  {item_row.iloc[0]['overview'][:150] + '...' if item_row.iloc[0]['overview'] and len(item_row.iloc[0]['overview']) > 150 else item_row.iloc[0]['overview']}")
    print("-" * 60)
    
    # Perform vector search using the item's embedding
    vector_query = VectorizedQuery(
        vector=item_embedding,
        k_nearest_neighbors=top_n + 1,  # +1 to account for the item itself
        fields="embedding"
    )
    
    try:
        results = search_client.search(
            search_text=None,
            vector_queries=[vector_query],
            select=["itemID", "title", "genres", "subgenres", "overview"]
        )
        
        similar_items = []
        for result in results:
            # Skip the query item itself
            if str(result["itemID"]) == str(item_id):
                continue
            
            similar_items.append({
                "itemID": result["itemID"],
                "title": result["title"],
                "genres": result["genres"],
                "subgenres": result["subgenres"],
                "overview": result["overview"][:100] + "..." if result["overview"] and len(result["overview"]) > 100 else result["overview"],
                "similarity_score": result["@search.score"]
            })
            
            if len(similar_items) >= top_n:
                break
        
        return pd.DataFrame(similar_items)
        
    except Exception as e:
        print(f"Error performing similarity search: {e}")
        return None



In [59]:
example_item_id = 32
similar_movies = find_similar_items(example_item_id, top_n=5)

if similar_movies is not None:
    print(f"\nTop 5 movies similar to item {example_item_id}:")
    display(similar_movies)

2026-01-18 22:26:19,302 INFO     Request URL: 'https://productreco.search.windows.net/indexes('movies-recommendations')/docs/search.post.search?api-version=REDACTED'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '34551'
    'api-key': 'REDACTED'
    'Accept': 'application/json;odata.metadata=none'
    'x-ms-client-request-id': 'b600c9f4-f4bc-11f0-b594-d2db28ac9ba5'
    'User-Agent': 'azsdk-python-search-documents/11.6.0 Python/3.11.14 (Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39)'
A body is sent with the request


Finding items similar to:
  Title:     Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
  Item ID:   32
  Genres:    Mystery|Sci-Fi|Thriller
  Subgenres: Time Travel|Dystopian|Psychological Thriller
  Overview:  In 2035, convict James Cole is sent back in time to discover the origin of a deadly virus that wiped out most of the earth's population. Mistakenly se...
------------------------------------------------------------


2026-01-18 22:26:19,473 DEBUG    https://productreco.search.windows.net:443 "POST /indexes('movies-recommendations')/docs/search.post.search?api-version=2025-09-01 HTTP/1.1" 200 None
2026-01-18 22:26:19,477 INFO     Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; odata.metadata=none; odata.streaming=true; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Vary': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'Preference-Applied': 'REDACTED'
    'OData-Version': 'REDACTED'
    'request-id': 'b600c9f4-f4bc-11f0-b594-d2db28ac9ba5'
    'elapsed-time': 'REDACTED'
    'Date': 'Sun, 18 Jan 2026 22:26:19 GMT'



Top 5 movies similar to item 32:


Unnamed: 0,itemID,title,genres,subgenres,overview,similarity_score
0,170,Hackers (1995),Action|Adventure|Crime|Thriller,Techno Thriller|Action Thriller,"A teenager and his friends, banned from using ...",0.879908
1,172,Johnny Mnemonic (1995),Action|Sci-Fi|Thriller,Cyberpunk|Action Thriller|Dystopian,"In a dystopian 2021, Johnny, a data trafficker...",0.877888
2,22,Copycat (1995),Crime|Drama|Horror|Mystery|Thriller,Psychological Thriller|Detective|Psychological...,An agoraphobic psychologist and a female detec...,0.87323
3,160,Congo (1995),Action|Adventure|Mystery|Sci-Fi,Exploration|Jungle,Eight people embark on an expedition into the ...,0.872568
4,29,"City of Lost Children, The (Cit√É¬© des enfants ...",Adventure|Drama|Fantasy|Mystery|Sci-Fi,Dark Fantasy|Psychological Noir|Urban Fantasy,A scientist in a surrealist society kidnaps ch...,0.871135


### 3.2 How Vector Similarity Search Works

The `find_similar_items()` function performs the following:

1. **Retrieve source embedding**: Get the embedding vector for the query item from local DataFrame
2. **Execute vector search**: Send the embedding to Azure AI Search for k-nearest neighbor search
3. **Return ranked results**: Items sorted by cosine similarity score (higher = more similar)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Query Item     ‚îÇ     ‚îÇ  Azure AI Search     ‚îÇ     ‚îÇ  Similar Items      ‚îÇ
‚îÇ  (itemID: 32)   ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Vector Search       ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Ranked by          ‚îÇ
‚îÇ                 ‚îÇ     ‚îÇ  (HNSW algorithm)    ‚îÇ     ‚îÇ  Cosine Similarity  ‚îÇ
‚îÇ  embedding:     ‚îÇ     ‚îÇ                      ‚îÇ     ‚îÇ                     ‚îÇ
‚îÇ  [0.12, -0.3,   ‚îÇ     ‚îÇ  k=5 nearest         ‚îÇ     ‚îÇ  1. Item 45 (0.92)  ‚îÇ
‚îÇ   0.87, ...]    ‚îÇ     ‚îÇ  neighbors           ‚îÇ     ‚îÇ  2. Item 78 (0.89)  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ  3. Item 12 (0.85)  ‚îÇ
                                                      ‚îÇ  ...                ‚îÇ
                                                      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

# 4 Train the SAR Model

### 4.1 Split the Data

We split the full dataset into a `train` and `test` dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. For this case, we can use the provided `python_stratified_split` function which holds out a percentage (in this case 25%) of items from each user, but ensures all users are in both `train` and `test` datasets. Other options are available in the `dataset.python_splitters` module which provide more control over how the split occurs.

In [20]:
train, test = python_stratified_split(data, ratio=0.75, col_user="userID", col_item="itemID", seed=42)

In [21]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userID'].unique()),
    train_items=len(train['itemID'].unique()),
    test_total=len(test),
    test_users=len(test['userID'].unique()),
    test_items=len(test['itemID'].unique()),
))


Train:
Total Ratings: 75630
Unique Users: 610
Unique Items: 8757

Test:
Total Ratings: 25206
Unique Users: 610
Unique Items: 5641



### 4.2 Instantiate the SAR Algorithm

We will use the single node implementation of SAR and specify the column names to match our dataset (timestamp is an optional column that is used and can be removed if your dataset does not contain it).

Other options are specified to control the behavior of the algorithm as described in the [deep dive notebook](../02_model_collaborative_filtering/sar_deep_dive.ipynb).

In [22]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

### 4.3 Train the SAR Model and Get Top-K Recommendations

SAR first computes an item-to-item ***co-occurence matrix***. Co-occurence represents the number of times two items appear together for any given user. Once we have the co-occurence matrix, we compute an ***item similarity matrix*** by rescaling the cooccurences by a given metric (Jaccard similarity in this example). 

We also compute an ***affinity matrix*** to capture the strength of the relationship between each user and each item. Affinity is driven by different types (like *rating* or *viewing* a movie), and by the time of the event. 

Recommendations are achieved by multiplying the affinity matrix $A$ and the similarity matrix $S$. The result is a ***recommendation score matrix*** $R$. We compute the ***top-k*** results for each user in the `recommend_k_items` function seen below.

A full walkthrough of the SAR algorithm can be found [here](../02_model_collaborative_filtering/sar_deep_dive.ipynb).

In [23]:
with Timer() as train_time:
    model.fit(train)

print("Took {} seconds for training.".format(train_time.interval))

2026-01-18 21:57:05,299 INFO     Collecting user affinity matrix
2026-01-18 21:57:05,308 INFO     Calculating time-decayed affinities
2026-01-18 21:57:05,406 INFO     Creating index columns
2026-01-18 21:57:05,539 INFO     Calculating normalization factors
2026-01-18 21:57:05,604 INFO     Building user affinity sparse matrix
2026-01-18 21:57:05,613 INFO     Calculating item co-occurrence
2026-01-18 21:57:08,429 INFO     Calculating item similarity
2026-01-18 21:57:08,431 INFO     Using jaccard based similarity
2026-01-18 21:57:17,024 INFO     Done training


Took 11.768720834001215 seconds for training.


In [24]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Took {} seconds for prediction.".format(test_time.interval))

2026-01-18 21:57:23,676 INFO     Calculating recommendation scores
2026-01-18 21:57:26,549 INFO     Removing seen items


Took 3.009700905999125 seconds for prediction.


In [25]:
top_k.head()

Unnamed: 0,userID,itemID,prediction
0,1,2115,3.997127
1,1,1391,3.781131
2,1,2918,3.690191
3,1,1089,3.686759
4,1,2194,3.632493


# 5 Hybrid Recommendation Fusion üîÄ

### The Cold-Start Problem

Collaborative filtering algorithms like SAR can only recommend items that have prior user interactions. New items (cold items) have no interaction history and cannot be recommended. This is a critical limitation for:

- Newly added content (movies, products, articles)
- Niche items with few ratings
- Long-tail inventory

### Hybrid Solution Architecture

The `get_hybrid_recommendations()` function combines SAR predictions with embedding-based similarity:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    HYBRID RECOMMENDATION PIPELINE                            ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                             ‚îÇ
‚îÇ  User 31                                                                    ‚îÇ
‚îÇ     ‚îÇ                                                                       ‚îÇ
‚îÇ     ‚ñº                                                                       ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ                     SAR COLLABORATIVE FILTER                         ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Get top-k predictions based on user's historical interactions    ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Scores reflect learned user preferences                           ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ     ‚îÇ                                                                       ‚îÇ
‚îÇ     ‚îÇ SAR Recommendations                                                   ‚îÇ
‚îÇ     ‚ñº                                                                       ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ                    EMBEDDING SIMILARITY SEARCH                       ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  For each SAR item with an embedding:                                ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Query Azure AI Search for similar items                          ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ These may include "cold" items not in SAR's training data        ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Score = SAR_score √ó similarity √ó 0.8                              ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ     ‚îÇ                                                                       ‚îÇ
‚îÇ     ‚îÇ Combined Candidates                                                   ‚îÇ
‚îÇ     ‚ñº                                                                       ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ                      HYBRID SCORING & FUSION                         ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Deduplicate: Items from both sources get "SAR + Similar to..."  ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Guarantee: At least 1 cold-start item in final top-k            ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Sort: Final ranking by hybrid prediction score                   ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ     ‚îÇ                                                                       ‚îÇ
‚îÇ     ‚ñº                                                                       ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ  Final Recommendations (Top 15)                                      ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îÇ Source ‚îÇ Description                                           ‚îÇ ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§ ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îÇ SAR    ‚îÇ Pure collaborative filtering prediction               ‚îÇ ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îÇ Cold   ‚îÇ Content-similar to a SAR item, but new to the user    ‚îÇ ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îÇ Both   ‚îÇ Recommended by SAR AND found via embedding similarity ‚îÇ ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ                                                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Features of the Hybrid Approach

| Feature | Implementation | Benefit |
|---------|----------------|---------|
| **Seed Selection** | Use SAR items that have embeddings | Grounds similarity in user preferences |
| **Overlap Detection** | Track items appearing in both sources | Identify strongly reinforced recommendations |
| **Diversity Guarantee** | Reserve slots for cold-start items | Ensures discovery of new content |

In [65]:


def get_hybrid_recommendations(user_id, sar_model, top_k=10, similar_items_per_cold=3, verbose=True):
    """
    Generate hybrid recommendations combining SAR collaborative filtering with 
    embedding-based similarity for cold items.
    
    For items in the SAR model: Use SAR predictions directly
    For cold items (not in training): Find similar items using Azure AI Search embeddings
    Shows when both SAR and embedding similarity recommend the same item.
    
    Args:
        user_id: The user ID to generate recommendations for
        sar_model: Trained SAR model
        top_k: Number of recommendations to return
        similar_items_per_cold: Number of similar items to fetch per cold item
        verbose: Whether to print debug information
    
    Returns:
        DataFrame with recommendations and their sources
    """
    # Get SAR recommendations for the user
    user_df = pd.DataFrame({"userID": [user_id]})
    sar_recs = sar_model.recommend_k_items(user_df, top_k=top_k, remove_seen=True)
    
    # Add source column and movie metadata
    sar_recs["source"] = "SAR"
    sar_recs = sar_recs.merge(
        movies_with_links[["itemID", "title", "genres"]], 
        on="itemID", 
        how="left"
    )
    
    # Check which recommended items have embeddings in our index
    items_with_embeddings_set = set(movies_with_embeddings["itemID"].tolist())
    
    if verbose:
        print(f"\nüìä SAR returned {len(sar_recs)} recommendations")
        sar_items_with_embeddings = [id for id in sar_recs["itemID"].tolist() if id in items_with_embeddings_set]
        print(f"üì¶ SAR items that have embeddings: {len(sar_items_with_embeddings)} of {len(sar_recs)}")
        print(f"   Items with embeddings: {sar_items_with_embeddings[:5]}...")
    
    # For each SAR recommendation that has an embedding, find similar items
    # This can help discover related content the user might also like
    similar_items_list = []
    similar_to_mapping = {}  # Track which items are similar to which seeds
    seeds_used = 0
    max_seeds = 5  # Try up to 5 seeds to find similar items
    
    for _, row in sar_recs.iterrows():
        if seeds_used >= max_seeds:
            break
            
        item_id = row["itemID"]
        if item_id in items_with_embeddings_set:
            seeds_used += 1
            if verbose:
                print(f"\nüå± Using seed #{seeds_used}: '{row['title']}' (ID: {item_id})")
            
            # Find similar items using embeddings (suppress verbose output in find_similar_items)
            similar_df = find_similar_items(item_id, top_n=similar_items_per_cold)
            
            if similar_df is not None and not similar_df.empty:
                if verbose:
                    print(f"   Found {len(similar_df)} similar items")
                
                for _, sim_row in similar_df.iterrows():
                    sim_item_id = int(sim_row["itemID"])
                    if sim_item_id not in similar_to_mapping:
                        similar_to_mapping[sim_item_id] = []
                    similar_to_mapping[sim_item_id].append({
                        "seed_title": row["title"][:20],
                        "seed_itemID": item_id,
                        "seed_prediction": row["prediction"],
                        "similarity_score": sim_row["similarity_score"]
                    })
                
                similar_df["seed_itemID"] = item_id
                similar_df["seed_prediction"] = row["prediction"]
                # Scale similarity score relative to the seed's SAR prediction
                similar_df["prediction"] = similar_df["similarity_score"] * row["prediction"] * 0.8
                similar_items_list.append(similar_df)
            else:
                if verbose:
                    print(f"   No similar items found")
    
    if verbose and seeds_used == 0:
        print(f"\n‚ö†Ô∏è  No SAR recommendations have embeddings in the index!")
        print(f"   This means none of user {user_id}'s SAR recommendations are in the first {len(items_with_embeddings_set)} indexed movies.")
    
    # Check for items recommended by BOTH SAR and embedding similarity
    sar_item_ids = set(sar_recs["itemID"].tolist())
    both_sources = sar_item_ids.intersection(set(similar_to_mapping.keys()))
    
    # Update SAR recs source for items also found via similarity
    def update_source(row):
        item_id = row["itemID"]
        if item_id in both_sources:
            similar_seeds = similar_to_mapping[item_id]
            seed_names = [s["seed_title"] for s in similar_seeds]
            return f"SAR + Similar to {', '.join(seed_names)}..."
        return row["source"]
    
    sar_recs["source"] = sar_recs.apply(update_source, axis=1)
    
    # Combine SAR recommendations with embedding-based similar items
    if similar_items_list:
        all_similar = pd.concat(similar_items_list, ignore_index=True)
        
        if verbose:
            print(f"\nüîç Total similar items found: {len(all_similar)}")
        
        # Remove items already in SAR recommendations (they're already marked as both)
        all_similar["itemID"] = all_similar["itemID"].astype(int)
        new_items = all_similar[~all_similar["itemID"].isin(sar_item_ids)].copy()
        
        # Remove duplicates - keep the one with highest similarity score
        new_items = new_items.sort_values("similarity_score", ascending=False).drop_duplicates(subset=["itemID"], keep="first")
        
        if verbose:
            print(f"üÜï New items from cold start (not in SAR): {len(new_items)}")
        
        # For remaining similar items, create source label
        def get_similar_source(item_id):
            if item_id in similar_to_mapping:
                seeds = similar_to_mapping[item_id]
                seed_names = [s["seed_title"] for s in seeds]
                if len(seeds) > 1:
                    return f"Cold Start: Similar to {len(seeds)} items ({', '.join(seed_names)})"
                return f"Cold Start: Similar to {seed_names[0]}"
            return "Cold Start: Embedding Similarity"
        
        new_items["source"] = new_items["itemID"].apply(get_similar_source)
        
        # Prepare similar items for combining - boost prediction to make them competitive
        new_items = new_items.rename(columns={"similarity_score": "embedding_score"})
        
        # Combine SAR and cold start items
        sar_df = sar_recs[["userID", "itemID", "prediction", "source", "title", "genres"]].copy()
        cold_df = new_items[["itemID", "prediction", "source", "title", "genres", "subgenres", "overview"]].copy()
        cold_df["userID"] = user_id
        
        if verbose:
            print(f"\nüìã Before combining:")
            print(f"   SAR items: {len(sar_df)}")
            print(f"   Cold start items: {len(cold_df)}")
        
        hybrid_recs = pd.concat([sar_df, cold_df], ignore_index=True)
        
        if verbose:
            print(f"   Combined total: {len(hybrid_recs)}")
    else:
        if verbose:
            print(f"\n‚ö†Ô∏è  No similar items found from any seeds")
        hybrid_recs = sar_recs[["userID", "itemID", "prediction", "source", "title", "genres"]]
    
    # Count before filtering to top_k
    all_cold_start = len(hybrid_recs[hybrid_recs["source"].str.contains("Cold Start", na=False)])
    all_sar_only = len(hybrid_recs[hybrid_recs["source"] == "SAR"])
    all_both = len(hybrid_recs[hybrid_recs["source"].str.contains("SAR \\+", regex=True, na=False)])
    
    if verbose:
        print(f"\nüìä Before top-k filtering:")
        print(f"   Total items: {len(hybrid_recs)}")
        print(f"   SAR only: {all_sar_only}, Cold Start: {all_cold_start}, Both: {all_both}")
    
    # Ensure at least 1 cold start item is included (if available)
    min_cold_start = 1
    cold_start_items = hybrid_recs[hybrid_recs["source"].str.contains("Cold Start", na=False)].copy()
    non_cold_start_items = hybrid_recs[~hybrid_recs["source"].str.contains("Cold Start", na=False)].copy()
    
    if len(cold_start_items) > 0:
        # Sort each group by prediction
        cold_start_items = cold_start_items.sort_values("prediction", ascending=False)
        non_cold_start_items = non_cold_start_items.sort_values("prediction", ascending=False)
        
        # Take top cold start items (at least min_cold_start)
        num_cold_to_include = max(min_cold_start, min(len(cold_start_items), similar_items_per_cold))
        top_cold = cold_start_items.head(num_cold_to_include)
        
        # Fill remaining slots with non-cold start items
        remaining_slots = top_k - num_cold_to_include
        top_non_cold = non_cold_start_items.head(remaining_slots)
        
        # Combine and sort by prediction
        hybrid_recs = pd.concat([top_non_cold, top_cold], ignore_index=True)
        hybrid_recs = hybrid_recs.sort_values("prediction", ascending=False).head(top_k)
        
        if verbose:
            print(f"\nüéØ Guaranteed {num_cold_to_include} cold start item(s) in final results")
    else:
        # No cold start items available, just take top-k
        hybrid_recs = hybrid_recs.sort_values("prediction", ascending=False).head(top_k)
    
    hybrid_recs = hybrid_recs.reset_index(drop=True)
    hybrid_recs.index = hybrid_recs.index + 1  # 1-based ranking
    hybrid_recs.index.name = "rank"
    
    # Print summary after top-k
    if verbose:
        cold_start_count = len(hybrid_recs[hybrid_recs["source"].str.contains("Cold Start", na=False)])
        sar_only_count = len(hybrid_recs[hybrid_recs["source"] == "SAR"])
        both_count = len(hybrid_recs[hybrid_recs["source"].str.contains("SAR \\+", regex=True, na=False)])
        
        print(f"\n" + "=" * 60)
        print(f"üìà RECOMMENDATION SUMMARY for User {user_id} (Top {top_k}):")
        print(f"   SAR only:           {sar_only_count}")
        print(f"   Cold Start only:    {cold_start_count}")
        print(f"   Both SAR + Similar: {both_count}")
        print(f"=" * 60)
    
    return hybrid_recs


# Example: Get hybrid recommendations for a user
example_user_id = 31

print(f"Generating hybrid recommendations for user {example_user_id}")
print("=" * 80)
print("Combining SAR collaborative filtering with embedding-based cold item similarity")
print("=" * 80)

hybrid_recommendations = get_hybrid_recommendations(
    user_id=example_user_id,
    sar_model=model,
    top_k=15,
    similar_items_per_cold=3
)

print(f"\nHybrid Recommendations for User {example_user_id}:")
display(hybrid_recommendations)

2026-01-18 22:39:19,632 INFO     Calculating recommendation scores
2026-01-18 22:39:19,637 INFO     Removing seen items
2026-01-18 22:39:19,676 INFO     Request URL: 'https://productreco.search.windows.net/indexes('movies-recommendations')/docs/search.post.search?api-version=REDACTED'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '34551'
    'api-key': 'REDACTED'
    'Accept': 'application/json;odata.metadata=none'
    'x-ms-client-request-id': '8724847a-f4be-11f0-b594-d2db28ac9ba5'
    'User-Agent': 'azsdk-python-search-documents/11.6.0 Python/3.11.14 (Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39)'
A body is sent with the request


Generating hybrid recommendations for user 31
Combining SAR collaborative filtering with embedding-based cold item similarity

üìä SAR returned 15 recommendations
üì¶ SAR items that have embeddings: 1 of 15
   Items with embeddings: [32]...

üå± Using seed #1: 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)' (ID: 32)
Finding items similar to:
  Title:     Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
  Item ID:   32
  Genres:    Mystery|Sci-Fi|Thriller
  Subgenres: Time Travel|Dystopian|Psychological Thriller
  Overview:  In 2035, convict James Cole is sent back in time to discover the origin of a deadly virus that wiped out most of the earth's population. Mistakenly se...
------------------------------------------------------------


2026-01-18 22:39:19,836 DEBUG    https://productreco.search.windows.net:443 "POST /indexes('movies-recommendations')/docs/search.post.search?api-version=2025-09-01 HTTP/1.1" 200 None
2026-01-18 22:39:19,839 INFO     Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; odata.metadata=none; odata.streaming=true; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Vary': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'Preference-Applied': 'REDACTED'
    'OData-Version': 'REDACTED'
    'request-id': '8724847a-f4be-11f0-b594-d2db28ac9ba5'
    'elapsed-time': 'REDACTED'
    'Date': 'Sun, 18 Jan 2026 22:39:19 GMT'


   Found 3 similar items

üîç Total similar items found: 3
üÜï New items from cold start (not in SAR): 3

üìã Before combining:
   SAR items: 15
   Cold start items: 3
   Combined total: 18

üìä Before top-k filtering:
   Total items: 18
   SAR only: 15, Cold Start: 3, Both: 0

üéØ Guaranteed 3 cold start item(s) in final results

üìà RECOMMENDATION SUMMARY for User 31 (Top 15):
   SAR only:           12
   Cold Start only:    3
   Both SAR + Similar: 0

Hybrid Recommendations for User 31:


Unnamed: 0_level_0,userID,itemID,prediction,source,title,genres,subgenres,overview
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,31,480,3.336695,SAR,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,,
2,31,32,3.245314,SAR,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,,
3,31,260,3.245064,SAR,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,,
4,31,2115,3.216913,SAR,Indiana Jones and the Temple of Doom (1984),Action|Adventure|Fantasy,,
5,31,1210,3.202786,SAR,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,,
6,31,2716,3.195978,SAR,Ghostbusters (a.k.a. Ghost Busters) (1984),Action|Comedy|Sci-Fi,,
7,31,1291,3.184828,SAR,Indiana Jones and the Last Crusade (1989),Action|Adventure,,
8,31,2987,3.079197,SAR,Who Framed Roger Rabbit? (1988),Adventure|Animation|Children|Comedy|Crime|Fant...,,
9,31,1573,3.02034,SAR,Face/Off (1997),Action|Crime|Drama|Thriller,,
10,31,1097,3.006843,SAR,E.T. the Extra-Terrestrial (1982),Children|Drama|Sci-Fi,,


---

# 6 Model Evaluation üìä

### Evaluating SAR Performance

We evaluate how well SAR performs using standard ranking metrics. Note that these metrics evaluate the **collaborative filtering component only** - the cold-start items from embedding similarity cannot be evaluated against the test set (since by definition, cold items have no ground-truth interactions).

The metrics below measure how well SAR recovers items the user actually interacted with in the held-out test set:

In [None]:
# Ranking metrics
eval_map = map(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_ndcg = ndcg_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_precision = precision_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_recall = recall_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)


In [None]:
# Rating metrics
eval_rmse = rmse(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_mae = mae(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_rsquared = rsquared(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_exp_var = exp_var(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")


In [None]:
positivity_threshold = 2
test_bin = test.copy()
test_bin["rating"] = binarize(test_bin["rating"], positivity_threshold)

top_k_prob = top_k.copy()
top_k_prob["prediction"] = minmax_scale(top_k_prob["prediction"].astype(float))

eval_logloss = logloss(
    test_bin, top_k_prob, col_user="userID", col_item="itemID", col_rating="rating"
)


In [None]:
print("Model:\t",
      "Top K:\t%d" % TOP_K,
      "MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall,
      "RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae,
      "R2:\t%f" % eval_rsquared,
      "Exp var:\t%f" % eval_exp_var,
      "Logloss:\t%f" % eval_logloss,
      sep='\n')

In [None]:
# Now let's look at the results for a specific user
user_id = 54

ground_truth = test[test["userID"] == user_id].sort_values(
    by="rating", ascending=False
)[:TOP_K]
prediction = model.recommend_k_items(
    pd.DataFrame(dict(userID=[user_id])), remove_seen=True
)
df = pd.merge(ground_truth, prediction, on=["userID", "itemID"], how="left")
df.head(10)

Above, we see that one of the highest rated items from the test set was recovered by the model's top-k recommendations, however the others were not. Offline evaluations are difficult as they can only use what was seen previously in the test set and may not represent the user's actual preferences across the entire set of items. Adjustments to how the data is split, algorithm is used and hyper-parameters can improve the results here. 

In [None]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)