<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# SAR Single Node on MovieLens (Python, CPU)

Simple Algorithm for Recommendation (SAR) is a fast and scalable algorithm for personalized recommendations based on user transaction history. It produces easily explainable and interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user. More details about SAR can be found in the [deep dive notebook](../02_model_collaborative_filtering/sar_deep_dive.ipynb). 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

This notebook provides an example of how to utilize and evaluate SAR in Python on a CPU.

# 0 Global Settings and Imports

In [1]:
import sys
import logging
import numpy as np
import pandas as pd
from sklearn.preprocessing import minmax_scale

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.utils.python_utils import binarize
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.evaluation.python_evaluation import (
    map,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
    rmse,
    mae,
    logloss,
    rsquared,
    exp_var
)
from recommenders.utils.notebook_utils import store_metadata

%load_ext autoreload
%autoreload 2

print(f"System version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



System version: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:46:25) [GCC 14.3.0]
NumPy version: 1.26.4
Pandas version: 2.3.3


# 1 Load Data

SAR is intended to be used on interactions with the following schema:
`<User ID>, <Item ID>,<Time>,[<Event Type>], [<Event Weight>]`. 

Each row represents a single interaction between a user and an item. These interactions might be different types of events on an e-commerce website, such as a user clicking to view an item, adding it to a shopping basket, following a recommendation link, and so on. Each event type can be assigned a different weight, for example, we might assign a “buy” event a weight of 10, while a “view” event might only have a weight of 1.

The MovieLens dataset is well formatted interactions of Users providing Ratings to Movies (movie ratings are used as the event weight) - we will use it for the rest of the example.

In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, 20m, or latest-small
MOVIELENS_DATA_SIZE = "latest-small"

### 1.1 Download and use the MovieLens Dataset

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE
)

# Convert the float precision to 32-bit in order to reduce memory consumption 
data["rating"] = data["rating"].astype(np.float32)

data.head()

100%|██████████| 956/956 [00:00<00:00, 5.83kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Default Catalog Contains Genres Only

In [4]:
# Load movie information (title, genres)
movies = movielens.load_item_df(
    size=MOVIELENS_DATA_SIZE,
    title_col="title",
    genres_col="genres"
)

movies.head()

100%|██████████| 956/956 [00:00<00:00, 5.13kKB/s]


Unnamed: 0,itemID,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Refence online catalogs and Gen AI to extend to sub genres and summaries for embeddings similarity search

Include tmdb url inferred from link.csv

In [5]:
# Load movie links (IMDb and TMDb IDs)
# Note: links.csv is only available in the latest-small

links = movielens.load_links_df(
        size=MOVIELENS_DATA_SIZE,
        movie_col="itemID"
    )

# Join movies with links and create TMDb URL column
movies_with_links = movies.merge(links, left_on="itemID", right_on="itemID", how="left")

# Create the TMDb URL column (handle missing tmdbId values)
movies_with_links["tmdburl"] = movies_with_links["tmdbId"].apply(
    lambda x: f"https://www.themoviedb.org/movie/{int(x)}" if pd.notna(x) else None
)

movies_with_links = movies_with_links[["itemID", "title", "genres", "tmdburl"]]

display(movies_with_links.head())



100%|██████████| 956/956 [00:00<00:00, 5.44kKB/s]


Unnamed: 0,itemID,title,genres,tmdburl
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,https://www.themoviedb.org/movie/862
1,2,Jumanji (1995),Adventure|Children|Fantasy,https://www.themoviedb.org/movie/8844
2,3,Grumpier Old Men (1995),Comedy|Romance,https://www.themoviedb.org/movie/15602
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,https://www.themoviedb.org/movie/31357
4,5,Father of the Bride Part II (1995),Comedy,https://www.themoviedb.org/movie/11862


Leverage Gen AI to extend into new subgenres and descriptive summaries

In [6]:

# Subgenres mapped to main genres for more detailed classification
SUBGENRES = {
    "Action": (
        "Martial Arts",
        "Spy",
        "Superhero",
        "Military Action",
        "Disaster",
    ),
    "Adventure": (
        "Exploration",
        "Survival",
        "Treasure Hunt",
        "Jungle",
        "Sea Adventure",
    ),
    "Animation": (
        "Anime",
        "CGI",
        "Stop Motion",
        "Hand Drawn",
        "Claymation",
    ),
    "Children's": (
        "Family",
        "Fairy Tale",
        "Coming of Age",
        "Educational",
        "Puppet",
    ),
    "Comedy": (
        "Romantic Comedy",
        "Slapstick",
        "Satire",
        "Parody",
        "Dark Comedy",
        "Screwball",
    ),
    "Crime": (
        "Heist",
        "Gangster",
        "Detective",
        "Legal Thriller",
        "True Crime",
    ),
    "Documentary": (
        "Nature",
        "Biographical",
        "Historical",
        "Social",
        "Sports Documentary",
    ),
    "Drama": (
        "Melodrama",
        "Psychological",
        "Family Drama",
        "Legal Drama",
        "Political Drama",
        "Medical Drama",
    ),
    "Fantasy": (
        "High Fantasy",
        "Urban Fantasy",
        "Dark Fantasy",
        "Fairy Tale Fantasy",
        "Mythological",
    ),
    "Film-Noir": (
        "Neo-Noir",
        "Tech-Noir",
        "Nordic Noir",
        "Psychological Noir",
    ),
    "Horror": (
        "Slasher",
        "Supernatural",
        "Psychological Horror",
        "Body Horror",
        "Found Footage",
        "Zombie",
        "Vampire",
    ),
    "Musical": (
        "Jukebox Musical",
        "Opera",
        "Dance Film",
        "Concert Film",
        "Backstage Musical",
    ),
    "Mystery": (
        "Whodunit",
        "Cozy Mystery",
        "Noir Mystery",
        "Paranormal Mystery",
        "Locked Room",
    ),
    "Romance": (
        "Period Romance",
        "Contemporary Romance",
        "Tragic Romance",
        "Romantic Drama",
        "Teen Romance",
    ),
    "Sci-Fi": (
        "Space Opera",
        "Cyberpunk",
        "Time Travel",
        "Dystopian",
        "Post-Apocalyptic",
        "Alien",
        "Hard Sci-Fi",
    ),
    "Thriller": (
        "Psychological Thriller",
        "Spy Thriller",
        "Action Thriller",
        "Erotic Thriller",
        "Techno Thriller",
    ),
    "War": (
        "World War I",
        "World War II",
        "Vietnam War",
        "Civil War",
        "Anti-War",
        "Military Drama",
    ),
    "Western": (
        "Spaghetti Western",
        "Revisionist Western",
        "Contemporary Western",
        "Comedy Western",
        "Epic Western",
    ),
}

In [None]:
#!pip install openai requests beautifulsoup4 tqdm

import requests
from bs4 import BeautifulSoup
from openai import AzureOpenAI
from tqdm import tqdm
import time
import json

# Azure OpenAI configuration
AZURE_OPENAI_ENDPOINT = "https://eusaoaijz.openai.azure.com/"
AZURE_OPENAI_API_KEY = "your openai api key here"
AZURE_OPENAI_DEPLOYMENT = "gpt-4o"  # or your deployment name

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
)

# Flatten subgenres for the prompt
ALL_SUBGENRES = []
for genre, subgenres in SUBGENRES.items():
    for subgenre in subgenres:
        ALL_SUBGENRES.append(f"{subgenre} ({genre})")

SUBGENRES_LIST = ", ".join(ALL_SUBGENRES)

def scrape_tmdb_page(url):
    """Scrape the TMDb page content"""
    if not url:
        return None
    try:
        headers = {"User-Agent": "Mozilla/5.0 (compatible; MovieBot/1.0)"}
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            # Extract the overview section
            overview_div = soup.find("div", class_="overview")
            if overview_div:
                return overview_div.get_text(strip=True)
            # Fallback: get meta description
            meta = soup.find("meta", attrs={"name": "description"})
            if meta:
                return meta.get("content", "")
        return None
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

def extract_overview_and_subgenres(url, title, genres, scraped_content=None):
    """Use Azure OpenAI to extract plot overview and assign subgenres"""
    if not url and not scraped_content:
        return None, None
    
    content = scraped_content or scrape_tmdb_page(url)
    if not content:
        return None, None
    
    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT,
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a movie classification assistant. Given movie information, you must:
1. Extract or summarize the plot overview (2-3 sentences max)
2. Assign 1-3 relevant subgenres from this list: {SUBGENRES_LIST}

Respond in JSON format only:
{{"overview": "plot summary here", "subgenres": ["Subgenre1", "Subgenre2"]}}"""
                },
                {
                    "role": "user", 
                    "content": f"""Movie: {title}
Main Genres: {genres}
Content from TMDb: {content[:2000]}

Extract the overview and assign appropriate subgenres."""
                }
            ],
            max_tokens=400,
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content.strip())
        overview = result.get("overview", "")
        subgenres = result.get("subgenres", [])
        
        # Convert subgenres list to pipe-separated string
        subgenres_str = "|".join(subgenres) if subgenres else None
        
        return overview, subgenres_str
        
    except Exception as e:
        print(f"Error with Azure OpenAI for {title}: {e}")
        return content, None  # Return scraped content as fallback

# Process only first x movies for demo
MAX_RECORDS = 200
movies_sample = movies_with_links.head(MAX_RECORDS).copy()

# Process movies with rate limiting
overviews = []
subgenres_list = []

for idx, row in tqdm(movies_sample.iterrows(), total=len(movies_sample), desc="Extracting overviews & subgenres"):
    overview, subgenres = extract_overview_and_subgenres(
        row["tmdburl"], 
        row["title"], 
        row["genres"]
    )
    overviews.append(overview)
    subgenres_list.append(subgenres)
    time.sleep(0.5)  # Rate limiting

movies_sample["overview"] = overviews
movies_sample["subgenres"] = subgenres_list

display(movies_sample[["itemID", "title", "genres", "subgenres", "overview"]].head(10))

# Optionally merge back with full dataset (movies without AI enrichment will have None values)
#movies_with_links = movies_with_links.merge(
#    movies_sample[["itemID", "overview", "subgenres"]], 
#    on="itemID", 
#    how="left"
#)

Extracting overviews & subgenres:  18%|█▊        | 35/200 [01:06<06:49,  2.48s/it]

Error with Azure OpenAI for Clueless (1995): 'NoneType' object has no attribute 'strip'


Extracting overviews & subgenres:  74%|███████▎  | 147/200 [06:05<02:37,  2.98s/it]

Error with Azure OpenAI for Kids (1995): 'NoneType' object has no attribute 'strip'


Extracting overviews & subgenres: 100%|██████████| 200/200 [08:20<00:00,  2.50s/it]


Unnamed: 0,itemID,title,genres,subgenres,overview
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,CGI|Family|Urban Fantasy,"Woody, a toy cowboy, feels threatened when Buz..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,Treasure Hunt|Family|High Fantasy,Siblings Judy and Peter discover an enchanted ...
2,3,Grumpier Old Men (1995),Comedy|Romance,Romantic Comedy|Slapstick,A family wedding reignites the feud between ne...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Romantic Drama|Melodrama,"Four women, Vannah, Bernie, Glo, and Robin, na..."
4,5,Father of the Bride Part II (1995),Comedy,Romantic Comedy|Family,George Banks is shocked to learn that both his...
5,6,Heat (1995),Action|Crime|Thriller,Heist|Action Thriller,Master thief Neil McCauley leads a skilled cre...
6,7,Sabrina (1995),Comedy|Romance,Romantic Comedy,"After returning from school in Paris, Sabrina,..."
7,8,Tom and Huck (1995),Adventure|Children,Coming of Age|Treasure Hunt,Tom Sawyer witnesses a murder and befriends Hu...
8,9,Sudden Death (1995),Action,Action Thriller,A man's daughter is kidnapped during a champio...
9,10,GoldenEye (1995),Action|Adventure|Thriller,Spy (Action)|Action Thriller (Thriller),"James Bond must stop his former ally, Alec Tre..."


Add new catalog metadata to AI Search as embeddings

In [None]:
#!pip install azure-search-documents

# Generate OpenAI embeddings and insert into Azure AI Search
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
)
from azure.core.credentials import AzureKeyCredential

# Azure AI Search configuration
AZURE_SEARCH_ENDPOINT = "https://productreco.search.windows.net"
AZURE_SEARCH_API_KEY = "your search api key here"
AZURE_SEARCH_INDEX_NAME = "movies-recommendations"

# Azure OpenAI Embedding model configuration
AZURE_OPENAI_EMBEDDING_DEPLOYMENT = "text-embedding-ada-002"  # or your embedding deployment name

def get_embedding(text):
    """Generate embedding for a text using Azure OpenAI"""
    if not text or pd.isna(text):
        return None
    try:
        response = client.embeddings.create(
            model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
            input=text[:8000]  # Truncate to max token limit
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

def create_combined_text(row):
    """Combine genres, subgenres, and overview into a single text for embedding"""
    parts = []
    if row.get("genres") and pd.notna(row["genres"]):
        parts.append(f"Genres: {row['genres']}")
    if row.get("subgenres") and pd.notna(row["subgenres"]):
        parts.append(f"Subgenres: {row['subgenres']}")
    if row.get("overview") and pd.notna(row["overview"]):
        parts.append(f"Overview: {row['overview']}")
    return " | ".join(parts) if parts else None

# Generate embeddings for each movie
print("Generating embeddings for movies...")
embeddings = []

for idx, row in tqdm(movies_sample.iterrows(), total=len(movies_sample), desc="Generating embeddings"):
    combined_text = create_combined_text(row)
    embedding = get_embedding(combined_text)
    embeddings.append(embedding)
    time.sleep(0.1)  # Rate limiting for embedding API

movies_sample["embedding"] = embeddings

# Filter out rows without embeddings
movies_with_embeddings = movies_sample[movies_sample["embedding"].notna()].copy()
print(f"Generated embeddings for {len(movies_with_embeddings)} movies")

# Create Azure AI Search index
search_index_client = SearchIndexClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    credential=AzureKeyCredential(AZURE_SEARCH_API_KEY)
)

# Define the index schema with vector search
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="hnsw-config")
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-config"
        )
    ]
)

index = SearchIndex(
    name=AZURE_SEARCH_INDEX_NAME,
    fields=[
        SimpleField(name="itemID", type=SearchFieldDataType.String, key=True),
        SearchableField(name="title", type=SearchFieldDataType.String),
        SearchableField(name="genres", type=SearchFieldDataType.String),
        SearchableField(name="subgenres", type=SearchFieldDataType.String),
        SearchableField(name="overview", type=SearchFieldDataType.String),
        SimpleField(name="tmdburl", type=SearchFieldDataType.String),
        SearchField(
            name="embedding",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,  # ada-002 embedding dimensions
            vector_search_profile_name="vector-profile"
        )
    ],
    vector_search=vector_search
)

# Create or update the index
try:
    search_index_client.create_or_update_index(index)
    print(f"Index '{AZURE_SEARCH_INDEX_NAME}' created/updated successfully")
except Exception as e:
    print(f"Error creating index: {e}")

# Prepare documents for upload
documents = []
for idx, row in movies_with_embeddings.iterrows():
    doc = {
        "itemID": str(row["itemID"]),
        "title": row["title"] if pd.notna(row["title"]) else "",
        "genres": row["genres"] if pd.notna(row["genres"]) else "",
        "subgenres": row["subgenres"] if pd.notna(row["subgenres"]) else "",
        "overview": row["overview"] if pd.notna(row["overview"]) else "",
        "tmdburl": row["tmdburl"] if pd.notna(row["tmdburl"]) else "",
        "embedding": row["embedding"]
    }
    documents.append(doc)

# Upload documents to Azure AI Search
search_client = SearchClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    index_name=AZURE_SEARCH_INDEX_NAME,
    credential=AzureKeyCredential(AZURE_SEARCH_API_KEY)
)

# Upload in batches of 100
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    try:
        result = search_client.upload_documents(documents=batch)
        succeeded = sum(1 for r in result if r.succeeded)
        print(f"Uploaded batch {i//batch_size + 1}: {succeeded}/{len(batch)} documents succeeded")
    except Exception as e:
        print(f"Error uploading batch {i//batch_size + 1}: {e}")

print(f"\nTotal documents uploaded to Azure AI Search: {len(documents)}")
display(movies_with_embeddings[["itemID", "title", "genres", "subgenres", "overview", "embedding"]].head())

Collecting azure-search-documents
  Downloading azure_search_documents-11.6.0-py3-none-any.whl.metadata (23 kB)
Collecting azure-core>=1.28.0 (from azure-search-documents)
  Downloading azure_core-1.38.0-py3-none-any.whl.metadata (47 kB)
Collecting azure-common>=1.1 (from azure-search-documents)
  Downloading azure_common-1.1.28-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting isodate>=0.6.0 (from azure-search-documents)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading azure_search_documents-11.6.0-py3-none-any.whl (307 kB)
Downloading azure_common-1.1.28-py2.py3-none-any.whl (14 kB)
Downloading azure_core-1.38.0-py3-none-any.whl (217 kB)
Downloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: azure-common, isodate, azure-core, azure-search-documents
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [azure-search-documents]-search-documents]
[1A[2KSuccessfully installed azure-common-1.1.28 azure-core-1.38.0 azure

Generating embeddings: 100%|██████████| 200/200 [00:49<00:00,  4.06it/s]


Generated embeddings for 200 movies
Index 'movies-recommendations' created/updated successfully
Uploaded batch 1: 100/100 documents succeeded
Uploaded batch 2: 100/100 documents succeeded

Total documents uploaded to Azure AI Search: 200


Unnamed: 0,itemID,title,genres,subgenres,overview
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,CGI|Family|Urban Fantasy,"Woody, a toy cowboy, feels threatened when Buz..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,Treasure Hunt|Family|High Fantasy,Siblings Judy and Peter discover an enchanted ...
2,3,Grumpier Old Men (1995),Comedy|Romance,Romantic Comedy|Slapstick,A family wedding reignites the feud between ne...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Romantic Drama|Melodrama,"Four women, Vannah, Bernie, Glo, and Robin, na..."
4,5,Father of the Bride Part II (1995),Comedy,Romantic Comedy|Family,George Banks is shocked to learn that both his...


In [18]:
# Vector similarity search function to find similar movies
from azure.search.documents.models import VectorizedQuery

def find_similar_items(item_id, top_n=5):
    """
    Find the top N most similar items to a given item using cosine similarity search.
    
    Args:
        item_id: The ID of the item to find similar items for
        top_n: Number of similar items to return (default: 5)
    
    Returns:
        DataFrame with similar items and their similarity scores
    """
    # First, get the embedding from our local DataFrame (more reliable)
    item_row = movies_with_embeddings[movies_with_embeddings["itemID"] == item_id]
    
    if item_row.empty:
        print(f"Item {item_id} not found in movies_with_embeddings")
        return None
    
    item_embedding = item_row.iloc[0]["embedding"]
    item_title = item_row.iloc[0]["title"]
    
    if item_embedding is None or (isinstance(item_embedding, list) and len(item_embedding) == 0):
        print(f"No embedding found for item {item_id}")
        return None
    
    # Display source item details
    print(f"Finding items similar to:")
    print(f"  Title:     {item_title}")
    print(f"  Item ID:   {item_id}")
    print(f"  Genres:    {item_row.iloc[0]['genres']}")
    print(f"  Subgenres: {item_row.iloc[0]['subgenres']}")
    print(f"  Overview:  {item_row.iloc[0]['overview'][:150] + '...' if item_row.iloc[0]['overview'] and len(item_row.iloc[0]['overview']) > 150 else item_row.iloc[0]['overview']}")
    print("-" * 60)
    
    # Perform vector search using the item's embedding
    vector_query = VectorizedQuery(
        vector=item_embedding,
        k_nearest_neighbors=top_n + 1,  # +1 to account for the item itself
        fields="embedding"
    )
    
    try:
        results = search_client.search(
            search_text=None,
            vector_queries=[vector_query],
            select=["itemID", "title", "genres", "subgenres", "overview"]
        )
        
        similar_items = []
        for result in results:
            # Skip the query item itself
            if str(result["itemID"]) == str(item_id):
                continue
            
            similar_items.append({
                "itemID": result["itemID"],
                "title": result["title"],
                "genres": result["genres"],
                "subgenres": result["subgenres"],
                "overview": result["overview"][:100] + "..." if result["overview"] and len(result["overview"]) > 100 else result["overview"],
                "similarity_score": result["@search.score"]
            })
            
            if len(similar_items) >= top_n:
                break
        
        return pd.DataFrame(similar_items)
        
    except Exception as e:
        print(f"Error performing similarity search: {e}")
        return None



In [19]:
# Example: Find movies similar to a specific movie
# First, let's see which item IDs are available in the index
print("Available item IDs in index (first 10):")
available_ids = movies_with_embeddings["itemID"].head(10).tolist()
print(available_ids)

# Use the first available item ID from our indexed movies
example_item_id = available_ids[0] if available_ids else 1

similar_movies = find_similar_items(example_item_id, top_n=5)

if similar_movies is not None:
    print(f"\nTop 5 movies similar to item {example_item_id}:")
    display(similar_movies)

Available item IDs in index (first 10):
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Finding items similar to:
  Title:     Toy Story (1995)
  Item ID:   1
  Genres:    Adventure|Animation|Children|Comedy|Fantasy
  Subgenres: CGI|Family|Urban Fantasy
  Overview:  Woody, a toy cowboy, feels threatened when Buzz Lightyear, a new toy, arrives. They must work together to return to their owner after getting separate...
------------------------------------------------------------

Top 5 movies similar to item 1:


Unnamed: 0,itemID,title,genres,subgenres,overview,similarity_score
0,126,"NeverEnding Story III, The (1994)",Adventure|Children|Fantasy,High Fantasy|Family,A young boy must restore order when a group of...,0.904178
1,60,"Indian in the Cupboard, The (1995)",Adventure|Children|Fantasy,Family|Fairy Tale Fantasy,A nine-year-old boy receives a plastic Indian ...,0.902285
2,29,"City of Lost Children, The (CitÃ© des enfants ...",Adventure|Drama|Fantasy|Mystery|Sci-Fi,Dark Fantasy|Psychological Noir|Urban Fantasy,A scientist in a surrealist society kidnaps ch...,0.88878
3,107,Muppet Treasure Island (1996),Adventure|Children|Comedy|Musical,Treasure Hunt|Family|Puppet,Young Jim Hawkins and his friends embark on a ...,0.887038
4,13,Balto (1995),Adventure|Animation|Children,Survival|Hand Drawn|Family,An outcast half-wolf risks his life to prevent...,0.881075


### 1.2 Split the data using the python random splitter provided in utilities:

We split the full dataset into a `train` and `test` dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. For this case, we can use the provided `python_stratified_split` function which holds out a percentage (in this case 25%) of items from each user, but ensures all users are in both `train` and `test` datasets. Other options are available in the `dataset.python_splitters` module which provide more control over how the split occurs.

In [None]:
train, test = python_stratified_split(data, ratio=0.75, col_user="userID", col_item="itemID", seed=42)

In [None]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userID'].unique()),
    train_items=len(train['itemID'].unique()),
    test_total=len(test),
    test_users=len(test['userID'].unique()),
    test_items=len(test['itemID'].unique()),
))

# 2 Train the SAR Model

### 2.1 Instantiate the SAR algorithm and set the index

We will use the single node implementation of SAR and specify the column names to match our dataset (timestamp is an optional column that is used and can be removed if your dataset does not contain it).

Other options are specified to control the behavior of the algorithm as described in the [deep dive notebook](../02_model_collaborative_filtering/sar_deep_dive.ipynb).

In [None]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

### 2.2 Train the SAR model on our training data, and get the top-k recommendations for our testing data

SAR first computes an item-to-item ***co-occurence matrix***. Co-occurence represents the number of times two items appear together for any given user. Once we have the co-occurence matrix, we compute an ***item similarity matrix*** by rescaling the cooccurences by a given metric (Jaccard similarity in this example). 

We also compute an ***affinity matrix*** to capture the strength of the relationship between each user and each item. Affinity is driven by different types (like *rating* or *viewing* a movie), and by the time of the event. 

Recommendations are achieved by multiplying the affinity matrix $A$ and the similarity matrix $S$. The result is a ***recommendation score matrix*** $R$. We compute the ***top-k*** results for each user in the `recommend_k_items` function seen below.

A full walkthrough of the SAR algorithm can be found [here](../02_model_collaborative_filtering/sar_deep_dive.ipynb).

In [None]:
with Timer() as train_time:
    model.fit(train)

print("Took {} seconds for training.".format(train_time.interval))

In [None]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Took {} seconds for prediction.".format(test_time.interval))

In [None]:
top_k.head()

### 2.3. Evaluate how well SAR performs

We evaluate how well SAR performs for a few common ranking metrics provided in the `python_evaluation` module. We will consider the Mean Average Precision (MAP), Normalized Discounted Cumalative Gain (NDCG), Precision, and Recall for the top-k items per user we computed with SAR. User, item and rating column names are specified in each evaluation method.

In [None]:
# Ranking metrics
eval_map = map(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_ndcg = ndcg_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_precision = precision_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)
eval_recall = recall_at_k(test, top_k, col_user="userID", col_item="itemID", col_rating="rating", k=TOP_K)


In [None]:
# Rating metrics
eval_rmse = rmse(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_mae = mae(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_rsquared = rsquared(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")
eval_exp_var = exp_var(test, top_k, col_user="userID", col_item="itemID", col_rating="rating")


In [None]:
positivity_threshold = 2
test_bin = test.copy()
test_bin["rating"] = binarize(test_bin["rating"], positivity_threshold)

top_k_prob = top_k.copy()
top_k_prob["prediction"] = minmax_scale(top_k_prob["prediction"].astype(float))

eval_logloss = logloss(
    test_bin, top_k_prob, col_user="userID", col_item="itemID", col_rating="rating"
)


In [None]:
print("Model:\t",
      "Top K:\t%d" % TOP_K,
      "MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall,
      "RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae,
      "R2:\t%f" % eval_rsquared,
      "Exp var:\t%f" % eval_exp_var,
      "Logloss:\t%f" % eval_logloss,
      sep='\n')

In [None]:
# Now let's look at the results for a specific user
user_id = 54

ground_truth = test[test["userID"] == user_id].sort_values(
    by="rating", ascending=False
)[:TOP_K]
prediction = model.recommend_k_items(
    pd.DataFrame(dict(userID=[user_id])), remove_seen=True
)
df = pd.merge(ground_truth, prediction, on=["userID", "itemID"], how="left")
df.head(10)

Above, we see that one of the highest rated items from the test set was recovered by the model's top-k recommendations, however the others were not. Offline evaluations are difficult as they can only use what was seen previously in the test set and may not represent the user's actual preferences across the entire set of items. Adjustments to how the data is split, algorithm is used and hyper-parameters can improve the results here. 

In [None]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)