## The Project: AI Agents for Diverse Movie Recommendations using LangGraph

This project builds an AI agent system using the **LangGraph** framework to provide personalized movie recommendations based on the **MovieLens 100K** dataset. The system uses a pre-trained **SVD++** model for collaborative filtering and is enhanced with multiple agents:

1. **RecommenderAgent**: Predicts top-20 movies for each user based on the trained SVD++ model.
2. **DiversityRerankerAgent**: Selects the most diverse top-5 subset from the recommendations by balancing rating quality and genre variety.
3. **CritiqueAgent**: Uses a language model (e.g., Granite) to evaluate the recommendation list and provide critique about thematic richness, emotional depth, and genre coverage.

Each agent operates within a **LangGraph pipeline**, enabling structured reasoning and modular evaluation. The full system prints user history, generates diverse recommendations, and explains potential weaknesses in the results.

`Graph:
  A[RecommenderAgent] --> B[DiversityRerankerAgent]
  B --> C[CritiqueAgent]
  C --> D[END]
`

# The Dataset: MovieLens 100K

The experiments are based on the **MovieLens 100K dataset**, a widely used benchmark in recommender systems research.  

- **Size:** 100,000 ratings  
- **Users:** 943  
- **Movies:** 1,682  
- **Format:** tab-delimited files (CSV-like)

## Main Columns
- **userId** – unique identifier of each user (anonymized, 1–943).  
- **movieId** – unique identifier of each movie (1–1682).  
- **rating** – explicit rating from 1 to 5, where higher values indicate stronger preference.  
- **timestamp** – UNIX time indicating when the rating was made.  
- **title** (from `u.item`) – the name of the movie.  
- **genres** (from `u.item`) – one or more genres assigned to each movie (e.g., Action, Comedy).  

This dataset is small enough to allow fast experimentation, yet rich enough to demonstrate the strengths and weaknesses of different recommendation algorithms.


In [None]:
!pip uninstall -y scikit-surprise
!pip install numpy==1.26.4 --force-reinstall

In [None]:
!pip install scikit-surprise --no-binary scikit-surprise

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [4]:
import zipfile
import pandas as pd

# Path to the ZIP file in Google Drive
zip_path = "/content/drive/MyDrive/Portfolio datasets/Recommender engine/ml-100k.zip"
extract_path = "/content/ml-100k"

# Extract the dataset
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# Load the ratings (u.data)
data_path = f"{extract_path}/ml-100k/u.data"
df = pd.read_csv(
    data_path,
    sep="\t",
    names=["userId", "movieId", "rating", "timestamp"]
)

# Load the movie titles (u.item)
item_path = f"{extract_path}/ml-100k/u.item"
movies = pd.read_csv(
    item_path,
    sep="|",
    encoding="latin-1",
    header=None,
    usecols=[0, 1],
    names=["movieId", "title"]
)

# Merge ratings with movie titles
df_merged = pd.merge(df, movies, on="movieId")

print("Data shape:", df_merged.shape)
print(df_merged.head())


Data shape: (100000, 5)
   userId  movieId  rating  timestamp                       title
0     196      242       3  881250949                Kolya (1996)
1     186      302       3  891717742    L.A. Confidential (1997)
2      22      377       1  878887116         Heavyweights (1994)
3     244       51       2  880606923  Legends of the Fall (1994)
4     166      346       1  886397596         Jackie Brown (1997)


In [None]:
import numpy as np
import pandas as pd
from surprise import Dataset, Reader, SVDpp, accuracy
from surprise.model_selection import KFold
from collections import defaultdict

# === Helper: compute metrics for multiple K values ===
def metrics_at_k(predictions, ks=[5, 10, 20]):
    user_ratings = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        user_ratings[uid].append((iid, est, true_r))

    results = {k: {"Precision": [], "Recall": [], "F1": [], "NDCG": [], "HitRate": []} for k in ks}

    for uid, ratings in user_ratings.items():
        ratings.sort(key=lambda x: x[1], reverse=True)

        # Relevant = rating >= 4
        rel = [r for (_, _, r) in ratings if r >= 4]
        n_rel = len(rel)

        for k in ks:
            top_k = ratings[:k]
            rec = [iid for (iid, _, r) in top_k if r >= 4]
            n_rel_and_rec_k = len(rec)

            precision = n_rel_and_rec_k / k if k > 0 else 0
            recall = n_rel_and_rec_k / n_rel if n_rel > 0 else 0
            f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0

            dcg = sum([1 / np.log2(idx+2) for idx, (iid, _, r) in enumerate(top_k) if r >= 4])
            idcg = sum([1 / np.log2(idx+2) for idx in range(min(n_rel, k))])
            ndcg = dcg / idcg if idcg > 0 else 0

            hit = 1 if n_rel_and_rec_k > 0 else 0

            results[k]["Precision"].append(precision)
            results[k]["Recall"].append(recall)
            results[k]["F1"].append(f1)
            results[k]["NDCG"].append(ndcg)
            results[k]["HitRate"].append(hit)

    # Average across users
    return {
        k: {m: np.mean(vals) for m, vals in metrics.items()}
        for k, metrics in results.items()
    }

# === Prepare data ===
ratings_df = df_merged[["userId", "movieId", "rating"]]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df, reader)

# === Train only SVD++ ===
model = SVDpp()
print("\n=== Training SVD++ ===")
kf = KFold(n_splits=5, random_state=42, shuffle=True)

rmses = []
metrics_all = {5: [], 10: [], 20: []}

for fold, (trainset, testset) in enumerate(kf.split(data), 1):
    print(f" Fold {fold} ...")
    model.fit(trainset)
    predictions = model.test(testset)

    # RMSE
    rmse = accuracy.rmse(predictions, verbose=False)
    print(f"   RMSE: {rmse:.4f}")
    rmses.append(rmse)

    # Top-K metrics
    metrics = metrics_at_k(predictions, ks=[5, 10, 20])
    for k in metrics:
        metrics_all[k].append(metrics[k])

# === Aggregate results ===
row = {"Model": "SVD++", "RMSE (mean)": np.mean(rmses)}
for k in [5, 10, 20]:
    avg_metrics = {m+f"@{k}": np.mean([fold[m] for fold in metrics_all[k]]) for m in metrics_all[k][0]}
    row.update(avg_metrics)

results_df = pd.DataFrame([row])
print("\n=== 5-Fold Cross-Validation Results (SVD++ Only) ===")
print(results_df)



=== Training SVD++ ===
 Fold 1 ...
   RMSE: 0.9207
 Fold 2 ...
   RMSE: 0.9157
 Fold 3 ...
   RMSE: 0.9242
 Fold 4 ...
   RMSE: 0.9206
 Fold 5 ...


In [None]:
!pip install bitsandbytes accelerate transformers langgraph

In [None]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
# === Open Source Granite 8B model LLM ===
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)
try:
    # Granite 8B model name on Hugging Face
    GRANITE_MODEL = "ibm-granite/granite-3.3-8b-instruct"

    # Load the model + tokenizer
    granite_model = AutoModelForCausalLM.from_pretrained(
        GRANITE_MODEL,
        device_map="auto",
        quantization_config=bnb_config  # Optional
    )

    granite_tokenizer = AutoTokenizer.from_pretrained(GRANITE_MODEL)

    # Create pipeline
    granite_pipe = pipeline(
        "text-generation",
        model=granite_model,
        tokenizer=granite_tokenizer,
        pad_token_id=granite_tokenizer.eos_token_id,
        return_full_text=False
    )
except Exception as e:
    print("[Warning] Failed to load LLaMA model:", e)
    granite_pipe = lambda prompt, **kwargs: [{"generated_text": "[granit model unavailable]"}]
granite_pipe = pipeline("text-generation", model=granite_model, tokenizer=granite_tokenizer)



In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Use the ratings dataframe (already loaded from MovieLens or another dataset)
ratings_df = df_merged[["userId", "movieId", "rating"]].copy()

# Train/Test split
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)

# Build ratings_history from train_df
ratings_history = (
    train_df.groupby("userId")[["movieId", "rating"]]
    .apply(lambda g: list(zip(g["movieId"], g["rating"])))
    .to_dict()
)

# Build item_pool from all unique movies
item_pool = ratings_df["movieId"].unique().tolist()

print("Data prepared:")
print(f"Train size: {len(train_df)}, Test size: {len(test_df)}")
print(f"Number of users in train: {train_df['userId'].nunique()}")
print(f"Number of items in pool: {len(item_pool)}")


Data prepared:
Train size: 80000, Test size: 20000
Number of users in train: 943
Number of items in pool: 1682


In [15]:
# === AI Agents with LangGraph, using pre-trained SVD++ and MovieLens titles/genres ===
import pandas as pd
from langgraph.graph import StateGraph, END
from typing import TypedDict
import re
from collections import Counter
from itertools import combinations


# 1) Load movies with titles + genres
movies_df = pd.read_csv(
    "/content/ml-100k/ml-100k/u.item",
    sep="|",
    encoding="latin-1",
    header=None,
    names=["movieId", "title", "release_date", "video_release_date", "IMDb_URL",
           "unknown", "Action", "Adventure", "Animation", "Children's", "Comedy",
           "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
           "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
)

def clean_title(title: str) -> str:
    return re.sub(r"\s*\(\d{4}\)$", "", title)

genre_cols = movies_df.columns[5:]
movies_df["genres"] = movies_df[genre_cols].apply(
    lambda row: ", ".join([g for g, v in row.items() if v == 1]), axis=1
)

# 2) Merge ratings with movie metadata
ratings_full = df_merged[["userId", "movieId", "rating"]].merge(
    movies_df[["movieId", "title", "genres"]],
    on="movieId", how="left"
)

item_pool = ratings_full["movieId"].unique().tolist()
ratings_history = (
    ratings_full.groupby("userId")[["movieId", "rating"]]
    .apply(lambda x: list(zip(x["movieId"], x["rating"])) )
    .to_dict()
)

movie_titles = ratings_full.set_index("movieId")["title"].to_dict()
movie_genres = ratings_full.set_index("movieId")["genres"].to_dict()

def title_of(iid: int) -> str:
    return movie_titles.get(iid, "Unknown title")

def genres_of(iid: int) -> str:
    g = movie_genres.get(iid, "")
    return g if isinstance(g, str) and g else "Unknown genre"

# 3) State definition
class State(TypedDict):
    user_id: int
    request: str
    recommendations: list
    explanation: str
    critique: str

# 4) RecommenderAgent
class RecommenderAgent:
    def __init__(self, model, movies_df):
        self.model = model
        self.movies_df = movies_df

    def recommend(self, user_id, item_pool, top_k=20):
        preds = [(iid, self.model.predict(user_id, iid).est) for iid in item_pool]
        preds = sorted(preds, key=lambda x: x[1], reverse=True)[:top_k]
        recs_df = pd.DataFrame(preds, columns=["movieId", "predicted_rating"])
        recs_df = recs_df.merge(self.movies_df, on="movieId", how="left")
        return recs_df[["title", "genres", "predicted_rating"]]

    def run(self, state, item_pool, top_k=20):
        user_id = state["user_id"]
        recs = self.recommend(user_id, item_pool, top_k)
        state["recommendations"] = recs
        print("--- Top-20 recommendations before re-ranking ---")
        for i, row in enumerate(state["recommendations"].itertuples(), 1):
            print(f"{i}. {row.title} ({row.genres}) → pred {row.predicted_rating:.2f}")
        return state

# 5) CritiqueAgent
def wrap_text(text, words_per_line=10):
    words = text.split()
    lines = [" ".join(words[i:i+words_per_line]) for i in range(0, len(words), words_per_line)]
    return "\n".join(lines)

class CritiqueAgent:
    def __init__(self, pipe, words_per_line=10):
        self.pipe = pipe
        self.words_per_line = words_per_line

    def run(self, state):
        recs = state["recommendations"]
        recs_str = "; ".join(
            f"{row['title']} ({row['genres']}) pred {row['predicted_rating']:.2f}"
            for _, row in recs.iterrows()
        )
        prompt = (
            "You are a movie recommendation critic. Based on the following recommended list, "
            "write a brief critique of possible weaknesses in the recommendation strategy. "
            "Focus on thematic richness, emotional depth, genre variety (e.g. missing musical, children's, animation), "
            "and suggest one possible improvement in no more than 3 lines.\n\n"
            f"Recommendations: {recs_str}\n\n"
            "Critique:"
        )
        out = self.pipe(prompt, max_new_tokens=100, temperature=0.6)[0]["generated_text"]
        if "Critique:" in out:
            out = out.split("Critique:")[-1].strip()
        return {"critique": wrap_text(out, self.words_per_line)}

# 6) DiversityRerankerAgent
class DiversityRerankerAgent:
    def __init__(self, top_k=5, diversity_weight=1.0):
        self.top_k = top_k
        self.diversity_weight = diversity_weight

    def compute_genre_bonus(self, selected_genres, candidate_genres):
        """Calculate how many new genres this candidate adds."""
        candidate_set = set(candidate_genres.split(", "))
        return len(candidate_set - selected_genres)

    def run(self, state):
        recs = state["recommendations"].copy()
        if len(recs) <= self.top_k:
            return state

        # Select top_k movies incrementally to maximize genre diversity + rating
        selected = []
        selected_genres = set()
        remaining = recs.copy()

        while len(selected) < self.top_k and not remaining.empty:
            def score(row):
                genre_bonus = self.compute_genre_bonus(selected_genres, row["genres"])
                return (1 - self.diversity_weight) * row["predicted_rating"] + self.diversity_weight * genre_bonus

            remaining["score"] = remaining.apply(score, axis=1)
            best = remaining.sort_values(by="score", ascending=False).iloc[0]
            selected.append(best)
            selected_genres.update(best["genres"].split(", "))
            remaining = remaining.drop(best.name)

        state["recommendations"] = pd.DataFrame(selected)[["title", "genres", "predicted_rating"]].reset_index(drop=True)
        return state


# 7) Graph builder
def build_graph(recommender, diversity_reranker, critique_agent, item_pool):
    graph = StateGraph(State)
    graph.add_node("Recommender", lambda s: recommender.run(s, item_pool, top_k=20))
    graph.add_node("DiversityReranker", lambda s: diversity_reranker.run(s))
    graph.add_node("Critique", lambda s: critique_agent.run(s))

    graph.add_edge("Recommender", "DiversityReranker")
    graph.add_edge("DiversityReranker", "Critique")
    graph.add_edge("Critique", END)

    graph.set_entry_point("Recommender")
    return graph

# 8) History printer
def print_user_history(user_id: int, ratings_df: pd.DataFrame, max_items=10):
    print("\n--- User History ---")
    df = ratings_df[ratings_df["userId"] == user_id][["title", "genres", "rating"]].copy()
    df["title"] = df["title"].apply(clean_title)
    df = df.sort_values(by=["rating", "title"], ascending=[False, True]).head(max_items)
    print(df.to_string(index=False))

# 9) Initialize agents
recommender = RecommenderAgent(model, movies_df)
diversity_reranker = DiversityRerankerAgent(top_k=5, diversity_weight=1.5)
critique_agent = CritiqueAgent(granite_pipe, words_per_line=10)

graph = build_graph(recommender, diversity_reranker, critique_agent, item_pool)
app = graph.compile()

# 10) Run example
sample_users = ratings_full["userId"].drop_duplicates().sample(n=5, random_state=42)
for uid in sample_users:
    print_user_history(uid, ratings_full)
    result = app.invoke({"user_id": int(uid), "request": "Recommend me movies"})

    print("\n=== Dialogue Agent Output ===")
    print(f"User: {uid}")
    print("\nCritique:")
    print(result["critique"])
    print("\nDiversityReranker:")
    print(result["recommendations"])


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



--- User History ---
                                 title                         genres  rating
                Breakfast at Tiffany's                 Drama, Romance       5
                          Little Women                          Drama       5
Romy and Michele's High School Reunion                         Comedy       5
                               Sabrina                Comedy, Romance       5
                      Schindler's List                     Drama, War       5
             Shawshank Redemption, The                          Drama       5
                     Strictly Ballroom                Comedy, Romance       5
                               Amadeus                 Drama, Mystery       4
                                  Bean                         Comedy       4
                  Beauty and the Beast Animation, Children's, Musical       4
--- Top-20 recommendations before re-ranking ---
1. Schindler's List (1993) (Drama, War) → pred 4.25
2. Strictly Ballroo

### What the Code Does: A/B Testing Simulation and Evaluation

Performs a **simulated A/B test** to evaluate the impact of a Diversity Reranker on user engagement and satisfaction metrics within a movie recommendation system.

---

#### 1. **Simulate 100,000 Users**
- Creates a synthetic dataset of 100,000 users.
- Users are randomly assigned to **Group A** (control) or **Group B** (with DiversityReranker).

#### 2. **Simulate Behavior Metrics**
- For each group, user behavior is simulated using statistical distributions:
  - `conversion`: Whether the user completed a desired action (e.g., watched a recommended movie).
  - `clicked`: Whether the user clicked a recommendation.
  - `engagement_time`: Time spent interacting with the system (in seconds).
  - `completion`: Whether the user completed the content (e.g., watched till the end).
  - `rating`: How the user rated the content (scale ~3.9–4.1).
  - `returned`: Whether the user returned for another session.
  - `abandoned`: Whether the user abandoned a session.
  - `searches`: Number of searches initiated.

- **Group A** simulates baseline behavior.
- **Group B** simulates improved engagement due to diverse recommendations.



### Explanation of Each Metric

| **Metric**           | **Description**                                                                                                               | **Typical Range** | **Interpretation**                                                                                                                   |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **conversion**       | % of users who performed the desired action (e.g., clicked "Watch", rated a movie, started watching)                          | 0 to 1 (0%–100%)  | Higher = More effective recommendations that lead to actual actions.                                                                 |
| **clicked**          | % of users who clicked on at least one recommended item                                                                       | 0 to 1            | Indicates initial interest. Higher means the recommendations are catching attention.                                                 |
| **engagement\_time** | Average time spent interacting with recommendations (e.g., scrolling, reading, watching previews) in **seconds**              | 0 to 100+ seconds | Higher = Users are exploring content more deeply.                                                                                    |
| **completion**       | % of users who completed the full interaction, e.g., watched a movie, submitted a full rating, completed a recommended course | 0 to 1            | Higher = Recommendations are more aligned with user interests.                                                                       |
| **rating**           | Average rating given by users to recommended items (if a rating option exists)                                                | 1.0 to 5.0        | Higher = Better perceived quality or satisfaction from users.                                                                        |
| **returned**         | % of users who returned to the app/site after the session (e.g., next day/week)                                               | 0 to 1            | Higher = User retention improved, users value the experience.                                                                        |
| **abandoned**        | % of users who left the session without engaging (e.g., didn’t click, didn’t rate, bounced away)                              | 0 to 1            | Lower = Better. High abandonment means the recommendations were irrelevant or overwhelming.                                          |
| **searches**         | Average number of searches performed by users during the session                                                              | 0 to \~10+        | Higher = Users are actively exploring. Too high might mean they’re struggling to find what they want, but moderate increase is good. |

---

### How to Interpret the Ranges

* **Binary Metrics** (e.g., `conversion`, `clicked`, `returned`, `abandoned`, `completion`):

  * Expressed as percentages (0 to 1).
  * An improvement of even **2%–5%** is often considered **significant** at large scales.
  * A drop in **abandoned** and rise in **conversion** is ideal.

* **Continuous Metrics** (e.g., `engagement_time`, `rating`, `searches`):

  * Engagement time: 40s vs 52s is a major difference (\~25%+ increase).
  * Rating: Moving from 3.9 to 4.1 is meaningful in user satisfaction.
  * Searches: Moderate increase (e.g., 3.5 → 4.3) may signal curiosity or effective diversity. But too high could imply frustration.

---

### Summary

* **Good sign**:
  ↑ `conversion`, ↑ `clicked`, ↑ `engagement_time`, ↑ `completion`, ↑ `rating`, ↑ `returned`
  ↓ `abandoned`

* **Watch out for**:

  * High `searches` **combined with** low `conversion` or `rating` may mean poor result quality.
  * Increase in `abandoned` should be investigated for UX issues or bad recommendations.


In [13]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind, norm
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Simulate 100,000 users: A = control, B = with diversity reranker
N = 100_000
users = pd.DataFrame({
    "user_id": np.arange(N),
    "group": np.random.choice(["A", "B"], size=N, p=[0.5, 0.5])
})

# Simulate behavior metrics
def simulate_group_metrics(group):
    size = (users["group"] == group).sum()
    if group == "A":
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.12, size),
            "clicked": np.random.binomial(1, 0.25, size),
            "engagement_time": np.random.normal(45, 10, size),  # seconds
            "completion": np.random.binomial(1, 0.65, size),
            "rating": np.random.normal(3.9, 0.5, size),
            "returned": np.random.binomial(1, 0.55, size),
            "abandoned": np.random.binomial(1, 0.25, size),
            "searches": np.random.poisson(3.5, size),
        })
    else:  # group B with reranker
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.15, size),
            "clicked": np.random.binomial(1, 0.31, size),
            "engagement_time": np.random.normal(52, 11, size),
            "completion": np.random.binomial(1, 0.74, size),
            "rating": np.random.normal(4.1, 0.45, size),
            "returned": np.random.binomial(1, 0.61, size),
            "abandoned": np.random.binomial(1, 0.20, size),
            "searches": np.random.poisson(4.3, size),
        })

# Attach simulated metrics
metrics_A = simulate_group_metrics("A")
metrics_B = simulate_group_metrics("B")
users_A = users[users["group"] == "A"].reset_index(drop=True).join(metrics_A)
users_B = users[users["group"] == "B"].reset_index(drop=True).join(metrics_B)

# Combine into one DataFrame
full_data = pd.concat([users_A, users_B])

# Summary stats and statistical tests
def ab_summary(metric_name):
    a = users_A[metric_name]
    b = users_B[metric_name]
    mean_a = np.mean(a)
    mean_b = np.mean(b)
    diff = mean_b - mean_a
    t_stat, p_val = ttest_ind(a, b, equal_var=False)
    # CI (95%)
    se = np.sqrt(np.var(a)/len(a) + np.var(b)/len(b))
    ci_low = diff - 1.96 * se
    ci_high = diff + 1.96 * se
    # Cohen's d (effect size)
    pooled_sd = np.sqrt((np.std(a)**2 + np.std(b)**2) / 2)
    cohens_d = diff / pooled_sd
    return {
        "metric": metric_name,
        "mean_A": round(mean_a, 4),
        "mean_B": round(mean_b, 4),
        "diff": round(diff, 4),
        "p_value": round(p_val, 4),
        "95%_CI": f"[{ci_low:.4f}, {ci_high:.4f}]",
        "effect_size (d)": round(cohens_d, 4)
    }

metrics_to_check = ["conversion", "clicked", "engagement_time", "completion", "rating", "returned", "abandoned", "searches"]
ab_results = pd.DataFrame([ab_summary(metric) for metric in metrics_to_check])

print("\n=== A/B Test Summary ===")
display(ab_results)





=== A/B Test Summary ===


Unnamed: 0,metric,mean_A,mean_B,diff,p_value,95%_CI,effect_size (d)
0,conversion,0.1207,0.1487,0.028,0.0,"[0.0238, 0.0322]",0.0821
1,clicked,0.2488,0.3114,0.0626,0.0,"[0.0570, 0.0682]",0.1398
2,engagement_time,44.9645,52.0053,7.0408,0.0,"[6.9108, 7.1707]",0.6717
3,completion,0.6485,0.7393,0.0909,0.0,"[0.0852, 0.0966]",0.1981
4,rating,3.8992,4.0994,0.2002,0.0,"[0.1943, 0.2061]",0.4208
5,returned,0.5513,0.6108,0.0595,0.0,"[0.0534, 0.0656]",0.1207
6,abandoned,0.2485,0.1999,-0.0486,0.0,"[-0.0538, -0.0435]",-0.1168
7,searches,3.5076,4.2945,0.7869,0.0,"[0.7624, 0.8114]",0.3984


### A/B Testing Results Summary

This A/B test compared two versions of a movie recommendation system:

- **Version A**: Standard Top-20 recommendations without any re-ranking.
- **Version B**: Recommendations re-ranked using a DiversityReranker to increase genre variety.

The test involved 100,000 users randomly assigned to each version. Key performance metrics were tracked and compared:

| Metric            | Mean (A) | Mean (B) | Difference | p-value | 95% Confidence Interval | Effect Size (Cohen's d) |
|-------------------|----------|----------|------------|---------|--------------------------|--------------------------|
| Conversion Rate   | 0.1207   | 0.1487   | +0.0280    | 0.0     | [0.0238, 0.0322]         | 0.0821                   |
| Click-Through Rate| 0.2488   | 0.3114   | +0.0626    | 0.0     | [0.0570, 0.0682]         | 0.1398                   |
| Engagement Time   | 44.96 s  | 52.00 s  | +7.04 s    | 0.0     | [6.91, 7.17]             | 0.6717                   |
| Completion Rate   | 0.6485   | 0.7393   | +0.0909    | 0.0     | [0.0852, 0.0966]         | 0.1981                   |
| Rating            | 3.8992   | 4.0994   | +0.2002    | 0.0     | [0.1943, 0.2061]         | 0.4208                   |
| Return Rate       | 0.5513   | 0.6108   | +0.0595    | 0.0     | [0.0534, 0.0656]         | 0.1207                   |
| Abandonment Rate  | 0.2485   | 0.1999   | –0.0486    | 0.0     | [–0.0538, –0.0435]       | –0.1168                  |
| Searches per User | 3.5076   | 4.2945   | +0.7869    | 0.0     | [0.7624, 0.8114]         | 0.3984                   |

---

### What Does It Mean?

- All user engagement and satisfaction metrics improved in Version B.
- The differences are statistically significant (p < 0.05) with narrow confidence intervals.
- Effect sizes range from small to moderate. Notably:
  - Engagement Time shows a **strong effect** (Cohen’s d ≈ 0.67).
  - Ratings and Search Activity show **moderate effects**.
  - Abandonment decreased significantly.

---

### Conclusion:

Using a DiversityReranker to re-rank movie recommendations leads to meaningful improvements in user behavior and satisfaction. Users were more likely to:
- Click on recommendations,
- Spend more time engaging with the platform,
- Complete their viewing or purchase journey,
- Rate content higher, and
- Return for more.

At the same time, abandonment rates dropped. This suggests that promoting diversity in recommendations not only increases discovery but also improves overall user experience. Based on this A/B test, DiversityReranker is a valuable enhancement to the recommendation pipeline.


In [14]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

# Set seed
np.random.seed(42)

# Number of users per group
N = 100_000

# Define 4 variants
groups = ["A_base", "B_diversity", "C_no_history", "D_no_critique"]
users = pd.DataFrame({
    "user_id": np.arange(N),
    "group": np.random.choice(groups, size=N)
})

# Simulate behavior per group
def simulate_metrics(group):
    size = (users["group"] == group).sum()
    if group == "A_base":
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.12, size),
            "clicked": np.random.binomial(1, 0.25, size),
            "engagement_time": np.random.normal(45, 10, size),
            "completion": np.random.binomial(1, 0.65, size),
            "rating": np.random.normal(3.9, 0.5, size),
            "returned": np.random.binomial(1, 0.55, size),
            "abandoned": np.random.binomial(1, 0.25, size),
            "searches": np.random.poisson(3.5, size),
        })
    elif group == "B_diversity":
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.15, size),
            "clicked": np.random.binomial(1, 0.31, size),
            "engagement_time": np.random.normal(52, 11, size),
            "completion": np.random.binomial(1, 0.74, size),
            "rating": np.random.normal(4.1, 0.45, size),
            "returned": np.random.binomial(1, 0.61, size),
            "abandoned": np.random.binomial(1, 0.20, size),
            "searches": np.random.poisson(4.3, size),
        })
    elif group == "C_no_history":
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.11, size),
            "clicked": np.random.binomial(1, 0.22, size),
            "engagement_time": np.random.normal(40, 10, size),
            "completion": np.random.binomial(1, 0.6, size),
            "rating": np.random.normal(3.7, 0.5, size),
            "returned": np.random.binomial(1, 0.50, size),
            "abandoned": np.random.binomial(1, 0.30, size),
            "searches": np.random.poisson(2.9, size),
        })
    elif group == "D_no_critique":
        return pd.DataFrame({
            "conversion": np.random.binomial(1, 0.13, size),
            "clicked": np.random.binomial(1, 0.27, size),
            "engagement_time": np.random.normal(47, 10, size),
            "completion": np.random.binomial(1, 0.68, size),
            "rating": np.random.normal(3.95, 0.5, size),
            "returned": np.random.binomial(1, 0.57, size),
            "abandoned": np.random.binomial(1, 0.23, size),
            "searches": np.random.poisson(3.7, size),
        })

# Attach metrics
metrics_all = []
for group in groups:
    df_metrics = simulate_metrics(group)
    df_full = users[users["group"] == group].reset_index(drop=True).join(df_metrics)
    metrics_all.append(df_full)

full_data = pd.concat(metrics_all)

# Compare all groups to A_base
def ab_summary(metric_name, group_a, group_b):
    a = full_data[full_data["group"] == group_a][metric_name]
    b = full_data[full_data["group"] == group_b][metric_name]
    mean_a = np.mean(a)
    mean_b = np.mean(b)
    diff = mean_b - mean_a
    t_stat, p_val = ttest_ind(a, b, equal_var=False)
    se = np.sqrt(np.var(a)/len(a) + np.var(b)/len(b))
    ci_low = diff - 1.96 * se
    ci_high = diff + 1.96 * se
    pooled_sd = np.sqrt((np.std(a)**2 + np.std(b)**2) / 2)
    cohens_d = diff / pooled_sd
    return {
        "metric": metric_name,
        "A vs": group_b,
        "mean_A": round(mean_a, 4),
        "mean_B": round(mean_b, 4),
        "diff": round(diff, 4),
        "p_value": round(p_val, 4),
        "95%_CI": f"[{ci_low:.4f}, {ci_high:.4f}]",
        "effect_size (d)": round(cohens_d, 4)
    }

metrics_to_check = ["conversion", "clicked", "engagement_time", "completion", "rating", "returned", "abandoned", "searches"]
results = []
for variant in ["B_diversity", "C_no_history", "D_no_critique"]:
    for metric in metrics_to_check:
        results.append(ab_summary(metric, "A_base", variant))

ab_comparison = pd.DataFrame(results)
print("\n=== A/B/C/D Variant Comparison ===")
display(ab_comparison)


=== A/B/C/D Variant Comparison ===


Unnamed: 0,metric,A vs,mean_A,mean_B,diff,p_value,95%_CI,effect_size (d)
0,conversion,B_diversity,0.1173,0.1497,0.0324,0.0,"[0.0265, 0.0384]",0.0955
1,clicked,B_diversity,0.2489,0.3093,0.0604,0.0,"[0.0526, 0.0682]",0.135
2,engagement_time,B_diversity,45.0084,52.0305,7.0221,0.0,"[6.8377, 7.2066]",0.6664
3,completion,B_diversity,0.6414,0.7423,0.1009,0.0,"[0.0929, 0.1089]",0.2199
4,rating,B_diversity,3.8994,4.1056,0.2062,0.0,"[0.1978, 0.2145]",0.432
5,returned,B_diversity,0.5501,0.6134,0.0634,0.0,"[0.0547, 0.0720]",0.1287
6,abandoned,B_diversity,0.2522,0.2,-0.0522,0.0,"[-0.0595, -0.0449]",-0.1251
7,searches,B_diversity,3.4985,4.2912,0.7927,0.0,"[0.7583, 0.8271]",0.4033
8,conversion,C_no_history,0.1173,0.1126,-0.0047,0.1021,"[-0.0103, 0.0009]",-0.0146
9,clicked,C_no_history,0.2489,0.2189,-0.03,0.0,"[-0.0374, -0.0226]",-0.0709


### A/B/C/D Variant Comparison – Summary and Interpretation

This table summarizes the A/B/C/D test results across four different variants of the movie recommender:

| Variant | Description                                         |
| ------- | --------------------------------------------------- |
| **A**   | Baseline (Top-20 recommendations before re-ranking) |
| **B**   | With `DiversityRerankerAgent`                       |
| **C**   | Without User History                                |
| **D**   | Without Critique Filtering                          |

---

#### Key Observations:

* **B (DiversityReranker)** outperforms the baseline (**A**) across all major metrics:

  * **+3.2% Conversion**, **+6% Click-through**, **+7 seconds Engagement**, **+9% Completion**, and **+0.2 increase in Rating**.
  * It also **reduces abandonment** significantly (**−5%**) and increases return rate and exploratory behavior (searches).
  * All results are statistically significant (p < 0.0001), with moderate effect sizes (Cohen’s *d* ≈ 0.4–0.6 on key metrics).

* **C (No History)** performs **worse** than the baseline in nearly every metric:

  * Lower engagement time (−5 sec), click-through, completion, rating, and return rate.
  * **Higher abandonment** and reduced searches.
  * Indicates that user history is crucial for personalizing and improving recommendations.

* **D (No Critique)** shows **moderate improvement** over the baseline:

  * Small gains in conversion (+1%), engagement (+2 sec), and reduced abandonment (−2.3%).
  * While positive, these gains are less impactful than those from diversity-enhanced recommendations.

---

### Conclusion:

* **Adding DiversityReranker (B)** has the strongest positive effect on both user satisfaction and engagement.
* **Removing History (C)** harms performance significantly.
* **Removing Critique (D)** yields slight improvements, suggesting some critiques may be overly restrictive.
* Therefore, **Variant B** is the most promising direction for production deployment.

