# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/13_systemes_recommandation/13_exercices.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '13_exercices.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 14 - Exercices : Syst√®mes de Recommandation

Ce notebook contient des exercices pratiques sur les syst√®mes de recommandation avec leurs solutions compl√®tes.

## Exercices

1. **Recommandation de films avec Collaborative Filtering**
2. **Syst√®me Content-Based avec TF-IDF**
3. **Syst√®me Hybride (CF + Content-Based)**

Chaque exercice inclut :
- Description du probl√®me
- Donn√©es et contexte
- Consignes d√©taill√©es
- Solution compl√®te

In [None]:
# Imports communs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

---

## Exercice 1 : Recommandation de Films avec Collaborative Filtering

### Objectif
Impl√©menter un syst√®me de recommandation de films en utilisant **user-based** et **item-based collaborative filtering**, puis les comparer.

### Contexte
Vous travaillez pour une plateforme de streaming vid√©o et devez recommander des films aux utilisateurs bas√©s sur leurs historiques de visionnage.

### Consignes

1. Charger le dataset MovieLens 100K
2. Cr√©er la matrice user-item
3. Impl√©menter user-based CF avec similarit√© Pearson
4. Impl√©menter item-based CF avec similarit√© cosine
5. Comparer les performances (RMSE, MAE)
6. G√©n√©rer des recommandations top-10 pour 3 utilisateurs
7. Analyser la diversit√© des recommandations

### Solution Exercice 1

In [None]:
# 1. Charger les donn√©es
from surprise import Dataset

data = Dataset.load_builtin('ml-100k')
ratings_df = pd.DataFrame(data.raw_ratings, columns=['user_id', 'item_id', 'rating', 'timestamp'])

# Mapping IDs
user_ids = ratings_df['user_id'].unique()
item_ids = ratings_df['item_id'].unique()
user_id_map = {uid: idx for idx, uid in enumerate(user_ids)}
item_id_map = {iid: idx for idx, iid in enumerate(item_ids)}

ratings_df['user_idx'] = ratings_df['user_id'].map(user_id_map)
ratings_df['item_idx'] = ratings_df['item_id'].map(item_id_map)

n_users = len(user_ids)
n_items = len(item_ids)

print(f"Dataset charg√©: {len(ratings_df)} ratings, {n_users} users, {n_items} items")

# Train/Test split
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)
print(f"Train: {len(train_df)}, Test: {len(test_df)}")

In [None]:
# 2. Cr√©er la matrice user-item
def create_user_item_matrix(df, n_users, n_items):
    matrix = np.zeros((n_users, n_items))
    for _, row in df.iterrows():
        matrix[int(row['user_idx']), int(row['item_idx'])] = row['rating']
    return matrix

train_matrix = create_user_item_matrix(train_df, n_users, n_items)
print(f"Train matrix shape: {train_matrix.shape}")
print(f"Sparsit√©: {(1 - np.count_nonzero(train_matrix) / train_matrix.size):.2%}")

In [None]:
# 3. User-Based CF avec corr√©lation de Pearson
from scipy.stats import pearsonr

def compute_pearson_similarity(matrix):
    """Calculer la similarit√© Pearson entre utilisateurs."""
    n_users = matrix.shape[0]
    similarity = np.zeros((n_users, n_users))
    
    for i in range(n_users):
        for j in range(i, n_users):
            # Items rat√©s par les deux utilisateurs
            mask = (matrix[i] > 0) & (matrix[j] > 0)
            if mask.sum() > 1:  # Au moins 2 items en commun
                corr, _ = pearsonr(matrix[i][mask], matrix[j][mask])
                similarity[i, j] = corr if not np.isnan(corr) else 0
                similarity[j, i] = similarity[i, j]
    
    return similarity

print("Calcul de la similarit√© Pearson (cela peut prendre 1-2 minutes)...")
user_similarity_pearson = compute_pearson_similarity(train_matrix)
print(f"Similarit√© moyenne: {user_similarity_pearson.mean():.4f}")

# Visualiser
plt.figure(figsize=(8, 6))
plt.imshow(user_similarity_pearson[:100, :100], cmap='RdYlGn', vmin=-1, vmax=1, aspect='auto')
plt.colorbar(label='Corr√©lation de Pearson')
plt.title('Similarit√© User-User (Pearson) - 100 premiers users')
plt.xlabel('User Index')
plt.ylabel('User Index')
plt.show()

In [None]:
# Fonction de pr√©diction user-based
def predict_user_based_pearson(train_matrix, user_similarity, user_idx, item_idx, k=30):
    users_who_rated = np.where(train_matrix[:, item_idx] > 0)[0]
    
    if len(users_who_rated) == 0:
        return train_matrix[train_matrix > 0].mean()
    
    sims = user_similarity[user_idx, users_who_rated]
    top_k_indices = np.argsort(sims)[-k:][::-1]
    top_k_users = users_who_rated[top_k_indices]
    top_k_sims = sims[top_k_indices]
    
    # Mean-centered prediction
    user_mean = train_matrix[user_idx][train_matrix[user_idx] > 0].mean()
    neighbor_means = np.array([train_matrix[u][train_matrix[u] > 0].mean() for u in top_k_users])
    neighbor_ratings = train_matrix[top_k_users, item_idx]
    
    if np.abs(top_k_sims).sum() == 0:
        return user_mean
    
    prediction = user_mean + np.sum(top_k_sims * (neighbor_ratings - neighbor_means)) / np.sum(np.abs(top_k_sims))
    return np.clip(prediction, 1, 5)

# √âvaluation
print("\n√âvaluation User-Based CF (Pearson)...")
test_sample = test_df.sample(n=min(1000, len(test_df)), random_state=42)

y_true_ub = []
y_pred_ub = []

for _, row in test_sample.iterrows():
    user_idx = int(row['user_idx'])
    item_idx = int(row['item_idx'])
    true_rating = row['rating']
    pred_rating = predict_user_based_pearson(train_matrix, user_similarity_pearson, user_idx, item_idx, k=30)
    
    y_true_ub.append(true_rating)
    y_pred_ub.append(pred_rating)

rmse_ub = np.sqrt(mean_squared_error(y_true_ub, y_pred_ub))
mae_ub = mean_absolute_error(y_true_ub, y_pred_ub)

print(f"User-Based CF (Pearson):")
print(f"  RMSE: {rmse_ub:.4f}")
print(f"  MAE: {mae_ub:.4f}")

In [None]:
# 4. Item-Based CF avec similarit√© cosine
item_similarity_cosine = cosine_similarity(train_matrix.T + 1e-9)

print(f"\nItem similarity shape: {item_similarity_cosine.shape}")
print(f"Similarit√© moyenne: {item_similarity_cosine.mean():.4f}")

# Visualiser
plt.figure(figsize=(8, 6))
plt.imshow(item_similarity_cosine[:100, :100], cmap='YlOrRd', vmin=0, vmax=1, aspect='auto')
plt.colorbar(label='Similarit√© Cosine')
plt.title('Similarit√© Item-Item (Cosine) - 100 premiers items')
plt.xlabel('Item Index')
plt.ylabel('Item Index')
plt.show()

In [None]:
# Fonction de pr√©diction item-based
def predict_item_based_cosine(train_matrix, item_similarity, user_idx, item_idx, k=30):
    items_rated = np.where(train_matrix[user_idx, :] > 0)[0]
    
    if len(items_rated) == 0:
        return train_matrix[train_matrix > 0].mean()
    
    sims = item_similarity[item_idx, items_rated]
    top_k_indices = np.argsort(sims)[-k:][::-1]
    top_k_items = items_rated[top_k_indices]
    top_k_sims = sims[top_k_indices]
    
    user_ratings = train_matrix[user_idx, top_k_items]
    
    if top_k_sims.sum() == 0:
        return train_matrix[user_idx][train_matrix[user_idx] > 0].mean()
    
    prediction = np.sum(top_k_sims * user_ratings) / top_k_sims.sum()
    return np.clip(prediction, 1, 5)

# √âvaluation
print("\n√âvaluation Item-Based CF (Cosine)...")

y_pred_ib = []

for _, row in test_sample.iterrows():
    user_idx = int(row['user_idx'])
    item_idx = int(row['item_idx'])
    pred_rating = predict_item_based_cosine(train_matrix, item_similarity_cosine, user_idx, item_idx, k=30)
    y_pred_ib.append(pred_rating)

rmse_ib = np.sqrt(mean_squared_error(y_true_ub, y_pred_ib))
mae_ib = mean_absolute_error(y_true_ub, y_pred_ib)

print(f"Item-Based CF (Cosine):")
print(f"  RMSE: {rmse_ib:.4f}")
print(f"  MAE: {mae_ib:.4f}")

In [None]:
# 5. Comparaison des performances
results_df = pd.DataFrame({
    'M√©thode': ['User-Based (Pearson)', 'Item-Based (Cosine)'],
    'RMSE': [rmse_ub, rmse_ib],
    'MAE': [mae_ub, mae_ib]
})

print("\n=== Comparaison des M√©thodes ===")
print(results_df.to_string(index=False))

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

methods = results_df['M√©thode']
x_pos = np.arange(len(methods))

axes[0].bar(x_pos, results_df['RMSE'], color=['steelblue', 'coral'])
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(methods, rotation=15, ha='right')
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE par M√©thode')
axes[0].grid(axis='y', alpha=0.3)

axes[1].bar(x_pos, results_df['MAE'], color=['steelblue', 'coral'])
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(methods, rotation=15, ha='right')
axes[1].set_ylabel('MAE')
axes[1].set_title('MAE par M√©thode')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 6. G√©n√©rer des recommandations top-10
def recommend_top_k_item_based(train_matrix, item_similarity, user_idx, k=10):
    """G√©n√©rer top-K recommandations avec item-based CF."""
    rated_items = set(np.where(train_matrix[user_idx, :] > 0)[0])
    candidate_items = [i for i in range(train_matrix.shape[1]) if i not in rated_items]
    
    predictions = []
    for item_idx in candidate_items:
        pred = predict_item_based_cosine(train_matrix, item_similarity, user_idx, item_idx, k=30)
        predictions.append((item_idx, pred))
    
    predictions.sort(key=lambda x: x[1], reverse=True)
    return predictions[:k]

# Recommandations pour 3 utilisateurs
example_users = [0, 10, 50]

print("\n=== Recommandations Top-10 (Item-Based CF) ===")
for user_idx in example_users:
    print(f"\nUtilisateur {user_idx}:")
    top_10 = recommend_top_k_item_based(train_matrix, item_similarity_cosine, user_idx, k=10)
    for rank, (item_idx, score) in enumerate(top_10, 1):
        print(f"  {rank}. Item {item_idx}: score = {score:.2f}")
    
    # Montrer les items d√©j√† rat√©s
    user_ratings = train_df[train_df['user_idx'] == user_idx][['item_idx', 'rating']].sort_values('rating', ascending=False).head(5)
    print(f"  Items d√©j√† rat√©s (top 5): {user_ratings['item_idx'].tolist()}")

In [None]:
# 7. Analyser la diversit√© des recommandations
def compute_diversity(recommendations, item_similarity):
    """Calculer la diversit√© intra-liste (1 - similarit√© moyenne)."""
    if len(recommendations) < 2:
        return 0
    
    item_indices = [item_idx for item_idx, _ in recommendations]
    total_sim = 0
    count = 0
    
    for i in range(len(item_indices)):
        for j in range(i+1, len(item_indices)):
            total_sim += item_similarity[item_indices[i], item_indices[j]]
            count += 1
    
    avg_sim = total_sim / count if count > 0 else 0
    diversity = 1 - avg_sim
    return diversity

print("\n=== Analyse de Diversit√© ===")
diversities = []

for user_idx in range(min(50, n_users)):
    top_10 = recommend_top_k_item_based(train_matrix, item_similarity_cosine, user_idx, k=10)
    div = compute_diversity(top_10, item_similarity_cosine)
    diversities.append(div)

print(f"Diversit√© moyenne: {np.mean(diversities):.4f}")
print(f"Diversit√© min/max: {np.min(diversities):.4f} / {np.max(diversities):.4f}")

plt.figure(figsize=(8, 5))
plt.hist(diversities, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
plt.xlabel('Diversit√©')
plt.ylabel('Nombre d\'utilisateurs')
plt.title('Distribution de la Diversit√© des Recommandations')
plt.axvline(np.mean(diversities), color='red', linestyle='--', label=f'Moyenne: {np.mean(diversities):.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

---

## Exercice 2 : Syst√®me Content-Based avec TF-IDF

### Objectif
Cr√©er un syst√®me de recommandation content-based bas√© sur les descriptions/genres de films avec TF-IDF.

### Contexte
Vous devez recommander des films similaires bas√©s uniquement sur leurs caract√©ristiques (genres, descriptions), sans utiliser les ratings des autres utilisateurs.

### Consignes

1. Cr√©er un dataset synth√©tique de films avec genres et descriptions
2. Vectoriser les features textuelles avec TF-IDF
3. Calculer la similarit√© cosine entre films
4. Impl√©menter une fonction de recommandation
5. Recommander les 5 films les plus similaires pour 3 films donn√©s
6. Analyser la couverture du syst√®me (% d'items recommand√©s au moins une fois)

### Solution Exercice 2

In [None]:
# 1. Cr√©er un dataset synth√©tique de films
movies_data = {
    'movie_id': list(range(20)),
    'title': [
        'Inception', 'The Matrix', 'Interstellar', 'Blade Runner 2049', 'Tenet',
        'The Notebook', 'Titanic', 'La La Land', 'Pride and Prejudice', 'Before Sunrise',
        'Avengers: Endgame', 'Iron Man', 'Thor', 'Black Panther', 'Spider-Man',
        'The Godfather', 'Goodfellas', 'Scarface', 'Casino', 'The Departed'
    ],
    'genres': [
        'Sci-Fi Thriller Action', 'Sci-Fi Action', 'Sci-Fi Drama Adventure', 'Sci-Fi Thriller', 'Sci-Fi Thriller Action',
        'Romance Drama', 'Romance Drama Disaster', 'Romance Musical Drama', 'Romance Drama Period', 'Romance Drama',
        'Action Superhero Adventure', 'Action Superhero', 'Action Superhero Fantasy', 'Action Superhero Drama', 'Action Superhero',
        'Crime Drama', 'Crime Drama Thriller', 'Crime Drama Thriller', 'Crime Drama', 'Crime Drama Thriller'
    ],
    'description': [
        'A thief who steals corporate secrets through dream-sharing technology',
        'A hacker discovers reality is a simulation',
        'Astronauts travel through a wormhole to save humanity',
        'A blade runner must find and eliminate rogue replicants',
        'A protagonist tries to prevent World War III through time manipulation',
        'A poor young man falls in love with a rich young woman',
        'A seventeen-year-old aristocrat falls in love with a poor artist aboard a ship',
        'A jazz pianist falls for an aspiring actress in Los Angeles',
        'A woman navigates love and society in 19th century England',
        'A romantic trilogy about two people meeting over nine years',
        'Superheroes assemble to defeat a powerful villain and save the universe',
        'A billionaire builds an armor suit to fight evil and terrorism',
        'The Norse god of thunder protects Earth and the Nine Realms',
        'The king of Wakanda fights to protect his nation',
        'A teenager gains spider-like abilities and fights crime',
        'The aging patriarch of an organized crime dynasty transfers control to his son',
        'The story of Henry Hill and his life in the mob',
        'A Cuban refugee builds a drug empire in Miami',
        'A tale of greed, deception, money, power in Las Vegas',
        'An undercover cop and a mole try to identify each other'
    ]
}

movies_df = pd.DataFrame(movies_data)
print(f"Dataset cr√©√©: {len(movies_df)} films")
movies_df.head(10)

In [None]:
# 2. Vectoriser avec TF-IDF
# Combiner genres et description
movies_df['content'] = movies_df['genres'] + ' ' + movies_df['description']

# TF-IDF vectorization
tfidf = TfidfVectorizer(stop_words='english', max_features=100)
tfidf_matrix = tfidf.fit_transform(movies_df['content'])

print(f"\nTF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Vocabulaire (premiers 20 termes): {list(tfidf.vocabulary_.keys())[:20]}")

# Visualiser la matrice TF-IDF
plt.figure(figsize=(12, 8))
plt.imshow(tfidf_matrix.toarray(), cmap='YlOrRd', aspect='auto')
plt.colorbar(label='TF-IDF Score')
plt.xlabel('Feature Index')
plt.ylabel('Movie Index')
plt.title('Matrice TF-IDF des Films')
plt.show()

In [None]:
# 3. Calculer la similarit√© cosine
movie_similarity = cosine_similarity(tfidf_matrix)

print(f"\nMovie similarity matrix shape: {movie_similarity.shape}")
print(f"Similarit√© moyenne: {movie_similarity.mean():.4f}")

# Visualiser
plt.figure(figsize=(10, 8))
plt.imshow(movie_similarity, cmap='RdYlGn', vmin=0, vmax=1, aspect='auto')
plt.colorbar(label='Similarit√© Cosine')
plt.title('Matrice de Similarit√© Films (Content-Based)')
plt.xlabel('Movie Index')
plt.ylabel('Movie Index')

# Annoter avec les titres
plt.xticks(range(len(movies_df)), movies_df['title'], rotation=90, fontsize=8)
plt.yticks(range(len(movies_df)), movies_df['title'], fontsize=8)
plt.tight_layout()
plt.show()

In [None]:
# 4. Fonction de recommandation
def recommend_similar_movies(movie_idx, similarity_matrix, movies_df, k=5):
    """Recommander les K films les plus similaires."""
    # Similarit√©s avec tous les autres films
    sims = similarity_matrix[movie_idx]
    
    # Trier par similarit√© d√©croissante (exclure le film lui-m√™me)
    similar_indices = np.argsort(sims)[::-1][1:k+1]
    
    recommendations = []
    for idx in similar_indices:
        recommendations.append({
            'movie_id': movies_df.iloc[idx]['movie_id'],
            'title': movies_df.iloc[idx]['title'],
            'genres': movies_df.iloc[idx]['genres'],
            'similarity': sims[idx]
        })
    
    return recommendations

In [None]:
# 5. Recommandations pour 3 films
example_movies = [0, 5, 10]  # Inception, The Notebook, Avengers

print("\n=== Recommandations Content-Based (Top 5) ===")
for movie_idx in example_movies:
    movie_title = movies_df.iloc[movie_idx]['title']
    movie_genres = movies_df.iloc[movie_idx]['genres']
    
    print(f"\n--- Film: {movie_title} ({movie_genres}) ---")
    
    recommendations = recommend_similar_movies(movie_idx, movie_similarity, movies_df, k=5)
    
    for rank, rec in enumerate(recommendations, 1):
        print(f"  {rank}. {rec['title']} ({rec['genres']}) - Similarit√©: {rec['similarity']:.3f}")

In [None]:
# 6. Analyser la couverture
def compute_coverage(movies_df, similarity_matrix, k=5):
    """Calculer le % d'items recommand√©s au moins une fois."""
    recommended_items = set()
    
    for movie_idx in range(len(movies_df)):
        recs = recommend_similar_movies(movie_idx, similarity_matrix, movies_df, k=k)
        for rec in recs:
            recommended_items.add(rec['movie_id'])
    
    coverage = len(recommended_items) / len(movies_df)
    return coverage, recommended_items

coverage, recommended_items = compute_coverage(movies_df, movie_similarity, k=5)

print(f"\n=== Analyse de Couverture (K=5) ===")
print(f"Items recommand√©s au moins une fois: {len(recommended_items)} / {len(movies_df)}")
print(f"Couverture: {coverage:.2%}")

# Items jamais recommand√©s
all_items = set(movies_df['movie_id'])
never_recommended = all_items - recommended_items

if never_recommended:
    print(f"\nFilms jamais recommand√©s: {never_recommended}")
    for movie_id in never_recommended:
        title = movies_df[movies_df['movie_id'] == movie_id]['title'].values[0]
        print(f"  - {title}")
else:
    print("\nTous les films sont recommand√©s au moins une fois!")

---

## Exercice 3 : Syst√®me Hybride (CF + Content-Based)

### Objectif
Cr√©er un syst√®me hybride combinant collaborative filtering et content-based filtering pour am√©liorer les recommandations.

### Contexte
Vous voulez b√©n√©ficier des avantages des deux approches : CF pour capturer les pr√©f√©rences collectives, et content-based pour g√©rer le cold start et assurer la diversit√©.

### Consignes

1. Utiliser les r√©sultats des exercices 1 et 2
2. Impl√©menter une strat√©gie de combinaison weighted (pond√©r√©e)
3. Tester plusieurs valeurs de poids alpha (0.0 √† 1.0)
4. Comparer les performances avec CF seul et content-based seul
5. Analyser les cas o√π le syst√®me hybride est meilleur

### Solution Exercice 3

In [None]:
# 1. Pr√©parer les scores des deux syst√®mes
# Pour simplifier, nous allons cr√©er des scores synth√©tiques pour MovieLens

# Simuler des scores content-based pour MovieLens (normalement on aurait des features r√©elles)
# Ici on utilise les genres du dataset MovieLens si disponibles, sinon on simule

# Pour cet exercice, nous allons combiner:
# - Scores CF (item-based) de l'exercice 1
# - Scores content-based simul√©s (bas√©s sur la similarit√© al√©atoire ou genres)

print("=== Syst√®me Hybride : CF + Content-Based ===")
print("\nNous allons combiner les scores de deux syst√®mes avec une pond√©ration.")

In [None]:
# 2. Impl√©menter la strat√©gie weighted
def hybrid_recommendation(user_idx, train_matrix, item_similarity_cf, item_similarity_content, alpha=0.5, k=10):
    """
    Recommandation hybride : score = alpha * CF + (1-alpha) * Content
    
    Args:
        user_idx: index de l'utilisateur
        train_matrix: matrice user-item pour CF
        item_similarity_cf: similarit√© item-item pour CF
        item_similarity_content: similarit√© item-item pour content-based
        alpha: poids pour CF (0=content only, 1=CF only)
        k: nombre de recommandations
    """
    rated_items = set(np.where(train_matrix[user_idx, :] > 0)[0])
    candidate_items = [i for i in range(train_matrix.shape[1]) if i not in rated_items]
    
    predictions = []
    
    for item_idx in candidate_items:
        # Score CF
        score_cf = predict_item_based_cosine(train_matrix, item_similarity_cf, user_idx, item_idx, k=30)
        
        # Score content-based (moyenne pond√©r√©e par similarit√© avec items rat√©s)
        if len(rated_items) > 0:
            rated_items_list = list(rated_items)
            content_sims = item_similarity_content[item_idx, rated_items_list]
            user_ratings_rated = train_matrix[user_idx, rated_items_list]
            
            if content_sims.sum() > 0:
                score_content = np.sum(content_sims * user_ratings_rated) / content_sims.sum()
            else:
                score_content = train_matrix[user_idx][train_matrix[user_idx] > 0].mean()
        else:
            score_content = train_matrix[train_matrix > 0].mean()
        
        # Combiner
        hybrid_score = alpha * score_cf + (1 - alpha) * score_content
        predictions.append((item_idx, hybrid_score))
    
    predictions.sort(key=lambda x: x[1], reverse=True)
    return predictions[:k]

print("\nFonction de recommandation hybride impl√©ment√©e.")

In [None]:
# Pour cet exercice, nous allons simuler une similarit√© content-based pour MovieLens
# En pratique, cela viendrait des genres, descriptions, etc.

# Simuler une matrice de similarit√© content (bas√©e sur des clusters al√©atoires pour l'exemple)
np.random.seed(42)
n_clusters = 10
item_clusters = np.random.randint(0, n_clusters, n_items)

# Similarit√© = 1 si m√™me cluster, 0.3 sinon
item_similarity_content_sim = np.zeros((n_items, n_items))
for i in range(n_items):
    for j in range(n_items):
        if item_clusters[i] == item_clusters[j]:
            item_similarity_content_sim[i, j] = 1.0
        else:
            item_similarity_content_sim[i, j] = 0.3

print(f"Matrice de similarit√© content simul√©e: {item_similarity_content_sim.shape}")
print(f"Similarit√© moyenne: {item_similarity_content_sim.mean():.4f}")

In [None]:
# 3. Tester plusieurs valeurs de alpha
alpha_values = [0.0, 0.25, 0.5, 0.75, 1.0]

print("\n=== Test de diff√©rentes valeurs de alpha ===")
print("alpha=0.0: Content-Based uniquement")
print("alpha=1.0: CF uniquement")
print("alpha=0.5: √âquilibre 50/50\n")

# √âvaluer sur un √©chantillon de test
test_sample_hybrid = test_df.sample(n=min(500, len(test_df)), random_state=42)

results_hybrid = []

for alpha in alpha_values:
    print(f"√âvaluation alpha={alpha:.2f}...")
    
    y_true_h = []
    y_pred_h = []
    
    for _, row in test_sample_hybrid.iterrows():
        user_idx = int(row['user_idx'])
        item_idx = int(row['item_idx'])
        true_rating = row['rating']
        
        # Score CF
        score_cf = predict_item_based_cosine(train_matrix, item_similarity_cosine, user_idx, item_idx, k=30)
        
        # Score content
        rated_items = np.where(train_matrix[user_idx, :] > 0)[0]
        if len(rated_items) > 0:
            content_sims = item_similarity_content_sim[item_idx, rated_items]
            user_ratings = train_matrix[user_idx, rated_items]
            score_content = np.sum(content_sims * user_ratings) / content_sims.sum() if content_sims.sum() > 0 else 3.0
        else:
            score_content = 3.0
        
        # Hybride
        hybrid_score = alpha * score_cf + (1 - alpha) * score_content
        
        y_true_h.append(true_rating)
        y_pred_h.append(hybrid_score)
    
    rmse_h = np.sqrt(mean_squared_error(y_true_h, y_pred_h))
    mae_h = mean_absolute_error(y_true_h, y_pred_h)
    
    results_hybrid.append({
        'alpha': alpha,
        'RMSE': rmse_h,
        'MAE': mae_h
    })
    
    print(f"  RMSE: {rmse_h:.4f}, MAE: {mae_h:.4f}")

results_hybrid_df = pd.DataFrame(results_hybrid)
print("\n=== R√©sultats Hybrides ===")
print(results_hybrid_df.to_string(index=False))

In [None]:
# 4. Visualiser les r√©sultats
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(results_hybrid_df['alpha'], results_hybrid_df['RMSE'], marker='o', linewidth=2, markersize=8, color='steelblue')
axes[0].set_xlabel('Alpha (poids CF)')
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE en fonction de Alpha\n(alpha=0: Content, alpha=1: CF)')
axes[0].grid(alpha=0.3)
axes[0].axvline(results_hybrid_df.loc[results_hybrid_df['RMSE'].idxmin(), 'alpha'], color='red', linestyle='--', label='Meilleur alpha')
axes[0].legend()

axes[1].plot(results_hybrid_df['alpha'], results_hybrid_df['MAE'], marker='s', linewidth=2, markersize=8, color='coral')
axes[1].set_xlabel('Alpha (poids CF)')
axes[1].set_ylabel('MAE')
axes[1].set_title('MAE en fonction de Alpha')
axes[1].grid(alpha=0.3)
axes[1].axvline(results_hybrid_df.loc[results_hybrid_df['MAE'].idxmin(), 'alpha'], color='red', linestyle='--', label='Meilleur alpha')
axes[1].legend()

plt.tight_layout()
plt.show()

best_alpha = results_hybrid_df.loc[results_hybrid_df['RMSE'].idxmin(), 'alpha']
print(f"\nMeilleur alpha (RMSE minimum): {best_alpha}")

In [None]:
# 5. Analyser les cas o√π le syst√®me hybride est meilleur
print("\n=== Analyse : Quand le syst√®me hybride est-il meilleur ? ===")
print("\nLe syst√®me hybride performe g√©n√©ralement mieux que les syst√®mes individuels dans les cas suivants:")
print("\n1. Cold Start Utilisateurs:")
print("   - Les nouveaux utilisateurs avec peu de ratings b√©n√©ficient du content-based")
print("   - Le CF seul ne peut pas faire de bonnes recommandations sans historique")

print("\n2. Cold Start Items:")
print("   - Les nouveaux films peuvent √™tre recommand√©s gr√¢ce √† leurs features (genres, description)")
print("   - Le CF seul ne peut pas recommander des items jamais rat√©s")

print("\n3. Diversit√©:")
print("   - Le content-based introduit de la diversit√© bas√©e sur les caract√©ristiques")
print("   - √âvite la filter bubble du CF (recommander toujours les m√™mes items populaires)")

print("\n4. Sparsit√©:")
print("   - Quand la matrice user-item est tr√®s sparse, le content-based compl√®te le CF")
print("   - Am√©liore la robustesse des pr√©dictions")

print("\n5. Explicabilit√©:")
print("   - Le content-based rend les recommandations plus explicables")
print("   - 'Nous recommandons ce film car il a des genres similaires aux films que vous avez aim√©s'")

print("\n=== Conclusion ===")
print(f"Dans nos tests, le meilleur alpha est {best_alpha}, ce qui signifie que la combinaison optimale")
print(f"donne un poids de {best_alpha*100:.0f}% au CF et {(1-best_alpha)*100:.0f}% au content-based.")
print("\nLe syst√®me hybride offre le meilleur compromis entre performance et robustesse!")

## Conclusion des Exercices

Nous avons explor√© trois approches de syst√®mes de recommandation :

1. **Collaborative Filtering** (Exercice 1)
   - User-based et Item-based
   - Performant mais souffre du cold start
   - Item-based souvent meilleur que User-based sur MovieLens

2. **Content-Based Filtering** (Exercice 2)
   - Bas√© sur TF-IDF des features textuelles
   - R√©sout le cold start pour nouveaux items
   - Peut manquer de diversit√© (filter bubble)

3. **Syst√®me Hybride** (Exercice 3)
   - Combine les avantages des deux approches
   - Meilleure robustesse et performance
   - Tuning du param√®tre alpha important

**Points cl√©s** :
- Aucune approche n'est universellement meilleure
- Le choix d√©pend du contexte (donn√©es disponibles, cold start, scalabilit√©)
- Les syst√®mes hybrides offrent souvent le meilleur compromis
- Les m√©triques de ranking (Precision@K, NDCG) sont plus pertinentes que RMSE pour les recommandations top-K

**En production** :
- Utiliser un pipeline multi-√©tapes : candidate generation + ranking
- Incorporer des features contextuelles (temps, localisation, device)
- A/B testing pour valider les am√©liorations
- Monitoring continu de la diversit√©, coverage, et fairness