# ‚ö° Notebook 3 : Optimisation & √âvaluation des Prompts

## M√©triques de Qualit√©, A/B Testing et Optimisation Automatique

Dans ce notebook, nous explorons les techniques scientifiques pour mesurer, √©valuer et optimiser automatiquement la performance de vos prompts.

### üéØ Objectifs d'apprentissage :
- D√©finir des m√©triques objectives de qualit√© des prompts
- Impl√©menter des syst√®mes d'A/B testing pour prompts
- Cr√©er des pipelines d'optimisation automatique
- Mesurer ROI et impact business des am√©liorations
- Construire des dashboards de monitoring en temps r√©el

## üìö Configuration et Imports

In [None]:
import os
import json
import time
import uuid
import random
import sqlite3
from datetime import datetime, timedelta
from typing import List, Dict, Tuple, Optional, Any
from dataclasses import dataclass, asdict
from concurrent.futures import ThreadPoolExecutor, as_completed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import accuracy_score, f1_score
from dotenv import load_dotenv

# Charger les variables d'environnement
load_dotenv()

# Configuration des APIs
import openai
from anthropic import Anthropic
import google.generativeai as genai

# Configuration des clients
openai.api_key = os.getenv('OPENAI_API_KEY')
anthropic = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))

# Configuration des graphiques
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("‚úÖ Configuration termin√©e !")

## 1. üìä M√©triques de Qualit√© des Prompts

### Framework de M√©triques Comprehensive

Nous allons impl√©menter un syst√®me complet de m√©triques pour √©valuer la qualit√© des prompts selon plusieurs dimensions :

- **Pr√©cision** : Justesse de la r√©ponse
- **Compl√©tude** : Exhaustivit√© du contenu
- **Coh√©rence** : Logique interne
- **Utilit√©** : Valeur actionnable
- **Efficiency** : Ratio qualit√©/co√ªt

In [None]:
@dataclass
class PromptMetrics:
    """Structure pour les m√©triques d'un prompt"""
    prompt_id: str
    prompt_text: str
    response: str
    timestamp: datetime
    
    # M√©triques de qualit√© (0-1)
    accuracy: float = 0.0
    completeness: float = 0.0
    coherence: float = 0.0
    usefulness: float = 0.0
    clarity: float = 0.0
    
    # M√©triques techniques
    response_time: float = 0.0
    token_count: int = 0
    cost_usd: float = 0.0
    
    # M√©triques business
    user_satisfaction: Optional[float] = None
    task_completion: bool = False
    
    def overall_score(self) -> float:
        """Score global pond√©r√©"""
        weights = {
            'accuracy': 0.25,
            'completeness': 0.20,
            'coherence': 0.20,
            'usefulness': 0.25,
            'clarity': 0.10
        }
        
        score = sum(getattr(self, metric) * weight 
                   for metric, weight in weights.items())
        return round(score, 3)
    
    def efficiency_score(self) -> float:
        """Ratio qualit√©/co√ªt"""
        if self.cost_usd == 0:
            return 0
        return self.overall_score() / self.cost_usd

class PromptEvaluator:
    """Syst√®me d'√©valuation automatique des prompts"""
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        self.model = model
        self.evaluations_db = "prompt_evaluations.db"
        self.init_database()
    
    def init_database(self):
        """Initialise la base de donn√©es pour stocker les √©valuations"""
        conn = sqlite3.connect(self.evaluations_db)
        cursor = conn.cursor()
        
        cursor.execute("""
        CREATE TABLE IF NOT EXISTS evaluations (
            id TEXT PRIMARY KEY,
            prompt_text TEXT,
            response TEXT,
            timestamp DATETIME,
            accuracy REAL,
            completeness REAL,
            coherence REAL,
            usefulness REAL,
            clarity REAL,
            response_time REAL,
            token_count INTEGER,
            cost_usd REAL,
            user_satisfaction REAL,
            task_completion BOOLEAN,
            overall_score REAL
        )
        """)
        
        conn.commit()
        conn.close()
    
    def evaluate_response_quality(self, prompt: str, response: str, context: str = "") -> Dict[str, float]:
        """√âvalue la qualit√© d'une r√©ponse selon plusieurs crit√®res"""
        evaluation_prompt = f"""
        √âvalue cette interaction prompt-r√©ponse selon 5 crit√®res pr√©cis :
        
        PROMPT : {prompt}
        {f'CONTEXTE : {context}' if context else ''}
        R√âPONSE : {response}
        
        Pour chaque crit√®re, donne une note de 0 √† 10 :
        
        1. ACCURACY (Pr√©cision) : La r√©ponse est-elle factuelle et correcte ?
        2. COMPLETENESS (Compl√©tude) : La r√©ponse couvre-t-elle tous les aspects importants ?
        3. COHERENCE (Coh√©rence) : La logique interne est-elle solide ?
        4. USEFULNESS (Utilit√©) : La r√©ponse est-elle actionnable et pratique ?
        5. CLARITY (Clart√©) : La r√©ponse est-elle claire et bien structur√©e ?
        
        Format de r√©ponse OBLIGATOIRE :
        ACCURACY: X/10
        COMPLETENESS: Y/10
        COHERENCE: Z/10
        USEFULNESS: W/10
        CLARITY: V/10
        """
        
        try:
            result = openai.ChatCompletion.create(
                model=self.model,
                messages=[{"role": "user", "content": evaluation_prompt}],
                temperature=0.1
            )
            
            # Parser les scores
            content = result.choices[0].message.content
            scores = {}
            
            import re
            for criterion in ['ACCURACY', 'COMPLETENESS', 'COHERENCE', 'USEFULNESS', 'CLARITY']:
                pattern = f"{criterion}:\s*(\d+(?:\.\d+)?)/10"
                match = re.search(pattern, content, re.IGNORECASE)
                if match:
                    scores[criterion.lower()] = float(match.group(1)) / 10
                else:
                    scores[criterion.lower()] = 0.5  # Score par d√©faut
            
            return scores
            
        except Exception as e:
            print(f"Erreur √©valuation : {e}")
            return {k: 0.5 for k in ['accuracy', 'completeness', 'coherence', 'usefulness', 'clarity']}
    
    def calculate_technical_metrics(self, response: str, start_time: float, model: str) -> Dict:
        """Calcule les m√©triques techniques"""
        end_time = time.time()
        response_time = end_time - start_time
        
        # Estimation du nombre de tokens (approximation)
        token_count = len(response.split()) * 1.3  # Approximation
        
        # Estimation du co√ªt (tarifs approximatifs 2024)
        cost_per_1k_tokens = {
            'gpt-3.5-turbo': 0.002,
            'gpt-4': 0.03,
            'gpt-4-turbo': 0.01,
            'claude-3-sonnet': 0.003,
            'claude-3-opus': 0.015
        }
        
        cost_rate = cost_per_1k_tokens.get(model, 0.002)
        cost_usd = (token_count / 1000) * cost_rate
        
        return {
            'response_time': response_time,
            'token_count': int(token_count),
            'cost_usd': round(cost_usd, 4)
        }
    
    def evaluate_prompt_complete(self, prompt: str, context: str = "", 
                               model: str = None) -> PromptMetrics:
        """√âvaluation compl√®te d'un prompt"""
        if model is None:
            model = self.model
        
        prompt_id = str(uuid.uuid4())
        start_time = time.time()
        
        # G√©n√©rer la r√©ponse
        try:
            response_obj = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7
            )
            response = response_obj.choices[0].message.content
        except Exception as e:
            print(f"Erreur g√©n√©ration : {e}")
            response = "Erreur lors de la g√©n√©ration"
        
        # √âvaluer la qualit√©
        quality_scores = self.evaluate_response_quality(prompt, response, context)
        
        # Calculer les m√©triques techniques
        tech_metrics = self.calculate_technical_metrics(response, start_time, model)
        
        # Cr√©er l'objet m√©triques
        metrics = PromptMetrics(
            prompt_id=prompt_id,
            prompt_text=prompt,
            response=response,
            timestamp=datetime.now(),
            **quality_scores,
            **tech_metrics
        )
        
        # Sauvegarder en base
        self.save_evaluation(metrics)
        
        return metrics
    
    def save_evaluation(self, metrics: PromptMetrics):
        """Sauvegarde une √©valuation en base de donn√©es"""
        conn = sqlite3.connect(self.evaluations_db)
        cursor = conn.cursor()
        
        cursor.execute("""
        INSERT INTO evaluations (
            id, prompt_text, response, timestamp, accuracy, completeness,
            coherence, usefulness, clarity, response_time, token_count,
            cost_usd, user_satisfaction, task_completion, overall_score
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            metrics.prompt_id, metrics.prompt_text, metrics.response,
            metrics.timestamp, metrics.accuracy, metrics.completeness,
            metrics.coherence, metrics.usefulness, metrics.clarity,
            metrics.response_time, metrics.token_count, metrics.cost_usd,
            metrics.user_satisfaction, metrics.task_completion, 
            metrics.overall_score()
        ))
        
        conn.commit()
        conn.close()
    
    def get_evaluation_history(self, limit: int = 100) -> pd.DataFrame:
        """R√©cup√®re l'historique des √©valuations"""
        conn = sqlite3.connect(self.evaluations_db)
        
        query = """
        SELECT * FROM evaluations 
        ORDER BY timestamp DESC 
        LIMIT ?
        """
        
        df = pd.read_sql_query(query, conn, params=(limit,))
        conn.close()
        
        return df

### üéØ Exemple : √âvaluation de prompts diff√©rents

In [None]:
# Initialiser l'√©valuateur
evaluator = PromptEvaluator()

# Tester diff√©rents prompts pour la m√™me t√¢che
task_context = "Cr√©ation d'une strat√©gie marketing pour une startup EdTech"

prompts_to_test = [
    {
        'name': 'Prompt Basic',
        'text': "Cr√©e une strat√©gie marketing pour notre startup EdTech."
    },
    {
        'name': 'Prompt Structur√©',
        'text': """
        En tant qu'expert marketing sp√©cialis√© EdTech :
        
        Cr√©e une strat√©gie marketing pour notre startup qui propose des cours de code en ligne.
        
        Inclus :
        - Analyse du march√© cible
        - 3 canaux d'acquisition prioritaires
        - Budget et timeline sur 6 mois
        - M√©triques de succ√®s
        """
    },
    {
        'name': 'Prompt Avec Contexte',
        'text': """
        CONTEXTE : Startup EdTech fran√ßaise, 6 mois d'existence, 500 utilisateurs, 
        budget marketing 50k‚Ç¨, √©quipe de 3 personnes.
        
        R√îLE : Tu es un Growth Hacker exp√©riment√© sp√©cialis√© dans l'EdTech B2C.
        
        T√ÇCHE : Con√ßois une strat√©gie d'acquisition client pour doubler notre base 
        utilisateur en 4 mois tout en maintenant un CAC < 100‚Ç¨.
        
        FORMAT : Plan actionnable avec timeline, budget d√©taill√© et KPIs pr√©cis.
        """
    }
]

# √âvaluer chaque prompt
results = []
print("üî¨ √âvaluation des prompts...\n")

for i, prompt_data in enumerate(prompts_to_test):
    print(f"{i+1}. √âvaluation : {prompt_data['name']}")
    
    metrics = evaluator.evaluate_prompt_complete(
        prompt_data['text'], 
        task_context
    )
    
    results.append({
        'name': prompt_data['name'],
        'metrics': metrics
    })
    
    print(f"   Score global : {metrics.overall_score():.3f}")
    print(f"   Co√ªt : ${metrics.cost_usd:.4f}")
    print(f"   Temps : {metrics.response_time:.2f}s\n")

# Comparaison des r√©sultats
print("üìä COMPARAISON DES PROMPTS :")
print("=" * 50)

for result in sorted(results, key=lambda x: x['metrics'].overall_score(), reverse=True):
    m = result['metrics']
    print(f"\nüèÜ {result['name']}")
    print(f"   Score global : {m.overall_score():.3f}")
    print(f"   Pr√©cision : {m.accuracy:.2f} | Compl√©tude : {m.completeness:.2f}")
    print(f"   Coh√©rence : {m.coherence:.2f} | Utilit√© : {m.usefulness:.2f}")
    print(f"   Clart√© : {m.clarity:.2f}")
    print(f"   Efficience : {m.efficiency_score():.1f} (qualit√©/co√ªt)")

## 2. üß™ A/B Testing pour Prompts

### Framework d'A/B Testing Statistiquement Robuste

Impl√©mentons un syst√®me d'A/B testing pour comparer objectivement diff√©rentes versions de prompts avec une rigueur statistique.

In [None]:
@dataclass
class ABTestConfig:
    """Configuration pour un test A/B"""
    test_name: str
    prompt_a: str
    prompt_b: str
    sample_size: int = 50
    significance_level: float = 0.05
    primary_metric: str = 'overall_score'
    secondary_metrics: List[str] = None
    test_contexts: List[str] = None
    
    def __post_init__(self):
        if self.secondary_metrics is None:
            self.secondary_metrics = ['accuracy', 'usefulness', 'efficiency_score']
        if self.test_contexts is None:
            self.test_contexts = [""]

@dataclass
class ABTestResult:
    """R√©sultats d'un test A/B"""
    test_name: str
    sample_size_a: int
    sample_size_b: int
    
    # Statistiques descriptives
    mean_a: float
    mean_b: float
    std_a: float
    std_b: float
    
    # Tests statistiques
    p_value: float
    is_significant: bool
    confidence_interval: Tuple[float, float]
    effect_size: float  # Cohen's d
    
    # M√©trics business
    lift_percentage: float
    winner: str  # 'A', 'B', ou 'No significant difference'
    power: float  # Puissance statistique

class PromptABTester:
    """Syst√®me d'A/B testing pour prompts"""
    
    def __init__(self, evaluator: PromptEvaluator):
        self.evaluator = evaluator
        self.tests_db = "ab_tests.db"
        self.init_tests_database()
    
    def init_tests_database(self):
        """Initialise la base de donn√©es pour les tests A/B"""
        conn = sqlite3.connect(self.tests_db)
        cursor = conn.cursor()
        
        cursor.execute("""
        CREATE TABLE IF NOT EXISTS ab_tests (
            test_id TEXT,
            test_name TEXT,
            variant TEXT,
            prompt_text TEXT,
            context TEXT,
            response TEXT,
            overall_score REAL,
            accuracy REAL,
            usefulness REAL,
            efficiency_score REAL,
            timestamp DATETIME,
            PRIMARY KEY (test_id, variant)
        )
        """)
        
        conn.commit()
        conn.close()
    
    def run_ab_test(self, config: ABTestConfig) -> ABTestResult:
        """Ex√©cute un test A/B complet"""
        print(f"üß™ D√©marrage du test A/B : {config.test_name}")
        print(f"üìä Taille d'√©chantillon : {config.sample_size} par variant")
        
        test_id = str(uuid.uuid4())
        results_a = []
        results_b = []
        
        # G√©n√©rer les √©chantillons pour chaque variant
        print("\nüîÑ G√©n√©ration des √©chantillons...")
        
        # Utiliser ThreadPoolExecutor pour parall√©liser
        with ThreadPoolExecutor(max_workers=4) as executor:
            # Soumettre les t√¢ches pour variant A
            futures_a = []
            for i in range(config.sample_size):
                context = random.choice(config.test_contexts)
                future = executor.submit(self._evaluate_variant, 
                                       config.prompt_a, context, 'A', test_id, config.test_name)
                futures_a.append(future)
            
            # Soumettre les t√¢ches pour variant B
            futures_b = []
            for i in range(config.sample_size):
                context = random.choice(config.test_contexts)
                future = executor.submit(self._evaluate_variant, 
                                       config.prompt_b, context, 'B', test_id, config.test_name)
                futures_b.append(future)
            
            # Collecter les r√©sultats A
            for i, future in enumerate(as_completed(futures_a)):
                try:
                    result = future.result(timeout=60)
                    results_a.append(result)
                    if (i + 1) % 10 == 0:
                        print(f"   Variant A : {i + 1}/{config.sample_size} termin√©s")
                except Exception as e:
                    print(f"   Erreur variant A : {e}")
            
            # Collecter les r√©sultats B
            for i, future in enumerate(as_completed(futures_b)):
                try:
                    result = future.result(timeout=60)
                    results_b.append(result)
                    if (i + 1) % 10 == 0:
                        print(f"   Variant B : {i + 1}/{config.sample_size} termin√©s")
                except Exception as e:
                    print(f"   Erreur variant B : {e}")
        
        print(f"‚úÖ Collecte termin√©e : {len(results_a)} vs {len(results_b)} √©chantillons")
        
        # Analyser les r√©sultats
        return self._analyze_results(config, results_a, results_b)
    
    def _evaluate_variant(self, prompt: str, context: str, variant: str, 
                         test_id: str, test_name: str) -> Dict:
        """√âvalue un variant et sauvegarde en base"""
        metrics = self.evaluator.evaluate_prompt_complete(prompt, context)
        
        # Sauvegarder en base
        conn = sqlite3.connect(self.tests_db)
        cursor = conn.cursor()
        
        cursor.execute("""
        INSERT INTO ab_tests (
            test_id, test_name, variant, prompt_text, context, response,
            overall_score, accuracy, usefulness, efficiency_score, timestamp
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            test_id, test_name, variant, prompt, context, metrics.response,
            metrics.overall_score(), metrics.accuracy, metrics.usefulness,
            metrics.efficiency_score(), datetime.now()
        ))
        
        conn.commit()
        conn.close()
        
        return {
            'overall_score': metrics.overall_score(),
            'accuracy': metrics.accuracy,
            'usefulness': metrics.usefulness,
            'efficiency_score': metrics.efficiency_score()
        }
    
    def _analyze_results(self, config: ABTestConfig, 
                        results_a: List[Dict], results_b: List[Dict]) -> ABTestResult:
        """Analyse statistique des r√©sultats A/B"""
        print("\nüìä Analyse statistique...")
        
        # Extraire la m√©trique principale
        values_a = [r[config.primary_metric] for r in results_a]
        values_b = [r[config.primary_metric] for r in results_b]
        
        # Statistiques descriptives
        mean_a, mean_b = np.mean(values_a), np.mean(values_b)
        std_a, std_b = np.std(values_a, ddof=1), np.std(values_b, ddof=1)
        
        # Test t de Student
        t_stat, p_value = stats.ttest_ind(values_a, values_b, equal_var=False)
        
        # Significance
        is_significant = p_value < config.significance_level
        
        # Intervalle de confiance pour la diff√©rence
        diff = mean_b - mean_a
        se_diff = np.sqrt((std_a**2 / len(values_a)) + (std_b**2 / len(values_b)))
        margin_error = stats.t.ppf(1 - config.significance_level/2, 
                                  len(values_a) + len(values_b) - 2) * se_diff
        ci = (diff - margin_error, diff + margin_error)
        
        # Effect size (Cohen's d)
        pooled_std = np.sqrt(((len(values_a)-1)*std_a**2 + (len(values_b)-1)*std_b**2) / 
                           (len(values_a) + len(values_b) - 2))
        cohens_d = (mean_b - mean_a) / pooled_std if pooled_std > 0 else 0
        
        # Lift percentage
        lift = ((mean_b - mean_a) / mean_a * 100) if mean_a > 0 else 0
        
        # D√©terminer le gagnant
        if not is_significant:
            winner = "No significant difference"
        else:
            winner = "B" if mean_b > mean_a else "A"
        
        # Calcul de la puissance (approximation)
        effect_size_for_power = abs(cohens_d)
        power = self._calculate_power(len(values_a), effect_size_for_power, config.significance_level)
        
        return ABTestResult(
            test_name=config.test_name,
            sample_size_a=len(values_a),
            sample_size_b=len(values_b),
            mean_a=mean_a,
            mean_b=mean_b,
            std_a=std_a,
            std_b=std_b,
            p_value=p_value,
            is_significant=is_significant,
            confidence_interval=ci,
            effect_size=cohens_d,
            lift_percentage=lift,
            winner=winner,
            power=power
        )
    
    def _calculate_power(self, sample_size: int, effect_size: float, alpha: float) -> float:
        """Calcule la puissance statistique (approximation)"""
        # Calcul simplifi√© de la puissance
        from scipy.stats import norm
        
        z_alpha = norm.ppf(1 - alpha/2)
        z_beta = effect_size * np.sqrt(sample_size/2) - z_alpha
        power = norm.cdf(z_beta)
        
        return max(0, min(1, power))
    
    def visualize_results(self, result: ABTestResult, save_path: str = None):
        """Visualise les r√©sultats du test A/B"""
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
        
        # 1. Comparaison des moyennes
        variants = ['Variant A', 'Variant B']
        means = [result.mean_a, result.mean_b]
        stds = [result.std_a, result.std_b]
        
        bars = ax1.bar(variants, means, yerr=stds, capsize=10, 
                      color=['lightblue', 'lightcoral'], alpha=0.7)
        ax1.set_title(f'{result.test_name}\nComparaison des Performances')
        ax1.set_ylabel('Score Moyen')
        
        # Annoter avec les valeurs
        for bar, mean in zip(bars, means):
            ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                    f'{mean:.3f}', ha='center', va='bottom')
        
        # 2. Distribution des scores
        # Simuler les distributions pour visualisation
        samples_a = np.random.normal(result.mean_a, result.std_a, 1000)
        samples_b = np.random.normal(result.mean_b, result.std_b, 1000)
        
        ax2.hist(samples_a, bins=30, alpha=0.7, label='Variant A', color='lightblue')
        ax2.hist(samples_b, bins=30, alpha=0.7, label='Variant B', color='lightcoral')
        ax2.set_title('Distribution des Scores')
        ax2.set_xlabel('Score')
        ax2.set_ylabel('Fr√©quence')
        ax2.legend()
        
        # 3. Intervalle de confiance
        diff = result.mean_b - result.mean_a
        ci_lower, ci_upper = result.confidence_interval
        
        ax3.errorbar([0], [diff], yerr=[[diff - ci_lower], [ci_upper - diff]], 
                    fmt='o', markersize=10, capsize=10, capthick=2)
        ax3.axhline(y=0, color='red', linestyle='--', alpha=0.7)
        ax3.set_title('Diff√©rence entre Variants\n(avec Intervalle de Confiance 95%)')
        ax3.set_ylabel('Diff√©rence (B - A)')
        ax3.set_xlim(-0.5, 0.5)
        ax3.set_xticks([])
        
        # 4. R√©sum√© statistique
        ax4.axis('off')
        summary_text = f"""
        üìä R√âSULTATS DU TEST A/B
        
        üèÜ Gagnant : {result.winner}
        üìà Lift : {result.lift_percentage:+.1f}%
        
        üìã Statistiques :
        ‚Ä¢ p-value : {result.p_value:.4f}
        ‚Ä¢ Significatif : {'‚úÖ Oui' if result.is_significant else '‚ùå Non'}
        ‚Ä¢ Effect Size : {result.effect_size:.3f}
        ‚Ä¢ Puissance : {result.power:.1%}
        
        üìè √âchantillons :
        ‚Ä¢ Variant A : {result.sample_size_a}
        ‚Ä¢ Variant B : {result.sample_size_b}
        """
        
        ax4.text(0.1, 0.9, summary_text, transform=ax4.transAxes, 
                fontsize=11, verticalalignment='top', fontfamily='monospace')
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        
        plt.show()
        
        return fig

### üéØ Exemple : Test A/B sur des prompts de r√©sum√©

In [None]:
# Configuration du test A/B
ab_config = ABTestConfig(
    test_name="R√©sum√© d'Articles - Prompt Simple vs Structur√©",
    
    prompt_a="""R√©sume cet article en 3 points principaux.""",
    
    prompt_b="""En tant qu'analyste exp√©riment√© :
    
    Analyse cet article et produis un r√©sum√© structur√© :
    
    üéØ ID√âE PRINCIPALE (1 phrase)
    üìã POINTS CL√âS (3 √©l√©ments maximum)
    üí° INSIGHTS ACTIONABLES (1-2 recommandations)
    
    Reste factuel et concis.""",
    
    sample_size=20,  # R√©duire pour la d√©mo
    primary_metric='overall_score',
    
    test_contexts=[
        "Article tech sur l'IA",
        "Article business sur les startups", 
        "Article scientifique sur le climat",
        "Article √©conomique sur les march√©s"
    ]
)

# Ex√©cuter le test A/B
ab_tester = PromptABTester(evaluator)
test_result = ab_tester.run_ab_test(ab_config)

# Afficher les r√©sultats
print("\n" + "="*60)
print("üß™ R√âSULTATS DU TEST A/B")
print("="*60)

print(f"\nüìä Test : {test_result.test_name}")
print(f"\nüèÜ Gagnant : {test_result.winner}")
print(f"üìà Am√©lioration : {test_result.lift_percentage:+.1f}%")

print(f"\nüìã Statistiques d√©taill√©es :")
print(f"  Variant A - Moyenne : {test_result.mean_a:.3f} (¬±{test_result.std_a:.3f})")
print(f"  Variant B - Moyenne : {test_result.mean_b:.3f} (¬±{test_result.std_b:.3f})")

print(f"\nüî¨ Tests statistiques :")
print(f"  p-value : {test_result.p_value:.4f}")
print(f"  Significatif : {'‚úÖ Oui' if test_result.is_significant else '‚ùå Non (p > 0.05)'}")
print(f"  Effect Size : {test_result.effect_size:.3f}")
print(f"  Puissance : {test_result.power:.1%}")

print(f"\nüìè Intervalles de confiance (95%) :")
ci_lower, ci_upper = test_result.confidence_interval
print(f"  Diff√©rence B-A : [{ci_lower:.3f}, {ci_upper:.3f}]")

# Visualiser les r√©sultats
ab_tester.visualize_results(test_result)

## 3. ü§ñ Optimisation Automatique des Prompts

### Framework d'Optimisation par Algorithme G√©n√©tique

Cr√©ons un syst√®me qui optimise automatiquement les prompts en utilisant des techniques d'optimisation inspir√©es de l'√©volution naturelle.

In [None]:
@dataclass
class PromptGene:
    """Repr√©sente un 'g√®ne' dans un prompt (un √©l√©ment modifiable)"""
    name: str
    type: str  # 'role', 'instruction', 'format', 'example', 'constraint'
    variations: List[str]
    current_value: str = ""
    
    def __post_init__(self):
        if not self.current_value and self.variations:
            self.current_value = random.choice(self.variations)
    
    def mutate(self) -> str:
        """Mute le g√®ne en choisissant une nouvelle variation"""
        self.current_value = random.choice(self.variations)
        return self.current_value

@dataclass
class PromptChromosome:
    """Repr√©sente un prompt complet (chromosme) avec ses g√®nes"""
    genes: List[PromptGene]
    template: str
    fitness: float = 0.0
    generation: int = 0
    
    def generate_prompt(self) -> str:
        """G√©n√®re le prompt complet √† partir des g√®nes"""
        prompt = self.template
        
        for gene in self.genes:
            placeholder = f"{{{gene.name}}}"
            prompt = prompt.replace(placeholder, gene.current_value)
        
        return prompt
    
    def mutate(self, mutation_rate: float = 0.3):
        """Applique des mutations al√©atoires"""
        for gene in self.genes:
            if random.random() < mutation_rate:
                gene.mutate()
    
    def crossover(self, other: 'PromptChromosome') -> 'PromptChromosome':
        """Cr√©e un offspring par croisement avec un autre chromosome"""
        new_genes = []
        
        for i, (gene1, gene2) in enumerate(zip(self.genes, other.genes)):
            # Choisir al√©atoirement un parent pour chaque g√®ne
            if random.random() < 0.5:
                new_gene = PromptGene(
                    name=gene1.name,
                    type=gene1.type,
                    variations=gene1.variations,
                    current_value=gene1.current_value
                )
            else:
                new_gene = PromptGene(
                    name=gene2.name,
                    type=gene2.type,
                    variations=gene2.variations,
                    current_value=gene2.current_value
                )
            new_genes.append(new_gene)
        
        return PromptChromosome(
            genes=new_genes,
            template=self.template,
            generation=max(self.generation, other.generation) + 1
        )

class GeneticPromptOptimizer:
    """Optimiseur de prompts par algorithme g√©n√©tique"""
    
    def __init__(self, evaluator: PromptEvaluator):
        self.evaluator = evaluator
        self.population = []
        self.generation = 0
        self.best_fitness_history = []
        self.avg_fitness_history = []
    
    def create_initial_population(self, template: str, genes: List[PromptGene], 
                                population_size: int = 20) -> List[PromptChromosome]:
        """Cr√©e la population initiale"""
        population = []
        
        for _ in range(population_size):
            # Cr√©er des copies des g√®nes avec des valeurs al√©atoires
            chromosome_genes = []
            for gene in genes:
                new_gene = PromptGene(
                    name=gene.name,
                    type=gene.type,
                    variations=gene.variations.copy(),
                    current_value=random.choice(gene.variations)
                )
                chromosome_genes.append(new_gene)
            
            chromosome = PromptChromosome(
                genes=chromosome_genes,
                template=template,
                generation=0
            )
            population.append(chromosome)
        
        return population
    
    def evaluate_fitness(self, chromosome: PromptChromosome, 
                        test_contexts: List[str], samples_per_context: int = 3) -> float:
        """√âvalue la fitness d'un chromosome"""
        prompt = chromosome.generate_prompt()
        total_score = 0
        total_samples = 0
        
        for context in test_contexts:
            for _ in range(samples_per_context):
                try:
                    metrics = self.evaluator.evaluate_prompt_complete(prompt, context)
                    total_score += metrics.overall_score()
                    total_samples += 1
                except Exception as e:
                    print(f"Erreur √©valuation : {e}")
                    # P√©nalit√© pour les prompts qui causent des erreurs
                    total_score += 0.1
                    total_samples += 1
        
        fitness = total_score / total_samples if total_samples > 0 else 0
        chromosome.fitness = fitness
        return fitness
    
    def selection(self, population: List[PromptChromosome], 
                 selection_size: int) -> List[PromptChromosome]:
        """S√©lection par tournoi"""
        selected = []
        
        for _ in range(selection_size):
            # Tournoi de taille 3
            tournament = random.sample(population, min(3, len(population)))
            winner = max(tournament, key=lambda x: x.fitness)
            selected.append(winner)
        
        return selected
    
    def evolve_generation(self, population: List[PromptChromosome],
                         elite_size: int = 2, mutation_rate: float = 0.3) -> List[PromptChromosome]:
        """√âvolution d'une g√©n√©ration"""
        # Trier par fitness
        population.sort(key=lambda x: x.fitness, reverse=True)
        
        # √âlitisme : garder les meilleurs
        new_population = population[:elite_size]
        
        # S√©lection pour reproduction
        parents = self.selection(population, len(population) - elite_size)
        
        # Reproduction et mutation
        while len(new_population) < len(population):
            parent1 = random.choice(parents)
            parent2 = random.choice(parents)
            
            # Croisement
            child = parent1.crossover(parent2)
            
            # Mutation
            child.mutate(mutation_rate)
            
            new_population.append(child)
        
        return new_population
    
    def optimize(self, template: str, genes: List[PromptGene], 
                test_contexts: List[str], generations: int = 10,
                population_size: int = 20, mutation_rate: float = 0.3) -> Dict:
        """Lance l'optimisation compl√®te"""
        print(f"üß¨ D√©marrage optimisation g√©n√©tique")
        print(f"üìä Param√®tres : {generations} g√©n√©rations, {population_size} individus")
        
        # Cr√©er population initiale
        print("\nüå± Cr√©ation de la population initiale...")
        self.population = self.create_initial_population(template, genes, population_size)
        
        # √âvoluer sur plusieurs g√©n√©rations
        for gen in range(generations):
            print(f"\nüîÑ G√©n√©ration {gen + 1}/{generations}")
            
            # √âvaluer la fitness de chaque individu
            print("   üìä √âvaluation de la fitness...")
            with ThreadPoolExecutor(max_workers=3) as executor:
                futures = []
                for chromosome in self.population:
                    future = executor.submit(self.evaluate_fitness, chromosome, 
                                           test_contexts, 2)  # 2 √©chantillons par contexte
                    futures.append(future)
                
                for future in as_completed(futures):
                    try:
                        future.result(timeout=120)
                    except Exception as e:
                        print(f"      Erreur : {e}")
            
            # Statistiques de la g√©n√©ration
            fitnesses = [c.fitness for c in self.population]
            best_fitness = max(fitnesses)
            avg_fitness = np.mean(fitnesses)
            
            self.best_fitness_history.append(best_fitness)
            self.avg_fitness_history.append(avg_fitness)
            
            print(f"   üèÜ Meilleure fitness : {best_fitness:.3f}")
            print(f"   üìà Fitness moyenne : {avg_fitness:.3f}")
            
            # √âvolution (sauf pour la derni√®re g√©n√©ration)
            if gen < generations - 1:
                print("   üß¨ √âvolution...")
                self.population = self.evolve_generation(self.population, 
                                                        elite_size=2, 
                                                        mutation_rate=mutation_rate)
        
        # Trouver le meilleur individu final
        best_chromosome = max(self.population, key=lambda x: x.fitness)
        
        return {
            'best_chromosome': best_chromosome,
            'best_prompt': best_chromosome.generate_prompt(),
            'best_fitness': best_chromosome.fitness,
            'fitness_history': {
                'best': self.best_fitness_history,
                'average': self.avg_fitness_history
            },
            'final_population': self.population
        }
    
    def visualize_evolution(self, save_path: str = None):
        """Visualise l'√©volution de la fitness"""
        plt.figure(figsize=(12, 6))
        
        generations = range(1, len(self.best_fitness_history) + 1)
        
        plt.plot(generations, self.best_fitness_history, 
                'o-', label='Meilleure Fitness', linewidth=2, markersize=6)
        plt.plot(generations, self.avg_fitness_history, 
                's-', label='Fitness Moyenne', linewidth=2, markersize=4, alpha=0.7)
        
        plt.title('√âvolution de la Fitness par G√©n√©ration', fontsize=14, fontweight='bold')
        plt.xlabel('G√©n√©ration')
        plt.ylabel('Score de Fitness')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        # Annotations
        best_gen = np.argmax(self.best_fitness_history)
        best_score = self.best_fitness_history[best_gen]
        plt.annotate(f'Max: {best_score:.3f}', 
                    xy=(best_gen + 1, best_score), 
                    xytext=(10, 10), textcoords='offset points',
                    bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7),
                    arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        
        plt.show()

### üéØ Exemple : Optimisation d'un prompt de recommandation produit

In [None]:
# D√©finir le template et les g√®nes pour l'optimisation
template = """
{role}

{task_instruction}

{context_guide}

{output_format}

{quality_constraint}
"""

# D√©finir les g√®nes (√©l√©ments variables du prompt)
optimization_genes = [
    PromptGene(
        name="role",
        type="role",
        variations=[
            "Tu es un conseiller en achats exp√©riment√©.",
            "En tant qu'expert e-commerce sp√©cialis√© dans les recommandations personnalis√©es :",
            "Agis comme un data scientist sp√©cialis√© en syst√®mes de recommandation.",
            "Tu es un consultant retail avec 10 ans d'exp√©rience."
        ]
    ),
    PromptGene(
        name="task_instruction",
        type="instruction",
        variations=[
            "Recommande 3 produits adapt√©s au profil client donn√©.",
            "Analyse le profil client et sugg√®re les 3 meilleurs produits correspondant √† ses besoins.",
            "Bas√© sur les donn√©es client, identifie 3 produits qui maximiseront sa satisfaction.",
            "Cr√©e une recommandation personnalis√©e de 3 produits en analysant les pr√©f√©rences client."
        ]
    ),
    PromptGene(
        name="context_guide",
        type="constraint",
        variations=[
            "Consid√®re le budget, les pr√©f√©rences et l'historique d'achat.",
            "Prends en compte : budget disponible, style personnel, achats pr√©c√©dents, et besoins exprim√©s.",
            "Analyse : contraintes budg√©taires, go√ªts personnels, saisonnalit√©, et tendances actuelles.",
            "√âvalue budget, pr√©f√©rences, contexte d'usage, et rapport qualit√©-prix."
        ]
    ),
    PromptGene(
        name="output_format",
        type="format",
        variations=[
            "Pr√©sente chaque produit avec nom, prix, et raison de la recommandation.",
            "Pour chaque produit : Nom | Prix | Pourquoi ce choix | Score de pertinence (/10)",
            "Format : üè∑Ô∏è [Nom] - [Prix] \nüí° [Justification] \n‚≠ê [Score de match] /10",
            "Structure : Produit + Prix + Avantages cl√©s + Niveau de recommandation (Excellent/Bon/Correct)"
        ]
    ),
    PromptGene(
        name="quality_constraint",
        type="constraint",
        variations=[
            "Assure-toi que chaque recommandation est pertinente et bien justifi√©e.",
            "Prioritise la pertinence et la valeur ajout√©e pour le client.",
            "Vise l'excellence : chaque recommandation doit √™tre parfaitement adapt√©e.",
            "Sois pr√©cis, pertinent et orient√© satisfaction client."
        ]
    )
]

# Contextes de test pour l'optimisation
test_contexts = [
    "Client homme, 30 ans, budget 200‚Ç¨, aime la tech et le sport",
    "Cliente femme, 25 ans, budget 500‚Ç¨, passionn√©e de mode et voyage", 
    "Client mixte, 45 ans, budget 300‚Ç¨, recherche des cadeaux famille",
    "Cliente femme, 35 ans, budget 150‚Ç¨, √©cologique et minimaliste"
]

# Lancer l'optimisation
optimizer = GeneticPromptOptimizer(evaluator)

optimization_result = optimizer.optimize(
    template=template,
    genes=optimization_genes,
    test_contexts=test_contexts,
    generations=6,  # R√©duire pour la d√©mo
    population_size=12,
    mutation_rate=0.4
)

# Afficher les r√©sultats
print("\n" + "="*70)
print("üß¨ R√âSULTATS DE L'OPTIMISATION G√âN√âTIQUE")
print("="*70)

print(f"\nüèÜ Meilleure fitness atteinte : {optimization_result['best_fitness']:.3f}")

print(f"\n‚ú® PROMPT OPTIMIS√â :")
print("="*40)
print(optimization_result['best_prompt'])
print("="*40)

# D√©tail des g√®nes du meilleur individu
print(f"\nüß¨ Composition g√©n√©tique du meilleur prompt :")
for gene in optimization_result['best_chromosome'].genes:
    print(f"  {gene.name.upper()}: {gene.current_value}")

# Visualiser l'√©volution
print(f"\nüìà √âvolution de la fitness sur {len(optimizer.best_fitness_history)} g√©n√©rations")
optimizer.visualize_evolution()

## 4. üìà Dashboard de Monitoring en Temps R√©el

### Syst√®me de Monitoring et Analytics

In [None]:
class PromptMonitoringDashboard:
    """Dashboard de monitoring des performances des prompts"""
    
    def __init__(self, evaluator: PromptEvaluator):
        self.evaluator = evaluator
    
    def generate_performance_report(self, days: int = 7) -> Dict:
        """G√©n√®re un rapport de performance sur les N derniers jours"""
        # R√©cup√©rer les donn√©es
        df = self.evaluator.get_evaluation_history(limit=1000)
        
        if df.empty:
            return {'error': 'Aucune donn√©e disponible'}
        
        # Convertir timestamp
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        
        # Filtrer par p√©riode
        cutoff_date = datetime.now() - timedelta(days=days)
        df_period = df[df['timestamp'] >= cutoff_date]
        
        if df_period.empty:
            return {'error': f'Aucune donn√©e sur les {days} derniers jours'}
        
        # Calculer les m√©triques
        report = {
            'period_days': days,
            'total_evaluations': len(df_period),
            'avg_overall_score': df_period['overall_score'].mean(),
            'metrics': {
                'accuracy': {
                    'mean': df_period['accuracy'].mean(),
                    'std': df_period['accuracy'].std(),
                    'trend': self._calculate_trend(df_period, 'accuracy')
                },
                'usefulness': {
                    'mean': df_period['usefulness'].mean(),
                    'std': df_period['usefulness'].std(),
                    'trend': self._calculate_trend(df_period, 'usefulness')
                },
                'coherence': {
                    'mean': df_period['coherence'].mean(),
                    'std': df_period['coherence'].std(),
                    'trend': self._calculate_trend(df_period, 'coherence')
                }
            },
            'performance': {
                'avg_response_time': df_period['response_time'].mean(),
                'total_cost': df_period['cost_usd'].sum(),
                'avg_cost_per_request': df_period['cost_usd'].mean(),
                'cost_efficiency': df_period['overall_score'].sum() / df_period['cost_usd'].sum() if df_period['cost_usd'].sum() > 0 else 0
            },
            'quality_distribution': {
                'excellent': len(df_period[df_period['overall_score'] >= 0.8]),
                'good': len(df_period[(df_period['overall_score'] >= 0.6) & (df_period['overall_score'] < 0.8)]),
                'average': len(df_period[(df_period['overall_score'] >= 0.4) & (df_period['overall_score'] < 0.6)]),
                'poor': len(df_period[df_period['overall_score'] < 0.4])
            }
        }
        
        return report
    
    def _calculate_trend(self, df: pd.DataFrame, metric: str) -> str:
        """Calcule la tendance d'une m√©trique"""
        if len(df) < 10:
            return "insufficient_data"
        
        # Diviser en deux moiti√©s
        mid_point = len(df) // 2
        first_half = df.iloc[:mid_point][metric].mean()
        second_half = df.iloc[mid_point:][metric].mean()
        
        change = (second_half - first_half) / first_half if first_half > 0 else 0
        
        if change > 0.05:
            return "increasing"
        elif change < -0.05:
            return "decreasing"
        else:
            return "stable"
    
    def create_dashboard_visualization(self, report: Dict, save_path: str = None):
        """Cr√©e une visualisation dashboard"""
        if 'error' in report:
            print(f"‚ùå Erreur : {report['error']}")
            return
        
        fig = plt.figure(figsize=(16, 12))
        
        # 1. M√©triques principales (2x2 grid en haut)
        gs = fig.add_gridspec(3, 4, height_ratios=[1, 1, 1.2], hspace=0.3, wspace=0.3)
        
        # Score global
        ax1 = fig.add_subplot(gs[0, 0])
        score = report['avg_overall_score']
        color = 'green' if score > 0.7 else 'orange' if score > 0.5 else 'red'
        ax1.pie([score, 1-score], labels=['Score', ''], colors=[color, 'lightgray'], 
               startangle=90, counterclock=False)
        ax1.set_title(f'Score Global\n{score:.1%}', fontweight='bold')
        
        # Nombre d'√©valuations
        ax2 = fig.add_subplot(gs[0, 1])
        ax2.bar(['√âvaluations'], [report['total_evaluations']], color='steelblue')
        ax2.set_title(f'Total √âvaluations\n{report["total_evaluations"]}', fontweight='bold')
        ax2.set_ylabel('Nombre')
        
        # Co√ªt total
        ax3 = fig.add_subplot(gs[0, 2])
        cost = report['performance']['total_cost']
        ax3.bar(['Co√ªt'], [cost], color='coral')
        ax3.set_title(f'Co√ªt Total\n${cost:.3f}', fontweight='bold')
        ax3.set_ylabel('USD')
        
        # Efficience
        ax4 = fig.add_subplot(gs[0, 3])
        efficiency = report['performance']['cost_efficiency']
        ax4.bar(['Efficience'], [efficiency], color='gold')
        ax4.set_title(f'Efficience\n{efficiency:.1f}', fontweight='bold')
        ax4.set_ylabel('Score/USD')
        
        # 2. M√©triques d√©taill√©es (barres avec tendances)
        ax5 = fig.add_subplot(gs[1, :])
        metrics_names = list(report['metrics'].keys())
        metrics_values = [report['metrics'][m]['mean'] for m in metrics_names]
        metrics_std = [report['metrics'][m]['std'] for m in metrics_names]
        
        bars = ax5.bar(metrics_names, metrics_values, yerr=metrics_std, 
                      capsize=5, color=['skyblue', 'lightgreen', 'plum'], alpha=0.8)
        
        # Ajouter les tendances
        for i, (bar, metric) in enumerate(zip(bars, metrics_names)):
            trend = report['metrics'][metric]['trend']
            trend_symbol = 'üìà' if trend == 'increasing' else 'üìâ' if trend == 'decreasing' else '‚û°Ô∏è'
            ax5.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                    trend_symbol, ha='center', va='bottom', fontsize=16)
        
        ax5.set_title('M√©triques de Qualit√© (avec Tendances)', fontweight='bold', fontsize=14)
        ax5.set_ylabel('Score (0-1)')
        ax5.set_ylim(0, 1)
        
        # 3. Distribution de qualit√© (camembert)
        ax6 = fig.add_subplot(gs[2, :2])
        quality_dist = report['quality_distribution']
        labels = ['Excellent (‚â•80%)', 'Bon (60-80%)', 'Moyen (40-60%)', 'Faible (<40%)']
        values = [quality_dist['excellent'], quality_dist['good'], 
                 quality_dist['average'], quality_dist['poor']]
        colors = ['green', 'lightgreen', 'orange', 'red']
        
        wedges, texts, autotexts = ax6.pie(values, labels=labels, colors=colors, 
                                          autopct='%1.1f%%', startangle=90)
        ax6.set_title('Distribution de la Qualit√©', fontweight='bold', fontsize=14)
        
        # 4. M√©triques de performance
        ax7 = fig.add_subplot(gs[2, 2:])
        perf_metrics = ['Temps R√©ponse (s)', 'Co√ªt Moyen ($)', 'Efficience']
        perf_values = [
            report['performance']['avg_response_time'],
            report['performance']['avg_cost_per_request'],
            report['performance']['cost_efficiency']
        ]
        
        # Normaliser pour visualisation
        normalized_values = [
            min(perf_values[0] / 2, 1),  # Cap √† 2s
            min(perf_values[1] / 0.01, 1),  # Cap √† $0.01
            min(perf_values[2] / 100, 1)  # Cap √† 100
        ]
        
        bars = ax7.barh(perf_metrics, normalized_values, 
                       color=['lightcoral', 'lightsalmon', 'lightseagreen'])
        
        # Ajouter les valeurs r√©elles
        for bar, value in zip(bars, perf_values):
            ax7.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
                    f'{value:.3f}', va='center', fontweight='bold')
        
        ax7.set_title('M√©triques de Performance', fontweight='bold', fontsize=14)
        ax7.set_xlim(0, 1.2)
        
        # Titre g√©n√©ral
        fig.suptitle(f'Dashboard Monitoring Prompts - {report["period_days"]} derniers jours', 
                    fontsize=16, fontweight='bold')
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        
        plt.show()
        
        return fig
    
    def generate_alert_system(self, report: Dict) -> List[str]:
        """G√©n√®re des alertes bas√©es sur les m√©triques"""
        alerts = []
        
        if 'error' in report:
            return [f"üö´ Erreur syst√®me : {report['error']}"]
        
        # V√©rifier le score global
        if report['avg_overall_score'] < 0.5:
            alerts.append("üî¥ CRITIQUE : Score global tr√®s bas (<50%)")
        elif report['avg_overall_score'] < 0.7:
            alerts.append("üü° ATTENTION : Score global mod√©r√© (<70%)")
        
        # V√©rifier les tendances
        for metric, data in report['metrics'].items():
            if data['trend'] == 'decreasing':
                alerts.append(f"üìâ TENDANCE BAISSI√àRE : {metric} en diminution")
        
        # V√©rifier les co√ªts
        avg_cost = report['performance']['avg_cost_per_request']
        if avg_cost > 0.05:
            alerts.append(f"üí∞ CO√õT √âLEV√â : ${avg_cost:.4f} par requ√™te")
        
        # V√©rifier la qualit√©
        poor_ratio = report['quality_distribution']['poor'] / report['total_evaluations']
        if poor_ratio > 0.2:
            alerts.append(f"‚ö†Ô∏è QUALIT√â : {poor_ratio:.1%} de prompts de faible qualit√©")
        
        # V√©rifier l'efficience
        if report['performance']['cost_efficiency'] < 10:
            alerts.append("üìä EFFICIENCE FAIBLE : Ratio qualit√©/co√ªt sous-optimal")
        
        if not alerts:
            alerts.append("‚úÖ Toutes les m√©triques sont dans les normes")
        
        return alerts
    
    def export_report(self, report: Dict, filename: str = None) -> str:
        """Exporte le rapport en JSON"""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"prompt_report_{timestamp}.json"
        
        # Ajouter metadata
        export_data = {
            'generated_at': datetime.now().isoformat(),
            'report_type': 'prompt_monitoring',
            'data': report
        }
        
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(export_data, f, indent=2, ensure_ascii=False, default=str)
        
        return filename

### üéØ G√©n√©ration du Dashboard avec donn√©es simul√©es

In [None]:
# G√©n√©rer quelques donn√©es de test pour le dashboard
print("üìä G√©n√©ration de donn√©es pour le dashboard...")

# Simuler quelques √©valuations pour avoir des donn√©es
test_prompts = [
    "R√©sume cet article en 3 points.",
    "Traduis ce texte en anglais professionnel.",
    "√âcris un email de suivi commercial.",
    "Cr√©e une liste de contr√¥le pour projet.",
    "Analyse les risques de cette strat√©gie."
]

# √âvaluer quelques prompts pour cr√©er des donn√©es
for i, prompt in enumerate(test_prompts):
    try:
        context = f"Contexte test {i+1}"
        metrics = evaluator.evaluate_prompt_complete(prompt, context)
        print(f"  ‚úÖ Prompt {i+1} √©valu√© (score: {metrics.overall_score():.3f})")
    except Exception as e:
        print(f"  ‚ùå Erreur prompt {i+1}: {e}")

# Cr√©er le dashboard
dashboard = PromptMonitoringDashboard(evaluator)

# G√©n√©rer le rapport
print("\nüìà G√©n√©ration du rapport de performance...")
performance_report = dashboard.generate_performance_report(days=7)

if 'error' not in performance_report:
    # Afficher le rapport textuel
    print("\n" + "="*60)
    print("üìä RAPPORT DE PERFORMANCE - 7 DERNIERS JOURS")
    print("="*60)
    
    print(f"\nüìã R√©sum√© :")
    print(f"  ‚Ä¢ Total √©valuations : {performance_report['total_evaluations']}")
    print(f"  ‚Ä¢ Score global moyen : {performance_report['avg_overall_score']:.1%}")
    print(f"  ‚Ä¢ Co√ªt total : ${performance_report['performance']['total_cost']:.4f}")
    print(f"  ‚Ä¢ Efficience : {performance_report['performance']['cost_efficiency']:.1f}")
    
    print(f"\nüìä M√©triques d√©taill√©es :")
    for metric, data in performance_report['metrics'].items():
        trend_emoji = {'increasing': 'üìà', 'decreasing': 'üìâ', 'stable': '‚û°Ô∏è'}
        emoji = trend_emoji.get(data['trend'], '‚ùì')
        print(f"  ‚Ä¢ {metric.capitalize()}: {data['mean']:.3f} (¬±{data['std']:.3f}) {emoji}")
    
    print(f"\nüéØ Distribution qualit√© :")
    dist = performance_report['quality_distribution']
    total = performance_report['total_evaluations']
    print(f"  ‚Ä¢ Excellent: {dist['excellent']} ({dist['excellent']/total:.1%})")
    print(f"  ‚Ä¢ Bon: {dist['good']} ({dist['good']/total:.1%})")
    print(f"  ‚Ä¢ Moyen: {dist['average']} ({dist['average']/total:.1%})")
    print(f"  ‚Ä¢ Faible: {dist['poor']} ({dist['poor']/total:.1%})")
    
    # G√©n√©rer les alertes
    print(f"\nüö® Alertes :")
    alerts = dashboard.generate_alert_system(performance_report)
    for alert in alerts:
        print(f"  {alert}")
    
    # Cr√©er la visualisation
    print(f"\nüìà G√©n√©ration du dashboard visuel...")
    dashboard.create_dashboard_visualization(performance_report)
    
    # Exporter le rapport
    export_filename = dashboard.export_report(performance_report)
    print(f"\nüíæ Rapport export√© : {export_filename}")
    
else:
    print(f"‚ùå Erreur : {performance_report['error']}")
    print("‚ÑπÔ∏è  Ex√©cutez d'abord quelques √©valuations pour g√©n√©rer des donn√©es.")

## 5. üéØ Synth√®se et Bonnes Pratiques

### Framework Complet d'Optimisation

Voici une classe qui combine toutes les techniques vues dans ce notebook :

In [None]:
class ComprehensivePromptOptimizer:
    """Framework complet d'optimisation de prompts"""
    
    def __init__(self):
        self.evaluator = PromptEvaluator()
        self.ab_tester = PromptABTester(self.evaluator)
        self.genetic_optimizer = GeneticPromptOptimizer(self.evaluator)
        self.dashboard = PromptMonitoringDashboard(self.evaluator)
    
    def comprehensive_optimization_pipeline(self, 
                                          base_prompt: str,
                                          optimization_goal: str,
                                          test_contexts: List[str]) -> Dict:
        """Pipeline complet d'optimisation"""
        
        print("üöÄ PIPELINE D'OPTIMISATION COMPLET")
        print("="*50)
        
        results = {
            'base_prompt': base_prompt,
            'goal': optimization_goal,
            'stages': {}
        }
        
        # Stage 1: √âvaluation initiale
        print("\nüìä Stage 1: √âvaluation Baseline")
        baseline_metrics = self.evaluator.evaluate_prompt_complete(base_prompt)
        results['stages']['baseline'] = {
            'metrics': baseline_metrics,
            'score': baseline_metrics.overall_score()
        }
        print(f"   Score baseline: {baseline_metrics.overall_score():.3f}")
        
        # Stage 2: A/B Testing avec variations manuelles
        print("\nüß™ Stage 2: A/B Testing")
        improved_prompt = self._generate_improved_variant(base_prompt, optimization_goal)
        
        ab_config = ABTestConfig(
            test_name=f"Optimization: {optimization_goal}",
            prompt_a=base_prompt,
            prompt_b=improved_prompt,
            sample_size=15,
            test_contexts=test_contexts
        )
        
        ab_result = self.ab_tester.run_ab_test(ab_config)
        results['stages']['ab_test'] = ab_result
        
        best_ab_prompt = improved_prompt if ab_result.winner == 'B' else base_prompt
        print(f"   Gagnant A/B: {ab_result.winner} (lift: {ab_result.lift_percentage:+.1f}%)")
        
        # Stage 3: Optimisation g√©n√©tique
        print("\nüß¨ Stage 3: Optimisation G√©n√©tique")
        
        # Cr√©er un template et des g√®nes bas√©s sur le meilleur prompt A/B
        genetic_template, genetic_genes = self._create_genetic_components(best_ab_prompt)
        
        genetic_result = self.genetic_optimizer.optimize(
            template=genetic_template,
            genes=genetic_genes,
            test_contexts=test_contexts,
            generations=5,
            population_size=10
        )
        
        results['stages']['genetic'] = genetic_result
        print(f"   Score g√©n√©tique final: {genetic_result['best_fitness']:.3f}")
        
        # Stage 4: Validation finale
        print("\n‚úÖ Stage 4: Validation Finale")
        final_prompt = genetic_result['best_prompt']
        final_metrics = self.evaluator.evaluate_prompt_complete(final_prompt)
        
        results['stages']['final'] = {
            'prompt': final_prompt,
            'metrics': final_metrics,
            'score': final_metrics.overall_score()
        }
        
        # Calculer l'am√©lioration totale
        total_improvement = (
            (final_metrics.overall_score() - baseline_metrics.overall_score()) / 
            baseline_metrics.overall_score() * 100
        )
        
        results['summary'] = {
            'total_improvement': total_improvement,
            'baseline_score': baseline_metrics.overall_score(),
            'final_score': final_metrics.overall_score(),
            'final_prompt': final_prompt
        }
        
        print(f"\nüèÜ R√âSULTAT FINAL:")
        print(f"   Am√©lioration totale: {total_improvement:+.1f}%")
        print(f"   Score baseline: {baseline_metrics.overall_score():.3f}")
        print(f"   Score final: {final_metrics.overall_score():.3f}")
        
        return results
    
    def _generate_improved_variant(self, base_prompt: str, goal: str) -> str:
        """G√©n√®re une variante am√©lior√©e du prompt"""
        improvement_prompt = f"""
        Am√©liore ce prompt pour mieux atteindre l'objectif sp√©cifi√© :
        
        PROMPT ORIGINAL : {base_prompt}
        
        OBJECTIF D'AM√âLIORATION : {goal}
        
        Cr√©e une version am√©lior√©e qui :
        1. Garde l'intention originale
        2. Optimise pour l'objectif donn√©
        3. Ajoute de la structure si n√©cessaire
        4. Am√©liore la clart√©
        
        R√©ponds uniquement avec le prompt am√©lior√©.
        """
        
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": improvement_prompt}],
                temperature=0.7
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            print(f"Erreur g√©n√©ration variante : {e}")
            return base_prompt + "\n\nSois pr√©cis et d√©taill√©."
    
    def _create_genetic_components(self, prompt: str) -> Tuple[str, List[PromptGene]]:
        """Cr√©e les composants g√©n√©tiques √† partir d'un prompt"""
        # Template simple avec placeholders
        template = "{instruction}\n\n{enhancement}\n\n{format}"
        
        # G√®nes g√©n√©riques pour am√©lioration
        genes = [
            PromptGene(
                name="instruction",
                type="instruction",
                variations=[
                    prompt,
                    f"En tant qu'expert, {prompt.lower()}",
                    f"Analyse m√©thodiquement et {prompt.lower()}",
                    f"√âtape par √©tape : {prompt.lower()}"
                ]
            ),
            PromptGene(
                name="enhancement",
                type="constraint",
                variations=[
                    "",
                    "Sois pr√©cis et factuel.",
                    "Assure-toi que ta r√©ponse est compl√®te et bien structur√©e.",
                    "Vise l'excellence et la clart√©."
                ]
            ),
            PromptGene(
                name="format",
                type="format",
                variations=[
                    "",
                    "Pr√©sente ta r√©ponse de mani√®re claire et organis√©e.",
                    "Structure ta r√©ponse avec des points principaux.",
                    "Utilise des sections et des exemples si pertinent."
                ]
            )
        ]
        
        return template, genes
    
    def generate_optimization_report(self, results: Dict) -> str:
        """G√©n√®re un rapport complet d'optimisation"""
        report = "\n" + "="*80 + "\n"
        report += "üéØ RAPPORT COMPLET D'OPTIMISATION DE PROMPT\n"
        report += "="*80 + "\n"
        
        report += f"\nüìã OBJECTIF : {results['goal']}\n"
        
        report += f"\nüìä R√âSUM√â DES PERFORMANCES :\n"
        report += f"  ‚Ä¢ Score baseline      : {results['summary']['baseline_score']:.3f}\n"
        report += f"  ‚Ä¢ Score final         : {results['summary']['final_score']:.3f}\n"
        report += f"  ‚Ä¢ Am√©lioration totale  : {results['summary']['total_improvement']:+.1f}%\n"
        
        report += f"\nüß™ D√âTAIL DES STAGES :\n"
        
        if 'ab_test' in results['stages']:
            ab = results['stages']['ab_test']
            report += f"  üìà A/B Test - Gagnant: {ab.winner} (Lift: {ab.lift_percentage:+.1f}%)\n"
        
        if 'genetic' in results['stages']:
            genetic = results['stages']['genetic']
            report += f"  üß¨ G√©n√©tique - Score final: {genetic['best_fitness']:.3f}\n"
        
        report += f"\n‚ú® PROMPT FINAL OPTIMIS√â :\n"
        report += "‚îÄ" * 40 + "\n"
        report += results['summary']['final_prompt'] + "\n"
        report += "‚îÄ" * 40 + "\n"
        
        return report

### üéØ Exemple : Pipeline complet d'optimisation

In [None]:
# Exemple d'utilisation du pipeline complet
optimizer = ComprehensivePromptOptimizer()

# Prompt de base √† optimiser
base_prompt = "Explique ce concept technique."

# Objectif d'optimisation
optimization_goal = "Maximiser la clart√© et l'utilit√© pour des non-experts"

# Contextes de test
test_contexts = [
    "Concept: Intelligence Artificielle",
    "Concept: Blockchain", 
    "Concept: Machine Learning",
    "Concept: Cloud Computing"
]

# Ex√©cuter le pipeline complet
optimization_results = optimizer.comprehensive_optimization_pipeline(
    base_prompt=base_prompt,
    optimization_goal=optimization_goal,
    test_contexts=test_contexts
)

# G√©n√©rer et afficher le rapport
final_report = optimizer.generate_optimization_report(optimization_results)
print(final_report)

# Visualiser l'√©volution g√©n√©tique si disponible
if 'genetic' in optimization_results['stages']:
    print("\nüìà √âvolution g√©n√©tique :")
    optimizer.genetic_optimizer.visualize_evolution()

## üìö Bonnes Pratiques et Recommandations

### üéØ Guide de S√©lection des Techniques

| Situation | Technique Recommand√©e | Raison |
|-----------|----------------------|--------|
| **Prompt unique critique** | A/B Testing simple | Validation statistique rapide |
| **Optimisation continue** | Monitoring Dashboard | Suivi de performance |
| **Exploration cr√©ative** | Algorithme G√©n√©tique | D√©couverte de solutions inattendues |
| **Am√©lioration incr√©mentale** | M√©triques + A/B | Optimisation mesur√©e |
| **D√©ploiement production** | Pipeline complet | Robustesse maximale |

### üí° Conseils d'Impl√©mentation

#### üîß M√©triques
- **Commencez simple** : Utilisez 3-5 m√©triques maximum
- **Alignez business** : Vos m√©triques doivent refl√©ter vos objectifs r√©els
- **√âvitez la sur-optimisation** : Ne pas optimiser pour les m√©triques mais pour l'usage

#### üß™ A/B Testing
- **Taille d'√©chantillon** : Minimum 30 √©chantillons par variant
- **Significativit√©** : Attendez p < 0.05 ET effect size > 0.2
- **Contexte vari√©** : Testez sur plusieurs types d'usage

#### üß¨ Optimisation G√©n√©tique
- **Patience** : 10-20 g√©n√©rations minimum
- **Diversit√©** : Maintenez de la variation dans la population
- **Co√ªt vs B√©n√©fice** : Technique co√ªteuse, r√©serv√©e aux cas critiques

#### üìä Monitoring
- **Temps r√©el** : Int√©grez dans vos workflows existants
- **Alertes automatiques** : Configurez des seuils d'alerte
- **Tendances** : Surveillez l'√©volution plus que les valeurs absolues

### ‚ö° Optimisations de Performance

```python
# Bonnes pratiques de performance
best_practices = {
    'batch_evaluation': 'Groupez les √©valuations pour r√©duire les co√ªts API',
    'caching': 'Cachez les r√©sultats d\'√©valuations similaires',
    'parallel_processing': 'Utilisez ThreadPoolExecutor pour parall√©liser',
    'smart_sampling': 'Adaptez la taille d\'√©chantillon selon l\'importance',
    'early_stopping': 'Arr√™tez l\'optimisation si convergence atteinte'
}
```

## üöÄ Exercices Pratiques

### Exercice 1 : Cr√©er vos propres m√©triques m√©tier

Impl√©mentez des m√©triques sp√©cifiques √† votre domaine :

In [None]:
# Exemple pour l'e-commerce
class ECommercePromptMetrics(PromptMetrics):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Ajoutez vos m√©triques m√©tier
        self.conversion_potential: float = 0.0
        self.personalization_score: float = 0.0
        self.urgency_factor: float = 0.0
    
    def business_score(self) -> float:
        """Score orient√© business pour l'e-commerce"""
        return (
            self.conversion_potential * 0.4 +
            self.personalization_score * 0.3 +
            self.urgency_factor * 0.3
        )

# Votre impl√©mentation ici
# D√©finissez des m√©triques pour votre domaine

### Exercice 2 : Syst√®me d'alertes intelligent

Cr√©ez un syst√®me d'alertes qui s'adapte aux patterns de votre usage :

In [None]:
class AdaptiveAlertSystem:
    def __init__(self):
        self.baseline_metrics = {}
        self.alert_thresholds = {}
    
    def learn_baseline(self, historical_data: pd.DataFrame):
        """Apprend les patterns normaux pour d√©finir les seuils"""
        # Votre impl√©mentation ici
        pass
    
    def generate_smart_alerts(self, current_metrics: Dict) -> List[str]:
        """G√©n√®re des alertes adapt√©es au contexte"""
        # Votre impl√©mentation ici
        pass

# Impl√©mentez votre syst√®me d'alertes intelligent

## üéì Pour aller plus loin

### üìö Ressources avanc√©es :
- [A/B Testing Guide for AI Systems](https://example.com)
- [Genetic Algorithms in NLP](https://example.com)
- [MLOps for Prompt Engineering](https://example.com)

### üî¨ Techniques avanc√©es √† explorer :
1. **Bayesian Optimization** pour l'optimisation continue
2. **Multi-Armed Bandits** pour l'exploration vs exploitation
3. **Neural Architecture Search** pour l'optimisation de structure
4. **Reinforcement Learning** pour l'apprentissage adaptatif

### üöÄ Int√©grations recommand√©es :
- **MLflow** pour le tracking des exp√©rimentations
- **Weights & Biases** pour le monitoring avanc√©
- **Apache Airflow** pour l'orchestration des pipelines
- **FastAPI** pour cr√©er des APIs d'optimisation

---

**Prochain notebook** : 04_Systemes_Dynamiques.ipynb - Templates adaptatifs et syst√®mes auto-am√©lior√©s