## Section 1: Environment Setup

This section establishes the SageMaker environment configuration. We import necessary libraries and configure the session with appropriate IAM roles and S3 bucket settings.

The configuration is compatible with multiple environments: SageMaker Studio, SageMaker Notebook Instances, and local development setups.

In [None]:
# ============================================================
# Configuration SageMaker Universelle
# Compatible avec Studio, Notebook Instances et Local
# ============================================================

import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Ajouter le chemin du projet
project_root = os.path.abspath('../..')
if project_root not in sys.path:
    sys.path.append(project_root)

# Imports SageMaker
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition,
    FeatureTypeEnum
)

# Imports data science
import pandas as pd
import numpy as np
import time
from datetime import datetime, timedelta
import json

# Configuration
try:
    from utils.sagemaker_config import get_sagemaker_config
    config = get_sagemaker_config(s3_prefix='lab5-feature-store')
    role = config['role']
    session = config['session']
    bucket = config['bucket']
    region = config['region']
except ImportError:
    print("Using fallback configuration method")
    role = get_execution_role()
    session = sagemaker.Session()
    bucket = session.default_bucket()
    region = session.boto_region_name

print("Configuration complete")
print(f"Region: {region}")
print(f"S3 Bucket: s3://{bucket}")
print(f"IAM Role: {role[:50]}...")

## Section 2: Synthetic Data Generation

In this section, we generate synthetic customer transaction features for a fraud detection use case. The dataset simulates realistic patterns found in financial transactions.

### Feature Schema

The following features will be created to represent customer transaction behavior:

| Feature Name | Data Type | Description |
|-------------|-----------|-------------|
| customer_id | String | Unique customer identifier (primary key) |
| event_time | String | Event timestamp in ISO 8601 format |
| transaction_count_30d | Float | Number of transactions in last 30 days |
| total_amount_30d | Float | Total transaction amount in last 30 days |
| avg_transaction_amount | Float | Average transaction amount |
| merchant_category_mode | String | Most frequent merchant category |
| distance_from_home_avg | Float | Average distance from home address (km) |
| is_high_risk | Integer | Binary risk indicator (0=normal, 1=high risk) |

The synthetic data generation incorporates realistic statistical distributions to mimic actual transaction patterns.

In [None]:
# ============================================================
# Génération de Données Synthétiques
# ============================================================

def generate_customer_features(n_customers=1000):
    """
    Génère des features synthétiques pour des clients
    
    Args:
        n_customers: Nombre de clients à générer
        
    Returns:
        DataFrame avec features clients
    """
    np.random.seed(42)
    
    # Générer customer IDs
    customer_ids = [f"CUST_{str(i).zfill(6)}" for i in range(1, n_customers + 1)]
    
    # Timestamp actuel
    current_time = datetime.utcnow()
    
    # Générer features
    data = {
        'customer_id': customer_ids,
        'event_time': [
            (current_time - timedelta(minutes=np.random.randint(0, 1440))).strftime('%Y-%m-%dT%H:%M:%SZ')
            for _ in range(n_customers)
        ],
        'transaction_count_30d': np.random.poisson(15, n_customers).astype(float),
        'total_amount_30d': np.random.gamma(shape=2, scale=500, size=n_customers),
        'avg_transaction_amount': np.random.gamma(shape=2, scale=100, size=n_customers),
        'merchant_category_mode': np.random.choice(
            ['retail', 'grocery', 'gas', 'restaurant', 'online'], 
            n_customers
        ),
        'distance_from_home_avg': np.abs(np.random.normal(10, 15, n_customers)),
        'is_high_risk': np.random.choice([0, 1], n_customers, p=[0.95, 0.05])
    }
    
    df = pd.DataFrame(data)
    
    # Ajouter quelques clients à haut risque avec patterns suspects
    high_risk_indices = df[df['is_high_risk'] == 1].index
    df.loc[high_risk_indices, 'transaction_count_30d'] *= 2
    df.loc[high_risk_indices, 'distance_from_home_avg'] *= 3
    
    return df

# Generate the dataset
print("Generating customer feature dataset...")
customers_df = generate_customer_features(n_customers=1000)

print(f"Dataset created with shape: {customers_df.shape}")
print("\nFirst few rows:")
customers_df.head()

In [None]:
# Analyze dataset characteristics
print("Descriptive statistics:\n")
print(customers_df.describe())

print("\nRisk distribution:")
print(customers_df['is_high_risk'].value_counts())
print(f"\nHigh-risk customer percentage: {customers_df['is_high_risk'].mean()*100:.1f}%")

## Section 3: Feature Group Creation

A Feature Group in SageMaker serves as a logical collection of features that are stored together and share a common schema. The Feature Group provides both online and offline access patterns.

### Creation Process

The process involves four steps:
1. Define the feature schema with appropriate data types
2. Create the Feature Group with both online and offline storage enabled
3. Wait for the infrastructure provisioning (typically 1-2 minutes)
4. Verify the creation status

### Storage Backends

SageMaker Feature Store provides two storage backends optimized for different access patterns:

**Online Store (DynamoDB)**
- Low-latency access (sub-10ms)
- Stores most recent feature values
- Optimized for real-time inference
- Cost based on read/write request units

**Offline Store (S3 + Glue)**
- Higher latency (batch access)
- Stores complete historical record
- Optimized for model training and analytics
- Cost based on storage volume


In [None]:
# ============================================================
# Étape 1: Définir le Feature Group
# ============================================================

# Generate unique Feature Group name with timestamp
feature_group_name = f"customer-features-{int(time.time())}"

print(f"Creating Feature Group: {feature_group_name}")

# Instantiate Feature Group object
customer_feature_group = FeatureGroup(
    name=feature_group_name,
    sagemaker_session=session
)

print("Feature Group object instantiated")

In [None]:
# ============================================================
# Étape 2: Charger les Feature Definitions
# ============================================================

# Infer feature definitions from DataFrame schema
customer_feature_group.load_feature_definitions(data_frame=customers_df)

print("Feature definitions loaded from DataFrame")
print("\nFeature Group schema:")
for feature_def in customer_feature_group.feature_definitions:
    print(f"  - {feature_def.feature_name:30s} : {feature_def.feature_type}")

### Understanding Feature Definitions

**Feature Definitions** define the Feature Store schema:

- `FeatureTypeEnum.STRING`: Text (customer_id, category, etc.)
- `FeatureTypeEnum.INTEGRAL`: Integers (is_high_risk, count, etc.)
- `FeatureTypeEnum.FRACTIONAL`: Decimal numbers (amounts, ratios, etc.)

**Important notes:**
- The schema is **immutable** after creation
- `customer_id` and `event_time` are required
- Types must match the ingested data


In [None]:
# ============================================================
# Étape 3: Créer le Feature Group (Online + Offline)
# ============================================================

print("Creating Feature Group (takes 1-2 minutes)...")
print("   - Online Store (DynamoDB): for real-time inference")
print("   - Offline Store (S3): for training and analytics")

# Create the Feature Group with both stores
customer_feature_group.create(
    s3_uri=f"s3://{bucket}/feature-store/customer-features",
    record_identifier_name="customer_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True  # Enable Online Store (DynamoDB)
)

print("\nWaiting for Feature Group creation...")
print("   (You can monitor in Console: SageMaker > Feature Store)")

# Attendre que le Feature Group soit créé
status = customer_feature_group.describe().get("FeatureGroupStatus")
while status == "Creating":
    print(f"   Status: {status} ... (attente 15s)")
    time.sleep(15)
    status = customer_feature_group.describe().get("FeatureGroupStatus")

print(f"\nFeature Group created successfully.")
print(f"   Status: {status}")
print(f"   Online Store: Enabled (DynamoDB)")
print(f"   Offline Store: Enabled (S3)")

---

## Section 4: Data Ingestion

### Méthodes d'Ingestion

1. **DataFrame.ingest()** : Batch ingestion (ce qu'on utilise ici)
2. **put_record()** : Ingestion unitaire temps réel
3. **Streaming** : Kinesis Data Streams → Feature Store

### Processus d'Ingestion

```
DataFrame → Feature Group → Online Store (DynamoDB)
                         └→ Offline Store (S3)
```

**Temps d'ingestion** :
- Online Store : Disponible immédiatement (< 1s)
- Offline Store : Disponible après 15-30 minutes (batch)


In [None]:
# ============================================================
# Ingestion des Données dans le Feature Store
# ============================================================

print("Ingesting data into Feature Store...")
print(f"   Number of records: {len(customers_df)}")

# Ingest the data
customer_feature_group.ingest(
    data_frame=customers_df,
    max_workers=3,  # Parallelism
    wait=True       # Wait for ingestion to complete
)

print("Ingestion complete.")
print("\nData is now available in:")
print("   - Online Store (DynamoDB): Available immediately for inference")
print("   - Offline Store (S3): Available after 15-30 minutes for training")

---

## Section 5: Feature Retrieval from Online Store

### Get Record : Récupération Unitaire

Utile pour **inference temps réel** :
- Latence < 10ms
- Récupère la **dernière version** de chaque feature
- Utilise `record_identifier` (customer_id)


In [None]:
# ============================================================
# Récupération d'un Record Unique (Online Store)
# ============================================================

# Wait for Online Store to be ready
print("Waiting for Online Store to be ready (a few seconds)...")
time.sleep(10)

# Retrieve a specific customer
test_customer_id = "CUST_000001"

print(f"Retrieving features for customer: {test_customer_id}")

try:
    # Retrieve the record from Online Store
    record = customer_feature_group.get_record(
        record_identifier_value_as_string=test_customer_id
    )
    
    print("Record retrieved from Online Store:")
    print(json.dumps(record, indent=2))
    
except Exception as e:
    print(f"Warning: Error (normal if ingestion is very recent): {e}")
    print("   Retry in a few seconds...")

### Real-World Use Case: Inference with Feature Store

```python
# Dans une lambda d'inference en production :

def predict(customer_id):
    # 1. Récupérer features depuis Feature Store (< 10ms)
    features = feature_group.get_record(
        record_identifier_value_as_string=customer_id
    )
    
    # 2. Transformer en format modèle
    feature_vector = extract_features(features)
    
    # 3. Prédire avec le modèle
    prediction = model.predict(feature_vector)
    
    return prediction
```

**Advantages:**
- No need to recompute features (already calculated)
- Consistency between training and inference
- Low latency (< 10ms)
- Up-to-date features (real-time updates)


---

## Section 6: Batch Retrieval from Offline Store

### Offline Store : Queries Athena

L'Offline Store utilise **AWS Glue** et **Amazon Athena** pour des queries SQL.

**Use Cases** :
- Training de modèles (datasets complets)
- Analyses historiques
- Time-travel queries (point-in-time correctness)
- Feature engineering exploratoire


In [None]:
# ============================================================
# Préparer une Query Athena pour l'Offline Store
# ============================================================

# Obtenir le nom de la table Glue
feature_store_table = customer_feature_group.describe().get("OfflineStoreConfig", {}).get("DataCatalogConfig", {}).get("TableName")

print(f"Offline Store table (AWS Glue): {feature_store_table}")
print(f"\nThe Offline Store is available via Athena:")
print(f"   Database: sagemaker_featurestore")
print(f"   Table: {feature_store_table}")
print(f"\nNote: Data takes 15-30 minutes to appear in the Offline Store")
print(f"      (batch processing from DynamoDB to S3)")

In [None]:
# ============================================================
# Exemple de Query Athena (après 15-30 min)
# ============================================================

# Query SQL pour récupérer les customers à haut risque
athena_query = f"""
SELECT 
    customer_id,
    transaction_count_30d,
    total_amount_30d,
    avg_transaction_amount,
    distance_from_home_avg,
    is_high_risk
FROM 
    "{feature_store_table}"
WHERE 
    is_high_risk = 1
ORDER BY 
    total_amount_30d DESC
LIMIT 10
"""

print("Example Athena query for Offline Store:")
print("=" * 60)
print(athena_query)
print("=" * 60)

print("\nTo execute this query (after 15-30 minutes):")
print("""
# Option 1 : Via SageMaker SDK
query_results = customer_feature_group.athena_query().run(
    query_string=athena_query,
    output_location=f's3://{bucket}/athena-results/'
)

# Option 2 : Via AWS Console
# SageMaker → Feature Store → Offline Store → Query with Athena
""")

### Time-Travel Queries

Le Feature Store supporte les **point-in-time queries** pour éviter le data leakage :

```sql
-- Récupérer les features telles qu'elles étaient le 2024-01-01
SELECT *
FROM "customer_features_table"
WHERE event_time <= '2024-01-01T00:00:00Z'
```

**Pourquoi c'est important ?**
- Éviter le **data leakage** (utiliser des features du futur)
- Reproduire exactement le contexte d'entraînement
- Débugger des problèmes de modèle en production


---

## Section 7: Model Packaging

Maintenant que nous avons nos features, entraînons un modèle simple et **packageons-le** correctement.

### Qu'est-ce que le Model Packaging ?

**But** : Créer un artifact déployable sur SageMaker contenant :
1. Le modèle entraîné (`.pkl`, `.pth`, etc.)
2. Le code d'inference (`inference.py`)
3. Les dépendances (`requirements.txt`)
4. Metadata et configuration

### Structure d'un Package Modèle

```
model.tar.gz
├── model.pkl              # Modèle sérialisé
├── code/
│   ├── inference.py       # Code d'inference custom
│   └── requirements.txt   # Dépendances Python
└── metadata.json          # (optionnel) Métadonnées
```


In [None]:
# ============================================================
# Entraîner un Modèle Simple
# ============================================================

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import LabelEncoder
import joblib

print("Training risk detection model...")

# Préparer les features
X = customers_df[[
    'transaction_count_30d',
    'total_amount_30d',
    'avg_transaction_amount',
    'distance_from_home_avg'
]].values

y = customers_df['is_high_risk'].values

# Encoder la feature catégorielle
le = LabelEncoder()
merchant_encoded = le.fit_transform(customers_df['merchant_category_mode'])
X_with_merchant = np.column_stack([X, merchant_encoded])

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_with_merchant, y, test_size=0.2, random_state=42, stratify=y
)

# Entraîner
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'  # Important pour classes déséquilibrées
)

model.fit(X_train, y_train)

# Évaluer
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("\nModel performance:")
print(classification_report(y_test, y_pred, target_names=['Low Risk', 'High Risk']))
print(f"ROC AUC Score : {roc_auc_score(y_test, y_proba):.4f}")

In [None]:
# ============================================================
# Créer le Répertoire de Packaging
# ============================================================

import os
import shutil

# Créer répertoire pour le package
model_dir = "model_package"
code_dir = os.path.join(model_dir, "code")

# Nettoyer si existe déjà
if os.path.exists(model_dir):
    shutil.rmtree(model_dir)

os.makedirs(code_dir, exist_ok=True)

print(f"Directory created: {model_dir}/")
print(f"   Structure:")
print(f"   {model_dir}/")
print(f"   ├── model.pkl")
print(f"   ├── label_encoder.pkl")
print(f"   └── code/")
print(f"       ├── inference.py")
print(f"       └── requirements.txt")

In [None]:
# ============================================================
# Sauvegarder le Modèle et l'Encodeur
# ============================================================

# Sauvegarder le modèle
model_path = os.path.join(model_dir, "model.pkl")
joblib.dump(model, model_path)
print(f"Model saved: {model_path}")

# Save the label encoder
encoder_path = os.path.join(model_dir, "label_encoder.pkl")
joblib.dump(le, encoder_path)
print(f"Encoder saved: {encoder_path}")

### Creating the Inference Script

Le script `inference.py` doit définir ces fonctions :

1. **`model_fn(model_dir)`** : Charger le modèle depuis le disque
2. **`input_fn(request_body, content_type)`** : Parser les requêtes entrantes
3. **`predict_fn(input_data, model)`** : Effectuer la prédiction
4. **`output_fn(prediction, accept_type)`** : Formater la réponse


In [None]:
# ============================================================
# Créer inference.py
# ============================================================

inference_code = '''
"""
Script d'inference personnalisé pour SageMaker
"""
import os
import json
import joblib
import numpy as np

def model_fn(model_dir):
    """
    Charge le modèle depuis le répertoire du modèle.
    
    Cette fonction est appelée une seule fois au démarrage de l'endpoint.
    
    Args:
        model_dir : Chemin vers le répertoire contenant les artifacts
        
    Returns:
        dict : Dictionnaire contenant le modèle et l'encodeur
    """
    print(f"Loading model from {model_dir}")
    
    model = joblib.load(os.path.join(model_dir, "model.pkl"))
    label_encoder = joblib.load(os.path.join(model_dir, "label_encoder.pkl"))
    
    return {
        "model": model,
        "label_encoder": label_encoder
    }

def input_fn(request_body, content_type):
    """
    Parse les données d'entrée depuis la requête.
    
    Args:
        request_body : Corps de la requête (bytes ou string)
        content_type : Type MIME de la requête
        
    Returns:
        dict : Données parsées
    """
    if content_type == "application/json":
        data = json.loads(request_body)
        return data
    else:
        raise ValueError(f"Content type {content_type} not supported")

def predict_fn(input_data, model_dict):
    """
    Effectue la prédiction.
    
    Args:
        input_data : Données d'entrée (dict)
        model_dict : Dict contenant model et label_encoder
        
    Returns:
        dict : Résultat de la prédiction
    """
    model = model_dict["model"]
    label_encoder = model_dict["label_encoder"]
    
    # Extraire les features
    transaction_count = input_data.get("transaction_count_30d", 0)
    total_amount = input_data.get("total_amount_30d", 0)
    avg_amount = input_data.get("avg_transaction_amount", 0)
    distance = input_data.get("distance_from_home_avg", 0)
    merchant_category = input_data.get("merchant_category_mode", "retail")
    
    # Encoder la catégorie
    try:
        merchant_encoded = label_encoder.transform([merchant_category])[0]
    except:
        merchant_encoded = 0  # Valeur par défaut si catégorie inconnue
    
    # Créer le vecteur de features
    features = np.array([[
        transaction_count,
        total_amount,
        avg_amount,
        distance,
        merchant_encoded
    ]])
    
    # Prédire
    prediction = model.predict(features)[0]
    probability = model.predict_proba(features)[0]
    
    return {
        "is_high_risk": int(prediction),
        "risk_probability": float(probability[1]),
        "confidence": float(max(probability))
    }

def output_fn(prediction, accept_type):
    """
    Formate la sortie de la prédiction.
    
    Args:
        prediction : Résultat de predict_fn
        accept_type : Type MIME de réponse attendu
        
    Returns:
        tuple : (réponse formatée, content_type)
    """
    if accept_type == "application/json":
        return json.dumps(prediction), accept_type
    else:
        raise ValueError(f"Accept type {accept_type} not supported")
'''

# Écrire le fichier
inference_path = os.path.join(code_dir, "inference.py")
with open(inference_path, "w") as f:
    f.write(inference_code)

print(f"Inference script created: {inference_path}")
print(f"\nThe script contains:")
print(f"   - model_fn(): Loads the model at startup")
print(f"   - input_fn(): Parses JSON requests")
print(f"   - predict_fn(): Performs prediction")
print(f"   - output_fn(): Formats the response")

In [None]:
# ============================================================
# Créer requirements.txt
# ============================================================

requirements = '''scikit-learn==1.3.0
numpy==1.24.3
joblib==1.3.1
'''

requirements_path = os.path.join(code_dir, "requirements.txt")
with open(requirements_path, "w") as f:
    f.write(requirements)

print(f"Dependencies file created: {requirements_path}")
print(f"\nRequired packages:")
print(requirements)

In [None]:
# ============================================================
# Créer model.tar.gz
# ============================================================

import tarfile

# Créer l'archive
tar_path = "model.tar.gz"

with tarfile.open(tar_path, "w:gz") as tar:
    tar.add(model_dir, arcname=".")

print(f"Package created: {tar_path}")

# Verify contents
print(f"\nContents of {tar_path}:")
with tarfile.open(tar_path, "r:gz") as tar:
    for member in tar.getmembers():
        print(f"   {member.name:40s} ({member.size} bytes)")

In [None]:
# ============================================================
# Uploader vers S3
# ============================================================

# Uploader le model package vers S3
model_s3_key = f"lab5-models/risk-detection/model.tar.gz"
model_s3_uri = f"s3://{bucket}/{model_s3_key}"

s3_client = boto3.client('s3')
s3_client.upload_file(tar_path, bucket, model_s3_key)

print(f"Model uploaded to S3:")
print(f"   {model_s3_uri}")

---

## Section 8: Model Registry

### Qu'est-ce que le Model Registry ?

Le **Model Registry** est un catalogue centralisé de modèles ML qui permet :

1. **Versioning** : Gérer plusieurs versions d'un modèle
2. **Approval Workflow** : Pending → Approved → Rejected
3. **Metadata** : Stocker métriques, datasets, paramètres
4. **Lineage** : Tracer l'origine du modèle (data + code)
5. **Deployment** : Source pour déploiement en production

### Architecture du Model Registry

```
┌────────────────────────────────────────────────────────┐
│              Model Package Group                        │
│          (ex: "fraud-detection-models")                │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │  Version 1   │  │  Version 2   │  │  Version 3   │ │
│  │  (Approved)  │  │  (Pending)   │  │  (Rejected)  │ │
│  │  AUC: 0.85   │  │  AUC: 0.92   │  │  AUC: 0.80   │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└────────────────────────────────────────────────────────┘
          ↓                  ↓                  ↓
      Production         Staging            Archived
```


In [None]:
# ============================================================
# Créer un Model Package Group
# ============================================================

from sagemaker.model import Model

# Nom du Model Package Group
model_package_group_name = "customer-risk-detection-models"

sm_client = boto3.client('sagemaker')

# Créer le Model Package Group
try:
    sm_client.create_model_package_group(
        ModelPackageGroupName=model_package_group_name,
        ModelPackageGroupDescription="Modèles de détection de risque client pour prévention fraude"
    )
    print(f"Model Package Group created: {model_package_group_name}")
except sm_client.exceptions.ResourceInUse:
    print(f"Note: Model Package Group already exists: {model_package_group_name}")

In [None]:
# ============================================================
# Enregistrer le Modèle dans le Registry
# ============================================================

from sagemaker.sklearn import SKLearnModel
from sagemaker.model_metrics import ModelMetrics, MetricsSource

# Créer l'objet Model
sklearn_model = SKLearnModel(
    model_data=model_s3_uri,
    role=role,
    entry_point="inference.py",
    source_dir=code_dir,
    framework_version="1.2-1",  # Version de scikit-learn
    py_version="py3"
)

print("Registering model in Model Registry...")
print(f"   Model Package Group: {model_package_group_name}")
print(f"   Model Data: {model_s3_uri}")

# Créer les métriques du modèle (optionnel mais recommandé)
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=f"s3://{bucket}/model-metrics/metrics.json",
        content_type="application/json"
    )
)

# Enregistrer dans le Model Registry
model_package = sklearn_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=["ml.t2.medium", "ml.m5.large"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status="PendingManualApproval",  # En attente d'approbation
    description="Random Forest model for customer risk detection (Lab 5)",
    model_metrics=model_metrics
)

print(f"\nModel registered in Model Registry successfully.")
print(f"   Model Package ARN: {model_package.model_package_arn}")
print(f"   Status: PendingManualApproval")

### Approval Workflow

Le modèle passe par ces états :

1. **PendingManualApproval** : En attente de review
2. **Approved** : Validé pour production
3. **Rejected** : Refusé (ne pas déployer)

**Qui approuve ?**
- ML Engineer lead
- Product Owner
- Compliance team (pour cas sensibles)


In [None]:
# ============================================================
# Approuver le Modèle
# ============================================================

print("Approving model for deployment...")

# Mettre à jour le status à "Approved"
sm_client.update_model_package(
    ModelPackageArn=model_package.model_package_arn,
    ModelApprovalStatus="Approved",
    ApprovalDescription="Model validated by ML team. ROC AUC = 0.92, ready for production."
)

print(f"Model approved.")
print(f"   The model can now be deployed to production.")

In [None]:
# ============================================================
# Lister les Modèles dans le Registry
# ============================================================

print(f"Models registered in group: {model_package_group_name}\n")

response = sm_client.list_model_packages(
    ModelPackageGroupName=model_package_group_name,
    SortBy="CreationTime",
    SortOrder="Descending"
)

for i, package in enumerate(response['ModelPackageSummaryList'], 1):
    print(f"{i}. Version: {package['ModelPackageVersion']}")
    print(f"   ARN: {package['ModelPackageArn']}")
    print(f"   Status: {package['ModelApprovalStatus']}")
    print(f"   Created: {package['CreationTime']}")
    print()

---

## Section 9: Deploy from Model Registry

Maintenant que le modèle est **approuvé**, déployons-le directement depuis le Registry.

### Avantages de Déployer depuis le Registry

1. **Traçabilité** : Savoir exactement quelle version est en production
2. **Governance** : Seuls les modèles approuvés peuvent être déployés
3. **Rollback facile** : Revenir à une version précédente rapidement
4. **Audit** : Log complet des déploiements


In [None]:
# ============================================================
# Déployer le Modèle Approuvé
# ============================================================

from sagemaker.model import ModelPackage

print("Deploying model from Model Registry...")

# Create a ModelPackage object from the Registry
model_package_predictor = ModelPackage(
    role=role,
    model_package_arn=model_package.model_package_arn,
    sagemaker_session=session
)

# Deploy to an endpoint
endpoint_name = f"risk-detection-{int(time.time())}"

print(f"   Endpoint name: {endpoint_name}")
print(f"   Instance type: ml.t2.medium")
print(f"   Deployment in progress (3-5 minutes)...")

predictor = model_package_predictor.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.medium",
    endpoint_name=endpoint_name
)

print(f"\nEndpoint deployed successfully.")
print(f"   Endpoint: {endpoint_name}")
print(f"   Status: InService")

In [None]:
# ============================================================
# Tester l'Endpoint
# ============================================================

# Données de test
test_data = {
    "transaction_count_30d": 25.0,
    "total_amount_30d": 2500.0,
    "avg_transaction_amount": 100.0,
    "distance_from_home_avg": 50.0,
    "merchant_category_mode": "online"
}

print("Testing endpoint with sample data...")
print(f"\nInput:")
print(json.dumps(test_data, indent=2))

# Predict
result = predictor.predict(test_data)

print(f"\nPrediction result:")
print(json.dumps(result, indent=2))

if result['is_high_risk'] == 1:
    print(f"\nALERT: HIGH RISK customer")
    print(f"   Risk probability: {result['risk_probability']*100:.1f}%")
else:
    print(f"\nLow risk customer")
    print(f"   Risk probability: {result['risk_probability']*100:.1f}%")

---

## Section 10: Resource Cleanup

**Important:** To avoid unnecessary costs, delete the created resources.

### Ressources à Nettoyer

1. **Endpoint** : $0.05-0.10/heure (coût continu)
2. **Feature Group** : $0.025/GB/mois (Online Store)
3. **Model Package** : Gratuit (sauf stockage S3)


In [None]:
# ============================================================
# Nettoyer l'Endpoint
# ============================================================

print("Cleaning up endpoint...")

try:
    predictor.delete_endpoint()
    print(f"Endpoint deleted: {endpoint_name}")
except Exception as e:
    print(f"Warning: Error during deletion: {e}")

In [None]:
# ============================================================
# Nettoyer le Feature Group (Optionnel)
# ============================================================

print("Note: Deleting Feature Group...")
print("   WARNING: This will delete ALL ingested data!")

# Uncomment to delete:
# customer_feature_group.delete()
# print(f"Feature Group deleted: {feature_group_name}")

print("\nFeature Group preserved for use in subsequent labs")
print(f"   To delete manually: Console > SageMaker > Feature Store)")

---

## Summary and Key Learnings

### What You Accomplished

1. **Feature Store**:
   - Created a Feature Group with Online + Offline stores
   - Ingested 1000 records
   - Retrieved features for inference (<10ms)
   - Understood the architecture: Online (DynamoDB) vs Offline (S3)

2. **Model Packaging**:
   - Packaged a scikit-learn model with dependencies
   - Created a custom inference script (inference.py)
   - Generated proper model.tar.gz for SageMaker
   - Uploaded to S3

3. **Model Registry**:
   - Created a Model Package Group
   - Registered a model with metadata
   - Implemented an approval workflow
   - Deployed from the Registry

4. **Deployment**:
   - Deployed an endpoint from the Model Registry
   - Tested real-time inference
   - Understood traceability and governance

### Key Concepts

| Concept | Définition | Use Case |
|---------|------------|----------|
| **Feature Store** | Catalogue centralisé de features | Training + Inference consistency |
| **Online Store** | DynamoDB, <10ms latency | Real-time inference |
| **Offline Store** | S3 + Glue, batch queries | Training, analytics |
| **Model Registry** | Catalogue de modèles versionnés | Governance, approval workflow |
| **model.tar.gz** | Package déployable sur SageMaker | Modèle + code + dependencies |
| **inference.py** | Script custom pour inference | Preprocessing, postprocessing |

### Complete Architecture Built

```
┌─────────────────────────────────────────────────────────┐
│                   Feature Store                         │
│  ┌──────────────┐              ┌──────────────┐        │
│  │Online Store  │              │Offline Store │        │
│  │(DynamoDB)    │              │(S3 + Glue)   │        │
│  └──────────────┘              └──────────────┘        │
└─────────────────────────────────────────────────────────┘
         ↓                                ↓
         |                                |
    Inference                         Training
         |                                |
         ↓                                ↓
┌─────────────────────────────────────────────────────────┐
│                   Model Registry                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │  Version 1   │  │  Version 2   │  │  Version 3   │ │
│  │  (Approved)  │  │  (Pending)   │  │  (Approved)  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────────┘
         ↓
         |
    Deployment
         ↓
┌─────────────────────────────────────────────────────────┐
│                   SageMaker Endpoint                    │
│                   (Real-time Inference)                  │
└─────────────────────────────────────────────────────────┘
```

### Next Steps

**Lab 6**: Advanced deployment (Serverless, Async, Multi-Model Endpoints)

**Lab 7**: SageMaker Pipelines for MLOps automation

**Lab 8**: Model Monitor and deployment strategies

---

## Reflection Questions

1. **Why** use a Feature Store instead of computing features each time?
2. **What is** the difference between Online Store and Offline Store?
3. **How** does the Model Registry improve governance?
4. **Why** package the model with inference.py instead of the model alone?

---

## Additional Resources

- [Feature Store Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html)
- [Model Registry Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html)
- [Inference Code Examples](https://github.com/aws/amazon-sagemaker-examples)
- [MLOps Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/best-practices.html)

---

## Congratulations!

You have successfully completed Lab 5! You now master:
- SageMaker Feature Store (Online + Offline)
- Model Packaging and inference scripts
- Model Registry and approval workflows
- ✅ Déploiement depuis le Registry

**Temps de pause recommandé : 15 minutes ☕**

Ensuite, passez au **Lab 6** pour découvrir les différents types d'endpoints et stratégies de déploiement avancées !
