# 🤖 ML Pipelines y Feature Stores

Objetivo: diseñar pipelines reproducibles de datos para ML, integrar feature stores (Feast/Tecton) y aplicar MLOps básico (versionado, reentrenamiento, monitoreo).

- Duración: 120–150 min
- Dificultad: Alta
- Prerrequisitos: Python ML básico, pipelines batch/streaming

### 🔄 **ML Pipeline Architecture: Training vs Inference**

**Desafío Clásico: Research vs Production Gap**

```
Data Science Notebook (Research):
┌────────────────────────────────┐
│ Jupyter Notebook               │
│ • Pandas local (100K rows)     │
│ • Manual feature engineering   │
│ • sklearn model training       │
│ • Pickle model → disk          │
└────────────────────────────────┘
   ↓ "Works on my machine!"
   ↓ Deploy to production...
   ❌ Features different
   ❌ Data scale breaks
   ❌ Not reproducible
   ❌ Drift undetected

Production Reality:
┌────────────────────────────────┐
│ • Spark (billions of rows)     │
│ • Streaming features           │
│ • Model serving infrastructure│
│ • Monitoring & alerts          │
│ • A/B testing                  │
│ • Retraining automation        │
└────────────────────────────────┘
```

**ML Pipeline Complete Architecture:**

```python
┌─────────────────────────────────────────────────────────────┐
│                    DATA SOURCES                              │
│  • Transactional DB (PostgreSQL)                            │
│  • Event Stream (Kafka)                                     │
│  • External APIs (Weather, Demographics)                    │
│  • Historical Data Lake (S3/Delta)                          │
└────────────────────┬────────────────────────────────────────┘
                     │
      ┌──────────────┴──────────────┐
      │                             │
      ▼                             ▼
┌──────────────┐            ┌──────────────┐
│   OFFLINE    │            │   ONLINE     │
│   FEATURES   │            │  FEATURES    │
│              │            │              │
│ • Batch ETL  │            │• Streaming   │
│   (Spark)    │            │  (Flink)     │
│ • Daily agg  │            │• Real-time   │
│ • Historical │            │  agg         │
│   30d, 90d   │            │• Last 24h    │
└──────┬───────┘            └──────┬───────┘
       │                           │
       └──────────┬────────────────┘
                  │
                  ▼
       ┌─────────────────┐
       │ FEATURE STORE   │
       │                 │
       │ • Feast/Tecton  │
       │ • Offline Store │
       │   (Delta Lake)  │
       │ • Online Store  │
       │   (Redis/DDB)   │
       └────────┬────────┘
                │
        ┌───────┴────────┐
        │                │
        ▼                ▼
┌──────────────┐  ┌──────────────┐
│   TRAINING   │  │   SERVING    │
│   PIPELINE   │  │   PIPELINE   │
│              │  │              │
│ • Feature    │  │ • Feature    │
│   retrieval  │  │   retrieval  │
│ • Model      │  │   (online)   │
│   training   │  │ • Model      │
│ • Validation │  │   inference  │
│ • Registry   │  │ • Monitoring │
│   (MLflow)   │  │              │
└──────┬───────┘  └──────┬───────┘
       │                 │
       ▼                 ▼
┌──────────────┐  ┌──────────────┐
│ MODEL        │  │  PREDICTIONS │
│ REGISTRY     │  │              │
│              │  │ • Online API │
│ • Versions   │  │   (<100ms)   │
│ • Metadata   │  │ • Batch job  │
│ • Stage      │  │   (daily)    │
│   (staging/  │  │ • A/B test   │
│    prod)     │  │              │
└──────────────┘  └──────────────┘
                         │
                         ▼
                  ┌──────────────┐
                  │  MONITORING  │
                  │              │
                  │ • Data drift │
                  │ • Model perf │
                  │ • Latency    │
                  │ • Alerts     │
                  └──────────────┘
```

**Training Pipeline (Offline):**

```python
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, datediff, sum as spark_sum
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve

# Step 1: Feature Engineering (Spark Batch)
spark = SparkSession.builder.appName("ML_Training").getOrCreate()

# Historical transactions
transactions = spark.read.format("delta").load("s3://datalake/silver/transactions")

# Aggregate features per customer (last 30 days)
training_date = datetime(2025, 10, 30)
lookback_start = training_date - timedelta(days=30)

customer_features = transactions \
    .filter(
        (col("transaction_date") >= lookback_start) & 
        (col("transaction_date") < training_date)
    ) \
    .groupBy("customer_id") \
    .agg(
        count("*").alias("txn_count_30d"),
        spark_sum("amount").alias("total_spent_30d"),
        avg("amount").alias("avg_transaction_30d"),
        max("transaction_date").alias("last_transaction_date")
    ) \
    .withColumn(
        "days_since_last_txn",
        datediff(lit(training_date), col("last_transaction_date"))
    )

# Join with labels (did customer churn in next 7 days?)
labels = spark.read.format("delta").load("s3://datalake/gold/churn_labels") \
    .filter(col("label_date") == training_date)

training_data = customer_features.join(labels, "customer_id")

# Convert to Pandas for sklearn (or use Spark MLlib)
df_train = training_data.toPandas()

# Step 2: Train Model
feature_cols = ["txn_count_30d", "total_spent_30d", "avg_transaction_30d", "days_since_last_txn"]
X_train = df_train[feature_cols]
y_train = df_train["churned"]

mlflow.set_experiment("customer_churn")

with mlflow.start_run(run_name=f"rf_train_{training_date.strftime('%Y%m%d')}"):
    # Log parameters
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("training_date", training_date.isoformat())
    mlflow.log_param("lookback_days", 30)
    
    # Train
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred_proba = model.predict_proba(X_train)[:, 1]
    auc = roc_auc_score(y_train, y_pred_proba)
    
    # Log metrics
    mlflow.log_metric("train_auc", auc)
    mlflow.log_metric("train_samples", len(X_train))
    
    # Log model
    mlflow.sklearn.log_model(
        model, 
        "model",
        registered_model_name="customer_churn_model"
    )
    
    # Log feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    mlflow.log_dict(feature_importance.to_dict(), "feature_importance.json")
    
    print(f"Model trained with AUC: {auc:.4f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")
```

**Inference Pipeline (Online):**

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.pyfunc
import redis
import json
from typing import Dict

app = FastAPI(title="Churn Prediction API")

# Load model from MLflow Registry (production stage)
model = mlflow.pyfunc.load_model("models:/customer_churn_model/Production")

# Redis for feature caching
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

class PredictionRequest(BaseModel):
    customer_id: str

class PredictionResponse(BaseModel):
    customer_id: str
    churn_probability: float
    churn_risk: str  # low, medium, high
    features_used: Dict[str, float]

@app.post("/predict", response_model=PredictionResponse)
async def predict_churn(request: PredictionRequest):
    """
    Real-time churn prediction
    Latency target: <100ms
    """
    customer_id = request.customer_id
    
    # 1. Get features from Redis (online feature store)
    cache_key = f"features:customer:{customer_id}"
    cached_features = redis_client.get(cache_key)
    
    if not cached_features:
        raise HTTPException(404, f"Features not found for customer {customer_id}")
    
    features = json.loads(cached_features)
    
    # 2. Prepare input for model
    feature_vector = [[
        features["txn_count_30d"],
        features["total_spent_30d"],
        features["avg_transaction_30d"],
        features["days_since_last_txn"]
    ]]
    
    # 3. Predict
    churn_proba = model.predict(feature_vector)[0]
    
    # 4. Classify risk
    if churn_proba < 0.3:
        risk = "low"
    elif churn_proba < 0.7:
        risk = "medium"
    else:
        risk = "high"
    
    return PredictionResponse(
        customer_id=customer_id,
        churn_probability=float(churn_proba),
        churn_risk=risk,
        features_used=features
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model_version": model.metadata.get_model_info().version}
```

**Batch Inference Pipeline:**

```python
from pyspark.sql import SparkSession
import mlflow.pyfunc
from pyspark.sql.functions import struct, col

# Load model in Spark context
model_uri = "models:/customer_churn_model/Production"
predict_udf = mlflow.pyfunc.spark_udf(
    spark, 
    model_uri=model_uri,
    result_type="double"
)

# Get all active customers
active_customers = spark.read.format("delta").load("s3://datalake/gold/active_customers")

# Get their features
customer_features = spark.read.format("delta").load("s3://datalake/gold/customer_features_daily") \
    .filter(col("feature_date") == current_date())

# Join
customers_with_features = active_customers.join(customer_features, "customer_id")

# Predict in batch (distributed)
predictions = customers_with_features.withColumn(
    "churn_probability",
    predict_udf(struct(
        col("txn_count_30d"),
        col("total_spent_30d"),
        col("avg_transaction_30d"),
        col("days_since_last_txn")
    ))
)

# Save predictions
predictions.select(
    "customer_id",
    "churn_probability",
    lit(current_date()).alias("prediction_date")
).write \
    .format("delta") \
    .mode("append") \
    .partitionBy("prediction_date") \
    .save("s3://datalake/gold/churn_predictions")

# High-risk customers for marketing campaign
high_risk = predictions.filter(col("churn_probability") > 0.7)

high_risk.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://crm-db:5432/marketing") \
    .option("dbtable", "high_churn_risk_customers") \
    .mode("overwrite") \
    .save()

print(f"Scored {predictions.count():,} customers")
print(f"High risk: {high_risk.count():,} customers")
```

**Key Differences: Training vs Serving**

```python
differences = {
    "Data Volume": {
        "Training": "Full historical data (TB-scale, years)",
        "Serving Online": "Single customer features (KB-scale)",
        "Serving Batch": "All active customers (GB-scale, daily)"
    },
    
    "Latency": {
        "Training": "Hours/days OK (run weekly/monthly)",
        "Serving Online": "<100ms required (SLA critical)",
        "Serving Batch": "Minutes/hours OK (overnight jobs)"
    },
    
    "Features": {
        "Training": "Pre-computed offline features (historical accuracy)",
        "Serving Online": "Real-time features from online store (Redis)",
        "Serving Batch": "Daily snapshot features"
    },
    
    "Compute": {
        "Training": "Large Spark cluster (100+ cores)",
        "Serving Online": "Small API server (4-8 cores, low memory)",
        "Serving Batch": "Medium Spark cluster (20-50 cores)"
    },
    
    "Tools": {
        "Training": "Spark, Pandas, sklearn/XGBoost, MLflow",
        "Serving Online": "FastAPI, Redis, model in memory",
        "Serving Batch": "Spark with mlflow.pyfunc.spark_udf"
    },
    
    "Challenges": {
        "Training": """
            • Feature-label leakage (use point-in-time correct features)
            • Class imbalance (use SMOTE, class weights)
            • Reproducibility (seed, versions, data snapshots)
        """,
        "Serving Online": """
            • Feature freshness (stale features in cache)
            • Latency spikes (model load, garbage collection)
            • Feature skew (training vs serving mismatch)
        """,
        "Serving Batch": """
            • Scale (billions of predictions)
            • Resource contention (don't starve other jobs)
            • Failure recovery (checkpoint, retry logic)
        """
    }
}
```

**Automated Retraining with Airflow:**

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.emr import EmrAddStepsOperator
from datetime import datetime, timedelta

def check_model_performance(**context):
    """
    Compare current model vs new model
    Promote to production if better
    """
    import mlflow
    
    # Get current production model metrics
    client = mlflow.tracking.MlflowClient()
    prod_versions = client.get_latest_versions("customer_churn_model", stages=["Production"])
    prod_run_id = prod_versions[0].run_id
    prod_auc = mlflow.get_run(prod_run_id).data.metrics["train_auc"]
    
    # Get latest staging model
    staging_versions = client.get_latest_versions("customer_churn_model", stages=["Staging"])
    staging_run_id = staging_versions[0].run_id
    staging_auc = mlflow.get_run(staging_run_id).data.metrics["train_auc"]
    
    print(f"Production AUC: {prod_auc:.4f}")
    print(f"Staging AUC: {staging_auc:.4f}")
    
    # Promote if better by at least 1%
    if staging_auc > prod_auc * 1.01:
        client.transition_model_version_stage(
            name="customer_churn_model",
            version=staging_versions[0].version,
            stage="Production"
        )
        print("✅ New model promoted to Production!")
        return "promote"
    else:
        print("❌ New model not better, keeping current Production")
        return "keep_current"

with DAG(
    "ml_training_pipeline",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@weekly",  # Every Monday
    catchup=False
) as dag:
    
    # Step 1: Feature engineering (Spark on EMR)
    feature_engineering = EmrAddStepsOperator(
        task_id="feature_engineering",
        job_flow_id="j-XXXXXXXXXXXXX",
        steps=[{
            'Name': 'Feature Engineering',
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': [
                    'spark-submit',
                    '--master', 'yarn',
                    's3://scripts/feature_engineering.py',
                    '--date', '{{ ds }}'
                ]
            }
        }]
    )
    
    # Step 2: Model training
    model_training = EmrAddStepsOperator(
        task_id="model_training",
        job_flow_id="j-XXXXXXXXXXXXX",
        steps=[{
            'Name': 'Model Training',
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': [
                    'spark-submit',
                    '--master', 'yarn',
                    's3://scripts/train_model.py',
                    '--date', '{{ ds }}'
                ]
            }
        }]
    )
    
    # Step 3: Model evaluation & promotion
    evaluate_and_promote = PythonOperator(
        task_id="evaluate_and_promote",
        python_callable=check_model_performance
    )
    
    feature_engineering >> model_training >> evaluate_and_promote
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🏪 **Feature Stores: Feast, Tecton y Feature Engineering at Scale**

**Problema: Training-Serving Skew**

```
Scenario sin Feature Store:
┌──────────────────────────────────────┐
│ TRAINING (Data Scientist)            │
│                                      │
│ # feature_engineering_training.py    │
│ df['avg_purchase_30d'] =             │
│   df.groupby('customer_id')          │
│     ['amount']                       │
│     .rolling(30)                     │
│     .mean()                          │
└──────────────────────────────────────┘
       ↓ Model trained con esta feature
       ↓ Deploy to production...
┌──────────────────────────────────────┐
│ SERVING (Engineer implementa en API) │
│                                      │
│ # feature_api.py                     │
│ avg_purchase = db.query(             │
│   f"SELECT AVG(amount)               │
│    FROM purchases                    │
│    WHERE customer_id={id}            │
│    AND date >= NOW() - 30"           │
│ )                                    │
└──────────────────────────────────────┘
       ❌ Subtle difference in logic!
       ❌ Training: rolling mean (include current)
       ❌ Serving: SQL AVG (different timestamp handling)
       ❌ Model performance degrades silently
```

**Feature Store Solution:**

```python
┌───────────────────────────────────────────────────────┐
│              FEATURE STORE                             │
│                                                        │
│  Single Source of Truth for Features                  │
│  ┌──────────────┐           ┌──────────────┐         │
│  │   OFFLINE    │           │   ONLINE     │         │
│  │   STORE      │           │   STORE      │         │
│  │              │           │              │         │
│  │ • Delta Lake │           │ • Redis      │         │
│  │ • Historical │           │ • DynamoDB   │         │
│  │   features   │           │ • Cassandra  │         │
│  │ • Training   │           │              │         │
│  │   retrieval  │           │ • Serving    │         │
│  │              │           │   retrieval  │         │
│  │ • Point-in-  │           │ • <10ms      │         │
│  │   time       │           │   latency    │         │
│  │   correct    │           │              │         │
│  └──────┬───────┘           └──────┬───────┘         │
│         │                          │                 │
│         │  SAME DEFINITION         │                 │
│         │  GUARANTEED CONSISTENCY  │                 │
│         │                          │                 │
└─────────┴──────────────────────────┴─────────────────┘
          │                          │
          ▼                          ▼
   ┌─────────────┐           ┌─────────────┐
   │  TRAINING   │           │   SERVING   │
   │             │           │             │
   │ get_        │           │ get_        │
   │ historical_ │           │ online_     │
   │ features()  │           │ features()  │
   └─────────────┘           └─────────────┘
```

**Feast: Open-Source Feature Store**

```python
# feast_repo/feature_store.yaml
project: customer_churn
registry: s3://feast-registry/registry.db
provider: aws
online_store:
    type: redis
    connection_string: "redis:6379"
offline_store:
    type: file  # Or "spark", "snowflake", "bigquery"
    
# feast_repo/features.py
from feast import Entity, FeatureView, Field, FileSource, Feature
from feast.types import Float32, Int64, String
from datetime import timedelta

# Define entity (primary key)
customer = Entity(
    name="customer",
    join_keys=["customer_id"],
    description="Customer entity"
)

# Offline data source (historical features)
customer_transactions_source = FileSource(
    path="s3://datalake/gold/customer_features_daily/",
    timestamp_field="feature_timestamp",
    created_timestamp_column="created_timestamp"
)

# Feature view (collection of features)
customer_transaction_features = FeatureView(
    name="customer_transaction_features",
    entities=[customer],
    ttl=timedelta(days=30),  # Features valid for 30 days
    schema=[
        Field(name="txn_count_30d", dtype=Int64),
        Field(name="total_spent_30d", dtype=Float32),
        Field(name="avg_transaction_30d", dtype=Float32),
        Field(name="days_since_last_txn", dtype=Int64),
        Field(name="favorite_category", dtype=String),
    ],
    online=True,  # Enable online serving
    source=customer_transactions_source,
    tags={"team": "data-science", "pii": "false"}
)

# On-demand features (computed at request time)
from feast import OnDemandFeatureView, RequestSource

request_source = RequestSource(
    name="request_data",
    schema=[
        Field(name="current_cart_value", dtype=Float32)
    ]
)

@OnDemandFeatureView(
    sources=[customer_transaction_features, request_source],
    schema=[
        Field(name="cart_to_avg_ratio", dtype=Float32)
    ]
)
def cart_features(inputs: pd.DataFrame) -> pd.DataFrame:
    """
    Compute ratio of current cart to historical average
    """
    df = pd.DataFrame()
    df["cart_to_avg_ratio"] = (
        inputs["current_cart_value"] / inputs["avg_transaction_30d"]
    )
    return df
```

**Deploy Feature Definitions:**

```bash
# Apply feature definitions to registry
feast apply

# Output:
# Registered entity customer
# Registered feature view customer_transaction_features
# Deploying infrastructure for online store...
# ✅ Deployment successful!
```

**Training: Get Historical Features**

```python
from feast import FeatureStore
import pandas as pd
from datetime import datetime, timedelta

store = FeatureStore(repo_path="feast_repo/")

# Entity dataframe (customers and timestamps for training)
entity_df = pd.DataFrame({
    "customer_id": [1001, 1002, 1003, 1004],
    "event_timestamp": [
        datetime(2025, 10, 1),
        datetime(2025, 10, 1),
        datetime(2025, 10, 1),
        datetime(2025, 10, 1)
    ]
})

# Get point-in-time correct features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_transaction_features:txn_count_30d",
        "customer_transaction_features:total_spent_30d",
        "customer_transaction_features:avg_transaction_30d",
        "customer_transaction_features:days_since_last_txn",
    ]
).to_df()

print(training_df.head())

# Output:
#   customer_id event_timestamp  txn_count_30d  total_spent_30d  avg_transaction_30d  days_since_last_txn
# 0        1001      2025-10-01             15           1250.0                 83.3                    2
# 1        1002      2025-10-01              8            430.0                 53.8                    5
# 2        1003      2025-10-01             25           3200.0                128.0                    1
# 3        1004      2025-10-01              3            120.0                 40.0                   14

# Point-in-time correct:
# For customer 1001 on 2025-10-01:
# - Only uses data from 2025-09-01 to 2025-09-30 (before event_timestamp)
# - No data leakage from future
```

**Materialize Features to Online Store:**

```python
# Materialize features to Redis for online serving
from datetime import datetime

store.materialize(
    start_date=datetime(2025, 10, 1),
    end_date=datetime(2025, 10, 30)
)

# Output:
# Materializing 4 features for customer_transaction_features
# Progress: 100% |████████████████████| 10000/10000 customers
# ✅ Materialization complete!
# Online store: Redis updated with latest features
```

**Serving: Get Online Features**

```python
# Real-time feature retrieval (< 10ms)
features = store.get_online_features(
    features=[
        "customer_transaction_features:txn_count_30d",
        "customer_transaction_features:total_spent_30d",
        "customer_transaction_features:avg_transaction_30d",
        "customer_transaction_features:days_since_last_txn",
    ],
    entity_rows=[
        {"customer_id": 1001},
        {"customer_id": 1002}
    ]
).to_dict()

print(features)

# Output:
# {
#   'customer_id': [1001, 1002],
#   'txn_count_30d': [15, 8],
#   'total_spent_30d': [1250.0, 430.0],
#   'avg_transaction_30d': [83.3, 53.8],
#   'days_since_last_txn': [2, 5]
# }

# Latency: 5-10ms (Redis lookup)
```

**Feature Engineering Patterns:**

```python
# 1. AGGREGATION FEATURES (most common)
"""
Aggregate historical transactions over windows
"""
from pyspark.sql import Window
from pyspark.sql.functions import col, sum, avg, count, datediff, current_date

# Daily feature computation
transactions = spark.read.format("delta").load("s3://datalake/silver/transactions")

# 7-day, 30-day, 90-day windows
for days in [7, 30, 90]:
    window_spec = Window \
        .partitionBy("customer_id") \
        .orderBy("transaction_date") \
        .rangeBetween(-days * 86400, 0)  # seconds
    
    transactions = transactions \
        .withColumn(f"txn_count_{days}d", count("*").over(window_spec)) \
        .withColumn(f"total_spent_{days}d", sum("amount").over(window_spec)) \
        .withColumn(f"avg_transaction_{days}d", avg("amount").over(window_spec))

# 2. RECENCY FEATURES
"""
How recent was last activity
"""
customer_features = transactions \
    .groupBy("customer_id") \
    .agg(
        max("transaction_date").alias("last_transaction_date")
    ) \
    .withColumn(
        "days_since_last_txn",
        datediff(current_date(), col("last_transaction_date"))
    )

# 3. FREQUENCY FEATURES
"""
How often customer transacts
"""
customer_features = customer_features \
    .withColumn(
        "avg_days_between_txn",
        col("days_since_first_txn") / col("total_txn_count")
    )

# 4. RATIO FEATURES
"""
Relative comparisons
"""
customer_features = customer_features \
    .withColumn(
        "spend_acceleration",
        col("total_spent_30d") / (col("total_spent_90d") + 1)  # +1 avoid div/0
    ) \
    .withColumn(
        "high_value_txn_ratio",
        col("high_value_txn_count") / (col("total_txn_count") + 1)
    )

# 5. CATEGORICAL EMBEDDINGS
"""
Convert categories to vectors (for neural networks)
"""
from pyspark.ml.feature import StringIndexer, OneHotEncoder

indexer = StringIndexer(inputCol="favorite_category", outputCol="category_idx")
encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_vec")

# 6. TEXT FEATURES
"""
NLP on customer reviews
"""
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

tokenizer = Tokenizer(inputCol="review_text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="raw_features", numFeatures=1000)
idf = IDF(inputCol="raw_features", outputCol="tfidf_features")
```

**Tecton: Enterprise Feature Store**

```python
# Tecton advantages over Feast:
tecton_features = {
    "1. Real-time Stream Features": """
        Feast: Batch only (daily materialization)
        Tecton: Native Flink/Spark Streaming
        
        @stream_feature_view(
            source=KafkaStreamSource(topic="transactions"),
            aggregation_interval=timedelta(minutes=1)
        )
        def user_transaction_counts(transactions):
            return f.count(transactions)
    """,
    
    "2. Feature Monitoring Built-in": """
        Tecton tracks:
        - Data quality (nulls, outliers)
        - Data drift (distribution changes)
        - Feature freshness (lag time)
        - Feature usage (which models use which features)
        
        Auto-alerts cuando features degradan
    """,
    
    "3. Multi-tenant & Enterprise": """
        - RBAC (role-based access control)
        - Cost tracking per team
        - SLA enforcement
        - Audit logs
        
        Feast: Open-source, self-managed
        Tecton: Managed SaaS ($$$)
    """,
    
    "4. Transformation Pushdown": """
        Tecton optimiza transformaciones:
        - Spark transformations compiled
        - Predicate pushdown
        - Partition pruning
        
        10x faster que Feast para large scale
    """
}

# Tecton example (similar API)
from tecton import Entity, BatchSource, FeatureView
from tecton.types import Field, String, Float64, Int64
from datetime import timedelta

customer = Entity(name="customer", join_keys=["customer_id"])

transactions_batch = BatchSource(
    name="transactions",
    batch_config=SnowflakeBatchConfig(
        database="ANALYTICS",
        schema="SILVER",
        table="TRANSACTIONS",
        timestamp_field="transaction_date"
    )
)

@batch_feature_view(
    sources=[transactions_batch],
    entities=[customer],
    mode="spark_sql",
    online=True,
    offline=True,
    feature_start_time=datetime(2020, 1, 1),
    batch_schedule=timedelta(days=1),
    ttl=timedelta(days=30)
)
def customer_transaction_metrics(transactions):
    return f"""
        SELECT
            customer_id,
            transaction_date as timestamp,
            COUNT(*) as txn_count_30d,
            SUM(amount) as total_spent_30d,
            AVG(amount) as avg_transaction_30d
        FROM {transactions}
        WHERE transaction_date >= CURRENT_DATE - INTERVAL 30 DAYS
        GROUP BY customer_id, transaction_date
    """
```

**Best Practices:**

```python
best_practices = {
    "1. Feature Naming Convention": """
        Pattern: {entity}_{aggregation}_{window}
        
        ✅ Good:
        - customer_txn_count_30d
        - product_view_count_7d
        - session_avg_duration_1h
        
        ❌ Bad:
        - feature1
        - count
        - customer_feature
    """,
    
    "2. Feature Documentation": """
        Every feature needs:
        - Description: What it measures
        - Business logic: How it's computed
        - Owner: Who to contact
        - Dependencies: Source tables
        - SLA: Freshness, availability
        
        Store in catalog (DataHub, Amundsen)
    """,
    
    "3. Feature Versioning": """
        Version features like code:
        - customer_txn_count_30d_v1
        - customer_txn_count_30d_v2 (changed logic)
        
        Old models use v1, new models use v2
        Gradual migration, no breaking changes
    """,
    
    "4. Feature Reusability": """
        Don't duplicate features:
        - Search catalog before creating
        - Reuse across models/teams
        - Contribute back to central store
        
        Example: "customer_txn_count_30d" usado por:
        - Churn model
        - Recommendation model  
        - Fraud detection model
    """,
    
    "5. Point-in-Time Correctness": """
        CRITICAL: No data leakage
        
        ❌ Wrong:
        SELECT AVG(amount) FROM transactions
        WHERE customer_id = 123
        -- Uses ALL data including future!
        
        ✅ Correct:
        SELECT AVG(amount) FROM transactions
        WHERE customer_id = 123
          AND transaction_date < :prediction_date
        -- Only past data
    """,
    
    "6. Monitoring & Alerting": """
        Track metrics:
        - Feature freshness (lag time)
        - Null rates (data quality)
        - Distribution shifts (drift)
        - Query latency (performance)
        
        Alert if:
        - Freshness > 2 hours (SLA violation)
        - Null rate > 10% (data issue)
        - P95 distribution shift > 2 std devs
    """
}
```

**Cost Optimization:**

```python
# Online store costs can be HIGH
cost_example = {
    "Scenario": "1M customers, 20 features, 8 bytes/feature",
    
    "Redis": {
        "Storage": "1M × 20 × 8 bytes = 160 MB",
        "Cost": "$0.023/GB/hour × 0.16 GB × 730 hours = $2.70/month",
        "Latency": "1-5ms",
        "Best for": "High QPS (>1000 req/s)"
    },
    
    "DynamoDB": {
        "Storage": "1M × 20 × 8 bytes = 160 MB = $0.04/month",
        "Reads": "100K reads/day × $0.25/million = $0.75/month",
        "Cost": "$0.79/month",
        "Latency": "5-20ms",
        "Best for": "Medium QPS (100-1000 req/s)"
    },
    
    "Optimization": """
        1. TTL: Expire stale features (save storage)
        2. Lazy loading: Only materialize requested features
        3. Batch reads: Get features for multiple entities
        4. Caching: Application-level cache (reduce lookups)
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔧 **MLOps: Versionado, Experimentos y Model Registry**

**ML Lifecycle Challenges:**

```
Traditional Software:
Code → Build → Test → Deploy
✅ Deterministic (same input = same output)
✅ Tests catch regressions
✅ Rollback = redeploy old code

Machine Learning:
Code + Data + Hyperparameters → Train → Validate → Deploy
❌ Non-deterministic (randomness, data drift)
❌ Tests don't catch model degradation
❌ Rollback = redeploy old model + old features + old data pipeline
```

**MLOps Stack:**

```python
┌─────────────────────────────────────────────────────────┐
│                  ML LIFECYCLE                            │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  1. EXPERIMENTATION                                     │
│     • Jupyter / VS Code                                 │
│     • MLflow Tracking                                   │
│     • Weights & Biases                                  │
│     ↓                                                   │
│  2. VERSIONING                                          │
│     • Data: DVC, Delta Lake versions                    │
│     • Code: Git                                         │
│     • Models: MLflow Model Registry                     │
│     • Features: Feature Store                           │
│     ↓                                                   │
│  3. TRAINING PIPELINE                                   │
│     • Orchestration: Airflow, Kubeflow, Metaflow       │
│     • Compute: Spark, Kubernetes, SageMaker            │
│     • Hyperparameter tuning: Optuna, Ray Tune          │
│     ↓                                                   │
│  4. VALIDATION & TESTING                                │
│     • Model validation: Great Expectations              │
│     • A/B testing framework                             │
│     • Shadow mode deployment                            │
│     ↓                                                   │
│  5. MODEL REGISTRY                                      │
│     • Staging → Production promotion                    │
│     • Metadata: metrics, lineage, dependencies         │
│     • RBAC: who can promote                            │
│     ↓                                                   │
│  6. DEPLOYMENT                                          │
│     • Online: REST API (FastAPI, TorchServe)           │
│     • Batch: Spark jobs                                │
│     • Edge: TensorFlow Lite, ONNX                      │
│     ↓                                                   │
│  7. MONITORING                                          │
│     • Data drift: Evidently, WhyLabs                   │
│     • Model performance: Custom dashboards             │
│     • Infrastructure: Prometheus, Datadog              │
│     ↓                                                   │
│  8. RETRAINING (back to step 3)                        │
│     • Scheduled: weekly/monthly                        │
│     • Triggered: drift detected                        │
│                                                          │
└─────────────────────────────────────────────────────────┘
```

**MLflow: Experiment Tracking**

```python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import numpy as np

# Set tracking server (or use local)
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("customer_churn_v2")

# Hyperparameter grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

best_auc = 0
best_run_id = None

for n_est in param_grid['n_estimators']:
    for depth in param_grid['max_depth']:
        for min_split in param_grid['min_samples_split']:
            
            # Start MLflow run
            with mlflow.start_run(run_name=f"rf_ne{n_est}_md{depth}_ms{min_split}"):
                
                # Log parameters
                mlflow.log_param("n_estimators", n_est)
                mlflow.log_param("max_depth", depth)
                mlflow.log_param("min_samples_split", min_split)
                mlflow.log_param("model_type", "RandomForest")
                mlflow.log_param("feature_version", "v2.1")
                mlflow.log_param("dataset_version", "2025-10-30")
                
                # Train model
                model = RandomForestClassifier(
                    n_estimators=n_est,
                    max_depth=depth,
                    min_samples_split=min_split,
                    random_state=42
                )
                model.fit(X_train, y_train)
                
                # Predictions
                y_pred = model.predict(X_test)
                y_pred_proba = model.predict_proba(X_test)[:, 1]
                
                # Calculate metrics
                accuracy = accuracy_score(y_test, y_pred)
                f1 = f1_score(y_test, y_pred)
                auc = roc_auc_score(y_test, y_pred_proba)
                
                # Log metrics
                mlflow.log_metric("accuracy", accuracy)
                mlflow.log_metric("f1_score", f1)
                mlflow.log_metric("roc_auc", auc)
                mlflow.log_metric("test_size", len(X_test))
                
                # Log model
                mlflow.sklearn.log_model(
                    model,
                    "model",
                    signature=mlflow.models.infer_signature(X_train, y_train)
                )
                
                # Log artifacts
                feature_importance = pd.DataFrame({
                    'feature': feature_names,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)
                
                feature_importance.to_csv("feature_importance.csv", index=False)
                mlflow.log_artifact("feature_importance.csv")
                
                # Track best model
                if auc > best_auc:
                    best_auc = auc
                    best_run_id = mlflow.active_run().info.run_id
                    mlflow.set_tag("best_model", "true")
                
                print(f"Run: ne={n_est}, md={depth}, ms={min_split} → AUC={auc:.4f}")

print(f"\n🏆 Best model: {best_run_id} with AUC={best_auc:.4f}")
```

**MLflow Model Registry: Staging → Production**

```python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register best model
model_name = "customer_churn_model"
model_uri = f"runs:/{best_run_id}/model"

model_version = mlflow.register_model(model_uri, model_name)

print(f"Model registered: {model_name} version {model_version.version}")

# Transition to Staging for validation
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Staging",
    archive_existing_versions=False
)

# Add description and tags
client.update_model_version(
    name=model_name,
    version=model_version.version,
    description=f"""
    Churn prediction model trained on 2025-10-30
    Features: customer_transaction_features v2.1
    Performance: AUC=0.89, F1=0.82
    Training data: 100K customers, 30 days lookback
    """
)

client.set_model_version_tag(
    name=model_name,
    version=model_version.version,
    key="validation_status",
    value="pending"
)

# Validation tests
print("\n🧪 Running validation tests...")

# Test 1: Model can load
loaded_model = mlflow.sklearn.load_model(f"models:/{model_name}/Staging")
print("✅ Model loads successfully")

# Test 2: Predictions on validation set
val_predictions = loaded_model.predict(X_val)
val_auc = roc_auc_score(y_val, loaded_model.predict_proba(X_val)[:, 1])
print(f"✅ Validation AUC: {val_auc:.4f}")

# Test 3: No feature drift
from scipy.stats import ks_2samp

for feature in feature_names:
    train_dist = X_train[feature]
    val_dist = X_val[feature]
    ks_stat, p_value = ks_2samp(train_dist, val_dist)
    
    if p_value < 0.05:
        print(f"⚠️ Feature drift detected in {feature} (p={p_value:.4f})")
    else:
        print(f"✅ {feature}: no drift")

# Test 4: Latency test
import time
start = time.time()
for _ in range(1000):
    _ = loaded_model.predict(X_val[:1])
latency = (time.time() - start) / 1000 * 1000  # ms
print(f"✅ Average latency: {latency:.2f}ms")

# If all tests pass, promote to Production
if val_auc > 0.85 and latency < 100:
    client.transition_model_version_stage(
        name=model_name,
        version=model_version.version,
        stage="Production",
        archive_existing_versions=True  # Archive old Production
    )
    
    client.set_model_version_tag(
        name=model_name,
        version=model_version.version,
        key="validation_status",
        value="passed"
    )
    
    print(f"\n🚀 Model promoted to Production!")
else:
    print(f"\n❌ Model failed validation, staying in Staging")
```

**Data Versioning with DVC:**

```python
# dvc.yaml - Define data pipeline
"""
stages:
  extract:
    cmd: python src/extract_data.py
    deps:
      - src/extract_data.py
    outs:
      - data/raw/transactions.parquet
  
  features:
    cmd: python src/generate_features.py
    deps:
      - src/generate_features.py
      - data/raw/transactions.parquet
    params:
      - features.lookback_days
      - features.aggregation_windows
    outs:
      - data/features/customer_features.parquet
  
  train:
    cmd: python src/train_model.py
    deps:
      - src/train_model.py
      - data/features/customer_features.parquet
    params:
      - model.n_estimators
      - model.max_depth
    metrics:
      - metrics.json:
          cache: false
    outs:
      - models/model.pkl
"""

# params.yaml - Hyperparameters
"""
features:
  lookback_days: 30
  aggregation_windows: [7, 30, 90]

model:
  n_estimators: 100
  max_depth: 10
  min_samples_split: 5
"""

# Track data version
"""
$ dvc add data/raw/transactions.parquet
$ git add data/raw/transactions.parquet.dvc .gitignore
$ git commit -m "Add transactions data v1"
$ git tag -a "data-v1" -m "Initial dataset"

$ dvc push  # Upload to S3/GCS
"""

# Reproduce experiment
"""
$ git checkout data-v1
$ dvc checkout
$ dvc repro  # Runs entire pipeline

Output:
  Running stage 'extract'
  Running stage 'features'  
  Running stage 'train'
  ✅ Reproduced experiment from data-v1
"""
```

**Continuous Training with Kubeflow:**

```python
# kubeflow_pipeline.py
import kfp
from kfp import dsl
from kfp.components import func_to_container_op

@func_to_container_op
def extract_data(output_path: str):
    """Extract training data"""
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    
    df = spark.read.format("delta").load("s3://datalake/silver/transactions")
    df.write.parquet(output_path)

@func_to_container_op
def train_model(data_path: str, model_path: str) -> float:
    """Train model and return AUC"""
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score
    import joblib
    
    df = pd.read_parquet(data_path)
    X = df.drop('target', axis=1)
    y = df['target']
    
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)
    
    auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
    
    joblib.dump(model, model_path)
    return auc

@func_to_container_op
def deploy_model(model_path: str, auc: float):
    """Deploy if AUC > threshold"""
    if auc > 0.85:
        # Deploy to production
        import mlflow
        mlflow.register_model(f"file://{model_path}", "churn_model")
        print(f"✅ Model deployed with AUC={auc:.4f}")
    else:
        print(f"❌ Model not deployed, AUC={auc:.4f} < 0.85")

@dsl.pipeline(
    name='Customer Churn Training Pipeline',
    description='Extract, train, deploy churn model'
)
def training_pipeline():
    # Extract
    extract_op = extract_data(output_path="/data/training.parquet")
    
    # Train
    train_op = train_model(
        data_path=extract_op.output,
        model_path="/models/churn_model.pkl"
    )
    
    # Deploy
    deploy_op = deploy_model(
        model_path=train_op.outputs['model_path'],
        auc=train_op.outputs['Output']
    )

# Compile and run
if __name__ == '__main__':
    kfp.compiler.Compiler().compile(training_pipeline, 'pipeline.yaml')
    
    # Submit to Kubeflow
    client = kfp.Client(host='http://kubeflow.example.com')
    client.create_run_from_pipeline_func(
        training_pipeline,
        arguments={},
        run_name='churn_training_2025_10_30'
    )
```

**Model Comparison Dashboard:**

```python
# Compare multiple model versions
import mlflow
import pandas as pd
import plotly.express as px

client = mlflow.tracking.MlflowClient()

# Get all versions of model
versions = client.search_model_versions(f"name='customer_churn_model'")

comparison_data = []
for version in versions:
    run = mlflow.get_run(version.run_id)
    
    comparison_data.append({
        'version': version.version,
        'stage': version.current_stage,
        'created': version.creation_timestamp,
        'auc': run.data.metrics.get('roc_auc', 0),
        'f1': run.data.metrics.get('f1_score', 0),
        'accuracy': run.data.metrics.get('accuracy', 0),
        'n_estimators': run.data.params.get('n_estimators', 0),
        'max_depth': run.data.params.get('max_depth', 0),
    })

df_comparison = pd.DataFrame(comparison_data)

# Plot comparison
fig = px.line(
    df_comparison,
    x='version',
    y=['auc', 'f1', 'accuracy'],
    title='Model Performance Over Versions',
    labels={'value': 'Score', 'variable': 'Metric'}
)
fig.show()

# Production model details
prod_version = df_comparison[df_comparison['stage'] == 'Production'].iloc[0]
print(f"\n📊 Production Model (v{prod_version['version']}):")
print(f"  AUC: {prod_version['auc']:.4f}")
print(f"  F1: {prod_version['f1']:.4f}")
print(f"  Accuracy: {prod_version['accuracy']:.4f}")
print(f"  Hyperparams: n_estimators={prod_version['n_estimators']}, max_depth={prod_version['max_depth']}")
```

**Best Practices:**

```python
mlops_best_practices = {
    "1. Reproducibility": """
        Track EVERYTHING:
        - Code version (git commit hash)
        - Data version (DVC, Delta version)
        - Environment (Docker image, requirements.txt)
        - Random seeds
        - Hyperparameters
        
        Goal: Reproduce exact model 6 months later
    """,
    
    "2. Automated Testing": """
        Unit tests:
        - Feature engineering logic
        - Model preprocessing
        - Prediction postprocessing
        
        Integration tests:
        - End-to-end pipeline
        - Model serving API
        - Feature store integration
        
        Model tests:
        - Minimum accuracy threshold
        - Inference latency < 100ms
        - No feature drift
    """,
    
    "3. Gradual Rollout": """
        Don't deploy to 100% traffic immediately
        
        Stage 1: Shadow mode (30 days)
        - New model runs parallel to old
        - Predictions logged but not used
        - Compare performance
        
        Stage 2: A/B test (7 days)
        - 10% traffic to new model
        - Monitor metrics closely
        - Rollback if issues
        
        Stage 3: Ramp up (14 days)
        - 25% → 50% → 75% → 100%
        - Gradual increase
    """,
    
    "4. Model Governance": """
        RBAC for model promotion:
        - Data Scientists: can register to Staging
        - ML Engineers: can promote to Production
        - Approvals: require 2+ reviewers
        
        Audit trail:
        - Who deployed what when
        - Why model was promoted
        - What tests were run
    """,
    
    "5. Rollback Plan": """
        Always have rollback ready:
        - Keep 2-3 previous versions in Production
        - Canary deployment (route % traffic)
        - Circuit breaker (auto-rollback if errors spike)
        
        Example:
        v10 deployed → errors 5% → auto-rollback to v9
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🚀 **Production Serving: Online, Batch y Model Monitoring**

**Serving Strategies:**

```python
┌────────────────────────────────────────────────────────────┐
│            MODEL SERVING PATTERNS                          │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  1. ONLINE SERVING (Real-time)                            │
│     ┌─────────────┐                                       │
│     │   Client    │                                       │
│     └──────┬──────┘                                       │
│            │ HTTP Request                                 │
│            ▼                                               │
│     ┌──────────────┐        ┌──────────────┐            │
│     │  Load        │        │   Feature    │            │
│     │  Balancer    │───────▶│   Store      │            │
│     │  (Nginx)     │        │   (Redis)    │            │
│     └──────┬───────┘        └──────────────┘            │
│            │                        │                     │
│            ▼                        │                     │
│     ┌──────────────┐                │                     │
│     │  API Server  │◀───────────────┘                     │
│     │  (FastAPI)   │                                      │
│     │              │                                      │
│     │ • Model in   │                                      │
│     │   memory     │                                      │
│     │ • Features   │                                      │
│     │   from Redis │                                      │
│     │ • <100ms     │                                      │
│     └──────────────┘                                      │
│                                                            │
│  2. BATCH SERVING (Bulk predictions)                      │
│     ┌──────────────┐                                      │
│     │   Scheduler  │                                      │
│     │   (Airflow)  │                                      │
│     └──────┬───────┘                                      │
│            │ Daily @2AM                                   │
│            ▼                                               │
│     ┌──────────────┐        ┌──────────────┐            │
│     │  Spark Job   │───────▶│  Feature     │            │
│     │              │        │  Store       │            │
│     │ • Load model │        │  (Delta)     │            │
│     │ • Get        │        └──────────────┘            │
│     │   features   │                                      │
│     │ • Predict    │                                      │
│     │   millions   │                                      │
│     └──────┬───────┘                                      │
│            │                                               │
│            ▼                                               │
│     ┌──────────────┐                                      │
│     │  Predictions │                                      │
│     │  Delta Table │                                      │
│     └──────────────┘                                      │
│                                                            │
│  3. STREAMING SERVING (Micro-batch)                       │
│     ┌──────────────┐                                      │
│     │    Kafka     │                                      │
│     │   (Events)   │                                      │
│     └──────┬───────┘                                      │
│            │                                               │
│            ▼                                               │
│     ┌──────────────┐        ┌──────────────┐            │
│     │ Flink/Spark  │───────▶│  Feature     │            │
│     │ Streaming    │        │  Store       │            │
│     │              │        └──────────────┘            │
│     │ • Enrich     │                                      │
│     │   with       │                                      │
│     │   features   │                                      │
│     │ • Predict    │                                      │
│     │ • <1s        │                                      │
│     └──────┬───────┘                                      │
│            │                                               │
│            ▼                                               │
│     ┌──────────────┐                                      │
│     │ Output Kafka │                                      │
│     │   Topic      │                                      │
│     └──────────────┘                                      │
│                                                            │
└────────────────────────────────────────────────────────────┘
```

**Online Serving: FastAPI Production Setup**

```python
from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel, Field
import mlflow.pyfunc
import redis
import json
from typing import List
import time
from prometheus_client import Counter, Histogram, generate_latest
from contextlib import asynccontextmanager

# Metrics
PREDICTION_COUNT = Counter('prediction_requests_total', 'Total prediction requests')
PREDICTION_LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')
PREDICTION_ERRORS = Counter('prediction_errors_total', 'Total prediction errors')

# Global model variable
model = None
redis_client = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    """
    Load model once at startup (not per request)
    """
    global model, redis_client
    
    # Load model from MLflow Registry
    print("Loading model from MLflow...")
    model = mlflow.pyfunc.load_model("models:/customer_churn_model/Production")
    print("✅ Model loaded")
    
    # Connect to Redis
    redis_client = redis.Redis(
        host='redis',
        port=6379,
        decode_responses=True,
        max_connections=50
    )
    print("✅ Redis connected")
    
    yield
    
    # Cleanup on shutdown
    print("Shutting down...")

app = FastAPI(
    title="Churn Prediction API",
    description="Real-time customer churn prediction",
    version="1.0.0",
    lifespan=lifespan
)

class PredictionRequest(BaseModel):
    customer_ids: List[str] = Field(..., max_items=100)  # Batch up to 100

class Prediction(BaseModel):
    customer_id: str
    churn_probability: float
    churn_risk: str
    features: dict

class PredictionResponse(BaseModel):
    predictions: List[Prediction]
    latency_ms: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """
    Batch prediction endpoint
    Latency: <100ms for 100 customers
    """
    start_time = time.time()
    
    try:
        PREDICTION_COUNT.inc(len(request.customer_ids))
        
        predictions = []
        
        # Get features from Redis (batch)
        pipeline = redis_client.pipeline()
        for customer_id in request.customer_ids:
            pipeline.get(f"features:customer:{customer_id}")
        
        features_raw = pipeline.execute()
        
        # Prepare batch input
        feature_vectors = []
        valid_customers = []
        
        for customer_id, features_str in zip(request.customer_ids, features_raw):
            if not features_str:
                predictions.append(Prediction(
                    customer_id=customer_id,
                    churn_probability=0.5,  # Default
                    churn_risk="unknown",
                    features={}
                ))
                continue
            
            features = json.loads(features_str)
            feature_vectors.append([
                features["txn_count_30d"],
                features["total_spent_30d"],
                features["avg_transaction_30d"],
                features["days_since_last_txn"]
            ])
            valid_customers.append((customer_id, features))
        
        # Batch prediction (efficient!)
        if feature_vectors:
            churn_probas = model.predict(feature_vectors)
            
            for (customer_id, features), churn_proba in zip(valid_customers, churn_probas):
                risk = "low" if churn_proba < 0.3 else "medium" if churn_proba < 0.7 else "high"
                
                predictions.append(Prediction(
                    customer_id=customer_id,
                    churn_probability=float(churn_proba),
                    churn_risk=risk,
                    features=features
                ))
        
        latency_ms = (time.time() - start_time) * 1000
        PREDICTION_LATENCY.observe(latency_ms / 1000)
        
        return PredictionResponse(
            predictions=predictions,
            latency_ms=latency_ms
        )
    
    except Exception as e:
        PREDICTION_ERRORS.inc()
        raise HTTPException(500, f"Prediction failed: {str(e)}")

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return generate_latest()

@app.get("/health")
async def health():
    """Health check"""
    try:
        redis_client.ping()
        return {
            "status": "healthy",
            "model_version": model.metadata.get_model_info().version,
            "redis": "connected"
        }
    except Exception as e:
        return {
            "status": "unhealthy",
            "error": str(e)
        }

# Run with: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
```

**Kubernetes Deployment:**

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-prediction-api
spec:
  replicas: 4  # Horizontal scaling
  selector:
    matchLabels:
      app: churn-prediction
  template:
    metadata:
      labels:
        app: churn-prediction
    spec:
      containers:
      - name: api
        image: myregistry/churn-prediction:v1.2.3
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        env:
        - name: MLFLOW_TRACKING_URI
          value: "http://mlflow:5000"
        - name: REDIS_HOST
          value: "redis"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: churn-prediction-service
spec:
  selector:
    app: churn-prediction
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-prediction-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-prediction-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
```

**Batch Serving with Spark:**

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct, col, current_date
import mlflow.pyfunc

spark = SparkSession.builder \
    .appName("ChurnPredictionBatch") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# Load model as Spark UDF
model_uri = "models:/customer_churn_model/Production"
predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri=model_uri,
    result_type="double"
)

# Get all active customers (10M customers)
active_customers = spark.read \
    .format("delta") \
    .load("s3://datalake/gold/active_customers") \
    .filter(col("is_active") == True)

print(f"Scoring {active_customers.count():,} customers...")

# Get features
customer_features = spark.read \
    .format("delta") \
    .load("s3://datalake/gold/customer_features_daily") \
    .filter(col("feature_date") == current_date())

# Join
customers_with_features = active_customers \
    .join(customer_features, "customer_id", "inner")

# Batch predict (distributed across cluster)
predictions = customers_with_features.withColumn(
    "churn_probability",
    predict_udf(struct(
        col("txn_count_30d"),
        col("total_spent_30d"),
        col("avg_transaction_30d"),
        col("days_since_last_txn")
    ))
).withColumn(
    "churn_risk",
    when(col("churn_probability") < 0.3, "low")
    .when(col("churn_probability") < 0.7, "medium")
    .otherwise("high")
).withColumn(
    "prediction_date",
    current_date()
)

# Save predictions
predictions.select(
    "customer_id",
    "churn_probability",
    "churn_risk",
    "prediction_date"
).write \
    .format("delta") \
    .mode("append") \
    .partitionBy("prediction_date") \
    .option("mergeSchema", "true") \
    .save("s3://datalake/gold/churn_predictions")

# High-risk customers → CRM
high_risk = predictions.filter(col("churn_risk") == "high")

high_risk.select(
    "customer_id",
    "churn_probability",
    "total_spent_30d",
    "days_since_last_txn"
).write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://crm:5432/marketing") \
    .option("dbtable", "high_churn_risk_customers") \
    .option("user", "crm_user") \
    .option("password", "***") \
    .mode("overwrite") \
    .save()

print(f"✅ Scored {predictions.count():,} customers")
print(f"⚠️ High risk: {high_risk.count():,} customers")

spark.stop()
```

**Model Monitoring: Data Drift Detection**

```python
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import mlflow

# Reference data (training data distribution)
reference_df = pd.read_parquet("s3://datalake/training/reference_data.parquet")

# Current data (production predictions last 7 days)
current_df = spark.read \
    .format("delta") \
    .load("s3://datalake/gold/prediction_inputs") \
    .filter("prediction_date >= current_date() - 7") \
    .toPandas()

# Evidently report
report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset()
])

report.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=None
)

# Save report
report.save_html("drift_report.html")

# Extract drift metrics
drift_metrics = report.as_dict()

# Check for drift
features_with_drift = []
for feature, metrics in drift_metrics['metrics'][0]['result']['drift_by_columns'].items():
    if metrics['drift_detected']:
        features_with_drift.append({
            'feature': feature,
            'drift_score': metrics['drift_score'],
            'p_value': metrics.get('stattest_threshold', 0)
        })

if features_with_drift:
    print("⚠️ DRIFT DETECTED in features:")
    for feature_drift in features_with_drift:
        print(f"  • {feature_drift['feature']}: score={feature_drift['drift_score']:.4f}")
    
    # Log to MLflow
    with mlflow.start_run(run_name="drift_detection"):
        mlflow.log_metric("features_with_drift", len(features_with_drift))
        mlflow.log_artifact("drift_report.html")
    
    # Send alert
    send_alert(
        title="⚠️ Data Drift Detected",
        message=f"{len(features_with_drift)} features drifted",
        severity="warning"
    )
else:
    print("✅ No drift detected")
```

**Model Performance Monitoring:**

```python
from datetime import datetime, timedelta
import pandas as pd
import plotly.graph_objects as go

# Get predictions and actuals
predictions = spark.read \
    .format("delta") \
    .load("s3://datalake/gold/churn_predictions") \
    .filter("prediction_date >= current_date() - 30")

# Join with actuals (churned or not in next 7 days)
actuals = spark.read \
    .format("delta") \
    .load("s3://datalake/gold/churn_labels") \
    .filter("label_date >= current_date() - 30")

performance = predictions \
    .join(actuals, ["customer_id", "prediction_date"]) \
    .toPandas()

# Calculate metrics over time
daily_metrics = []

for date in pd.date_range(
    start=datetime.now() - timedelta(days=30),
    end=datetime.now(),
    freq='D'
):
    day_data = performance[performance['prediction_date'] == date.date()]
    
    if len(day_data) == 0:
        continue
    
    from sklearn.metrics import roc_auc_score, precision_score, recall_score
    
    auc = roc_auc_score(day_data['churned'], day_data['churn_probability'])
    
    # Classify at 0.7 threshold
    predictions_binary = (day_data['churn_probability'] > 0.7).astype(int)
    precision = precision_score(day_data['churned'], predictions_binary)
    recall = recall_score(day_data['churned'], predictions_binary)
    
    daily_metrics.append({
        'date': date,
        'auc': auc,
        'precision': precision,
        'recall': recall,
        'num_predictions': len(day_data)
    })

df_metrics = pd.DataFrame(daily_metrics)

# Plot performance over time
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_metrics['date'], y=df_metrics['auc'], name='AUC'))
fig.add_trace(go.Scatter(x=df_metrics['date'], y=df_metrics['precision'], name='Precision'))
fig.add_trace(go.Scatter(x=df_metrics['date'], y=df_metrics['recall'], name='Recall'))

# Add threshold line
fig.add_hline(y=0.80, line_dash="dash", line_color="red", annotation_text="Minimum AUC")

fig.update_layout(
    title='Model Performance Over Last 30 Days',
    xaxis_title='Date',
    yaxis_title='Score',
    yaxis_range=[0, 1]
)

fig.write_html("performance_dashboard.html")

# Check if below threshold
if df_metrics['auc'].iloc[-1] < 0.80:
    print("⚠️ MODEL PERFORMANCE DEGRADED")
    print(f"  Current AUC: {df_metrics['auc'].iloc[-1]:.4f}")
    print(f"  Threshold: 0.80")
    
    # Trigger retraining
    from airflow.api.client.local_client import Client
    client = Client(None, None)
    client.trigger_dag(
        dag_id='ml_training_pipeline',
        conf={'reason': 'performance_degradation'}
    )
    
    print("✅ Retraining triggered")
else:
    print("✅ Model performance healthy")
```

**A/B Testing Framework:**

```python
# Route traffic between model versions
from fastapi import FastAPI, Request
import random
import mlflow

app = FastAPI()

# Load multiple model versions
model_v1 = mlflow.pyfunc.load_model("models:/customer_churn_model/1")
model_v2 = mlflow.pyfunc.load_model("models:/customer_churn_model/2")

# A/B test config
AB_TEST_CONFIG = {
    "enabled": True,
    "model_v1_traffic": 0.9,  # 90% traffic
    "model_v2_traffic": 0.1   # 10% traffic (new model)
}

@app.post("/predict")
async def predict(request: Request, customer_id: str):
    # Assign to variant
    if AB_TEST_CONFIG["enabled"]:
        rand = random.random()
        if rand < AB_TEST_CONFIG["model_v2_traffic"]:
            variant = "v2"
            model = model_v2
        else:
            variant = "v1"
            model = model_v1
    else:
        variant = "v1"
        model = model_v1
    
    # Get features and predict
    features = get_features(customer_id)
    prediction = model.predict([features])[0]
    
    # Log for analysis
    log_ab_test(
        customer_id=customer_id,
        variant=variant,
        prediction=prediction,
        timestamp=datetime.now()
    )
    
    return {
        "customer_id": customer_id,
        "churn_probability": float(prediction),
        "model_variant": variant  # For debugging
    }

# Analyze A/B test results
def analyze_ab_test():
    """
    Compare model v1 vs v2 performance
    """
    results = spark.read \
        .format("delta") \
        .load("s3://datalake/gold/ab_test_logs") \
        .filter("timestamp >= current_timestamp() - interval 7 days")
    
    # Group by variant
    v1_results = results.filter("variant == 'v1'")
    v2_results = results.filter("variant == 'v2'")
    
    from scipy.stats import ttest_ind
    
    # Compare AUC
    v1_auc = calculate_auc(v1_results)
    v2_auc = calculate_auc(v2_results)
    
    t_stat, p_value = ttest_ind(v1_results['auc'], v2_results['auc'])
    
    print(f"Model v1 AUC: {v1_auc:.4f}")
    print(f"Model v2 AUC: {v2_auc:.4f}")
    print(f"Improvement: {(v2_auc - v1_auc) / v1_auc * 100:.2f}%")
    print(f"Statistical significance: p={p_value:.4f}")
    
    if p_value < 0.05 and v2_auc > v1_auc:
        print("✅ Model v2 is significantly better!")
        print("Recommendation: Ramp up v2 to 100% traffic")
    else:
        print("❌ No significant improvement, keep v1")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Pipeline de datos para ML: ETL → Feature Engineering → Training → Serving

- Extract: fuentes raw (transacciones, logs, eventos).
- Transform: limpieza, agregaciones por ventana, joins.
- Feature Engineering: crear features (ratios, lags, embeddings).
- Training: dataset versionado, experimentos (MLflow/Weights&Biases).
- Serving: online (API real-time) y offline (batch predictions).

## 2. Feature Store: concepto y práctica con Feast (demo local)

In [None]:
# Nota: instala feast si quieres ejecutar (no en requirements por defecto)
feast_demo = r'''
# feature_repo/feature_definitions.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

cliente = Entity(name='cliente_id', join_keys=['cliente_id'])

source = FileSource(
    path='data/features.parquet',
    timestamp_field='event_timestamp'
)

cliente_fv = FeatureView(
    name='cliente_features',
    entities=[cliente],
    ttl=timedelta(days=30),
    schema=[
        Field(name='total_compras', dtype=Float32),
        Field(name='num_transacciones', dtype=Int64),
    ],
    source=source
)
# feast apply
# feast materialize-incremental
# store.get_online_features(...)
'''
print(feast_demo.splitlines()[:25])

## 3. Versionado de datasets y experimentos con MLflow

In [None]:
mlflow_snippet = r'''
import mlflow
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('ventas_prediccion')

with mlflow.start_run():
    mlflow.log_param('model', 'xgboost')
    mlflow.log_param('n_estimators', 100)
    mlflow.log_metric('rmse', 0.85)
    mlflow.log_artifact('model.pkl')
'''
print(mlflow_snippet)

## 4. Reentrenamiento automatizado (Airflow + MLflow)

- DAG semanal: extraer nuevos datos, generar features, entrenar, evaluar.
- Si métrica mejora sobre baseline → registrar modelo, promover a producción.
- Si drift detectado → alerta y reentrenamiento fuera de calendario.

## 5. Serving online y batch

- Online: endpoint REST (FastAPI + modelo en memoria o via MLflow Model Registry).
- Batch: Spark job semanal para scoring de todo el catálogo/clientes.
- Caché de features online (Redis) para latencia < 50ms.

## 6. Monitoreo de modelos

- Data drift: distribución de features cambia (KS test, PSI).
- Concept drift: relación X→Y cambia (performance degrada).
- Métricas de negocio: precisión, recall, ROC-AUC, ingresos.
- Alertas automáticas y rollback si threshold cruzado.

---

## 🧭 Navegación

**← Anterior:** [🏛️ Arquitecturas Modernas de Datos: Lambda, Kappa, Delta y Data Mesh](04_arquitecturas_modernas.ipynb)

**Siguiente →:** [💰 Cost Optimization y FinOps en la Nube →](06_cost_optimization_finops.ipynb)

**📚 Índice de Nivel Senior:**
- [🏛️ Senior - 01. Data Governance y Calidad de Datos](01_data_governance_calidad.ipynb)
- [🏗️ Data Lakehouse con Parquet, Delta Lake e Iceberg (conceptos y práctica ligera)](02_lakehouse_delta_iceberg.ipynb)
- [Apache Spark Streaming: Procesamiento en Tiempo Real](03_spark_streaming.ipynb)
- [🏛️ Arquitecturas Modernas de Datos: Lambda, Kappa, Delta y Data Mesh](04_arquitecturas_modernas.ipynb)
- [🤖 ML Pipelines y Feature Stores](05_ml_pipelines_feature_stores.ipynb) ← 🔵 Estás aquí
- [💰 Cost Optimization y FinOps en la Nube](06_cost_optimization_finops.ipynb)
- [🔐 Seguridad, Compliance y Auditoría de Datos](07_seguridad_compliance.ipynb)
- [📊 Observabilidad y Linaje de Datos](08_observabilidad_linaje.ipynb)
- [🏆 Proyecto Integrador Senior 1: Plataforma de Datos Completa](09_proyecto_integrador_1.ipynb)
- [🌐 Proyecto Integrador Senior 2: Data Mesh Multi-Dominio con Feature Store](10_proyecto_integrador_2.ipynb)

**🎓 Otros Niveles:**
- [Nivel Junior](../nivel_junior/README.md)
- [Nivel Mid](../nivel_mid/README.md)
- [Nivel Senior](../nivel_senior/README.md)
- [Nivel GenAI](../nivel_genai/README.md)
- [Negocio LATAM](../negocios_latam/README.md)
