# Tracking Experiments with MLflow

## Objectives
- Bridge the gap between "Data Science" (building models in a black box) and "MLOps" (tracking models like operational software).
- Integrate **MLflow** into a Scikit-Learn training script to automatically log hyperparameters, metrics, and models.

## Dataset
- We will use a synthetic dataset for a binary classification task (e.g., "Will this server crash in the next 10 minutes?").

## Expected Outcome
- A functional Python script that logs training runs. You can view these runs by typing `mlflow ui` in your terminal.

## Challenge
- Can you modify the logging code to also track the model's signature (the expected input/output schema) using `mlflow.models.signature.infer_signature`?

In [None]:
# !pip install mlflow scikit-learn pandas matplotlib

In [None]:
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.datasets import make_classification

# Set up MLflow tracking URI (default is a local ./mlruns directory)
mlflow.set_tracking_uri("sqlite:///mlruns.db") 
mlflow.set_experiment("Server_Crash_Prediction")

### 1. The Scenario
Every time you train a model, you typically try different hyperparameters. Without MLflow, you end up with spreadsheets or messy notebook cells trying to remember what worked best.

In [None]:
# Generate Synthetic SRE Data (Features could be latency, memory util, disk IO)
X, y = make_classification(n_samples=2000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2. The MLflow Context Manager
By wrapping our training code in `with mlflow.start_run():`, everything inside that block is grouped into a single tracked "Run".

In [None]:
def train_and_log_model(n_estimators, max_depth):
    with mlflow.start_run():
        # 1. Log the parameters we are testing
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        
        # 2. Train the model
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        clf.fit(X_train, y_train)
        
        # 3. Evaluate
        predictions = clf.predict(X_test)
        acc = accuracy_score(y_test, predictions)
        prec = precision_score(y_test, predictions)
        rec = recall_score(y_test, predictions)
        
        # 4. Log the metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision", prec)
        mlflow.log_metric("recall", rec)
        
        # 5. Log the actual model artifact!
        mlflow.sklearn.log_model(clf, "random_forest_model")
        
        print(f"Run Complete! | estimators: {n_estimators}, depth: {max_depth} | Acc: {acc:.3f} | Prec: {prec:.3f}")

# Let's simulate a hyperparameter search
train_and_log_model(50, 5)
train_and_log_model(100, 10)
train_and_log_model(200, None)

### 3. Viewing the Results
Now that we have logged 3 runs, you can open your terminal, navigate to this directory, and type:

```bash
mlflow ui
```

This will launch a web dashboard on `http://localhost:5000` where you can compare the runs, visualize the metrics, and download the exact `.pkl` model file that produced the best results.