# 🧪 MLflow Experiment Tracking Setup (Hybrid: Local + S3)

In this project, we're using **MLflow** to track experiments locally, while saving large model artifacts to a remote **Amazon S3 bucket**. This is a lightweight hybrid approach — no EC2 server required.

---

## 🗺️ MLflow Setup Overview

| Component         | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| **Tracking server**     | Runs **locally** on your machine (`localhost:5000`)                            |
| **Backend store**       | Uses a **local SQLite database** (`backend.db`) to store metadata (runs, params, metrics) |
| **Artifact store**      | Stores models and other artifacts in an **S3 bucket** (`mlops-churn-analytics-falcon`)    |

---

## 🚀 MLflow Server Command Explained

```bash
mlflow server \
  --backend-store-uri sqlite:///backend.db \
  --default-artifact-root s3://mlops-churn-analytics-falcon/mlflow-artifacts \
  --host 127.0.0.1 \
  --port 5000

In [1]:
# 1. Setup MLflow tracking URI
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:5000")  # local tracking server
print(f"Tracking URI: {mlflow.get_tracking_uri()}")

# 2. Confirm connection
mlflow.search_experiments()

Tracking URI: http://127.0.0.1:5000


[<Experiment: artifact_location='s3://mlops-churn-analytics-falcon/mlflow-artifacts/0', creation_time=1751969261156, experiment_id='0', last_update_time=1751969261156, lifecycle_stage='active', name='Default', tags={}>]

In [2]:
mlflow.set_experiment("churn-prediction-hybrid")

2025/07/08 10:26:56 INFO mlflow.tracking.fluent: Experiment with name 'churn-prediction-hybrid' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlops-churn-analytics-falcon/mlflow-artifacts/1', creation_time=1751970416075, experiment_id='1', last_update_time=1751970416075, lifecycle_stage='active', name='churn-prediction-hybrid', tags={}>

In [3]:
import pandas as pd

# Load data
train_df = pd.read_csv("../data/processed/train.csv")
test_df = pd.read_csv("../data/processed/test.csv")

print("✅ Train shape:", train_df.shape)
print("✅ Test shape:", test_df.shape)

train_df.head()

✅ Train shape: (7088, 19)
✅ Test shape: (3039, 19)


Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Total_Trans_Amt,Total_Trans_Ct,Total_Amt_Chng_Q4_Q1,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender,Education_Level,Marital_Status,Income_Category,Card_Category,churn
0,44,3,36,2,3,3,6680.0,1839,7632,95,0.617,0.532,0.275,F,Uneducated,Married,Less than $40K,Blue,0
1,39,1,34,3,1,1,2884.0,2517,4809,87,0.693,0.74,0.873,F,Graduate,Single,Unknown,Blue,0
2,52,1,36,4,2,2,14858.0,1594,4286,72,0.51,0.636,0.107,M,Unknown,Married,$80K - $120K,Blue,0
3,34,0,17,4,1,4,2638.0,2092,1868,43,0.591,0.344,0.793,M,Graduate,Married,$40K - $60K,Blue,0
4,47,5,36,3,1,2,8896.0,1338,4252,70,0.741,0.591,0.15,M,Doctorate,Single,Less than $40K,Blue,0


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# Separate features and target
target = "churn"
features = [
    'Customer_Age', 'Dependent_count', 'Months_on_book',
    'Total_Relationship_Count', 'Months_Inactive_12_mon',
    'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
    'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Amt_Chng_Q4_Q1',
    'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
    'Gender', 'Education_Level', 'Marital_Status',
    'Income_Category', 'Card_Category'
]

X_train = train_df[features]
y_train = train_df[target]

X_test = test_df[features]
y_test = test_df[target]

# Define categorical columns
categorical_cols = [
    'Gender', 'Education_Level', 'Marital_Status',
    'Income_Category', 'Card_Category'
]

# Build preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)

# Create full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

# Evaluate
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("✅ ROC AUC:", roc_auc_score(y_test, y_proba))

✅ Accuracy: 0.889108259295821
✅ ROC AUC: 0.9003275796698177


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [5]:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("churn-prediction-hybrid")

with mlflow.start_run():

    # Define params for Random Forest
    params = {
        "n_estimators": 100,
        "max_depth": 10,
        "random_state": 42
    }
    mlflow.log_params(params)

    # Replace classifier in pipeline
    pipeline.set_params(classifier=RandomForestClassifier(**params))

    # Train
    pipeline.fit(X_train, y_train)

    # Predict
    y_pred = pipeline.predict(X_test)
    y_proba = pipeline.predict_proba(X_test)[:, 1]

    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_proba)

    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("roc_auc", roc_auc)

    # Save full pipeline to S3
    mlflow.sklearn.lo_model(pipeline, artifact_path="model")

    print("✅ Random Forest run logged to MLflow!")
    print("📍 Tracking URI:", mlflow.get_tracking_uri())
    print("📦 Artifacts URI:", mlflow.get_artifact_uri())



✅ Random Forest run logged to MLflow!
📍 Tracking URI: http://127.0.0.1:5000
📦 Artifacts URI: s3://mlops-churn-analytics-falcon/mlflow-artifacts/1/6bb43663600e4888b6374bb437ba88e9/artifacts
🏃 View run enthused-finch-482 at: http://127.0.0.1:5000/#/experiments/1/runs/6bb43663600e4888b6374bb437ba88e9
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1


### 🔐 Registering the Best Model in MLflow

After training and logging multiple models during experiments, we often want to **register the best-performing model** (e.g. with the highest ROC AUC or accuracy).

By registering a model:
- It becomes part of the **Model Registry**, where we can manage different versions.
- We can assign stages like `Staging`, `Production`, or `Archived`.
- It simplifies deployment and collaboration.

#### How it works:
1. We connect to the MLflow Tracking Server using `MlflowClient`.
2. We get the last run ID (or choose a specific one manually).
3. We call `mlflow.register_model(...)` to register the model from that run.

📌 **Note:** The `artifact_path` used in `mlflow.sklearn.log_model(pipeline, artifact_path="model")` **must match** the `"model"` string used in `model_uri=f"runs:/{run_id}/model"`.

In [7]:
from mlflow.tracking import MlflowClient

# Connect to your tracking server
client = MlflowClient("http://127.0.0.1:5000")

# Search the last run from your experiment
experiment_id = "1"
run_id = client.search_runs(experiment_ids=[experiment_id])[0].info.run_id

mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",  # match the correct folder name used in log_model()
    name="churn-randomforest-classifier"
)

Registered model 'churn-randomforest-classifier' already exists. Creating a new version of this model...
2025/07/08 10:50:18 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: churn-randomforest-classifier, version 1
Created version '1' of model 'churn-randomforest-classifier'.


<ModelVersion: aliases=[], creation_timestamp=1751971818755, current_stage='None', deployment_job_state=<ModelVersionDeploymentJobState: current_task_name='', job_id='', job_state='DEPLOYMENT_JOB_CONNECTION_STATE_UNSPECIFIED', run_id='', run_state='DEPLOYMENT_JOB_RUN_STATE_UNSPECIFIED'>, description='', last_updated_timestamp=1751971818755, metrics=None, model_id=None, name='churn-randomforest-classifier', params=None, run_id='6bb43663600e4888b6374bb437ba88e9', run_link='', source='models:/m-1afcf28ce54f41ab858e8e12b59d407f', status='READY', status_message=None, tags={}, user_id='', version='1'>