# Lab 2: AutoML with PyCaret and MLflow

Welcome to Lab 2! Now that we have a clean, processed dataset from our ETL pipeline, it's time to train a machine learning model. Instead of manually trying out different models, we'll use an AutoML (Automated Machine Learning) tool, **PyCaret**, to automatically find the best model for our problem. We'll also use **MLflow** to track every experiment.

## Learning Objectives

By the end of this lab, you will be able to:
- Understand the benefits of AutoML.
- Use PyCaret to set up an ML experiment and compare multiple models with a single command.
- Leverage PyCaret's seamless integration with MLflow to automatically log experiments.
- Identify the best model from the experiments and register it in the MLflow Model Registry for deployment.

### 1. Setup: Installing Dependencies

PyCaret has a rich set of features, and installing it with `[full]` ensures all optional dependencies are included. We also need `mlflow` for tracking.

In [None]:
%pip install pycaret[full] mlflow

### 2. Loading the Processed Data

Let's start by loading the processed data we created in Lab 1.

In [None]:
import pandas as pd

PROCESSED_DATA_PATH = '../../data/churn_processed.parquet'
df = pd.read_parquet(PROCESSED_DATA_PATH)

df.head()

### 3. Understanding AutoML and PyCaret

**AutoML** automates the tasks of model selection, hyperparameter tuning, and feature engineering. It helps data scientists to quickly build high-performing models without extensive manual effort.

**PyCaret** is a low-code AutoML library in Python that makes this process incredibly simple. It acts as a wrapper around many ML libraries (like scikit-learn, XGBoost, LightGBM) and provides a consistent and easy-to-use API.

### 4. Setting up the PyCaret Experiment

The first step is to initialize the experiment using the `setup()` function. This function handles all the data preprocessing steps (like one-hot encoding, scaling, etc.). We will also tell PyCaret to automatically log everything to MLflow by setting `log_experiment=True`.

In [None]:
from pycaret.classification import *
import mlflow

# Set the MLflow tracking URI. PyCaret will use this.
# This assumes you run this notebook from its directory.
mlflow.set_tracking_uri('../../mlruns')

# Initialize the PyCaret environment
exp = setup(
    data=df,
    target='Exited',  # Our target variable
    session_id=123,  # for reproducibility
    log_experiment=True, # Enable MLflow logging
    experiment_name='churn_prediction_automl', # Name of the MLflow experiment
    ignore_features=['CustomerID'] # Ignore irrelevant features
)

### 5. Training and Comparing All Models

This is where the magic happens! With a single line of code, PyCaret will train and evaluate over a dozen different classification models using cross-validation and display the results in a sortable grid.

In [None]:
best_model = compare_models()

### 6. Analyzing Results in the MLflow UI

Because we set `log_experiment=True`, PyCaret has logged every single model it trained as a separate run in MLflow. Let's explore this.

1. **Open a new terminal or command prompt.**
2. **Navigate to the root directory of this project** (`advanced-mlops-tutorial`).
3. **Run the command:** `mlflow ui`
4. **Open your browser** and go to `http://localhost:5000`.

In the MLflow UI, you will find the `churn_prediction_automl` experiment. Click on it to see all the runs. You can sort them by metrics like 'Accuracy' or 'AUC' and click on any run to see the parameters, metrics, and artifacts (like confusion matrix, feature importance) that PyCaret automatically logged.

### 7. Registering the Best Model in MLflow

The final step is to take our best performing model and register it in the **MLflow Model Registry**. The registry is a centralized place to manage the lifecycle of your models (e.g., moving them from Staging to Production).

We will first finalize the model (retrain it on the full dataset) and then find its corresponding run in MLflow to register it.

In [None]:
# Finalize the model (retrains on the full dataset)
final_model = finalize_model(best_model)
print(final_model)

In [None]:
# Register the model
model_name = "churn-classifier"
model_uri = f"runs:/{get_config('mlflow_run_id')}/model"

registered_model = mlflow.register_model(
    model_uri=model_uri,
    name=model_name
)

print(f"Model '{model_name}' registered with version {registered_model.version}")

Now, if you go back to the MLflow UI and click on the **Models** tab, you will see your newly registered `churn-classifier` model. This is the model we will serve in the next lab.

### 8. Conclusion

In this lab, you saw the power of AutoML with PyCaret. You were able to train, evaluate, and select the best model from a wide variety of candidates with just a few lines of code. You also saw how MLflow can be seamlessly integrated to keep track of all your experiments.

In Lab 3, we will take the model we just registered and deploy it as a production-ready API using FastAPI.