# Introduction to MLflow

## üß† What is MLflow?

**MLflow** is an open-source platform that helps you manage the full lifecycle of your machine learning models ‚Äî from development and training, all the way to deployment and tracking.

If you're a developer or data scientist working on machine learning, you often:

- Train many models
- Change and tune parameters
- Forget which settings worked best
- Want to compare experiments
- Deploy models for production
- Work with a team

üí° MLflow makes all of this clean, organized, and automated.

---

## üí° Why Do We Need MLflow?

| Problem | What MLflow Does |
|--------|------------------|
| You forget the settings (hyperparameters) used for training | MLflow logs all parameters automatically |
| You have 5 different models and don‚Äôt know which is best | MLflow keeps versions and compares metrics |
| You want to deploy the model somewhere else | MLflow saves and serves models as REST APIs |
| You want to collaborate with your team | Everything is stored and viewable in a web UI |
| You have prompts for LLMs like GPT | Since MLflow 2.22, it supports prompt versioning and tracking! |

---

## üß© The 5 Core Components of MLflow

### 1. **MLflow Tracking**
- Logs every experiment
- Tracks:
  - Model type
  - Parameters (e.g., `n_estimators=100`)
  - Metrics (e.g., accuracy, loss)
- Visualized in the MLflow UI

### 2. **MLflow Projects**
- Helps run ML code in a consistent and reproducible way
- Uses configuration files (like `MLproject`) to standardize execution

### 3. **MLflow Models**
- Standard way to save and load models
- Works with many frameworks (scikit-learn, PyTorch, TensorFlow, XGBoost, etc.)

### 4. **MLflow Model Registry**
- Central place to register and manage models
- Tracks:
  - Model name
  - Version numbers (v1, v2, v3‚Ä¶)
  - Aliases (like `@dev`, `@staging`, `@prod`)
  - Description and metadata

### 5. **MLflow Prompts** (NEW in v2.22)
- Manage and version prompts for Generative AI (e.g., GPT-style LLMs)
- Useful for tracking changes and deploying prompt templates

---

#### Example Use Case:
Imagine you're training a model, changing hyperparameters, and testing accuracy.  
MLflow will:

- Automatically log all your runs
- Save your models
- Let you compare results visually
- Help you deploy the best version with one click

---
## üéØ Logging Artifacts in MLflow

In MLflow, you can store **any file or folder** you want inside the **Artifacts** section of each run.

Artifacts are basically storage folders attached to your experiment runs ‚Äî  
they keep all the files that help you reproduce or analyze your experiments later.

### üß© What Can You Log?

You can log anything you find useful, such as:
- üìÅ Source code files (`.py`, `.ipynb`, or entire folders)
- üìä Datasets used in training/testing
- ‚öôÔ∏è Configuration files (`config.yaml`, `params.json`)
- üìà Generated results (plots, reports, metrics)
- üßÆ Model predictions (`predictions.csv`)
- üìù Notes, logs, or documentation

### üí° Practical Example

If you have a folder called `src` that contains your source code  
and you want to save it in MLflow under a folder named `code`,  
use this line inside your MLflow run:

```python
mlflow.log_artifacts("src", artifact_path="code")
```
üîπ `"src"` ‚Üí the local folder you want to upload.  
üîπ `artifact_path="code"` ‚Üí the folder name that will appear inside MLflow UI.

After running this code, open your **MLflow Tracking UI** (for example at `http://127.0.0.1:5000`),  
select the corresponding run, and open the **Artifacts** tab.  
There you‚Äôll see a folder named **`code/`**, containing everything that was inside your local `src` folder.

---

In [1]:
pip install mlflow

Collecting mlflow
  Downloading mlflow-2.22.0-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-skinny==2.22.0 (from mlflow)
  Downloading mlflow_skinny-2.22.0-py3-none-any.whl.metadata (31 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting waitress<4 (from mlflow)
  Downloading waitress-3.0.2-py3-none-any.whl.metadata (5.8 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==2.22.0->mlflow)
  Downloading databricks_sdk-0.52.0-py3-none-any.whl.metadata (39 kB)
Collecting fastapi<1 (from mlflow-skinny==2.22.0->mlflow)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting opentelemetry-api<3,>=1.9.0 (from mlflow-skinny==2.22.0->mlflow)
  Downloading opentelemetry_api-1.32.1-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetry-sdk<3,>=1.9.0 (from mlflow-skinny==2.22.0->mlflow)
  D

In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print(f"Hyperparameters: n_estimators=50, random_state=42")
print(f"Model accuracy: {accuracy}")


Hyperparameters: n_estimators=50, random_state=42
Model accuracy: 1.0


In [17]:
import mlflow
import mlflow.sklearn
from mlflow import MlflowClient
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load dataset (Iris)
X, y = load_iris(return_X_y=True)

# 2. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# 3. Set up MLflow and the Model Registry client
# 3.1 Set MLflow tracking server (make sure it's running locally)
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# 3.2 Set the experiment name (will create one if it doesn't exist)
mlflow.set_experiment("RandomForestClassifier1")

# 3.3 Initialize MLflow client (to manage registry and aliases)
client = MlflowClient()

# 4.  Set model registry name and alias (like @dev or @staging or @prod)
model_name = "SimpleRandomForestModel"
alias_name = "dev"  # You can change it to "staging" or "prod"

# 5. Start an MLflow run
with mlflow.start_run(run_name=f"Run for @{alias_name}"):

    # Train RandomForest model
    model = RandomForestClassifier(n_estimators=50, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate accuracy on the test set
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)

    # Log model hyperparameters and metrics
    mlflow.log_param("n_estimators", 50)
    mlflow.log_metric("accuracy", acc)

    
    # 6. Log and register the model
    # 6.1 Log the trained model to MLflow artifacts
    mlflow.sklearn.log_model(model, "model")

    # 6.2 Build model URI from the run ID
    model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"

    # 6.3 Register the model to the MLflow Model Registry
    result = mlflow.register_model(model_uri=model_uri, name=model_name)
    version = result.version  # Automatically assigned version (e.g., 1, 2, 3...)
    
    #7. Assign an alias (like @dev) to this version
    # Assign alias (e.g., @dev) to this version
    client.set_registered_model_alias(
        name=model_name,
        alias=alias_name,
        version=version
    )

    df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
    df["target"] = y
    df.to_csv("iris_dataset.csv", index=False)
    mlflow.log_artifact("iris_dataset.csv", artifact_path="dataset")

    mlflow.log_artifact("mlflow_sample_example.ipynb", artifact_path="code")

    # Print results
    print(f"‚úÖ Accuracy: {acc:.2f}")
    print(f"üì¶ Model registered as version: {version}")
    print(f"üîÅ Alias @{alias_name} now points to version: {version}")

# 8. End the MLflow run
mlflow.end_run()


Registered model 'SimpleRandomForestModel' already exists. Creating a new version of this model...
2025/10/20 17:55:28 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: SimpleRandomForestModel, version 4
Created version '4' of model 'SimpleRandomForestModel'.


‚úÖ Accuracy: 1.00
üì¶ Model registered as version: 4
üîÅ Alias @dev now points to version: 4
üèÉ View run Run for @dev at: http://127.0.0.1:5000/#/experiments/203598953536007816/runs/49633ceaafed4852aabd5cfbf2355064
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/203598953536007816


In [5]:
from mlflow import MlflowClient

# Define model name and the target alias
model_name = "SimpleRandomForestModel"
target_alias = "dev"

# Manually specify the version you want to assign
version_to_promote = 1  # üîÅ Change this number as needed

# Create the MLflow client
client = MlflowClient()

# Set the alias to the specified version
client.set_registered_model_alias(
    name=model_name,
    alias=target_alias,
    version=version_to_promote
)

print(f"‚úÖ Alias @{target_alias} now points to version {version_to_promote}")


‚úÖ Alias @dev now points to version 1


## In anaconda powershell run command: mlflow ui

In [8]:
import mlflow.sklearn

# Define model name and version
model_name = "SimpleRandomForestModel"
version = 1  # Replace with the version you want

# Build the model URI using version
model_uri = f"models:/{model_name}/{version}"

# Load the model from MLflow Registry
loaded_model = mlflow.sklearn.load_model(model_uri)

# Use the model for prediction
# Example input (replace with your own data)
sample_input = [[5.1, 3.5, 1.4, 0.2]]
prediction = loaded_model.predict(sample_input)

print(f"‚úÖ Prediction from version {version}: {prediction}")


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

‚úÖ Prediction from version 1: [0]
