## Most Important Steps

Full play list -> [YouTube](https://www.youtube.com/playlist?list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK) 

### 1.a Environment preparation

Steps [here](https://youtu.be/IXSiYkP23zo?list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK)

(Optional if you already have a good enough local machine or in-house cluster account)

- Launch a [EC2 machine on AWS](https://eu-central-1.console.aws.amazon.com/ec2/home?region=eu-central-1#Home:) --> see steps

- Now ssh to the EC2 machine (better add it to the `.ssh/config` file) --> see steps

    NOTE: The error msg while accessing EC2 machine means SSH refuses to use your `pem` key because it’s not private enough. Right now the .pem file has permissions 0664 → readable by you, your group, and others. That’s insecure, since anyone else on the system could copy your private key. Fix -

    ```sh
    chmod 600 /home/jigar/.ssh/mlops-zoomcamp.pem
    ```

    NOTE: after adding this EC2 machine to ssh config file we can simple do -

    ```sh
    ssh ec2-mlops-zoomcamp
    ```

- For IDE experience SSH connect via VSCode

------------------------------------------------------------------------------------------
- A little _detour_ to make the EC2 terminal pretty, convenient and development ready

    ```sh
    sudo apt update
    # install zsh git curl
    sudo apt install -y zsh git curl 
    sh -c "$(wget -O- https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
    ```

    Type 'y' when asked if you want zsh your default shell

    Install more packages as needed e.g.

    ```sh
    sudo apt install -y tree
    # usage
    tree -L 3
    ```

    Add useful aliases e.g.

    ```sh
    echo "alias cls='clear; ls -l --color=auto'" >> ~/.zshrc && source ~/.zshrc
    ``
------------------------------------------------------------------------------------------

- This EC2 machine comes with very latest version of python. But we want to use and older version that works with other libraries we intent to use. For this reason we download (using wget from [here](https://www.anaconda.com/download/success)) and install anaconda to create an env wth desired python version and libraries. --> see steps

    ```sh
    mkdir software; cd software; mkdir anaconda
    wget https://repo.anaconda.com/archive/Anaconda3-2025.06-0-Linux-x86_64.sh
    bash Anaconda3-2025.06-0-Linux-x86_64.sh
    ```

    - Specify the installation path as `/home/ubuntu/software/anaconda/anaconda3`

    - Enter `yes` when prompted  with -

        ```
        Do you wish to update your shell profile to automatically initialize conda?
        .
        .
        You can undo this by running `conda init --reverse $SHELL`? [yes|no]
        [no] >>>
        ```

    - This initializes conda env every time we log in (notice `(base) ➜  ~` every new log in). For more details on how this works see ~./zshrc line starting with  `# >>> conda initialize >>>`

    Create and activate a new conda env -

    ```sh
    conda create -n mlops-zoomcamp-env python==3.13.5
    conda activate mlops-zoomcamp-env
    
    # (Optinal) if you have a requirements.txt file
    pip install -r requirements
    ```

- We now install docker and docker compose as we will need it later for containerization --> see steps

    ```sh
    sudo apt update
    sudo apt install docker.io

    cd /home/ubuntu/software/
    mkdir docker
    wget https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o docker-compose
    chmod +x docker-compose
    ```
    Add path to docker compose to $PATH

    ```sh
    echo "\n# Path to docker" >> ~/.zshrc && source ~/.zshrc
    echo 'export PATH="${HOME}/software/docker:${PATH}"' >> ~/.zshrc && source ~/.zshrc
    ```

- To run docker we need to use `sudo` everytime :(. To avoid this we need to our user to _docker group_

    ```sh
    sudo groupadd docker # group 'docker' already exists
    sudo usermod -aG docker $USER # we add ourselves to user group of docker
    ```

    Then log out of EC2 machine and log back, and test using `docker run hello-world:latest` --> works w/o sudo! :)

### 1.b MLOps Maturity Model (see [Microsoft Guide](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/mlops-maturity-model))

<img src="2025-09-18_12-17.png" alt="alt text" width="300"/>

<img src="2025-09-18_12-22.png" alt="alt text" width="300"/>

<img src="2025-09-18_12-24.png" alt="alt text" width="300"/>

<img src="2025-09-18_12-28.png" alt="alt text" width="300"/>

<img src="2025-09-18_12-31.png" alt="alt text" width="300"/>

### 2a. Expreiment Tracking

<img src="2025-09-18_16-47.png" alt="alt text" width="350"/>

<img src="2025-09-18_16-49.png" alt="alt text" width="400"/>

<img src="2025-09-18_16-51.png" alt="alt text" width="400"/>

MLFlow backend settings in terminal

```sh
# Tell mlflow to store artifacts and metadata in sqlite
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

MLFlow import in python

```py
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")  # set backend to the same as above 
mlflow.set_experiment("nyc-taxi-experiment")    # name of ml experiment

```

---
Tracking a simple experiment

```py
with mlflow.start_run():

    mlflow.set_tag("developer", "jigar")    # e.g.

    alpha = 0.1
    mlflow.log_param("alpha", alpha)        # a good param to log    
    lr = Lasso(alpha)
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_val)
    rmse = root_mean_squared_error(y_val, y_pred)
    mlflow.log_metric("rmse", rmse)         # a good metric to log
```

---
Tracking an experiment during hyper-parameter optimization - the objective func looks somewhat like this -

```py
def objective(params):
    with mlflow.start_run():
        mlflow.set_tag("developer", "jigar")    # good to know who wrote this code
        mlflow.set_tag("model", "xgboost")      # model name
        mlflow.log_params(params)
        booster = xgb.train(
            params=params,
            dtrain=train,
            num_boost_round=1000,
            evals=[(valid, 'validation')],
            early_stopping_rounds=50
        )
        y_pred = booster.predict(valid)
        rmse = root_mean_squared_error(y_val, y_pred)
        mlflow.log_metric("rmse", rmse)

    return {'loss': rmse, 'status': STATUS_OK}  # a good metric to log
```


Typical search space may look like -

```py
search_space = {
    'max_depth': scope.int(hp.quniform('max_depth', 4, 100, 1)),
    'learning_rate': hp.loguniform('learning_rate', -3, 0),
    'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
    'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
    'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
    'objective': 'reg:linear',
    'seed': 42
}

best_result = fmin(
    fn=objective, space=search_space, algo=tpe.suggest, 
    max_evals=50, trials=Trials()
)
```

Check out `hyperopt` python library to relate with above 


---
Auto logging is a powerful feature that allows you to log metrics, parameters, and models without the need for explicit log statements. All you need to do is to call `mlflow.autolog()` (see [here](https://mlflow.org/docs/latest/ml/tracking/autolog/)) before your training code. 

```py
mlflow.autolog()
mlflow.autolog(disable=True)

mlflow.xgboost.autolog()
mlflow.xgboost.autolog(disable=True)

mlflow.pytorch.autolog()  # also works with Lightning
mlflow.pytorch.autolog(disable=True)
```

### 2b. Model Management

<img src="2025-09-19_13-42.png" alt="alt text" width="550"/>

We already saw the experiment tracting above. Now lets see how can we use mlflow for model saving and versioning.

```py
mlflow.sklearn.log_model(preprocessor, name="preprocessor")         # log a preprocessor
mlflow.sklearn.log_model(clf, name="model_name")        # log a model
```

So now we can simply download this model (with its preprocessor) and run (or even deploy).

```py
preprocessor_uri = f"runs:/<preprocessor_run_id>/preprocessor"
model_uri = f"runs:/<model_run_id>/rf_model"

loaded_preprocessor = mlflow.sklearn.load_model(preprocessor_uri)
loaded_model = mlflow.sklearn.load_model(model_uri)

X_test_scaled = loaded_preprocessor.transform(X_test)
preds = loaded_model.predict(X_test_scaled)
```

NOTE - even a complete pipeline (preprocessor + model + postprocessor) can be logged and loaded!

<img src="2025-09-19_15-20.png" alt="alt text" width="550"/>
<img src="2025-09-19_15-16.png" alt="alt text" width="550"/>

### 2c. Model Registry

<img src="2025-09-19_17-06.png" alt="alt text" width="550"/>

Model Registry contains all the models which are ready for production (job of Data Scientist to furnish these models). 

* It doesn't involve deployment - but only lists the production ready models -- with stages as "labels" assinged to the models! 

    For example: 
    - models ready for production (v3, v4) --> staging stage (`@challenger`)
    - models in production (v2) --> production stage (`@champion`)
    - models to be archived for some reason (v1) --> archive stage (`@retired`)
    - if neeed v1 can be rolled back to deployment (back to production stage)

* The deployment engineer can have a look here to figure out -- what are the hyper-parameters used, size of the model, performance etc. -- based on that they can decide to move the model between the different stages in model registry

* Model Registry needs to be complemented with some CI/CD code in order to perform actual _deployement_ of those models

After performing extensive model experimentation, to choose the model to be pushed into model registry we need to focus on a few imp detailed of best performing models e.g. run time (training & inference), model size, performance metric, etc. -- watch this section [here](https://youtu.be/TKHU7HAvGH8?list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&t=310)

<img src="2025-09-19_18-37.png" alt="alt text" width="600"/>

We can also check our ml experiments using python.

```py
from mlflow.tracking import MlflowClient
from mlflow.entities import ViewType

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"
client = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)

# Gather top 3 runs in exp with id 1
runs = client.search_runs(
    experiment_ids="1",
    filter_string="metrics.rmse < 5.63",
    run_view_type=ViewType.ACTIVE_ONLY,
    max_results=5,
    order_by=["metrics.rmse ASC"]
)

# Print id and metric
for run in runs:
    print(f"run id: {run.info.run_id}, rmse: {run.data.metrics['rmse']:.4f}")

# 



```