# Homework - Module 02

The goal of this homework is to get familiar with MLflow, the tool for experiment tracking and model management.


### Q1. Install MLflow

To get started with MLflow we need the MLflow Python package.

For this we've already created a separate Python environment, using [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs), and installed the package with `pip` or `conda`.

Let's check the version with the command `mlflow --version`:

In [1]:
# Check the mlflow version:
!mlflow --version

mlflow, version 2.22.0


### Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the duration of each trip. 

Let's download the data for January, February and March 2023 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [2]:
# Download January data
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
# Download February data
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
# Download March data
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-03.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1393k  100 1393k    0     0   478k      0  0:00:02  0:00:02 --:--:--  478k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1497k  100 1497k    0     0   767k      0  0:00:01  0:00:01 --:--:--  766k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1690k  100 1690k    0     0   451k      0  0:00:03  0:00:03 --:--:--  451k


In [3]:
# List data files
!ls -lh green*

-rw-r--r--@ 1 cm-mboulou-mac  staff   1.4M May 27 21:56 green_tripdata_2023-01.parquet
-rw-r--r--@ 1 cm-mboulou-mac  staff   1.5M May 27 21:56 green_tripdata_2023-02.parquet
-rw-r--r--@ 1 cm-mboulou-mac  staff   1.7M May 27 21:56 green_tripdata_2023-03.parquet


We will then use the script `preprocess_data.py` to preprocess the data.

The script will:

* load the data from the folder `<TAXI_DATA_FOLDER>` (the folder where the data was downloaded),
* fit a `DictVectorizer` on the training set (January 2023 data),
* save the preprocessed datasets and the `DictVectorizer` to disk.

We will execute the script using the following command:

```
python preprocess_data.py --raw_data_path <TAXI_DATA_FOLDER> --dest_path ./output
```

In [4]:
# Preprocess the data
!python preprocess_data.py --raw_data_path ./ --dest_path ./output

In [5]:
# List files in the `OUTPUT_FOLDER`
!ls output

dv.pkl    test.pkl  train.pkl val.pkl


`4` files were saved to `OUTPUT_FOLDER`.

### Q3. Train a model with autolog

We will train a `RandomForestRegressor` (from Scikit-Learn) on the taxi dataset. We have prepared the training script `train.py` for this exercise. 

The script will:
* load the datasets produced by the previous step,
* train the model on the training set,
* calculate the RMSE score on the validation set.

We need to modify the script to enable **autologging** with MLflow, execute the script and then launch the MLflow UI (`mlflow ui`) to check that the experiment run was properly tracked. 

Tip 1: don't forget to wrap the training code with a `with mlflow.start_run():` statement as we showed in the videos.

Tip 2: don't modify the hyperparameters of the model to make sure that the training will finish quickly.

In [6]:
# Run the training script
!python train.py

2025/05/27 21:56:58 INFO mlflow.tracking.fluent: Experiment with name 'my-experiment-1' does not exist. Creating a new experiment.


The value of the `min_samples_split` parameter is `2`.

### Q4. Launch the tracking server locally

Now we want to manage the entire lifecycle of our ML model. In this step, you'll need to launch a tracking server. This way we will also have access to the model registry. 

We will:

* launch the tracking server on our local machine,
* select a SQLite db for the backend store and a folder called `artifacts` for the artifacts store.

You should keep the tracking server running to work on the next two exercises that use the server.

In addition to `backend-store-uri`, to properly configure the server, we also need to pass: `default-artifact-root`.

The command to run in our terminal is the folllowing:
`mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts`

### Q5. Tune model hyperparameters

Now let's try to reduce the validation error by tuning the hyperparameters of the `RandomForestRegressor` using `hyperopt`. We will use the script `hpo.py` for this exercise. 

We need to modify the script `hpo.py` to make sure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization (we will need to add a few lines of code to the `objective` function) and run the script without passing any parameters.

After that, we will check everything went well by opening the UI and exploring the runs from the experiment called `random-forest-hyperopt`.

Only should be logged:
* the list of hyperparameters that are passed to the `objective` function during the optimization,
* the RMSE obtained on the validation set (February 2023 data).

In [7]:
# Run the hyper-parameter tuning script
!python hpo.py

2025/05/27 21:58:44 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
🏃 View run amazing-grouse-40 at: http://127.0.0.1:5000/#/experiments/1/runs/93e810ab1c97483580488df6182d5f2c

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run unique-boar-752 at: http://127.0.0.1:5000/#/experiments/1/runs/3d94a1c6c2aa46b0b2e2e9b875985de2

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run resilient-crab-800 at: http://127.0.0.1:5000/#/experiments/1/runs/5dcc2568026746c4b6fbc8773b11d4a6

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run rumbling-stork-674 at: http://127.0.0.1:5000/#/experiments/1/runs/2cac75a34f9d41a0bd24fa6b08cc7c10

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run upbeat-cod-595 at: http://127.0.0.1:5000/#/experiments/1/runs/ae562cce02e44ae89326b7639

The best validation RMSE obtained is `5.335`.

### Q6. Promote the best model to the model registry

The results from the hyperparameter optimization are quite good. So, we can assume that we are ready to test some of these models in production. 

In this exercise, we'll promote the best model to the model registry. We have prepared a script called `register_model.py`, which will check the results from the previous step and select the top 5 runs. After that, it will calculate the RMSE of those models on the test set (March 2023 data) and save the results to a new experiment called `random-forest-best-models`.

We will update the script `register_model.py` so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

Tip 1: we can use the method `search_runs` from the `MlflowClient` to get the model with the lowest RMSE,

Tip 2: to register the model we can use the method `mlflow.register_model` and will need to pass the right `model_uri` in the form of a string that looks like this: `"runs:/<RUN_ID>/model"`, and the name of the model.

In [8]:
# Register the model
!python register_model.py

2025/05/27 22:02:03 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best-models' does not exist. Creating a new experiment.
🏃 View run painted-crow-963 at: http://127.0.0.1:5000/#/experiments/2/runs/12a33f17a89b47bd9a85554d38367392
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
🏃 View run sneaky-ram-822 at: http://127.0.0.1:5000/#/experiments/2/runs/1b8786f49406414baceece301e77c591
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
🏃 View run mercurial-mare-277 at: http://127.0.0.1:5000/#/experiments/2/runs/d9a4826644154fbfb96b9d7d1626f75d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
🏃 View run handsome-shoat-541 at: http://127.0.0.1:5000/#/experiments/2/runs/8a492581667a4018a09ef2eeeb8927ea
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
🏃 View run omniscient-finch-94 at: http://127.0.0.1:5000/#/experiments/2/runs/19f9ee559a674b59aad82323a498da5d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
Successfully r

The test RMSE of the best model is `5.567`.

---