MLOps Zoomcamp Homework 2

Data - NYC Taxi dataset:

Green Taxi Trip Records - Jan, Feb & Mar 2023
- https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
- https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
- https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-03.parquet

In [1]:
! pwd

/workspaces/mlops-zoomcamp/02-experiment-tracking


## Q1. Install MLflow

Check the installed version.

In [2]:
! mlflow --version

mlflow, version 2.22.0


## Q2. Download and preprocess the data

Download the datasets and then execute this command:

```sh
python preprocess_data.py --raw_data_path <TAXI_DATA_FOLDER> --dest_path ./output
```

How many files were saved to OUTPUT_FOLDER?

In [3]:
# Download the prepared scripts
! wget -nc -P taxi_data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
! wget -nc -P taxi_data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
! wget -nc -P taxi_data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-03.parquet

File ‘taxi_data/green_tripdata_2023-01.parquet’ already there; not retrieving.

File ‘taxi_data/green_tripdata_2023-02.parquet’ already there; not retrieving.

File ‘taxi_data/green_tripdata_2023-03.parquet’ already there; not retrieving.



In [4]:
! python homework_scripts/preprocess_data.py --raw_data_path taxi_data --dest_path ./output

In [5]:
! ls ./output

dv.pkl	test.pkl  train.pkl  val.pkl


There are 4 files generated in total.

## Q3. Train a model with autolog

Train a RamdonForestRegressor (from Scikit-Learn) on the taxi dataset, based on the provided training script `train.py`.

The script will:
- load the datasets produced by the previous step,
- train the model on the training set,
- calculate the RMSE score on the validation set.

Your task is to modify the script to enable autologging with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked.

What is the value of the `min_samples_split` parameter?

In [6]:
! python homework_scripts/train.py --data_path ./output

2025/05/26 17:17:57 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


Go to the MLflow UI to check the parameters:

```sh
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

`min_samples_split` = 2

## Q4. Launch the tracking server locally

Perform the following to manage the whole experiment lifecycle, e.g. have access to the model registry:
- launch the tracking server on your local machine
- select a SQLite db for the backend store and a folder called artifacts for the artifacts store.

You should keep the tracking server running to work on the next two exercises that use the server.

In addition to backend-store-uri, what else do you need to pass to properly configure the server?

- `default-artifact-root`
- `serve-artifacts`
- `artifacts-only`
- `artifacts-destination`

We need to specify `default-artifact-root` to save the models.

```sh
mlflow server \
  --backend-store-uri sqlite:////workspaces/mlops-zoomcamp/02-experiment-tracking/mlflow.db \
  --default-artifact-root /workspaces/mlops-zoomcamp/02-experiment-tracking/models \
  --host 0.0.0.0 \
  --port 5000
```

## Q5. Tune model hyperparameters

In [3]:
! python homework_scripts/hpo.py

2025/05/27 11:47:21 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
🏃 View run adorable-bear-966 at: http://127.0.0.1:5000/#/experiments/2/runs/43145edfe4f24f1cbf4d5124996b8f8b

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2                    

🏃 View run valuable-jay-172 at: http://127.0.0.1:5000/#/experiments/2/runs/5bd04eea9a87448494be59d09ebf2194

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2                    

🏃 View run overjoyed-fly-407 at: http://127.0.0.1:5000/#/experiments/2/runs/44f62935c36843ffb079a901b6a06854

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2                    

🏃 View run powerful-hen-799 at: http://127.0.0.1:5000/#/experiments/2/runs/c4bc574dcd9a42bf856a013667fed40c

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2                    

🏃 View run judicious-shrike-590 at: http://127.0.0.1:5000/#/experiments/2/runs/97e5a2f0702443b58a8fe

## Q6. Promote the best model to the model registry

In [2]:
! python homework_scripts/register_model.py

2025/05/28 07:19:43 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best-models' does not exist. Creating a new experiment.
🏃 View run classy-trout-10 at: http://127.0.0.1:5000/#/experiments/3/runs/32f3375eedd94af2b51a7392f830e5ce
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3
🏃 View run selective-hog-130 at: http://127.0.0.1:5000/#/experiments/3/runs/dd5751aead4248ecae0057d873a96900
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3
🏃 View run melodic-eel-201 at: http://127.0.0.1:5000/#/experiments/3/runs/caf56de8a5924860b52f927afd99657a
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3
🏃 View run serious-kit-655 at: http://127.0.0.1:5000/#/experiments/3/runs/d65ae6e8a8fb45f580c5efd33a49549e
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3
🏃 View run fun-moth-775 at: http://127.0.0.1:5000/#/experiments/3/runs/088031ce98e44e0a8a5f78410e18d9b4
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3
Best run id: 088031ce98e4