# MLOps Zoomcamp Homework 2

In [23]:
import os
from pathlib import Path

In [24]:
# Define some paths
base_dir = Path(os.getcwd())
scripts_dir = base_dir / "homework"
data_dir = base_dir / "data"
output_dir = base_dir  / "output"

## Questions

### Q1

In [25]:
!mlflow --version

mlflow, version 2.3.2


### Q2

In [26]:
# Run the preprocess_data script
os.chdir(scripts_dir)
!python preprocess_data.py --raw_data_path {data_dir} --dest_path {output_dir}
os.chdir(base_dir)

In [33]:
dv_file = output_dir / "dv.pkl"
f"The DictVectorizer file has a size of {dv_file.stat().st_size/(1<<10):.0f} KB"

'The DictVectorizer file has a size of 150 KB'

### Q3

In [28]:
# Run the train script
os.chdir(scripts_dir)
!python train.py --data_path {output_dir}
os.chdir(base_dir)

2023/06/01 17:14:10 INFO mlflow.tracking.fluent: Experiment with name 'nyc-taxi-experiment-autolog' does not exist. Creating a new experiment.
2023/06/01 17:14:10 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [8]:
from IPython.display import Markdown as md

md(f"Then I ran the command `mlflow ui` in the '{base_dir}' directory")

Then I ran the command `mlflow ui` in the '/Users/olivier/Documents/courses/mlops-zoomcamp/02_experiment_tracking' directory

Below is the screenshot of the first parameters of the run in `mlflow ui`
![mlflow ui parameters](images/q3.png)
The value of `max_depth` is 10

### Q4

In [29]:
md(f"Before running the following sections, it is needed to run: <br />`mlflow ui --backend-store-uri sqlite:///mlflow.db --artifacts-destination ./artifacts`<br /> from the '{base_dir}' directory")

Before running the following sections, it is needed to run: <br />`mlflow ui --backend-store-uri sqlite:///mlflow.db --artifacts-destination ./artifacts`<br /> from the '/Users/olivier/Documents/courses/mlops-zoomcamp/02_experiment_tracking' directory

In [34]:
# Run the hpo script
os.chdir(scripts_dir)
!python hpo.py --data_path {output_dir}
os.chdir(base_dir)

2023/06/01 17:19:12 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
[32m[I 2023-06-01 17:19:12,890][0m A new study created in memory with name: no-name-bcdaccb4-3232-4d25-a59a-095a1177b19f[0m
[32m[I 2023-06-01 17:19:14,176][0m Trial 0 finished with value: 2.451379690825458 and parameters: {'n_estimators': 25, 'max_depth': 20, 'min_samples_split': 8, 'min_samples_leaf': 3}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-06-01 17:19:14,876][0m Trial 1 finished with value: 2.4667366020368333 and parameters: {'n_estimators': 16, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-06-01 17:19:15,860][0m Trial 2 finished with value: 2.449827329704216 and parameters: {'n_estimators': 34, 'max_depth': 15, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 2 with value: 2.449827329704216.[0m
[32m[I 2023-06-01 17:19

Below is the screenshot of the 10 runs for the experiment `random-forest-hyperopt`:
![random-forest-hyperopt runs](images/q4.png)
The best value for the metric `rmse` is 2.45 (on the first line)

### Q5

In [35]:
# Run the register_model script
os.chdir(scripts_dir)
!python register_model.py --data_path {output_dir}
os.chdir(base_dir)

2023/06/01 17:19:23 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best-models' does not exist. Creating a new experiment.
Experiment name: random-forest-best-models
Best model id: ab8972f5953e487eba63a79bd3e4c271, best model test_rmse: 2.2855
Successfully registered model 'nyc-taxi-regressor-best-rf'.
2023/06/01 17:19:40 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: nyc-taxi-regressor-best-rf, version 1
Created version '1' of model 'nyc-taxi-regressor-best-rf'.


The best model for the experiment `random-forest-best-models` has a `test_rmse` of 2.285

### Q6

Below is the screenshot of best model in the registry for the experiment `random-forest-best-models`:
![random-forest-best-models best model](images/q6.png)
We can see the **version** of the model. However there is no source experiment, but the source run, and no signature.