# Import Libraries

In [1]:
import mlflow

# Answer Questions

### Q1. Install MLflow

In [2]:
!mlflow --version

mlflow, version 2.13.0


### Q2. Download and preprocess the data

We will use the preprocess_data.py script to preprocess the data.

The script will:

- load the data from the folder `../data/`.
- fit a DictVectorizer on the training set (January 2023 data - `green_tripdata_2023-01.parquet`).
- save the preprocessed datasets and the DictVectorizer to disk.

In [4]:
!python preprocess_data.py data_path --dest_path ./output

### Q3. Train a model with autolog

We use the training script `train.py` for this exercise.

The script will:

- load the datasets produced by the previous step.
- connect to the experiment `RandomForest_Experiment` in mlflow.
- train the model on the training set.
- extract the model parameters and publish to mlflow.
- calculate the RMSE score on the validation set and publish to mlflow.

In [6]:
!python train.py --data_path ./output

2024/05/25 23:24:48 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-train' does not exist. Creating a new experiment.


What is the value of the min_samples_split parameter?

`min_samples_split`: 2

### Q4. Launch the tracking server locally

We will setup a `mlflow` sqlite server to store the model parameters and artifacts. Execute the next code on the terminal.

```bash
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
```

![mlflow-run](images/model-params.png)

### Q5. Tune model hyperparameters

We will use script `hpo.py` to optimize the RF regressor model. For each iteration we need to log:
- the model `params`
- the `RMSE` on validation set.

In [7]:
!python hpo.py --data_path ./output --num_trials 15

2024/05/25 23:31:32 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.















100%|██████████| 15/15 [00:42<00:00,  2.80s/trial, best loss: 5.335419588556921]


![mlflow-run](images/mlflow-run-hpo.png)

### Q6. Promote the best model to the model registry

We will promote the best model to the model registry.

In [33]:
!python register_model.py --data_path ./output --top_n 5

2024/05/26 00:05:47 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best_models' does not exist. Creating a new experiment.
Successfully registered model 'random-forest-best_model'.
2024/05/26 00:06:29 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: random-forest-best_model, version 1
Created version '1' of model 'random-forest-best_model'.


![mlflow-run](images/registered-model.png)