## Experiment Tracking with NYC Taxi Trip Duration Prediction

As per DataTalksClub's ML Ops zoomcamp, module 2 homework was to predict the trip duration from the NYC taxi data using linear regression model. 

References: 
- [DTC ML Ops zoomcamp 2024 module 2 homework](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2024/02-experiment-tracking/homework.md)
- [NYC Taxi Dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

### Q1. Install MLflow.

Go to the appropriate directory.  
For example, `cd ~/github/mlops-zoomcamp/notebooks/homework2`  

Create and activate the environment with conda. 
```
conda create -n mlops-zoomcamp-env python=3.9
conda activate mlops-zoomcamp-env
```
Install MLflow and other libraries.  
`pip install mlflow jupyter scikit-learn pandas seaborn hyperopt xgboost fastparquet boto3`

In [2]:
import mlflow

In [3]:
mlflow.__version__

'2.14.3'

What version are you using? 2.14.3

In [None]:
mlflow.set_tracking_uri()
mlflow.set_experiment()

### Q2. Download and preprocess the data.

Download the data for January, February and March 2023 in parquet format.

In [None]:
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet

`preprocess_data.py` will:

- load the data from the folder <TAXI_DATA_FOLDER> (the folder where you have downloaded the data)
- fit a DictVectorizer on the training set (January 2023 data)
- save the preprocessed datasets and the DictVectorizer to disk

In [None]:
!python preprocess_data.py --raw_data_path ../../data --dest_path ./output

How many files were saved to `OUTPUT_FOLDER`? 4

### Q3. Train a model with autolog.

Train a RandomForestRegressor (from Scikit-Learn) on the taxi dataset. 

`train.py` will:

- load the datasets produced by the previous step
- train the model on the training set
- calculate the RMSE score on the validation set

Modify `train.py` to enable autologging with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked.

In [None]:
!python train.py --data_path ./output

In [None]:
!mlflow ui

In [1]:
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Retrieve the experiment ID from its name.
experiment_name = "random-forest"
experiment = client.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id

# Retrieve information about the runs in the experiment.
runs = client.search_runs(experiment_ids=[experiment_id])
for run in runs:
    run_id = run.info.run_id
    params = client.get_run(run_id).data.params
    print(f"Hyperparameters for run {run_id}: {params}")
    min_samples_split = params.get("min_samples_split")
    print(f"min_samples_split for run {run_id}: {min_samples_split}")

Hyperparameters for run 32a54f6f4f13489fab8e3b1fbb23f67e: {'n_estimators': '100', 'warm_start': 'False', 'min_samples_split': '2', 'bootstrap': 'True', 'random_state': '0', 'max_samples': 'None', 'max_depth': '10', 'min_weight_fraction_leaf': '0.0', 'max_features': '1.0', 'verbose': '0', 'ccp_alpha': '0.0', 'monotonic_cst': 'None', 'oob_score': 'False', 'max_leaf_nodes': 'None', 'criterion': 'squared_error', 'n_jobs': 'None', 'min_impurity_decrease': '0.0', 'min_samples_leaf': '1'}
min_samples_split for run 32a54f6f4f13489fab8e3b1fbb23f67e: 2
Hyperparameters for run b07ca09683234950b192b0c9910d457a: {'n_estimators': '100', 'warm_start': 'False', 'min_samples_split': '2', 'bootstrap': 'True', 'random_state': '0', 'max_samples': 'None', 'max_depth': '10', 'min_weight_fraction_leaf': '0.0', 'max_features': '1.0', 'verbose': '0', 'ccp_alpha': '0.0', 'monotonic_cst': 'None', 'oob_score': 'False', 'max_leaf_nodes': 'None', 'criterion': 'squared_error', 'n_jobs': 'None', 'min_impurity_decreas

What is the value of the min_samples_split parameter? 2

### Q4. Launch the tracking server locally. 

- launch the tracking server on your local machine
- select a SQLite db for the backend store and a folder called artifacts for the artifacts store

```
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

In addition to `backend-store-uri`, what else do you need to pass to properly configure the server? `default-artifact-root`

### Q5. Tune model hyperparameters

Reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using hyperopt. 

Modify `hpo.py` to log: 

- the list of hyperparameters that are passed to the objective function during the optimization
- the RMSE obtained on the validation set (February 2023 data)

to the tracking server for each run of the hyperparameter optimization.

Wrap the code inside the `objective()` function with `mlflow.start_run()` and log parameters and metrics.

```
def objective(params):

        with mlflow.start_run():

            mlflow.log_params(params)

            rf = RandomForestRegressor(**params)
            rf.fit(X_train, y_train)
            y_pred = rf.predict(X_val)
            rmse = mean_squared_error(y_val, y_pred, squared=False)

            mlflow.log_metric("rmse", rmse)

            return {'loss': rmse, 'status': STATUS_OK}
```

In [None]:
!python hpo.py 

After that, open UI and explore the runs from the experiment called random-forest-hyperopt to answer the question below.
(Note: Don't use autologging for this exercise.)

In [None]:
!mlflow ui

What's the best validation RMSE that you got? 5.335

### Q6. Promote the best model to the model registry. 

Promote the best model to the model registry with `register_model.py`, which will check the results from the previous step and select the top 5 runs. After that, it will calculate the RMSE of those models on the test set (March 2023 data) and save the results to a new experiment called random-forest-best-models.

Update `register_model.py` so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

Tip 1: you can use the method search_runs from the MlflowClient to get the model with the lowest RMSE,

```
# Select the model with the lowest test RMSE
experiment = client.get_experiment_by_name(EXPERIMENT_NAME)

best_run = client.search_runs(
    experiment_ids=experiment.experiment_id,
    filter_string = "metrics.test_rmse < 5.5",
    run_view_type=ViewType.ACTIVE_ONLY,
    max_results=top_n,
    order_by=["metrics.rmse ASC"]
)[0]

print(f"best_run id: {best_run.info.run_id}, rmse: {best_run.data.metrics['test_rmse']:.4f}")
```

Tip 2: to register the model you can use the method `mlflow.register_model` and you will need to pass the right model_uri in the form of a string that looks like this: `"runs:/<RUN_ID>/model"`, and the name of the model.

```
# Register the best model
run_id = best_run.info.run_id
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri=model_uri, name=EXPERIMENT_NAME)
```

In [None]:
!python register_model.py 

What is the test RMSE of the best model? 5.567