# MLOps Zoomcamp 2024

# Homework 2

# Question 1

I used GitHub Codespaces to open the [repository](https://github.com/moosetim/mlops-zoomcamp-2024) in my local VS Code.

I used the following commands to set up a Conda virtual environment and install MLflow along with other required packages listed in the `requirements.txt` file: 
1. `conda create -n exp-tracking-env python=3.9`
2. `conda init`
3. `conda activate exp-tracking-env`
4. `pip install -r requirements.txt`

The output of `mlflow --version`: `mlflow, version 2.13.0`.


# Question 2

The Green Taxi Trip Records datasets for January, February, March 2023 (`green_tripdata_2023-*.parquet`) were saved in the folder `02-experiment-tracking/data`. The preprocessing script preprocess_data.py was saved in the `homeworks` folder.

The following command was used to execute the script: `python preprocess_data.py --raw_data_path ../02-experiment-tracking/data/ --dest_path ./output`. 

The created `output` folder has **4 files** (`dv.pkl`, `test.pkl`, `train.pkl`, `val.pkl`).

# Question 3

In [1]:
import mlflow

Run the following command in the terminal to launch the MLflow UI in the browser and specify the backend store uri: `mlflow ui --backend-store-uri sqlite:///mlflow.db`.

In [19]:
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("hw-2-experiment") # MLflow will assign all the runs to this experiment

<Experiment: artifact_location='/workspaces/mlops-zoomcamp-2024/homeworks/mlruns/1', creation_time=1716504547872, experiment_id='1', last_update_time=1716504547872, lifecycle_stage='active', name='hw-2-experiment', tags={}>

The following lines of code were added to the `train.py` file to enable MLflow autologging:

    import mlflow

    mlflow.set_tracking_uri("sqlite:///mlflow.db")
    mlflow.set_experiment("hw-2-experiment") 

    with mlflow.start_run():
        # Enable auto-logging
        mlflow.autolog()

        # Run the training script
        run_train()

Checking the model runs in the MLflow UI, **the value of the min_samples_split parameter is 2**.


# Question 4

Use the following command to launch the MLflow server: `mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts`.

# Question 5

The script `hpo.py` has been updated and can be found in the `homeworks` folder. The main changes were made to the `objectiv()` function:

        def objective(params):
            with mlflow.start_run(nested=True):

                # Log the parameters to MLflow
                mlflow.log_params(params)
                
                rf = RandomForestRegressor(**params)
                rf.fit(X_train, y_train)
                y_pred = rf.predict(X_val)
                
                rmse = root_mean_squared_error(y_val, y_pred)
                # Log RMSE to MLflow
                mlflow.log_metric("rmse", rmse)

            return {'loss': rmse, 'status': STATUS_OK}


To check the best RMSE, launch the mlflow UI with and rank the runs in order of the increasing RMSE. The lowest RMSE equals to `5.335419588556921`.

# Question 6

The following changes were made in the `register_model.py` file which is located in the `homeworks` folder:
1. Additional `print()` statements were added to faciliate debugging
2. For the model from the best run, the lowest test RMSE from the best model was printed out and further confirmed in the MLflow UI.

The lowest test RMSE was `5.567408012462019`.