# MLflow and scikit-learn Task

In this task, we replace the ridge regression from the demo with a random forest model.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import mlflow
import mlflow.sklearn


## Data Loading: California Housing Dataset

We load the [California housing dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) for regression.
It consists of $20,640$ samples with the following $8$ numeric attributes, and median house values for California districts as targets.

- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude


In [None]:
X, y = fetch_california_housing(return_X_y=True, download_if_missing=True)
X.shape, y.shape


In [None]:
# split data into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)
X_train.shape


## ML Training

### Model Training and Logging

Let's use random forest regression to predict house prices.
Therefore, we want to find a reasonable maximum depth for the single decision trees.
We keep track of several tries and their mean squared error using MLflow.

Please note that we ignore best practices like cross validation, feature selection and randomised parameter search for demonstration purposes.

**Task:** Setup the pipeline factory with a random forest regressor using $100$ estimators and the specified `max_depth`.


In [None]:
def create_pipeline(max_depth: int) -> Pipeline:
    return Pipeline(
        steps=[('scalar', StandardScaler()),
               ('model', RandomForestRegressor(  # todo
               ))])


**Task:** Choose reasonable hyperparameters to try, and execute the training process.


In [None]:
max_depths_to_try = []  # todo

for max_depth in max_depths_to_try:
    with mlflow.start_run():
        # build a pipeline with a ridge regression model
        model_pipeline = create_pipeline(max_depth=max_depth)
        model_pipeline.fit(X_train, y_train)

        # calculaye the mean squared error using the test data
        y_pred = model_pipeline.predict(X=X_test)
        mse = mean_squared_error(
            y_true=y_test, y_pred=y_pred, squared=True, multioutput='uniform_average')

        # log parameters, metrics and the model
        mlflow.log_param(key="max_depth", value=max_depth)
        mlflow.log_metric(key="mean_squared_error", value=mse)
        mlflow.sklearn.log_model(
            sk_model=model_pipeline, artifact_path="house_model_forest")

        print(
            f"Model saved in run {mlflow.active_run().info.run_uuid}. MSE={mse}")


### Assessing the Runs in the MLflow Web-UI

**Task:** Inspect the training runs with their parameters and metrics with MLflow's web-UI.
Just execute this cell and visit the uri in your web browser.
Terminate this cell or the notebook to stop the server.


In [None]:
!mlflow ui
