# Data Preparation and Model Training

In this notebook, we load the [Iris plants dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris), extract features, and train a Random Forest classifier.


In [None]:
import matplotlib.pyplot as plt
import mlflow
from sklearn import datasets
from sklearn.discriminant_analysis import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline


## Data Preparation

We use the [Iris plants dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) to predict the class of Iris flowers.
It contains 150 samples (50 for each instance), and the following 4 numeric, predictive attributes.

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

The class names are Iris-Setosa (`0`), Iris-Versicolour (`1`), and Iris-Virginica (`2`).


In [None]:
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape


In [None]:
X.min(axis=0), X.max(axis=0)


Let's plot the data to gain some insights.


In [None]:
fig = plt.figure(figsize=(12, 6))

# Petal Length and Width
ax = fig.add_subplot(121)
ax.set_title('Iris Classification Based on Sepal Length and Width')
ax.set_xlabel('Sepal Length (cm)')
ax.set_ylabel('Sepal Width (cm)')
ax.grid(True, linestyle='-', color='0.5')
ax.scatter(X[:, 0], X[:, 1], s=32, c=y, marker='o')

# Sepal Length and Width
ax = fig.add_subplot(122)
ax.set_title("Iris Classification Based on Petal Length and Width")
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Petal width (cm)')
ax.grid(True, linestyle='-', color='0.5')
ax.scatter(X[:, 2], X[:, 3], s=32, c=y, marker='s')


**Task:** Split the data into train and test sets. Use $60%$ samples for training.


In [None]:
X_train, X_test, y_train, y_test = None  # todo
X_train.shape, X_test.shape


## ML Training

### Model Training and Logging

Let's use random forest regression to predict house prices.
Therefore, we want to find a reasonable maximum depth for the single decision trees.
We keep track of several tries and their mean squared error using MLflow.

Please note that we ignore best practices like cross validation, feature selection and randomised parameter search for demonstration purposes.

**Task:** Setup the pipeline factory with a random forest classifier using `max_depth=3` and a specified number of estimators. You might add additional pipeline steps.


In [None]:
def create_pipeline(n_estimators: int) -> Pipeline:
    return Pipeline(
        steps=[ # todo
               ])


**Task:** Choose reasonable hyperparameters to try, and execute the training process. Log the accuracy and according parameters. You might add further metrics.


In [None]:
n_estimators_to_try = []  # todo: choose hyperparameters to try
REGISTERED_MODEL_NAME = "iris_forest_classifier"

for max_depth in n_estimators_to_try:
    with mlflow.start_run():
        # build a pipeline with a ridge regression model
        model_pipeline = create_pipeline(n_estimators=max_depth)
        model_pipeline.fit(X_train, y_train)

        # calculaye metrics using the test data
        y_pred = model_pipeline.predict(X=X_test)
        acc = accuracy_score(y_true=y_test, y_pred=y_pred)

        # todo: log parameters and metrics

        mlflow.sklearn.log_model(
            sk_model=model_pipeline, artifact_path="iris", registered_model_name=REGISTERED_MODEL_NAME)

        print(
            f"Model saved in run {mlflow.active_run().info.run_uuid}. Acc={acc}")


### Assessing the Runs in the MLflow Web-UI

**Task:** Inspect the training runs with their parameters and metrics with MLflow's web-UI.
Store the best model in the model registry, and stage it for production (**either in the web UI, or using the Python interface**).

Just execute this cell and visit the uri in your web browser.
Terminate this cell or the notebook to stop the server.


In [None]:
!mlflow ui -p 5002


In [None]:

# only necessary, if you did not stage a model for production using the web interface
client = mlflow.MlflowClient()
client.transition_model_version_stage(
    name=REGISTERED_MODEL_NAME,
    version=None,  # todo: choose the model version, if not already done in the web interface
    stage="Production"
)


## Model Deployment

Let's use our production model to generate a Docker image for the model endpoint.

Please note that the first run of this cell might take some minutes.
In the meantime, you can start with the next tasks.


In [None]:
mlflow.models.build_docker(
    model_uri=f"models:/{REGISTERED_MODEL_NAME}/Production", name="iris_model_api", env_manager="local")
