# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store Quickstart** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Iris Classification</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/iris/iris_tutorial.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the quickstart tutorial about Hopsworks Feature Store. Here you will work with data related to iris flower classification.


## 🗒️ This notebook is divided into next sections:
1. Import libraries and connect to Hopsworks Feature Store
2. Load the Iris flower dataset
3. Create a feature group and upload to the feature store
4. Create a feature view from the feature group
5. Create a training dataset
6. Train a model using XGBoost
7. Save the trained model to Hopsworks
8. Launch a serving instance.
9. Model deployment in Hopsworks
10. Send a prediction request to the served model
11. Try out your Model Interactively with a Gradio UI 

![tutorial-flow](../images/03_model.png)

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U hopsworks --quiet
!pip install -U xgboost --quiet

In [None]:
import joblib
import json
import time
import os

import pandas as pd
from sklearn.preprocessing import LabelEncoder

import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import f1_score, confusion_matrix

from matplotlib import pyplot
import seaborn as sns

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

In [None]:
# Read the Iris dataset
iris_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/iris.csv")

# Display the first 3 rows
iris_df.head(3)

In [None]:
# Display concise summary information
# This includes the data types, non-null counts, and memory usage of each column
iris_df.info()

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

Lets save a feature group (hive table) called `iris` that contains the iris features and the corresponding numeric labels.

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

In [None]:
# Get or create the 'iris' feature group
iris_fg = fs.get_or_create_feature_group(
    name="iris",
    version=1,
    primary_key=[
        "sepal_length", "sepal_width",
        "petal_length", "petal_width",
    ],
    description="Iris flower dataset",
)
# Insert data info the feature group
iris_fg.insert(iris_df)

---

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

Feature views are used to read features for training and inference.
If the feature view already exists, get it. If not, create the feature view.

In [None]:
# Select features for training data
selected_features = iris_fg.select_all()

# Uncomment this if you would like to view your selected features
# selected_features.show(5)

In [None]:
# Get or create the 'iris' feature view
feature_view = fs.get_or_create_feature_view(
    name="iris",
    version=1,
    description="Read from Iris flower dataset",
    query=selected_features,
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.train_test_split()` method.

* `X_train` is the train set features
* `X_test` is the test set features
* `Y_train` is the train set labels
* `Y_test` is the test set labels

In [None]:
# Perform a train-test split
X_train, X_test, _, _ = feature_view.train_test_split(
    description='iris tutorial',
    test_size=0.2,
)

In [None]:
# Extract the target variable 'variety' from the training dataset (X_train)
y_train = X_train.pop('variety').to_frame()

# Extract the target variable 'variety' from the testing dataset (X_test)
y_test = X_test.pop('variety').to_frame()

In [None]:
# Display the first 3 rows of the X_train
X_train.head(3)

In [None]:
# Display the first 3 elements of the y_train
y_train[:3]

---

## <span style="color:#ff5f27;"> 🧬 Modeling</span>

Train the XGBoost Classifier model.

In [None]:
# Create an instance of LabelEncoder from scikit-learn
le = LabelEncoder()

# Fit and transform the training target variable 'variety' to numeric labels
y_train_encoded = le.fit_transform(y_train['variety'])

# Transform the testing target variable 'variety' to numeric labels using the previously fitted LabelEncoder
y_test_encoded = le.transform(y_test['variety'])

In [None]:
# Create an instance of the XGBoost classifier
classifier = xgb.XGBClassifier()

# Train the classifier using the training features (X_train) and encoded target variable (y_train_encoded)
classifier.fit(X_train, y_train_encoded)

### <span style='color:#ff5f27'> 📐 Model Validation

In [None]:
# Plot the importance of features using the XGBoost classifier
plot_importance(
    classifier, 
    max_num_features=10, 
    importance_type='weight',
)

In [None]:
# Use the trained XGBoost classifier to make predictions on the testing features
y_pred = classifier.predict(X_test)

# Calculate the F1 score using the true testing labels (y_test) and predicted labels (y_pred)
# The 'macro' average calculates the F1 score for each class and then takes the unweighted mean
f1 = f1_score(
    y_test_encoded, 
    y_pred, 
    average="macro",
)

# Create a dictionary containing the calculated F1 score
metrics = {
    "f1_score": f1,
}

# Print the dictionary containing the F1 score
print(metrics)

In [None]:
# Display the true labels of the first 5 examples in the testing set
y_test_encoded[:5]

In [None]:
# Display the predicted labels for the first 5 examples in the testing set
y_pred[:5]

In [None]:
# Calculate and print the confusion matrix using the true labels (y_test) and predicted labels (y_pred)
results = confusion_matrix(
    y_test_encoded, 
    y_pred,
)
print(results)

In [None]:
# Create a DataFrame from the confusion matrix with labels for rows and columns
df_cm = pd.DataFrame(
    results, 
    ['True Setosa', 'True Versicolor', 'True Virginica'],
    ['Pred Setosa', 'Pred Versicolor', 'Pred Virginica'],
)

# Create a heatmap using seaborn with annotations
cm = sns.heatmap(df_cm, annot=True)

# Get the figure from the heatmap and display it
fig = cm.get_figure()
fig.show()

---

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Create input and output schemas based on training features (X_train) and target variable (y_train)
input_schema = Schema(X_train)
output_schema = Schema(y_train_encoded)

# Create a model schema using the input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary
model_schema.to_dict()

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

Save the following objects as pickle files locally to a directory that will be uploaded later to the model registry:

 * the model object, `classifier` saved as `iris_xgboost_model.pkl`
 * the label encoder object, `le` saved as `iris_encoder.pkl`, so that we can reconstruct categorical names 
    from the encoded predictions (numbers) 
    
The model input schema is the same set of features as in the `X_train` DataFrame.

The model output schema is the same label as in the `y_train_encoded` array.

Finally, lazily create the model that will be register, including all files (artifacts) in the directory (containing the pickled label encoder object and the pickled model object), the model's input/output schema, and a sample input row (`input_example`). The model registry is the `mr` object, and for our Scikit-Learn model, we create a model of type Python with `mr.python.create_model()`. For TensorFlow, there is `mr.tensorflow.create_model()`.

In [None]:
# Check if the 'iris_model' directory exists; if not, create it
model_dir = "iris_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

# Save the trained XGBoost classifier to a file in the 'iris_model' directory
joblib.dump(classifier, model_dir + '/xgboost_iris_model.pkl')

# Save the LabelEncoder to a file in the 'iris_model' directory
joblib.dump(le, model_dir + '/iris_encoder.pkl')

# Save the confusion matrix plot as an image file in the 'iris_model' directory
fig.savefig(model_dir + "/confusion_matrix.png")

In [None]:
# Get the model registry
mr = project.get_model_registry()

# Create a Python model in the model registry
iris_model = mr.python.create_model(
    name="xgboost_iris_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=X_train.sample(), 
    description="Iris Flower Predictor",
)

# Save the model to the 'iris_model' directory
iris_model.save(model_dir)

---

## <a class="anchor" id="1.5_bullet" style="color:#ff5f27"> 🚀 Model Deployment</a>

Models can be served via KFServing or "default" serving, which means a Docker container exposing a Flask server. For KFServing models, or models written in Tensorflow, you do not need to write a prediction file (see the section below). However, for sklearn models using default serving, you do need to proceed to write a prediction file.

In order to use KFServing, you must have Kubernetes installed and enabled on your cluster.

### <span style="color:#ff5f27;">📎 Predictor script for Python models</span>


Scikit-learn and XGBoost models are deployed as Python models, in which case you need to provide a **Predict** class that implements the **predict** method. The **predict()** method invokes the model on the inputs and returns the prediction as a list.

The **init()** method is run when the predictor is loaded into memory, loading the model from the local directory it is materialized to, *ARTIFACT_FILES_PATH*.

The directive `%%writefile` writes out the cell before to the given Python file. We will use the `predict_example.py` file to create a deployment for our model. 

In [None]:
%%writefile predict_example.py

import os
import joblib

class Predict(object):
    
    def __init__(self):
        # NOTE: env var ARTIFACT_FILES_PATH has the local path to the model artifact files      
        self.model = joblib.load(os.environ["ARTIFACT_FILES_PATH"] + "/xgboost_iris_model.pkl")

    def predict(self, inputs):
        """ Serves a prediction request from a trained model"""
        return self.model.predict(inputs).tolist()

In [None]:
# Get the dataset API for the project
dataset_api = project.get_dataset_api()

# Upload the file "predict_example.py" to the "Models" dataset
# If a file with the same name already exists, overwrite it
uploaded_file_path = dataset_api.upload("predict_example.py", "Models", overwrite=True)

# Construct the full path to the uploaded predictor script
predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

In [None]:
# Deploy the model using the predictor script located at 'predictor_script_path'
deployment = iris_model.deploy(
    name="irisdeployed",
    script_file=predictor_script_path,
)

In [None]:
# Print the name of the deployed model
print("Deployment: " + deployment.name)

# Retrieve and print detailed information about the deployment
deployment.describe()

In [None]:
print("Deployment is warming up...")
time.sleep(45)

The deployment has now been registered. However, to start it you need to run:

In [None]:
# Start the deployment and wait for it to be in a running state for up to 180 seconds
deployment.start(await_running=180)

In [None]:
# Retrieve and print detailed information about the current state of the deployment
deployment.get_state().describe()

In [None]:
# To troubleshoot you can use `get_logs()` method
deployment.get_logs(component='predictor')

### <span style='color:#ff5f27'>🔮 Predicting using deployment</span>

In [None]:
# Use the deployed model to make predictions on the provided input example
predict = deployment.predict(
    inputs=iris_model.input_example,
)
# or deployment.predict({ "instances": [iris_model.input_example] })

print(le.inverse_transform([predict["predictions"][0]]))

---

## <span style="color:#ff5f27;"> 👾 Try out your Model Interactively </span> 


We will build a user interface with Gradio to allow you to enter the 4 feature values (sepal length/width and petal length/width), producing a prediction of the type of iris flower.

First, we have to install the gradio library.

In [None]:
!pip install gradio --quiet
!pip install typing-extensions

### Run Gradio

Start the Gradio UI. Users enter the 4 feature values and a prediction is returned. We use the label encoder object to transform the number returned to the categorical value (stringified name of the Iris Flower).

In [None]:
import gradio as gr

def iris(sl, sw, pl, pw):
    list_inputs = [sl, sw, pl, pw]
    res = deployment.predict(inputs=[list_inputs])
    return le.inverse_transform([res["predictions"][0]])[0]

demo = gr.Interface(
    fn=iris,
    title="Iris Flower Predictive Analytics",
    description="Experiment with sepal/petal lengths/widths to predict which flower it is.",
    allow_flagging="never",
    inputs=[
        gr.Number(label="sepal length (cm)"),
        gr.Number(label="sepal width (cm)"),
        gr.Number(label="petal length (cm)"),
        gr.Number(label="petal width (cm)"),
    ],
    outputs="text"
)

demo.launch(share=True)

---