# <span style="color:#ff5f27"> üë®üèª‚Äçüè´ Sklearn Transformation Functions Registration</span>

## <span style="color:#ff5f27">üóÑÔ∏è Table of Contents</span>
- [üìù Imports](#1)
- [üíΩ Loading Data](#2)
- [üîÆ Connecting to Hopsworks Feature Store](#3)
- [ü™Ñ Creating Feature Groups](#4)
- [üñç Feature View Creation](#5)
- [üë©üèª‚Äçüî¨ Data Transformation](#6)
- [üß¨ Modeling](#7)
- [üíæ Saving Model and Transformation Functions in Model Registry](#8)
- [üìÆ Retrieving model and Transformation Functions from Model Registry](#9)
- [üë®üèª‚Äç‚öñÔ∏è Batch Prediction](#10)
- [üë®üèª‚Äç‚öñÔ∏è Serving Feature Vector Prediction](#11)

<a name='1'></a>
## <span style='color:#ff5f27'> üìù Imports </span>

In [None]:
import pandas as pd
import numpy as np
import os
import joblib

from sklearn.preprocessing import OneHotEncoder, StandardScaler

import xgboost as xgb
from sklearn.metrics import accuracy_score

<a name='2'></a>
## <span style="color:#ff5f27;"> üíΩ Loading Data </span>

In [None]:
df_original = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")
df_original['target'] = np.random.choice([0, 1], size=len(df_original))

df_original.head(3)

<a name='3'></a>
## <span style="color:#ff5f27;"> üîÆ Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

<a name='4'></a>
## <span style="color:#ff5f27;">ü™Ñ Creating Feature Groups</span>

### <span style="color:#ff5f27;">üõ†Ô∏è Offline Feature Group</span>

In [None]:
fg_offline = fs.get_or_create_feature_group(
    name='feature_group_offline',
    description='Offline Feature Group',
    version=1,
    primary_key=['city_name', 'date'],
    online_enabled=False,
)    
fg_offline.insert(df_original)

### <span style="color:#ff5f27;">üõ†Ô∏è Online Feature Group</span>

In [None]:
fg_online = fs.get_or_create_feature_group(
    name='feature_group_online',
    description='Online Feature Group',
    version=1,
    primary_key=['city_name', 'date'],
    online_enabled=True,
)    
fg_online.insert(df_original)

<a name='5'></a>
## <span style="color:#ff5f27;"> üñç Feature View Creation</span>

### <span style="color:#ff5f27;">üõ†Ô∏è Batch Feature View</span>

In [None]:
query = fg_offline.select_except(['date'])
query.read()

In [None]:
feature_view_batch = fs.get_or_create_feature_view(
    name='batch_fv',
    version=1,
    query=query,
    labels=['target']
)

### <span style="color:#ff5f27;">üõ†Ô∏è Serving Feature View</span>

In [None]:
query = fg_online.select_except(['date'])

feature_view_serving = fs.get_or_create_feature_view(
    name='serving_fv',
    version=1,
    query=query,
    labels=['target']
)

## <span style="color:#ff5f27;"> üèãÔ∏è Training Dataset Creation</span>


In [None]:
# Create a train-test split dataset
td_version, job = feature_view_batch.create_train_test_split(
    test_size=0.1,
    description='Description of a dataset',
    data_format='csv'
)

### <span style="color:#ff5f27;">ü™ù Training Dataset Retrieval</span>

In [None]:
X_train, X_test, y_train, y_test = feature_view_batch.get_train_test_split(
    training_dataset_version=td_version
)

In [None]:
X_train.head(3)

In [None]:
y_train.head(3)

<a name='6'></a>
## <span style="color:#ff5f27;">üë©üèª‚Äçüî¨ Data Transformation</span>

In [None]:
def transform_all(func):
    def inner(data, one_hot_encoder, standard_scaler):
        
        if isinstance(data, pd.DataFrame):
            return func(data, one_hot_encoder, standard_scaler)
        
        if isinstance(data[0], list): 
            city_names = [vector[0] for vector in feature_vectors]
            pm2_5_values = [vector[1] for vector in feature_vectors]
            data = pd.DataFrame(
                {
                    'city_name': city_names,
                    'pm2_5': pm2_5_values,
                }
            )
            return func(data, one_hot_encoder, standard_scaler)
            
        data = pd.DataFrame(
                {
                    'city_name': [data[0]],
                    'pm2_5': [data[1]],
                }
            )
        return func(data, one_hot_encoder, standard_scaler)
    return inner

In [None]:
@transform_all
def transform_data(data, one_hot_encoder, standard_scaler):
    # Transform the 'city_name' column using OneHotEncoder
    city_encoded = one_hot_encoder.transform(data[['city_name']])

    # Create a new DataFrame with the encoded values
    encoded_df = pd.DataFrame(city_encoded, columns=one_hot_encoder.categories_[0])

    # Concatenate the encoded DataFrame with the original DataFrame
    data = pd.concat([data.drop('city_name', axis=1), encoded_df], axis=1)
    
    # Transform the 'pm2_5' column using StandardScaler
    data['pm2_5'] = standard_scaler.transform(data[['pm2_5']])

    return data

### <span style="color:#ff5f27;"> üëî Transformer instances fit</span>

In [None]:
# Create an instance of the OneHotEncoder and StandardScaler
one_hot_encoder = OneHotEncoder(sparse=False)
standard_scaler = StandardScaler()

In [None]:
one_hot_encoder.fit(X_train[['city_name']])
standard_scaler.fit(X_train[['pm2_5']])
print('‚úÖ Done!')

### <span style="color:#ff5f27;">‚õ≥Ô∏è Train Data Transformation</span>

In [None]:
X_train_transformed = transform_data(X_train, one_hot_encoder, standard_scaler)
X_train_transformed.head(3)

### <span style="color:#ff5f27;">‚õ≥Ô∏è Test Data Transformation</span>

In [None]:
X_test_transformed = transform_data(X_test, one_hot_encoder, standard_scaler)
X_test_transformed.head(3)

<a name='7'></a>
## <span style="color:#ff5f27;">üß¨ Modeling</span>

In [None]:
# Define the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(tree_method='hist', enable_categorical=True)

# Fit the classifier
xgb_classifier.fit(X_train_transformed, y_train)

# Evaluate the model
y_pred = xgb_classifier.predict(X_test_transformed)
accuracy = accuracy_score(y_test, y_pred)
print("üëÆüèª‚Äç‚ôÇÔ∏è Accuracy:", accuracy)

## <span style="color:#ff5f27;">üóÑ Model Registry</span>

One of the features in Hopsworks is the model registry. Besides models, you can store there your transformation functions.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">‚öôÔ∏è Model Schema</span>


In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train_transformed.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

<a name='8'></a>
### <span style="color:#ff5f27;">üíæ Saving Model and Transformation Functions</span>

In [None]:
model_dir = "model_tf_dir"

if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

joblib.dump(xgb_classifier, model_dir + '/xgb_classifier.pkl')

joblib.dump(one_hot_encoder, model_dir + '/one_hot_encoder.pkl')
joblib.dump(standard_scaler, model_dir + '/standard_scaler.pkl')

In [None]:
model = mr.python.create_model(
    name="xgb_model",
    metrics={"Accuracy": accuracy}, 
    description="XGB model",
    input_example=X_train_transformed.sample(),
    model_schema=model_schema
)

model.save(model_dir)

<a name='9'></a>
## <span style="color:#ff5f27;"> üìÆ Retrieving model and Transformation Functions from Model Registry </span>

In [None]:
retrieved_model = mr.get_model(
    name="xgb_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
retrieved_xgboost_model = joblib.load(saved_model_dir + "/xgb_classifier.pkl")

one_hot_encoder = joblib.load(saved_model_dir + "/one_hot_encoder.pkl")
standard_scaler = joblib.load(saved_model_dir + "/standard_scaler.pkl")

<a name='10'></a>
## <span style="color:#ff5f27;"> üë®üèª‚Äç‚öñÔ∏è Batch Prediction </span>

In [None]:
feature_view_batch.init_batch_scoring(training_dataset_version=td_version)

batch_data = feature_view_batch.get_batch_data()
batch_data.head(3)

In [None]:
batch_data_transformed = transform_data(batch_data, one_hot_encoder, standard_scaler)
batch_data_transformed.head()

In [None]:
predictions_batch = retrieved_xgboost_model.predict(batch_data_transformed)
predictions_batch[:10]

<a name='11'></a>
## <span style="color:#ff5f27;"> üë®üèª‚Äç‚öñÔ∏è Serving Feature Vector Prediction</span>

In [None]:
feature_view_serving.init_serving(1)

feature_vector = feature_view_serving.get_feature_vector(
    entry = {
        "city_name": 'Amsterdam',
        "date": '2013-01-01',
    }
)
feature_vector

In [None]:
feature_vector_transformed = transform_data(feature_vector, one_hot_encoder, standard_scaler)
feature_vector_transformed

In [None]:
prediction_feature_vector = retrieved_xgboost_model.predict(feature_vector_transformed)
prediction_feature_vector

In [None]:
feature_vectors = feature_view_serving.get_feature_vectors(
    entry = [
        {"city_name": 'Amsterdam', "date": '2013-01-01'},
        {"city_name": 'Amsterdam', "date": '2013-01-02'},
    ]
)
feature_vectors

In [None]:
feature_vectors_transformed = transform_data(feature_vectors, one_hot_encoder, standard_scaler)
feature_vectors_transformed

In [None]:
prediction_feature_vectors = retrieved_xgboost_model.predict(feature_vectors_transformed)
prediction_feature_vectors

---