# <span style="font-width:bold; font-size: 3rem; color:#333;">Training Pipeline</span>

## 🗒️ This notebook is divided into the following sections:

1. Select features for the model and create a Feature View with the selected features
2. Create training data using the feature view
3. Train model
4. Evaluate model performance
5. Save model to model registry

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
#pip install xgboost
#pip install scikit-learn

In [2]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from util.helper import * 
import util.helper as helpfun

## <span style="color:#ff5f27;"> Prepare training features</span>

In [3]:
# Get current working directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

Current directory: C:\Users\marth\OneDrive - KTH\[Y1] Period 2\ID2223 Scalable Machine Learning and Deep Learning\Projects\Final Pj\ID2223-Final-Project


In [4]:
# Construct path
data_path = os.path.join(current_dir, 'util', 'data', 'feature_eng_stats.csv')
training_df = pd.read_csv(data_path)
#print(training_df)

In [5]:
#check_dataframe_types(training_df)

In [6]:
#Check float Value vs NaN
column_check = ['champion']

for col in column_check:
    check_nan_float(training_df, col)

Are all float values NaN? True
Number of float values: 43
Number of NaN values: 43


In [7]:
# Remove rows where champion is NaN
training_df = training_df.dropna(subset=['champion'])

# Verify no more NaN in champion column
print(f"\nRemaining NaN in champion column: {training_df['champion'].isna().sum()}")


Remaining NaN in champion column: 0


In [12]:
#Clean table and convert string to num00
training_df = convert_training_df(training_df)
print(training_df)

Note: Some columns were not found in DataFrame: {'oppmates1', 'teammates1', 'teammates4', 'player_id', 'oppmates3', 'region_profile', 'oppmates4', 'oppmates2', 'teammates3', 'teammates2', 'oppmates5'}
      champion  region      date  level  team  result  match_length_mins  \
0         None     NaN  1.736014   14.0   NaN       0              28.97   
1         None     NaN  1.736012    7.0   NaN       0              16.03   
2         None     NaN  1.736010   14.0   NaN       1              21.35   
3         None     NaN  1.736009   18.0   NaN       1              35.12   
4         None     NaN  1.735932   15.0   NaN       0              27.55   
...        ...     ...       ...    ...   ...     ...                ...   
10326     None     NaN  1.734516   16.0   NaN       1              30.30   
10327     None     NaN  1.734514    9.0   NaN       0              15.18   
10328     None     NaN  1.734437   15.0   NaN       1              23.15   
10329     None     NaN  1.734338   15.0

In [11]:
# Get list of object columns and their details
object_cols = training_df.select_dtypes(include=['object']).columns

print("Object columns details:")
print("-" * 50)
for col in object_cols:
    unique_values = training_df[col].nunique()
    sample_values = training_df[col].dropna().sample(min(3, unique_values)).tolist()
    
    print(f"\nColumn: {col}")
    print(f"Unique values count: {unique_values}")
    print(f"Sample values: {sample_values}")
    print(f"Type of values: {training_df[col].apply(type).unique()}")

Object columns details:
--------------------------------------------------


--- 

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

In [None]:
selected_features_new = air_quality_fg_new.select([
    # PM2.5 target and its engineered features
    'pm25',
    'pm25_rolling_mean_3d', 'pm25_rolling_std_3d', 'pm25_rolling_max_3d',
    'pm25_rolling_mean_7d', 'pm25_rolling_std_7d', 'pm25_rolling_max_7d',
    'pm25_lag_1d', 'pm25_lag_2d', 'pm25_lag_3d',
    'pm25_diff_1d', 'pm25_pct_change',
    'pm25_ema_3d', 'pm25_ema_7d',
    'pm25_percentile',
    'pm25_skew_7d',
    'pm25_trend_3d', 'pm25_trend_7d', 'pm25_trend_14d',
    'pm25_volatility_3d','pm25_volatility_7d', 'pm25_volatility_14d',
    'pm25_zscore',
    'day_of_week', 'is_weekend', 'month', 'month_cos', 'month_sin',  
    'city'  # Keep city for joining
]).join(
    weather_fg_new.select([
        # Weather features
        'date', 'temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant',
        'temperature_2m_mean_rolling_mean_3d', 'precipitation_sum_rolling_mean_3d', 'wind_speed_10m_max_rolling_mean_3d',
        'temp_wind_interaction', 'temp_precip_interaction', 'temp_wind_precip',
        'wind_direction_sin', 'wind_direction_cos', 'wind_efficiency',
        'high_temp_low_wind', 
        'city'
    ]),
    on=['city']
)

selected_features_new.show(10)

### Feature Views

`Feature Views` are selections of features from different **Feature Groups** that make up the input and output API (or schema) for a model. A **Feature Views** can create **Training Data** and also be used in Inference to retrieve inference data.

The Feature Views allows a schema in form of a query with filters, defining a model target feature/label and additional transformation functions (declarative feature encoding).

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

You can specify the following parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - declarative feature encoding (not used here)

- `query` - selected features/labels for the model 

In [None]:
feature_view_new = fs.get_or_create_feature_view(
    name='air_quality_fv_improved',
    description="weather features with air quality as the target",
    version=1,
    labels=['pm25'],
    query=selected_features_new,
)

feature_view_new

## <span style="color:#ff5f27;">🪝 Split the training data into train/test data sets </span>

We use a time-series split here, with training data before this date `start_date_test_data` and test data after this date

In [None]:
start_date_validation_data = "2023-09-19" #60%
start_date_test_data = "2024-04-16" #20%
# Convert string to datetime object
valid_start = datetime.strptime(start_date_validation_data, "%Y-%m-%d")
test_start = datetime.strptime(start_date_test_data, "%Y-%m-%d")

In [None]:
X_train, X_validation, y_train, y_validation = feature_view_new.train_test_split(
    test_start=valid_start
)

X_train_and_valid, X_test, y_train_and_valid, y_test = feature_view_new.train_test_split(
    test_start=test_start
)

In [None]:
X_train

In [None]:
X_validation

In [None]:
X_test

In [None]:
# Drop the index columns - 'date' (event_time) and 'city' (primary key)

train_features = X_train.drop(['date', 'city'], axis=1)
validation_features = X_validation.drop(['date', 'city'], axis=1)
test_features = X_test.drop(['date', 'city'], axis=1)

validation_features

In [None]:
y_train

In [None]:
y_test

The `Feature View` is now saved in Hopsworks and you can retrieve it using `FeatureStore.get_feature_view(name='...', version=1)`.

---

## <span style="color:#ff5f27;">🧬 Modeling</span>

We will train a regression model to predict pm25 using our 4 features (wind_speed, wind_dir, temp, precipitation)

In [None]:
xgb_regressor = XGBRegressor(
    max_depth=3,                # Further reduced to limit model complexity
    learning_rate=0.005,        # Even smaller learning rate for better generalization
    n_estimators=3000,          # Increased to compensate for smaller learning rate
    subsample=0.6,              # Further reduced to increase randomization
    colsample_bytree=0.6,       # Further reduced to increase randomization
    min_child_weight=10,        # Increased to require more observations per leaf
    reg_alpha=0.5,              # Increased L1 regularization
    reg_lambda=2.0,             # Increased L2 regularization
    gamma=0.4,                  # Increased minimum loss reduction
    random_state=42
)

# Fit with more aggressive early stopping
xgb_regressor.fit(
    train_features, 
    y_train,
    eval_set=[(train_features, y_train), (validation_features, y_validation)],
    early_stopping_rounds=50,    # More aggressive early stopping
    eval_metric='rmse',
    verbose=100
)

In [None]:
# Predicting target values on the test set
y_pred = xgb_regressor.predict(test_features)

# Calculating Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test.iloc[:,0], y_pred)
print("MSE:", mse)

# Calculating R squared using sklearn
r2 = r2_score(y_test.iloc[:,0], y_pred)
print("R squared:", r2)

In [None]:
def evaluate_model(X, y, set_name):
    predictions = xgb_regressor.predict(X)
    mse = mean_squared_error(y, predictions)
    rmse = np.sqrt(mse)
    r2 = r2_score(y, predictions)
    print(f"\n{set_name} Metrics:")
    print(f"MSE: {mse:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"R2: {r2:.4f}")
    return predictions

# Evaluate on all sets
train_pred = evaluate_model(train_features, y_train, "Training")
val_pred = evaluate_model(validation_features, y_validation, "Validation")
test_pred = evaluate_model(test_features, y_test, "Test")

# Create the time series plot
plt.figure(figsize=(15, 5))

# Plot training data and predictions
plt.plot(y_train.index, y_train.values, label='Actual (Train)', alpha=0.5)
plt.plot(y_train.index, train_pred, label='Predicted (Train)', alpha=0.5)

# Plot validation data and predictions
plt.plot(y_validation.index, y_validation.values, label='Actual (Val)', alpha=0.5)
plt.plot(y_validation.index, val_pred, label='Predicted (Val)', alpha=0.5)

# Plot test data and predictions
plt.plot(y_test.index, y_test.values, label='Actual (Test)', alpha=0.5)
plt.plot(y_test.index, test_pred, label='Predicted (Test)', alpha=0.5)

# Customize the plot
plt.title('PM2.5 Predictions vs Actual')
plt.xlabel('Date')
plt.ylabel('PM2.5')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# Optional: Add feature importance plot
importance_df = pd.DataFrame({
    'feature': train_features.columns,
    'importance': xgb_regressor.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 4))
plt.bar(range(len(importance_df['importance'][:10])), importance_df['importance'][:10])
plt.xticks(range(len(importance_df['feature'][:10])), importance_df['feature'][:10], rotation=45, ha='right')
plt.title('Top 10 Feature Importance')
plt.tight_layout()
plt.show()

In [None]:
#pip install shap

In [None]:
import shap

explainer = shap.Explainer(xgb_regressor)
shap_values = explainer(train_features)
shap.summary_plot(shap_values, train_features)

# Force plot for the first observation
shap.force_plot(
    explainer.expected_value,     # Scalar expected value
    shap_values.values[0, :],     # SHAP values for the first sample
    train_features.iloc[0, :]     # Feature values for the first sample
)

# Dependence plot for a specific feature
#shap.dependence_plot(
#    "pm25_rolling_mean_3d", 
#    shap_values.values, 
#    train_features
#)

# Feature importance DataFrame
shap_importance = pd.DataFrame({
    'feature': train_features.columns,
    'mean_abs_shap': np.abs(shap_values.values).mean(axis=0)
}).sort_values(by='mean_abs_shap', ascending=False)

shap_importance.tail(20)




In [None]:
#top_features = shap_importance['feature'][:20]  # Retain top 20 features

#train_features_trimmed = train_features[top_features]
#validation_features_trimmed = validation_features[top_features]
#test_features_trimmed = test_features[top_features]

#train_features_trimmed

In [None]:
# Predicting target values on the test set
#y_pred = xgb_regressor_trimmed.predict(test_features_trimmed)

# Calculating Mean Squared Error (MSE) using sklearn
#mse_trimmed = mean_squared_error(y_test.iloc[:,0], y_pred)
#print("MSE:", mse)

# Calculating R squared using sklearn
#r2_trimmed = r2_score(y_test.iloc[:,0], y_pred)
#print("R squared:", r2)

In [None]:
#xgb_regressor_trimmed = XGBRegressor(
#    max_depth=3,                # Further reduced to limit model complexity
#    learning_rate=0.005,        # Even smaller learning rate for better generalization
#   n_estimators=3000,          # Increased to compensate for smaller learning rate
#    subsample=0.6,              # Further reduced to increase randomization
#    colsample_bytree=0.6,       # Further reduced to increase randomization
#   min_child_weight=10,        # Increased to require more observations per leaf
#    reg_alpha=0.5,              # Increased L1 regularization
#    reg_lambda=2.0,             # Increased L2 regularization
#    gamma=0.4,                  # Increased minimum loss reduction
#    random_state=42
#)

# Fit with more aggressive early stopping
#xgb_regressor_trimmed.fit(
#    train_features_trimmed, 
#    y_train,
#    eval_set=[(train_features_trimmed, y_train), (validation_features_trimmed, y_validation)],
#    early_stopping_rounds=50,    # More aggressive early stopping
#    eval_metric='rmse',
#    verbose=100
#)

In [None]:
#def evaluate_model(X, y, set_name):
#    predictions = xgb_regressor_trimmed.predict(X)
#    mse = mean_squared_error(y, predictions)
#    rmse = np.sqrt(mse)
#    r2 = r2_score(y, predictions)
#    print(f"\n{set_name} Metrics:")
#    print(f"MSE: {mse:.2f}")
#    print(f"RMSE: {rmse:.2f}")
#   print(f"R2: {r2:.4f}")
#    return predictions

# Evaluate on all sets
#train_pred = evaluate_model(train_features_trimmed, y_train, "Training")
#val_pred = evaluate_model(validation_features_trimmed, y_validation, "Validation")
#test_pred = evaluate_model(test_features_trimmed, y_test, "Test")

# Create the time series plot
#plt.figure(figsize=(15, 5))

# Plot training data and predictions
#plt.plot(y_train.index, y_train.values, label='Actual (Train)', alpha=0.5)
#plt.plot(y_train.index, train_pred, label='Predicted (Train)', alpha=0.5)

# Plot validation data and predictions
#plt.plot(y_validation.index, y_validation.values, label='Actual (Val)', alpha=0.5)
#plt.plot(y_validation.index, val_pred, label='Predicted (Val)', alpha=0.5)

# Plot test data and predictions
#plt.plot(y_test.index, y_test.values, label='Actual (Test)', alpha=0.5)
#plt.plot(y_test.index, test_pred, label='Predicted (Test)', alpha=0.5)

# Customize the plot
#plt.title('PM2.5 Predictions vs Actual')
#plt.xlabel('Date')
#plt.ylabel('PM2.5')
#plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
#plt.tight_layout()
#plt.show()

# Optional: Add feature importance plot
#importance_df = pd.DataFrame({
#    'feature': train_features_trimmed.columns,
#    'importance': xgb_regressor_trimmed.feature_importances_
#}).sort_values('importance', ascending=False)

#plt.figure(figsize=(10, 4))
#plt.bar(range(len(importance_df['importance'][:10])), importance_df['importance'][:10])
#plt.xticks(range(len(importance_df['feature'][:10])), importance_df['feature'][:10], rotation=45, ha='right')
#plt.title('Top 10 Feature Importance')
#plt.tight_layout()
#plt.show()


In [None]:
df = y_test
df['predicted_pm25'] = y_pred

In [None]:
df['date'] = X_test['date']
df = df.sort_values(by=['date'])
df.head(5)

In [None]:
# Creating a directory for the model artifacts if it doesn't exist
model_dir = "air_quality_model_improved"
if not os.path.exists(model_dir):
    os.mkdir(model_dir)
images_dir = model_dir + "/images"
if not os.path.exists(images_dir):
    os.mkdir(images_dir)

In [None]:
file_path = images_dir + "/pm25_hindcast.png"
plt = util.plot_air_quality_forecast(city, street, df, file_path, hindcast=True) 
plt.show()

In [None]:
# Plotting feature importances using the plot_importance function from XGBoost
plot_importance(xgb_regressor, max_num_features=10)
feature_importance_path = images_dir + "/feature_importance.png"
plt.savefig(feature_importance_path)
plt.show()

---

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Creating input and output schemas using the 'Schema' class for features (X) and target variable (y)
input_schema = Schema(X_train)
output_schema = Schema(y_train)

# Creating a model schema using 'ModelSchema' with the input and output schemas
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

# Converting the model schema to a dictionary representation
schema_dict = model_schema.to_dict()

In [None]:
# Saving the XGBoost regressor object as a json file in the model directory
xgb_regressor.save_model(model_dir + "/model.json")

In [None]:
res_dict = { 
        "MSE": str(mse),
        "R squared": str(r2),
    }

In [None]:
mr = project.get_model_registry()

# Creating a Python model in the model registry named 'air_quality_xgboost_model'

aq_model = mr.python.create_model(
    name="air_quality_xgboost_model_improved", 
    metrics= res_dict,
    model_schema=model_schema,
    input_example=X_test.sample().values, 
    description="Air Quality (PM2.5) predictor",
)

# Saving the model artifacts to the 'air_quality_model' directory in the model registry
aq_model.save(model_dir)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 04: Batch Inference</span>

In the following notebook you will use your model for Batch Inference.
