# Training the SKLearn model

When this notebook is executed, we expect <br>**(1)** Serialized train and test data tracked with DVC <br>**(2)** Knowledge of their location within DVCFileSystem (Path within Git Repository for DVC tracking)

Steps covered in this notebook:
1. Retrieve parameters
2. Download training package
3. Initialize and train ```XGBoost``` (Regressor) model
4. Download test package
5. Run initial testing and check metrics
6. Serialize and track model

In [None]:
# Install required packages.
# TODO: Create IBM Cloud Software Configuration for those
!pip install ibm-cos-sdk xgboost ibm_watson_studio_pipelines 'dvc[s3]' # dvc[all] alternatively, however, COS is covered by S3

In [None]:
from ibm_watson_studio_pipelines import WSPipelines
from ibm_watson_machine_learning import APIClient
import ibm_boto3

from botocore.client import Config
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
import numpy as np
import pandas as pd

import pickle
import dvc.api
import io

import logging
import os, types
import warnings

warnings.filterwarnings("ignore")

### 1. Retrieve Parameters
**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [None]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
from credentials2 import set_env_variables_for_credentials
set_env_variables_for_credentials()

In [None]:
CLOUD_API_KEY = os.getenv("CLOUD_API_KEY")
GIT_REPOSITORY = os.getenv("GIT_REPOSITORY")
train_package_dvc_location = os.getenv("train_package_dvc_location") 
test_package_dvc_location = os.getenv("test_package_dvc_location")

# Name of serialized model is passed as pipeline param
MODEL_FILENAME = os.getenv("MODEL_FILENAME")

### 2. Pre-Training: DVC Pull and Deserialize Training Data Package

In [None]:
# TODO: Make pipeline param
repo = \
    GIT_REPOSITORY

In [None]:
# Retrieve dataset from tracking information in git. The repository itself contains the remote storage info and credentials.
train_package = pickle.load(io.BytesIO(dvc.api.read(train_package_dvc_location,repo=repo, mode="rb")))

In [None]:
X_train = train_package['X_train']
y_train = train_package['y_train'] 

### 3. Initialize and train ```XGBoost``` (Regressor) model

In [None]:
import xgboost as xgb

# Define the hyperparameters for XGBRegressor
params = {
    'objective': 'reg:squarederror',  # Objective function for regression
    'learning_rate': 0.001,             # Learning rate
    'max_depth': 4,                   # Maximum depth of each tree
    'n_estimators': 500,              # Number of trees (boosting rounds)
    'subsample': 0.6,                 # Subsample ratio of the training instances
    'colsample_bytree': 0.6,          # Subsample ratio of columns when constructing each tree
    'gamma': 0.1,                     # Minimum loss reduction required to make a further partition on a leaf node
    'reg_alpha': 0.25,                 # L1 regularization term on weights
    'reg_lambda': 0.25,                # L2 regularization term on weights
    'random_state': 42                # Random seed for reproducibility
}

# Create an instance of XGBRegressor
model = xgb.XGBRegressor(**params)

X_train = X_train.apply(pd.to_numeric, errors="coerce")

model.fit(X_train.to_numpy(), y_train.to_numpy())

In [None]:
model.score(X_train.tail(2000000).to_numpy(), y_train.tail(2000000).to_numpy())

### 4. Download test package

In [None]:
# Retrieve test package for brief testing
test_package = pickle.load(io.BytesIO(dvc.api.read(test_package_dvc_location,repo=repo, mode="rb")))

In [None]:
# Make predictions on the testing data
X_test = test_package['X_test']
y_test = test_package['y_test']

### 5. Run initial testing and check metrics

In [None]:
# NOTE: Step no longer necessary
# # Drop columns that were dropped in X_train earlier
# X_test = X_test.drop(dropped_cols, axis=1)

# Convert to ensure numeric data (avoid e.g. Timestamp() data type)
X_test = X_test.apply(pd.to_numeric, errors="coerce")

y_pred = model.predict(X_test)

In [None]:
# In-line comparison of actual prediction versus known predictant 
validation_df = pd.DataFrame({'y_pred': y_pred, 'y_validate': y_test})
validation_df

In [None]:
# Misc testing
# See how many predictions are off by no more than 1-25% 
# Filter the DataFrame based on the condition
filtered_df = validation_df[abs(validation_df['y_pred'] - validation_df['y_validate']) <= 0.25 * validation_df['y_pred']]
filtered_df2 = filtered_df[abs(validation_df['y_pred'] - validation_df['y_validate']) > 0.01 * validation_df['y_pred']]

# Print the filtered DataFrame
filtered_df2
# Percent of predictions which were within a +-25% range of the actual value
((100/len(validation_df) * len(filtered_df2)))

##### Check a few metrics

You may want to set a threshold for some metrics in the Watson Studio Pipeline. If so, make sure to pass the value (you want to set a threshold for) with the training_params down below.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error (MSE):', mse)

# Calculate the mean absolute error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error (MAE):', mae)

# Calculate the R-squared score (coefficient of determination)
r2 = r2_score(y_test, y_pred)

# Print the R-squared score
print('R-squared Score:', r2)

### 6. Serialize and track model

In [None]:
with open(MODEL_FILENAME, 'wb') as f:
    pickle.dump(model, f)

##### Track Model with DVC

In [None]:
!echo $MODEL_FILENAME

In [None]:
!git clone $GIT_REPOSITORY

In [None]:
!cd dvc-testing && mkdir model

In [None]:
!mv $MODEL_FILENAME dvc-testing/model/

In [None]:
!cd dvc-testing && dvc add model/$MODEL_FILENAME

In [None]:
!cd dvc-testing && git add model/$MODEL_FILENAME.dvc

In [None]:
!cd dvc-testing && git commit -m "New regression model" && git push

In [None]:
!cd dvc-testing && dvc push

In [None]:
training_params = {}
training_params['training_completed'] = True
training_params['model_filename'] = MODEL_FILENAME

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(training_params)