# Training the SKLearn model

- Clean data
    - Drop columns not required for training
    - Drop rows with null valus where it makes sense 
    (river discharge may be NaN where there is no river. It makes sense to keep these rows for the model to learn where rivers are)
- Think about whether or not to have separate notebooks for new data retrievals and prep
- Version Control the data
- Train test splitting
- Version control again??

In [1]:
# Install required packages.
# TODO: Create IBM Cloud Software Configuration for those
!pip install ibm-cos-sdk xgboost ibm_watson_studio_pipelines 'dvc[s3]' # dvc[all] alternatively, however, COS is covered by S3

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/pip: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


In [2]:
from ibm_watson_studio_pipelines import WSPipelines
from ibm_watson_machine_learning import APIClient
import ibm_boto3

from botocore.client import Config
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
import numpy as np
import pandas as pd

import pickle
import dvc.api
import io

import logging
import os, types
import warnings

warnings.filterwarnings("ignore")

### Setup IBM Cloud and COS Credentials

**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [3]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
from credentials import set_env_variables_for_credentials
set_env_variables_for_credentials()

In [4]:
CLOUD_API_KEY = os.getenv("CLOUD_API_KEY")
GIT_REPOSITORY = os.getenv("GIT_REPOSITORY")
train_package_dvc_location = os.getenv("train_package_dvc_location") 
test_package_dvc_location = os.getenv("test_package_dvc_location")

In [5]:
# For testing
# train_package_dvc_location = "data/train_package.pkl"
# test_package_dvc_location = "data/test_package.pkl"

### 1. Pre-Training: DVC Pull and Deserialize Training Data Package

In [6]:
# TODO: Make pipeline param
repo = \
    GIT_REPOSITORY

In [20]:
# Retrieve dataset from tracking information in git. The repository itself contains the remote storage info and credentials.
train_package = pickle.load(io.BytesIO(dvc.api.read(train_package_dvc_location,repo=repo, mode="rb")))

In [21]:
X_train = train_package['X_train']
y_train = train_package['y_train'] 

In [65]:
import xgboost as xgb

# Define the hyperparameters for XGBRegressor
params = {
    'objective': 'reg:squarederror',  # Objective function for regression
    'learning_rate': 0.001,             # Learning rate
    'max_depth': 4,                   # Maximum depth of each tree
    'n_estimators': 500,              # Number of trees (boosting rounds)
    'subsample': 0.6,                 # Subsample ratio of the training instances
    'colsample_bytree': 0.6,          # Subsample ratio of columns when constructing each tree
    'gamma': 0.1,                     # Minimum loss reduction required to make a further partition on a leaf node
    'reg_alpha': 0.25,                 # L1 regularization term on weights
    'reg_lambda': 0.25,                # L2 regularization term on weights
    'random_state': 42                # Random seed for reproducibility
}

# Create an instance of XGBRegressor
model = xgb.XGBRegressor(**params)

X_train = X_train.apply(pd.to_numeric, errors="coerce")

model.fit(X_train.to_numpy(), y_train.to_numpy())

In [66]:
model.score(X_train.tail(2000000).to_numpy(), y_train.tail(2000000).to_numpy())

0.015450673488172195

In [34]:
# Retrieve test package for brief testing
test_package = pickle.load(io.BytesIO(dvc.api.read(test_package_dvc_location,repo=repo, mode="rb")))

In [67]:
# Make predictions on the testing data
X_test = test_package['X_test']

# NOTE: Step no longer necessary
# # Drop columns that were dropped in X_train earlier
# X_test = X_test.drop(dropped_cols, axis=1)

# Convert to ensure numeric data (avoid e.g. Timestamp() data type)
X_test = X_test.apply(pd.to_numeric, errors="coerce")

y_pred = model.predict(X_test)

In [68]:
y_test = test_package['y_test']

# In-line comparison of actual prediction versus known predictant 
validation_df = pd.DataFrame({'y_pred': y_pred, 'y_validate': y_test})
validation_df

Unnamed: 0,y_pred,y_validate
196490,9.992476,2.078125
1695950,8.182959,0.156250
3171536,36.460670,0.500000
3228297,48.546841,44.000000
3629990,10.985207,0.562500
...,...,...
704725,5.733423,0.625000
1001508,2.487385,0.015625
880876,12.835928,0.343750
6754564,25.964056,0.781250


In [69]:
# Misc testing
# See how many predictions are off by no more than 1-25% 
# Filter the DataFrame based on the condition
filtered_df = validation_df[abs(validation_df['y_pred'] - validation_df['y_validate']) <= 0.25 * validation_df['y_pred']]
filtered_df2 = filtered_df[abs(validation_df['y_pred'] - validation_df['y_validate']) > 0.01 * validation_df['y_pred']]

# Print the filtered DataFrame
filtered_df2

Unnamed: 0,y_pred,y_validate
3228297,48.546841,44.000000
198564,21.109163,24.382812
4487850,12.533396,15.312500
5833072,32.801022,38.625000
3860088,14.634206,11.687500
...,...,...
5496489,46.323757,56.343750
1558944,20.237976,17.140625
1292283,3.220083,2.875000
3498235,3.087046,3.125000


### Check a few metrics

You may want to set a threshold for some metrics in the Watson Studio Pipeline. If so, make sure to pass the value (you want to set a threshold for) with the training_params down below.

In [73]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error (MSE):', mse)

# Calculate the mean absolute error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error (MAE):', mae)

# Calculate the R-squared score (coefficient of determination)
r2 = r2_score(y_test, y_pred)

# Print the R-squared score
print('R-squared Score:', r2)

Mean Squared Error (MSE): 303661.12
Mean Absolute Error (MAE): 70.42673
R-squared Score: 0.014722116532971397


### Serialize Regressor

In [74]:
MODEL_FILENAME = "xgbr.pkl"

os.environ["MODEL_FILENAME"] = MODEL_FILENAME

with open(MODEL_FILENAME, 'wb') as f:
    pickle.dump(model, f)

### Track Model with DVC

In [77]:
!echo $MODEL_FILENAME

xgbr.pkl


In [78]:
!git clone $GIT_REPOSITORY

fatal: destination path 'dvc-testing' already exists and is not an empty directory.


In [79]:
!cd dvc-testing && mkdir model

mkdir: model: File exists


In [80]:
!mv $MODEL_FILENAME dvc-testing/model/

In [81]:
!cd dvc-testing && dvc add model/$MODEL_FILENAME

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[?25l                                                                          ⠋ Checking graph
Adding...                                                                       
!
  0% Checking cache in '/Users/ennmouri/csm/mlops-sustainability-oss/dvc-testing
                                                                                
!
  0%|          |Transferring                          0/? [00:00<?,     ?file/s]
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]
                                                                                
!
  0%|          |Checking out model/xgbr.pkl           0/? [00:00<?,    ?files/s]
  0%|          |Checking out model/xgbr.pkl           0/1 [00:00<?,    ?files/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 68.77file/s]


In [82]:
!cd dvc-testing && git add model/$MODEL_FILENAME.dvc

In [83]:
!cd dvc-testing && git commit -m "New regression model" && git push

[main cb2b453] New regression model
 1 file changed, 4 insertions(+)
 create mode 100644 model/xgbr.pkl.dvc
git: 'credential-manager-core' is not a git command. See 'git --help'.
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 466 bytes | 466.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
remote: This repository moved. Please use the new location:
remote:   https://github.com/iIias/dvc-testing.git
To https://github.com/iiias/dvc-testing.git
   b49d9fc..cb2b453  main -> main


In [84]:
!cd dvc-testing && dvc push

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
!
  0%|          |/Users/ennmouri/csm/mlops-sustainab0.00/? [00:00<?,        ?B/s]
  0%|          |/Users/ennmouri/csm/mlops-sustai0.00/840k [00:00<?,        ?B/s]
100% Transferring|███████████████████████████████|1/1 [00:00<00:00,  3.04file/s]
1 file pushed                                                                   


In [85]:
training_params = {}
training_params['training_completed'] = True
training_params['r2_score'] = r2
training_params['model_filename'] = MODEL_FILENAME

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(training_params)