<h3>Import and Process Data</h3>

In this example, we use the Breast Cancer data

Again, this data is very clean and thus does not need any preprocessing besides train/test splitting

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

bc = pd.read_csv('breast-cancer.csv')
bc.head()

features = bc.drop(['ID','Class'], axis=1).values
label = bc['Class'].values

trainX_prc, testX_prc, trainY, testY = train_test_split(features,label,test_size=0.2)
trainX_prc.shape, testX_prc.shape, trainY.shape, testY.shape

((559, 9), (140, 9), (559,), (140,))

<h3> Model Tuning with Grid Search and Cross Validation </h3>

Tuning models in sklearn is straight-forward. In general, you need to
1. Set up a parameter grid
2. Create an empty model
3. Create a GridSearchCV object using the created model and the parameter grid. Also remember to set the cv fold and the scoring here (accuracy for classification and r2_score for regression)
4. Train the GridSearchCV object using fit()

<h4> Logistic Regression </h4>

We only need to tune <b>C</b> for logistic regression. 

\<Test the different between keeping and dropping ID\>

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

#hyperparameter grid for logistic regression, we only optimize regularization term C
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#create new model
logistic = LogisticRegression(max_iter=5000)

#perform grid search with 5-fold cross validation
grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=5000),
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             return_train_score=True, scoring='accuracy')

Let's look at the best model

In [3]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.05}
0.9624195624195625


And apply it on the testing data

In [4]:
best_logistic = grid_search.best_estimator_
testY_pred_logistic = best_logistic.predict(testX_prc)
print('logistic regression accuracy: ', accuracy_score(testY, testY_pred_logistic))
print('logistic regression f1: ', f1_score(testY, testY_pred_logistic))

logistic regression accuracy:  0.9785714285714285
logistic regression f1:  0.9655172413793104


<h4> Support Vector Machine </h4>

We need to tune C, kernel (poly or rbf), coef0 and degree for poly kernel, and gamma for rbf kernel

In [5]:
from sklearn.svm import SVC

svc = SVC()

param_grid = [{
    'C': [0.01, 0.1, 1, 10, 100],
    'kernel' : ['rbf'],
    'gamma' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}]

grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
                          'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                          'kernel': ['rbf']}],
             return_train_score=True, scoring='accuracy')

Let's check the best model

In [6]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
0.9677767052767052


And test it on testing data

In [7]:
best_svc = grid_search.best_estimator_
testY_pred_svc = best_svc.predict(testX_prc)
print('support vector machine accuracy: ', accuracy_score(testY, testY_pred_svc))
print('support vector machine f1: ', f1_score(testY, testY_pred_svc))

support vector machine accuracy:  0.9642857142857143
support vector machine f1:  0.9438202247191012


<h4> Decision Tree </h4>

We need to tune max_depth, max_features, min_sample_split, and min_sample_leaf

In [8]:
from sklearn.tree import DecisionTreeClassifier

param_grid = [{
    'max_depth': [3,4,5,6,7],
    'max_features' : [3, 5, 7],
    'min_samples_split' : [10, 20, 30, 40],
    'min_samples_leaf' : [10, 20, 30, 40]
}]

dt = DecisionTreeClassifier()

grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid=[{'max_depth': [3, 4, 5, 6, 7],
                          'max_features': [3, 5, 7],
                          'min_samples_leaf': [10, 20, 30, 40],
                          'min_samples_split': [10, 20, 30, 40]}],
             return_train_score=True, scoring='accuracy')

In [9]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 7, 'max_features': 3, 'min_samples_leaf': 20, 'min_samples_split': 40}
0.9499034749034749


In [10]:
best_dt = grid_search.best_estimator_
testY_pred_dt = best_dt.predict(testX_prc)
print('decision tree accuracy: ', accuracy_score(testY, testY_pred_dt))
print('decision tree f1: ', f1_score(testY, testY_pred_dt))

decision tree accuracy:  0.9
decision tree f1:  0.8541666666666667


<h4>Random Forest</h4>

Similar to trees but with n_estimators added

In [11]:
from sklearn.ensemble import RandomForestClassifier

param_grid = [{
    'n_estimators' : [5, 10, 20, 50],
    'max_depth': [3, 4, 5, 6],
    'max_features' : [3, 5, 7],
    'min_samples_split' : [10, 20, 30, 40],
    'min_samples_leaf' : [10, 20, 30, 40]
}]

rf = RandomForestClassifier()

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid=[{'max_depth': [3, 4, 5, 6], 'max_features': [3, 5, 7],
                          'min_samples_leaf': [10, 20, 30, 40],
                          'min_samples_split': [10, 20, 30, 40],
                          'n_estimators': [5, 10, 20, 50]}],
             return_train_score=True, scoring='accuracy')

Best training model:

In [12]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 20, 'min_samples_split': 30, 'n_estimators': 10}
0.9696106821106822


In [13]:
best_rf = grid_search.best_estimator_
testY_pred_rf = best_rf.predict(testX_prc)
print('random forest accuracy: ', accuracy_score(testY, testY_pred_rf))
print('random forest f1: ', f1_score(testY, testY_pred_rf))

random forest accuracy:  0.9714285714285714
random forest f1:  0.9545454545454546


<h3> Model Tuning for AWS XGBoost </h3>

<h4> Setting up Environment </h4>

Since we are working with the external platfrom SageMaker, we need to perform a few steps of environment setting up. This requires importing and using sagemaker and boto3 to manage the sagemaker sessions.

The code below create a new sagemaker session and set up storage location, and does not require modifications in general.

In [14]:
import sagemaker
import boto3
import os

region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()                    # Set a default S3 bucket for storing training, validation, and testing data
prefix = 'breast-cancer'                          # the folder to store your data in the S3 instance

<h4>Preparing Data</h4>

We cannot use sklearn GridsearchCV to tune AWS models since they are not sklearn models. Instead, we utilize the tuning job application in SageMaker, which requires manually splitting training data into training and validation

In [15]:
from sklearn.model_selection import train_test_split

#further split training data to training and validation
trainX_prc, validX_prc, trainY, validY = train_test_split(trainX_prc,trainY,test_size=0.3)

Input data to SageMaker models needs to be a single dataframe with the first column being the target. We will reorganize the training and validation data that way and upload it to the session's bucket

In [16]:
#reorganize the data in format <label - features>
traindata = np.concatenate([trainY.reshape(-1,1),trainX_prc],axis=1)
validdata = np.concatenate([validY.reshape(-1,1),validX_prc],axis=1)

#generate csv files to upload
pd.DataFrame(traindata).to_csv('train.csv', index=False, header=False)
pd.DataFrame(validdata).to_csv('validation.csv', index=False, header=False)

#upload training and validation data to the s3 bucket
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

<h4>XGBoost Model</h4>

<h5>Define Hyperparamter Grid and Training Job</h5>

Creating a parameter grid in SageMaker is slightly complicated than sklearn, however, is in the same general direction. We create a dictionary that consists of hyperparameters and their ranges. 

For simplicity, the part where the hyperparameters are defined is extracted into sagemaker_models.py, so please go there if you want to view or edit (optional). Otherwise, we will import everything from the module.

We just need to create a hyperparameter dictionary similarly to in sklearn, then call the prewritten function to train the tuner.

In [17]:
from sagemaker_models import *

from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1), 
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2), 
    'max_depth': IntegerParameter(1, 10)
}

xgb_tuner = get_xgb_classifier(region, bucket, prefix, sess, role, hyperparameter_ranges)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


......................................................................................!


In [18]:
best_xgboost = xgb_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='xgboost-cls')

testY_pred_xgb = predict_xgb_cls(best_xgboost, testX_prc)


2023-02-27 19:18:07 Starting - Found matching resource for reuse
2023-02-27 19:18:07 Downloading - Downloading input data
2023-02-27 19:18:07 Training - Training image download completed. Training in progress.
2023-02-27 19:18:07 Uploading - Uploading generated training model
2023-02-27 19:18:07 Completed - Resource reused by training job: sagemaker-xgboost-230227-1911-014-31739c71
-------!

In [19]:
print('xgboost accuracy: ', accuracy_score(testY, testY_pred_xgb))
print('xgboost f1: ', f1_score(testY, testY_pred_xgb))

xgboost accuracy:  0.9714285714285714
xgboost f1:  0.9545454545454546


<h4>Removing XGBoost Endpoint</h4>

In practice, we may want to keep the endpoint and model to use later on. However, the free tier account we use limits having only one endpoint at a time, so we need to remove xgboost endpoint before moving on to linear learner

In [20]:
best_xgboost.delete_endpoint(delete_endpoint_config=True)

<h3>Linear Learner</h3>

Very similarly to XGBoost, we need to define the hyperparameter dictionary then call the tuner function for classification linear learner

In [21]:
hyperparameter_ranges = {
    "l1": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "wd": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "learning_rate": ContinuousParameter(1e-5, 1, scaling_type="Auto"),
    "mini_batch_size": IntegerParameter(100, 300, scaling_type="Auto"),
}

ll_tuner = get_ll_classifier(region, bucket, prefix, sess, role, hyperparameter_ranges)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.0-1.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


.................................................................................................!


In [22]:
best_ll = ll_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='ll-cls')

testY_pred_ll = predict_ll_cls(best_ll, testX_prc)


2023-02-27 19:30:23 Starting - Found matching resource for reuse
2023-02-27 19:30:23 Downloading - Downloading input data
2023-02-27 19:30:23 Training - Training image download completed. Training in progress.
2023-02-27 19:30:23 Uploading - Uploading generated training model
2023-02-27 19:30:23 Completed - Resource retained for reuse
---------!

In [23]:
print('linear learner accuracy: ', accuracy_score(testY, testY_pred_ll))
print('linear learner f1: ', f1_score(testY, testY_pred_ll))

linear learner accuracy:  0.9642857142857143
linear learner f1:  0.9411764705882352


<h2>Cleaning up</h2>

Please always run these!

In [24]:
# delete linear learner endpoint
best_ll.delete_endpoint(delete_endpoint_config=True)

#create session client to further clean up
sagemaker_client = boto3.client('sagemaker')

# delete models
sagemaker_client.delete_model(ModelName='xgboost-cls')
sagemaker_client.delete_model(ModelName='ll-cls')

# delete bucket
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'JVKT5JFYRWK4RMFC',
   'HostId': 'Vm8vkBZl7fNBZtvzpnV4redluSKiyidoFMbJ4S0l/ohkgkwL3FWdu2wJ93ntmBiw2Iwg6SI3NkY=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'Vm8vkBZl7fNBZtvzpnV4redluSKiyidoFMbJ4S0l/ohkgkwL3FWdu2wJ93ntmBiw2Iwg6SI3NkY=',
    'x-amz-request-id': 'JVKT5JFYRWK4RMFC',
    'date': 'Mon, 27 Feb 2023 19:35:25 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'breast-cancer/output/linear-learner-230227-1922-008-0217bcae/output/model.tar.gz'},
   {'Key': 'breast-cancer/output/sagemaker-xgboost-230227-1911-009-07a3f06e/output/model.tar.gz'},
   {'Key': 'breast-cancer/output/sagemaker-xgboost-230227-1911-003-ee09770b/output/model.tar.gz'},
   {'Key': 'breast-cancer/output/sagemaker-xgboost-230227-1911-001-b20681be/output/model.tar.gz'},
   {'Key': 'breast-cancer/output/sagemaker-xgboost-230227-1911-005