<h3> Loading the data </h3>

In this module, we practice with the obesity data. Each row in the data represents a subject. The data has 16 attributes on different measurements of the subjects such as gender, age, height, other test measurements, and the obesity level. The target is obesity level of each subject which is as follows

- ObesityLevel = 1 : Insufficient_Weight
- ObesityLevel = 2 : Normal_Weight
- ObesityLevel = 3 : Overweight_Level_I
- ObesityLevel = 4 : Overweight_Level_II
- ObesityLevel = 5 : Obesity_Type_I
- ObesityLevel = 6 : Obesity_Type_II
- ObesityLevel = 7 : Obesity_Type_III

We will quickly go through data processing

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('obesity.csv')

features = data.drop('ObesityLevel', axis=1)
label = data['ObesityLevel'].values

trainX, testX, trainY, testY = train_test_split(features,label,test_size=0.3)
trainX.shape, testX.shape, trainY.shape, testY.shape

((1477, 16), (634, 16), (1477,), (634,))

Some descriptive statistics

We use the standard pipeline as developed previously. We also generate the testdata for AWS models later.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np

num_cols = trainX.columns[(trainX.dtypes==np.int64) | (trainX.dtypes==np.float64)]

num_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

cat_cols = trainX.columns[trainX.dtypes==object]

cat_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='constant',fill_value='missing')),
    ('encode', OneHotEncoder())
])

full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

trainX_prc = full_pipeline.fit_transform(trainX)


testX_prc = full_pipeline.transform(testX)
testdata = np.array(testX_prc)                                               #for AWS

<h3>Ridge Regression</h3>

This version of the linear regression model does not have regularization

In [3]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

param_grid = [{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]
ridge_reg = Ridge()
grid_search = GridSearchCV(ridge_reg, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=Ridge(),
             param_grid=[{'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                    10, 50, 100]}],
             return_train_score=True, scoring='r2')

We can examine the best model

In [4]:
grid_search.best_params_
grid_search.best_score_

0.9487052303942429

In [5]:
from sklearn.metrics import mean_squared_error, r2_score

best_ridge = grid_search.best_estimator_
testY_pred_ridge = best_ridge.predict(testX_prc)
print('Testing MSE:',mean_squared_error(testY, testY_pred_ridge))
print('Testing R2:',r2_score(testY, testY_pred_ridge))

Testing MSE: 0.1772674680842385
Testing R2: 0.9569792482350228


<h3> Support Vector Regressor </h3>

All hyperparameters are same with SVC

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

svr = SVR()

param_grid = [{
    'C': [0.01, 0.1, 1, 10, 100],
    'kernel' : ['rbf'],
    'gamma' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}]

grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=SVR(),
             param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
                          'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                          'kernel': ['rbf']}],
             return_train_score=True, scoring='r2')

The finetuned model (note that score is now R2 since we are doing regression):

In [7]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.9755209421004893


And the testing performance 

In [8]:
best_svr = grid_search.best_estimator_
testY_pred_svr = best_svr.predict(testX_prc)
print('Testing MSE:',mean_squared_error(testY, testY_pred_svr))
print('Testing R2:',r2_score(testY, testY_pred_svr))

Testing MSE: 0.10099345614575474
Testing R2: 0.975490063384507


<h3>Decision Tree Regressor</h3>

Hyperparameters are same with DecisionTreeClassifier

In [9]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()

param_grid = [{
    'max_depth': [3,4,5,6],
    'max_features' : [5, 7, 9],
    'min_samples_split' : [2, 10, 20, 30, 40],
    'min_samples_leaf' : [1, 10, 20, 30, 40]
}]

grid_search = GridSearchCV(dtr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(),
             param_grid=[{'max_depth': [3, 4, 5, 6], 'max_features': [5, 7, 9],
                          'min_samples_leaf': [1, 10, 20, 30, 40],
                          'min_samples_split': [2, 10, 20, 30, 40]}],
             return_train_score=True, scoring='r2')

In [10]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 6, 'max_features': 9, 'min_samples_leaf': 1, 'min_samples_split': 10}
0.9085674354564596


In [11]:
best_dt = grid_search.best_estimator_
testY_pred_dt = best_dt.predict(testX_prc)
print('Testing MSE:',mean_squared_error(testY, testY_pred_dt))
print('Testing R2:',r2_score(testY, testY_pred_dt))

Testing MSE: 0.45363254362250405
Testing R2: 0.8899086602713534


<h3>Random Forest Regressor</h3>

Hyperparamters are same with RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

param_grid = [{
    'n_estimators' : [5, 10, 20, 50],
    'max_depth': [3,4,5],
    'max_features' : [5, 7, 9],
    'min_samples_split' : [2, 10, 20, 30, 40],
    'min_samples_leaf' : [1, 10, 20, 30, 40]
}]

grid_search = GridSearchCV(rfr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

In [27]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 5, 'max_features': 9, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
0.9265202665976616


In [28]:
best_rf = grid_search.best_estimator_
testY_pred_rf = best_rf.predict(testX_prc)
print('Testing MSE:',mean_squared_error(testY, testY_pred_rf))
print('Testing R2:',r2_score(testY, testY_pred_rf))

Testing MSE: 0.25168248834212265
Testing R2: 0.9389195887346182


<h2>AWS Models</h2>

XGBoost and LinearLearner for regression are very similar to classification, we just change the objective when setting up their hyperparameter grids.

In [None]:
import sagemaker
import boto3
import os
from sagemaker_models import *
from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter
from sagemaker.tuner import HyperparameterTuner

region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()                    # Set a default S3 bucket for storing training, validation, and testing data
prefix = 'obesity'                                # the folder to store your data in the S3 instance

In [None]:
from sklearn.model_selection import train_test_split

#further split training data to training and validation
trainX_prc, validX_prc, trainY, validY = train_test_split(trainX_prc,trainY,test_size=0.3)

In [None]:
traindata = np.concatenate([trainY.reshape(-1,1),trainX_prc],axis=1)
validdata = np.concatenate([validY.reshape(-1,1),validX_prc],axis=1)
pd.DataFrame(traindata).to_csv('train.csv', index=False, header=False)
pd.DataFrame(validdata).to_csv('validation.csv', index=False, header=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

<h3> XGBoost </h3>

In [None]:
hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1), 
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2), 
    'max_depth': IntegerParameter(1, 10)
}

xgb_tuner = get_xgb_regressor(region, bucket, prefix, sess, role, hyperparameter_ranges)

In [None]:
best_xgboost = xgb_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='xgboost-reg')

testY_pred_xgb = predict_xgb_reg(best_xgboost, testX_prc)

In [29]:
from sklearn.metrics import mean_squared_error, r2_score

print('Testing MSE:',mean_squared_error(testY, testY_pred_xgb))
print('Testing R2:',r2_score(testY, testY_pred_xgb))

Testing MSE: 0.10838363262991277
Testing R2: 0.9736965535461803


<h4>Removing Endpoint before Moving on to Linear Learner</h4>

In [None]:
best_xgboost.delete_endpoint(delete_endpoint_config=True)

<h3>Linear Learner</h3>

In [None]:
hyperparameter_ranges = {
    "l1": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "wd": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "learning_rate": ContinuousParameter(1e-5, 1, scaling_type="Auto"),
    "mini_batch_size": IntegerParameter(100, 500, scaling_type="Auto"),
}

ll_tuner = get_ll_regressor(region, bucket, prefix, sess, role, hyperparameter_ranges)

In [None]:
best_ll = ll_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='ll-reg')

testY_pred_ll = predict_ll_reg(best_ll, testX_prc)

In [30]:
from sklearn.metrics import mean_squared_error, r2_score

print('Testing MSE:',mean_squared_error(testY, testY_pred_ll))
print('Testing R2:',r2_score(testY, testY_pred_ll))

Testing MSE: 0.1869729891618434
Testing R2: 0.9546238312059322


<h3>Final Clean up</h3>

In [None]:
best_ll.delete_endpoint(delete_endpoint_config=True)

sagemaker_client = boto3.client('sagemaker')

# delete model
sagemaker_client.delete_model(ModelName='xgboost-reg')
sagemaker_client.delete_model(ModelName='ll-reg')

# delete bucket
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()