### Challenge Lab

**Instructions**: Build a model using a Built-in Algorithm that performs better than the model we have trained in Lab 008. Use the Automated Model Tuning capability of SageMaker to help you find the best model.

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.set_printoptions(suppress=True, precision=2)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
df = pd.read_csv("files/003.csv")

In [4]:
df

Unnamed: 0,target,a,b,c,d,e,f,g,h
0,151400,2,2,3,7,37,0,3,0
1,286000,4,4,6,16,64,0,6,0
2,172200,2,2,4,24,16,2,3,0
3,141400,4,4,4,0,32,1,4,0
4,100800,3,3,4,8,8,0,4,0
...,...,...,...,...,...,...,...,...,...
265,219600,0,0,0,0,24,0,0,54
266,139200,0,0,0,6,36,0,12,0
267,181200,0,0,0,18,42,0,6,0
268,1440000,150,150,30,0,0,0,0,0


In [5]:
X = df.drop(['target'], axis = 1)
y = df['target']

In [6]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=0)

In [7]:
import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
bucket = session.default_bucket()

In [8]:
train_dataset = pd.concat([y_train,X_train], axis = 1)
val_dataset = pd.concat([y_train_val,X_train_val], axis = 1)
test_dataset = pd.concat([y_test,X_test], axis = 1)

In [9]:
!mkdir -p tmp

In [10]:
train_dataset.to_csv("tmp/003.train.csv", header=False, index=False)
val_dataset.to_csv("tmp/003.val.csv", header=False, index=False)
test_dataset.to_csv("tmp/003.test.csv", header=False, index=False)

In [11]:
train_s3_path = session.upload_data("tmp/003.train.csv")
val_s3_path = session.upload_data("tmp/003.val.csv")
test_s3_path = session.upload_data("tmp/003.test.csv")

In [12]:
import sagemaker
import boto3
from sagemaker import tuner

region = boto3.Session().region_name    
client = boto3.Session().client('sagemaker')
session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [13]:
from sagemaker.image_uris import retrieve

container = retrieve('linear-learner', region, version="1")
container

'382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1'

In [14]:
from sagemaker.inputs import TrainingInput
    
s3_input_training = TrainingInput(train_s3_path, content_type="text/csv")
s3_input_validation = TrainingInput(val_s3_path, content_type="text/csv")

In [15]:
estimator = sagemaker.estimator.Estimator( 
            container,
            role,
            instance_count=1, 
            instance_type='ml.m5.large',  
            sagemaker_session=session)

In [16]:
estimator.set_hyperparameters(
    predictor_type='regressor')

In [17]:
hyperparameter_ranges = {
    'wd': tuner.ContinuousParameter(1e-7, 1),
    'l1': tuner.ContinuousParameter(1e-7,1),
    'mini_batch_size': tuner.IntegerParameter(50,100),
    'learning_rate': tuner.ContinuousParameter(1e-5, 1),
    'use_bias': tuner.CategoricalParameter([True, False])
}

In [18]:
objective_metric_name = 'validation:objective_loss'

In [19]:
hyperparameter_tuner = tuner.HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    objective_type='Minimize',
    max_jobs=15,
    max_parallel_jobs=5)

In [20]:
hyperparameter_tuner.fit(
    {'train': s3_input_training, 'validation': s3_input_validation},
    include_cls_metadata=False,
    wait=False
)

In [21]:
job_name = hyperparameter_tuner.latest_tuning_job.job_name

response = client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=job_name
)

response['HyperParameterTuningJobStatus']

'InProgress'

In [22]:
from pprint import pprint

pprint(response)

{'CreationTime': datetime.datetime(2021, 5, 22, 13, 54, 12, 864000, tzinfo=tzlocal()),
 'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:305262579855:hyper-parameter-tuning-job/linear-learner-210522-1354',
 'HyperParameterTuningJobConfig': {'HyperParameterTuningJobObjective': {'MetricName': 'validation:objective_loss',
                                                                        'Type': 'Minimize'},
                                   'ParameterRanges': {'CategoricalParameterRanges': [{'Name': 'use_bias',
                                                                                       'Values': ['True',
                                                                                                  'False']}],
                                                       'ContinuousParameterRanges': [{'MaxValue': '1',
                                                                                      'MinValue': '1e-07',
                                           

In [23]:
tuning_job_name_linear = response['HyperParameterTuningJobName']
%store tuning_job_name_linear

tuning_job_name_linear

Stored 'tuning_job_name_linear' (str)


'linear-learner-210522-1354'

In [24]:
from time import sleep

while response['HyperParameterTuningJobStatus'] == 'InProgress':
    response = client.describe_hyper_parameter_tuning_job(
        HyperParameterTuningJobName=job_name
    )

    print(response['HyperParameterTuningJobStatus'])
    sleep(60)

InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
Completed


In [25]:
predictor = hyperparameter_tuner.deploy(initial_instance_count=1,
                             instance_type='ml.t2.medium')


2021-05-22 14:05:10 Starting - Preparing the instances for training
2021-05-22 14:05:10 Downloading - Downloading input data
2021-05-22 14:05:10 Training - Training image download completed. Training in progress.
2021-05-22 14:05:10 Uploading - Uploading generated training model
2021-05-22 14:05:10 Completed - Training job completed
-------------------!

In [26]:
predictor.serializer = sagemaker.serializers.CSVSerializer()

In [27]:
test_dataset.head(5)

Unnamed: 0,target,a,b,c,d,e,f,g,h
64,86400,0,0,0,24,0,0,0,0
135,302800,4,4,6,14,74,0,6,0
153,605600,60,20,10,4,8,0,0,60
189,2416000,200,240,120,0,0,0,0,0
253,1712000,160,200,32,0,0,0,0,0


In [28]:
X_test = test_dataset.drop(['target'], axis = 1)
y_test = test_dataset[['target']]

In [29]:
def predict(data, rows = 500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = predictor.predict(array).decode('utf-8')
        
    return predictions

In [30]:
import ast

y_pred = predict(X_test.to_numpy())
y_pred = ast.literal_eval(y_pred)
y_pred

{'predictions': [{'score': 77791.28125},
  {'score': 260109.375},
  {'score': 622902.625},
  {'score': 2532281.0},
  {'score': 1840484.625},
  {'score': 42007.32421875},
  {'score': 408188.0625},
  {'score': 893421.0},
  {'score': 490653.125},
  {'score': 222828.75},
  {'score': 571656.0},
  {'score': 199911.609375},
  {'score': 580122.125},
  {'score': 565848.625},
  {'score': 109491.203125},
  {'score': 676230.0625},
  {'score': 446710.5},
  {'score': 155582.5625},
  {'score': 747266.75},
  {'score': 795900.375},
  {'score': 1725767.5},
  {'score': 449945.0},
  {'score': 247639.0},
  {'score': 934353.9375},
  {'score': 1034657.5},
  {'score': 1899210.75},
  {'score': 899890.0},
  {'score': 498177.875},
  {'score': 1513038.75},
  {'score': 258664.375},
  {'score': 216674.3125},
  {'score': 670065.75},
  {'score': 1633844.625},
  {'score': 3441092.25},
  {'score': 386748.0625},
  {'score': 494034.125},
  {'score': 193374.03125},
  {'score': 756519.375},
  {'score': 1349834.875},
  {'sc

In [35]:
y_pred['predictions'][0]

{'score': 77791.28125}

In [37]:
y_fin = [x['score'] for x in y_pred['predictions']]
y_fin

[77791.28125,
 260109.375,
 622902.625,
 2532281.0,
 1840484.625,
 42007.32421875,
 408188.0625,
 893421.0,
 490653.125,
 222828.75,
 571656.0,
 199911.609375,
 580122.125,
 565848.625,
 109491.203125,
 676230.0625,
 446710.5,
 155582.5625,
 747266.75,
 795900.375,
 1725767.5,
 449945.0,
 247639.0,
 934353.9375,
 1034657.5,
 1899210.75,
 899890.0,
 498177.875,
 1513038.75,
 258664.375,
 216674.3125,
 670065.75,
 1633844.625,
 3441092.25,
 386748.0625,
 494034.125,
 193374.03125,
 756519.375,
 1349834.875,
 1245805.25,
 52537.6171875,
 336056.75,
 800785.4375,
 144435.84375,
 517328.75,
 218982.40625,
 1452134.5,
 1380363.5,
 258171.875,
 327102.0625,
 1071303.75,
 668486.25,
 5659184.0,
 964447.4375]

In [38]:
from sklearn import metrics

mae_9 = metrics.mean_absolute_error(y_test, y_fin)
mse_9 = metrics.mean_squared_error(y_test, y_fin)
rmse_9 =  np.sqrt(metrics.mean_squared_error(y_test, y_fin))
r2_9 = metrics.r2_score(y_test, y_fin)

print(f'Mean Absolute Error: {mae_9}')
print(f'Mean Squared Error: {mse_9}')
print(f'Root Mean Squared Error: {rmse_9}')
print(f'Coefficient of Determination: {r2_9}')

Mean Absolute Error: 58249.80085358796
Mean Squared Error: 10635552266.983366
Root Mean Squared Error: 103128.81395121038
Coefficient of Determination: 0.9891436775428604


### Comparison with Lab 8
---

In [39]:
%store -r mae_8
%store -r mse_8
%store -r rmse_8
%store -r r2_8

In [40]:
print(f'Mean Absolute Error Difference: {mae_8-mae_9}')
print(f'Mean Squared Error Difference: {mse_8-mse_9}')
print(f'Root Mean Squared Error Difference: {rmse_8-rmse_9}')
print(f'Coefficient of DeterminationDifference : {r2_9-r2_8}')

Mean Absolute Error Difference: -24002.218652922114
Mean Squared Error Difference: -6993750405.46458
Root Mean Squared Error Difference: -42781.47050082487
Coefficient of DeterminationDifference : -0.007138924964167348


Unfortunately, the automated hyperparameter tuning did not able to find the optimal hyperparameter values to bet the metrics of the model in Lab 8. This process was tested with both Linear Learner and XGBoost. Initially, the Linear Learner model had a R^2 score of ~97.2. Trying out XGBoost, the score dropped down to ~86%. Finally, multiple run throughs were done with the Linear Learner model to achieve the current metrics. While the built-in model was not able to surpass the model in Lab 8, limiting the hyperparameter value ranges and trying out more configurations shows great promise in surpassing the metrics of Lab 8's model.

In [41]:
predictor.delete_endpoint()