# Hyperparameter tuning

This lab addresses two challenges of the machine learning lifecycle.  The first is making the logic developed in labs 1 and 2 more formalized and reusable by wrapping it in Python functions.  The second is determining the ideal hyperparameters for a selected algorithm in order to optimize model performance and accuracy.

---

## Feature engineering re-use

Raw data needs to be converted to features both during the training stage of a model's life but also during production use of the model.  In the same way that raw data is transformed before being sent with an algorithm to training, raw data must also be transformed before it can be used to make predictions at run time.  Therefore its important to be able to make the feature engineering logic reusable and portable.  This can be done as a RESTful microservice or a Python code module, any method the delivery team prefers.  Today we will stop at formalizing the logic as a collection of functions.  But bear in mind that this logic would need to be re-used during runtime as well as at training.

Review the code below, its an almagamation of the code from the previous two notebooks.  Run the cells to create a data set for use with hyperparameter tuning.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import glob
from datetime import datetime
import boto3

**S3 bucket for training data**

The next cell defines a variable used to store training and validation data for hyperparameter optimization and training.  Specify a bucket name below and the S3 bucket will be created on your behalf.

In [None]:
YOUR_BUCKET_NAME = ' < YOUR S3 BUCKET NAME > '

In [None]:
prefix = 'hpo'
s3_hpo_uri = 's3://{}/{}/'.format (YOUR_BUCKET_NAME, prefix)

!aws s3 mb "s3://{YOUR_BUCKET_NAME}"

In [None]:
def read_s3_csv (dates):
    s3 = boto3.resource('s3')
    deutsche_boerse_bucket = 'deutsche-boerse-xetra-pds'
    
    bucket = s3.Bucket(deutsche_boerse_bucket)
    
    dataframes = []
    
    for date in dates:
        objs_count = 0
        csv_objects = bucket.objects.filter(Prefix=date)
        for csv_obj in csv_objects:
            csv_key = csv_obj.key
            if csv_key[-4:] == '.csv':
                objs_count += 1
                csv_body = csv_obj.get()['Body']
                df = pd.read_csv(csv_body)
                dataframes.append(df)
        
        print ("Loaded {} data objects for {}".format (objs_count, date))
    return pd.concat(dataframes)

In [None]:
def build_index(non_empty_days, from_time, to_time):
    date_ranges = []
    for date in non_empty_days:
        yyyy, mm, dd = date.split('-')
        from_hour, from_min = from_time.split(':')
        to_hour, to_min = to_time.split(':')    
        t1 = datetime(int(yyyy), int(mm), int(dd), int(from_hour),int(from_min),0)
        t2 = datetime(int(yyyy), int(mm), int(dd), int(to_hour),int(to_min),0) 
        date_ranges.append(pd.DataFrame({"OrganizedDateTime": pd.date_range(t1, t2, freq='1Min').values}))
    agg = pd.concat(date_ranges, axis=0) 
    agg.index = agg["OrganizedDateTime"]
    return agg

In [None]:
def basic_stock_features(input_df, mnemonic, new_time_index, inplace=False):
    stock = input_df.loc[mnemonic]
    if not inplace:
        stock = input_df.loc[mnemonic].copy()
    
    stock = stock.reindex(new_time_index)
    
    features = ['MinPrice', 'MaxPrice', 'EndPrice', 'StartPrice']
    for f in features:
        stock[f] = stock[f].fillna(method='ffill')   
    
    features = ['TradedVolume', 'NumberOfTrades']
    for f in features:
        stock[f] = stock[f].fillna(0.0)
        
    stock['HourOfDay'] = stock.index.hour
    stock['MinOfHour'] = stock.index.minute
    stock['MinOfDay'] = stock.index.hour*60 + stock.index.minute

    stock['DayOfWeek'] = stock.index.dayofweek
    stock['DayOfYear'] = stock.index.dayofyear
    stock['MonthOfYear'] = stock.index.month
    stock['WeekOfYear'] = stock.index.weekofyear
    
    stock['Mnemonic'] = mnemonic
    unwanted_features = ['ISIN', 'SecurityDesc', 'SecurityType', 'Currency', 'SecurityID', 'Date', 'Time', 'CalcTime']
    return stock.drop (unwanted_features, axis=1)

In [None]:
def clean_data (df, inplace = False):
    column_filter = ['ISIN', 'Mnemonic', 'SecurityDesc', 'SecurityType', 'Currency', 'SecurityID', 'Date', 'Time', 'StartPrice', 'MaxPrice', 'MinPrice', 'EndPrice', 'TradedVolume', 'NumberOfTrades']
    n_df = df[column_filter]
    if not inplace:
        n_df = df.copy ()
        
    n_df.drop (n_df.Time == 'Time', inplace = True)
    # we want the dates to be comparable to datetime.strptime()
    n_df["CalcTime"] = pd.to_datetime("1900-01-01 " + n_df["Time"], errors='coerce')
    n_df["CalcDateTime"] = pd.to_datetime(n_df["Date"] + " " + n_df["Time"], errors='coerce')

    # Filter common stock
    # Filter between trading hours 08:00 and 20:00
    # Exclude auctions (those are with TradeVolume == 0)
    only_common_stock = n_df[n_df.SecurityType == 'Common stock']
    time_fmt = "%H:%M"
    opening_hours_str = "08:00"
    closing_hours_str = "20:00"
    opening_hours = datetime.strptime(opening_hours_str, time_fmt)
    closing_hours = datetime.strptime(closing_hours_str, time_fmt)

    cleaned_common_stock = only_common_stock[(only_common_stock.TradedVolume > 0) & \
                      (only_common_stock.CalcTime >= opening_hours) & \
                      (only_common_stock.CalcTime <= closing_hours)]
    
    bymnemonic = cleaned_common_stock[['Mnemonic', 'TradedVolume']].groupby(['Mnemonic']).sum()
    number_of_stocks = 100
    top = bymnemonic.sort_values(['TradedVolume'], ascending=[0]).head(number_of_stocks)
    top_k_stocks = list(top.index.values)
    cleaned_common_stock = cleaned_common_stock[cleaned_common_stock.Mnemonic.isin(top_k_stocks)]
    sorted_by_index = cleaned_common_stock.set_index(['Mnemonic', 'CalcDateTime']).sort_index()
    non_empty_days = sorted(list(cleaned_common_stock['Date'].unique()))
    new_datetime_index = build_index(non_empty_days, opening_hours_str, closing_hours_str)["OrganizedDateTime"].values
    
    stocks = []
    for stock in top_k_stocks:
        stock = basic_stock_features(sorted_by_index, stock, new_datetime_index, inplace=True)
        stocks.append(stock)
    # prepared should contain the numeric features for all top k stocks,
    # for all days in the interval, for which there were trades (that means excluding weekends and holidays)
    # for all minutes from 08:00 until 20:00
    # in minutes without trades the prices from the last available minute are carried forward
    # trades are filled with zero for such minutes
    # a new column called HasTrade is introduced to denote the presence of trades
    prepared = pd.concat(stocks, axis=0).dropna(how='any')
    prepared.Mnemonic = prepared.Mnemonic.astype('category')
    return prepared

In [None]:
def create_xgb_target (df):
    return df.MaxPrice.shift(-1).fillna (method='ffill')

In [None]:
def create_xgb_features (df, horizon, inplace = False):
    n_df = df
    if not inplace:
        n_df = df.copy ()
    
    for offset in range(1, horizon+1):
        min_price = n_df['MinPrice'].shift (offset).fillna(method='bfill')
        max_price = n_df['MaxPrice'].shift (offset).fillna(method='bfill')
        start_price = n_df['StartPrice'].shift (offset).fillna(method='bfill')
        end_price = n_df['EndPrice'].shift (offset).fillna(method='bfill')
        trade_vol = n_df['TradedVolume'].shift (offset).fillna(method='bfill')
        num_trades = n_df['NumberOfTrades'].shift (offset).fillna(method='bfill')
        
        n_df["h{}_MinPrice".format (offset)] = min_price
        n_df["h{}_MaxPrice".format (offset)] = max_price
        n_df["h{}_StartPrice".format (offset)] = start_price
        n_df["h{}_EndPrice".format (offset)] = end_price
        n_df["h{}_TradeVolume".format (offset)] = trade_vol
        n_df["h{}_NumberOfTrades".format (offset)] = num_trades
        
    return n_df

In [None]:
def engineer_date_range (dates):
    unprocessed_df = read_s3_csv (dates)
    print ("Loaded CSV data set from S3")
    
    cleaned_df = clean_data (unprocessed_df, inplace = True)
    print ("Cleaned CSV data set")
     
    xgb_data = create_xgb_features (cleaned_df, 5, inplace=True)
    xgb_data['NextMaxPrice'] = create_xgb_target (xgb_data)
    print ("Engineered CSV data set")
    
    train_data, validate_data = train_test_split (xgb_data, train_size=0.8, test_size=0.2, shuffle=True)

    cols = list(train_data.columns.values)
    cols.remove ('NextMaxPrice')
    cols = ['NextMaxPrice'] + cols

    train_data = pd.get_dummies (train_data[cols])
    validate_data = pd.get_dummies (validate_data[cols])
    print ("Data split for training purposes")
    
    return (train_data, validate_data)

In [None]:
hpo_data_folder = 'data/hpo'
train_output_folder = hpo_data_folder +'/train'
validate_output_folder = hpo_data_folder +'/validate'
! mkdir -p {train_output_folder}
! mkdir -p {validate_output_folder}

# Earliest possible date is 2017-06-17
from_date = '2017-09-01'
until_date = '2017-11-30'
dates = list(pd.date_range(from_date, until_date, freq='D').strftime('%Y-%m-%d'))

print ("Reading data for dates {} to {}".format (from_date, until_date))
train_df, validate_df = engineer_date_range (dates)

In [None]:
print ("Writing CSV data for dates {} to {}".format (from_date, until_date))

chunk_size = 100000
print ("Writing {} training records".format (train_df.shape[0]))
for chunk_index in range (0, train_df.shape[0], chunk_size):
    train_df[chunk_index:chunk_index + chunk_size].to_csv(train_output_folder + '/{}-{}_{}-{}.csv'.format (from_date, until_date, chunk_index, chunk_index + chunk_size), header=False, index=False)
    print ("Wrote training records {} to {}".format (chunk_index, chunk_index + chunk_size))

print ("Writing {} validation records".format (validate_df.shape[0]))
for chunk_index in range (0, validate_df.shape[0], chunk_size):
    validate_df[chunk_index:chunk_index + chunk_size].to_csv(validate_output_folder + '/{}-{}_{}-{}.csv'.format (from_date, until_date, chunk_index, chunk_index + chunk_size), header=False, index=False)
    print ("Wrote validation records {} to {}".format (chunk_index, chunk_index + chunk_size))
print ("Export to CSV complete")

In [None]:
!aws s3 sync {hpo_data_folder} {s3_hpo_uri}

## Hyperparameter tuning

---

During the previous lab we ran spot checks against multiple machine learning algorithms.  XGBoost looked like it was worth pursuing and so we will create a hyperparameter tuning job on a subset of the overall data set in order to determine the most effective hyperparameters for the algorithm.  Using the data generated above, create a hyperparameter tuning job to allow Amazon SageMaker to search the settings for an optimal collection.  In the next lab you will use these settings to train XGBoost on the entire data set.

### Using the AWS Console
1. Upload the train and validate data sets to an S3 bucket in your preferred region with SageMaker available
1. From the SageMaker console click `Hyperparameter tuning jobs` --> `Create hyperparameter tuning job`
1. Give the tuning job a name, such as 'XGBoost-forecast-01'
1. Select `XGBoost` from the drop down of Amazon SageMaker built-in algorithms, click `Next`
1. For Objective metric select `validation:rmse`
1. Set the following hyperparameters as:
  1. num_round from 10 to 100
  1. eta from 0.2 to 0.9
  1. gamma from 0.1 to 9.0
  1. max_depth from 3 to 10
1. Click `Next`
1. Define two training channels pointing to the train and the validate data sets pushed earlier to S3
1. For the train data set specify
  1. `Channel name` to `train`
  1. `Content type` to `csv`
  1. `Compression type` and `Record wrapper` to `None`
  1. `S3 Data Type` to `S3Prefix`
  1. `S3 Data Distribution` to `Fully replicated`
  1. Specify the `URI` as the S3 train folder in `s3_hpo_uri`
  1. Set `Input mode` to `File`
1. Click `Add channel` and specify the validate data set
  1. `Channel name` to `validation`
  1. `Content type` to `csv`
  1. `Compression type` and `Record wrapper` to `None`
  1. `S3 Data Type` to `S3Prefix`
  1. `S3 Data Distribution` to `Fully replicated`
  1. Specify the `URI` as the S3 validate folder in `s3_hpo_uri`
  1. Set `Input mode` to `File`
1. Specify the `S3 output path` and click `Next`
1. Specify an instance type of `m4.4xlarge`
1. Specify 20 GB of storage
1. Specify a maximum training job count of 6 and a maximum of 2 parallel training jobs
1. Click `Create jobs`

**Note:** Execution of this HPO job will take approximately 20 minutes.

### Using the API

In [None]:
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.9",
          "MinValue": "0.1",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
        {
          "MaxValue": "9.0",
          "MinValue": "0.1",
          "Name": "gamma"
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "3",
          "Name": "max_depth"
        },
        {
          "MaxValue": "100",
          "MinValue": "10",
          "Name": "num_round"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 10,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:rmse",
      "Type": "Minimize"
    }
  }

In [None]:
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import get_execution_role

role = get_execution_role()
training_image = get_image_uri(boto3.Session().region_name, 'xgboost')

s3_input_train = 's3://{}/{}/train'.format(YOUR_BUCKET_NAME, prefix)
s3_input_validation ='s3://{}/{}/validate/'.format(YOUR_BUCKET_NAME, prefix)
     
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(YOUR_BUCKET_NAME, prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 20
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "rmse",
      "objective": "reg:linear",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

In [None]:
import time
tuning_job_name = "forecast-tuning-{}".format (datetime.now ().strftime("%d%H%M"))

smclient = boto3.client ('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                           HyperParameterTuningJobConfig = tuning_job_config,
                                           TrainingJobDefinition = training_job_definition)
status = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)['HyperParameterTuningJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)['HyperParameterTuningJobStatus']
    print(status)

## Note optimal hyperparameter values

---

Once the hyperparameter tuning job has completed make a note of the hyperparameters that produced the best performing model.  You will need to specify these values in the next lab when training on the full data set.