## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

In [None]:
import boto3
import sagemaker
import pandas as pd

In [None]:
%store -r

%store

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_path.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [None]:
%%writefile ./src/training.py
# Training algorithm
import os
import xgboost
import numpy as np
import pandas as pd
import mlflow
import argparse
from sklearn.metrics import roc_auc_score,accuracy_score,precision_score,recall_score

 
def _parse_args():

    parser = argparse.ArgumentParser()
    
    # XGBoost hyperparameters
    parser.add_argument('--eta', type=float, default=0.2, help='Learning rate (default: 0.2)')
    parser.add_argument('--gamma', type=float, default=4, help='Minimum loss reduction required to make a further partition (default: 4)')
    parser.add_argument('--min_child_weight', type=int, default=6, help='Minimum sum of instance weight needed in a child (default: 6)')
    parser.add_argument('--subsample', type=float, default=0.8, help='Subsample ratio of training instances (default: 0.8)')
    parser.add_argument('--silent', type=int, default=0, help='Logging mode - quiet means silent mode (default: 0)')
    parser.add_argument('--objective', type=str, default='binary:logistic', help='Learning task objective (default: binary:logistic)')
    parser.add_argument('--num_round', type=int, default=100, help='Number of boosting rounds/trees (default: 100)')

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])
    
    return parser.parse_known_args()

if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()

    train_data = pd.read_csv(os.path.join(args.train, 'train.csv'), header=None)
    validation_data = pd.read_csv(os.path.join(args.validation, 'validation.csv'), header=None)
    
    # labels are in the first column
    train_y = train_data.iloc[:, 0]
    train_X = train_data.iloc[:, 1:]
    validation_y = validation_data.iloc[:, 0]
    validation_X = validation_data.iloc[:, 1:]

    mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN'])
    print(f"logging experiment to {os.environ['MLFLOW_TRACKING_ARN']}")
    mlflow.set_experiment(os.environ['EXPERIMENT_NAME'])

    with mlflow.start_run(run_name=f"Training") as run:               
        mlflow.autolog()
             
        # Creating DMatrix(es)
        dtrain = xgboost.DMatrix(train_X, label=train_y)
        dval = xgboost.DMatrix(validation_X, label=validation_y)
        watchlist = [(dtrain, "train"), (dval, "validation")]

        param_dist = {
            "eta": args.eta,
            "gamma": args.gamma,
            "min_child_weight": args.min_child_weight,
            "subsample": args.subsample,
            "silent": args.silent,
            "objective": str(args.objective),
            "num_round": args.num_round
        }        
    
        xgb = xgboost.train(
            params=param_dist,
            dtrain=dtrain,
            evals=watchlist,
            num_boost_round=args.num_round)
    
        predictions = xgb.predict(dval)
    
        print (pd.crosstab(index=validation_y, columns=np.round(predictions),
                           rownames=['Actuals'], colnames=['Predictions'], margins=True))
        
        rounded_predict = np.round(predictions)
    
        val_accuracy = accuracy_score(validation_y, rounded_predict)
        val_precision = precision_score(validation_y, rounded_predict)
        val_recall = recall_score(validation_y, rounded_predict)
    
        print("Accuracy Model A: %.2f%%" % (val_accuracy * 100.0))            
        print("Precision Model A: %.2f" % (val_precision))
        print("Recall Model A: %.2f" % (val_recall))
        
        # Log additional metrics, next to the default ones logged automatically
        mlflow.log_metric("Accuracy Model A", val_accuracy * 100.0)
        mlflow.log_metric("Precision Model A", val_precision)
        mlflow.log_metric("Recall Model A", val_recall)
            
        val_auc = roc_auc_score(validation_y, predictions)
        
        print("Validation AUC A: %.2f" % (val_auc))
        mlflow.log_metric("Validation AUC A", val_auc)
    
        model_file_path="/opt/ml/model/xgboost_model.bin"
        os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
        xgb.save_model(model_file_path)
    

In [None]:
%%writefile ./src/requirements.txt
mlflow==2.17.0
sagemaker-mlflow

In [None]:
from sagemaker.xgboost.estimator import XGBoost

xgb = XGBoost(
    entry_point='training.py',
    source_dir='./src',
    framework_version='1.7-1',
    role=sagemaker_role,
    instance_count=1, 
    instance_type='ml.m5.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    environment={"MLFLOW_TRACKING_ARN": tracking_server_arn, "EXPERIMENT_NAME": "bank-marketing"},
    keep_alive_period_in_seconds=3600)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    num_round=100
)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 