# DEMO演示 - 利用XGBoost模型预测客户流失

---

---

## Contents

1. [Background 业务背景介绍](#Background) 
1. [Setup 环境变量，依赖库配置](#Setup)
1. [Data 数据预处理](#Data)
1. [Train 模型训练](#Train)
  - [Experiment 实验配置](#Experiment)
  - [Debugger 调试配置](#Debugger)
1. [Compile-Neo Neo编译](#Compile)
1. [Host 模型推理托管](#Host)
  - [Evaluate 模型评估](#Evaluate)
  - [ModelMonitor 模型监控](#ModelMonitor)
1. [Extensions 扩展](#Extensions)
  - [AutoModelTuning 自动模型调优](#AutoModelTuning)
  - [CleanUp 清理实验环境](#CleanUp)

---

## Background 背景介绍

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.

<font color=#00BFFF size=3>失去客户对任何企业来说都是代价高昂的。 尽早识别出不满意的客户，可以使他们有机会留下来。激励措施通常比失去和重新获得客户更具成本效益。</font>

---

## Setup 环境设置

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

<font color=#00BFFF size=3></font>

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting. <font color=#00BFFF size=3>保证Notebook和S3在相同的区域</font>
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s). <font color=#00BFFF size=3>IAM角色允许Notebook实例访问S3上的数据</font>



In [None]:
import sys
import IPython
# 安装 实验 lib
!{sys.executable} -m pip install -U sagemaker-experiments
# 安装 debugger lib
!{sys.executable} -m pip install -U smdebug
# 重新启动内核
# IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'sagemaker/demo-xgboost-churn'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

---
## Data 数据预处理

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.

<font color=#00BFFF size=3>基于移动终端客户的多种属性值的历史使用记录，训练模型，预测客户是否会流失</font>

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt ./

churn = pd.read_csv('./churn.txt')
pd.set_option('display.max_columns', 500)
churn

By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- `Account Length`: the number of days that this account has been active
- `Area Code`: the three-digit area code of the corresponding customer’s phone number
- `Phone`: the remaining seven-digit phone number
- `Int’l Plan`: whether the customer has an international calling plan: yes/no
- `VMail Plan`: whether the customer has a voice mail feature: yes/no
- `VMail Message`: presumably the average number of voice mail messages per month
- `Day Mins`: the total number of calling minutes used during the day
- `Day Calls`: the total number of calls placed during the day
- `Day Charge`: the billed cost of daytime calls
- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening
- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime
- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls
- `CustServ Calls`: the number of calls placed to Customer Service
- `Churn?`: whether the customer left the service: true/false

The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

XGBoost 决策树模型，二分类和线性回归，本实验是二分类的方式。

Let's begin exploring the data:

In [None]:
# Frequency tables for each categorical feature - 交叉表是用于统计分组频率的特殊透视表
for column in churn.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=churn[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features - 每个特征的直方图
# pandas有两个核心数据结构 Series和DataFrame，分别对应了一维的序列和二维的表结构。
# 而describe()函数就是返回这两个核心数据结构的统计变量。其目的在于观察这一系列数据的范围、大小、波动趋势等等，为后面的模型选择打下基础。

# 统计值变量说明
# count：数量统计，此列共有多少有效值
# std：标准差
# min：最小值
# 25%：四分之一分位数
# 50%：二分之一分位数
# 75%：四分之三分位数
# max：最大值
# mean：均值
display(churn.describe())

# 直方图
%matplotlib inline
hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))

We can see immediately that:
- `State` appears to be quite evenly distributed 分布比较均匀
- `Phone` takes on too many unique values to be of any practical use.  It's possible parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. Phone基本都是一些唯一值，避免使用
- Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity.  `VMail Message` being a notable exception (and `Area Code` showing up as a feature we should convert to non-numeric). Area Code作为一项特征应当被转换为非数字

In [None]:
churn = churn.drop('Phone', axis=1)
churn['Area Code'] = churn['Area Code'].astype(object)

Next let's look at the relationship between each of the features and our target variable.

In [None]:
for column in churn.select_dtypes(include=['object']).columns:
    if column != 'Churn?':
        display(pd.crosstab(index=churn[column], columns=churn['Churn?'], normalize='columns'))

for column in churn.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = churn[[column, 'Churn?']].hist(by='Churn?', bins=30)
    plt.show()

In [None]:
display(churn.corr())

# scatter_matrix 绘制矩阵散点图
pd.plotting.scatter_matrix(churn, figsize=(12, 12))
plt.show()

We see several features that essentially have 100% correlation with one another.  Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias.  Let's remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:

<font color=#00BFFF size=3>我们看到几个特征基本上具有100％的相关性。 在某些机器学习算法中包括这些功能对可能会造成灾难性的问题，而在其他机器学习算法中，只会引入较小的冗余和偏差。我们需要删除最支付相关的特征值 </font>


In [None]:
churn = churn.drop(['Day Charge', 'Eve Charge', 'Night Charge', 'Intl Charge'], axis=1)

Now that we've cleaned up our dataset, let's determine which algorithm to use.  As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn.  In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms.  Instead, let's attempt to model this problem using gradient boosted trees.  Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint.  XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format.  For this example, we'll stick with CSV.  It should:
- Have the predictor variable in the first column
- Not have a header row

But first, let's convert our categorical features into numeric features.

In [None]:
# 完成最终数据集的组装
model_data = pd.get_dummies(churn)
model_data = pd.concat([model_data['Churn?_True.'], model_data.drop(['Churn?_False.', 'Churn?_True.'], axis=1)], axis=1)
model_data

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

In [None]:
# 拆分3套数据集 训练、验证、测试
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

Now we'll upload these files to S3.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---
## Train 模型训练

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [None]:
# 获取XGBoost镜像
container = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, '1.0-1')
display(container)

Then, because we're training with the CSV file format, we'll create `TrainingInput`s that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).
- eta [default=0.3] 为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3 取值范围为：[0,1]
- gamma [default=0] minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. 取值范围为：[0,∞]
- max_depth [default=6] 树的最大深度。缺省值为6 取值范围为：[1,∞]
- min_child_weight [default=1] 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative 取值范围为：[0,∞]
- max_delta_step [default=0] 我们允许每个树的权重被估计的值。如果它的值被设置为0，意味着没有约束；如果它被设置为一个正值，它能够使得更新的步骤更加保守。通常这个参数是没有必要的，但是如果在逻辑回归中类极其不平衡这时候他有可能会起到帮助作用。把它范围设置为1-10之间也许能控制更新。 取值范围为：[0,∞]
- subsample [default=1] 用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的从整个样本集合中随机的抽取出50%的子样本建立树模型，这能够防止过拟合。 取值范围为：(0,1]
- colsample_bytree [default=1] 在建立树时对特征采样的比例。缺省值为1 取值范围为：(0,1]
- colsample_bylevel [default=1]：树的每个层级分裂时子样本的特征所占的比例。作者表示不用这个参数，因为subsample和colsample_bytree组合做的事与之类似
- lambda [default=1]： l2正则化权重的术语
- alpha [default=0] ：l1正则化的权重术语。当特征量特别多的时候可以使用，这样能加快算法的运行效率
- n_estimators 最佳迭代次数，树的个数

AUC值：考虑样本不均衡情况，大部分采用AUC值进行模型评估，值越大代表效果越好¶
- AUC的概率意义是随机取一对正负样本，正样本得分大于负样本的概率
- AUC的最小值为0.5，最大值为1，取值越高越好
- AUC=1，完美分类器，采用这个预测模型时，不管设定什么阀值都能得出完美预测。绝大多数预测的场合，不存在完美分类器。
- 0.5<AU<1，优于随机猜测。这个分类器(模型)妥善设定阀值的话，能有预测价值
- 最终AUC的范围在[0.5，1]之间，并且越接近1越好

In [None]:
# 引入Debugger库
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig

sess = sagemaker.Session()

create_time = strftime("%Y-%m-%d-%H-%M-%S")
save_interval = 5

# 创建一个实验，方便多个实验的指标进行对比
customer_churn_experiment = Experiment.create(experiment_name="demo-customer-churn-experiment-{}".format(create_time),
                                              description="Using xgboost to predict customer churn",
                                              sagemaker_boto_client=boto3.client('sagemaker'))

# 设置超参数
hyperparas = {"max_depth":5, # 构建树的深度，越大越容易过拟合
              "eta":0.2, # 如同学习率
              "gamma":4, # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2
              "min_child_weight":9,
              "subsample":0.8, # 随机采样训练样本
              "silent":0, # 设置成1则没有运行信息输出，最好是设置为0
              "objective":'binary:logistic',
              "eval_metric":'auc', 
              "num_round":6} #训练轮数

trial = Trial.create(trial_name="demo-trial-{}-weight-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime()), hyperparas["min_child_weight"]),
                    experiment_name=customer_churn_experiment.experiment_name,
                    sagemaker_boto_client=boto3.client('sagemaker'))

job_name = "demo-customer-churn-job"

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    hyperparameters=hyperparas,
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    # 使用Spot实例进行训练
                                    use_spot_instances=True,
                                    max_run=3600,
                                    max_wait=7200,
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess,
                                    # debugger hook设置
                                    debugger_hook_config=DebuggerHookConfig(
                                        s3_output_path='s3://{}/{}/debugger'.format(bucket, prefix),  # Required
                                        collection_configs=[
                                            CollectionConfig(
                                                name="metrics",
                                                parameters={
                                                    "save_interval": str(save_interval)
                                                }
                                            ),
                                            CollectionConfig(
                                                name="feature_importance",
                                                parameters={
                                                    "save_interval": str(save_interval)
                                                }
                                            ),
                                            CollectionConfig(
                                                name="full_shap",
                                                parameters={
                                                    "save_interval": str(save_interval)
                                                }
                                            ),
                                            CollectionConfig(
                                                name="average_shap",
                                                parameters={
                                                    "save_interval": str(save_interval)
                                                }
                                            ),
                                        ],
                                    ),

                                    # 设置 Debugger的规则
                                    rules=[
                                        # 检测损失值是否以适当的速度减少
                                        Rule.sagemaker(
                                            rule_configs.loss_not_decreasing(),
                                            rule_parameters={
                                                "collection_names": "metrics",
                                                "num_steps": str(save_interval * 2),
                                            },
                                        ),
                                        # 检测是否过度训练
                                        Rule.sagemaker(rule_configs.overtraining()),                                        
                                        # 检测模型是否与测试数据过度拟合
                                        Rule.sagemaker(rule_configs.overfit())
                                    ],)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation},
        # 实验的配置信息
       experiment_config={"ExperimentName":customer_churn_experiment.experiment_name,
                         "TrialName":trial.trial_name,
                         "TrialComponentDisplayName":"Training"}) 

In [None]:
import time

for _ in range(36):
    job_name = xgb.latest_training_job.name
    client = xgb.sagemaker_session.sagemaker_client
    description = client.describe_training_job(TrainingJobName=job_name)
    training_job_status = description["TrainingJobStatus"]
    rule_job_summary = xgb.latest_training_job.rule_job_summary()
    rule_evaluation_status = rule_job_summary[0]["RuleEvaluationStatus"]
    print("Training job status: {}, Rule Evaluation Status: {}".format(training_job_status, rule_evaluation_status))
    
    if training_job_status in ["Completed", "Failed"]:
        break

    time.sleep(10)

In [None]:
xgb.latest_training_job.rule_job_summary()

In [None]:
# from smdebug.trials import create_trial

# s3_output_path = xgb.latest_job_debugger_artifacts_path()
# trial = create_trial(s3_output_path)

In [None]:
# trial.tensor_names()

In [None]:
# trial.tensor("average_shap/f1").values()

In [None]:
# import matplotlib.pyplot as plt
# import seaborn as sns
# import re


# def get_data(trial, tname):
#     """
#     For the given tensor name, walks though all the iterations
#     for which you have data and fetches the values.
#     Returns the set of steps and the values.
#     """
#     tensor = trial.tensor(tname)
#     steps = tensor.steps()
#     vals = [tensor.value(s) for s in steps]
#     return steps, vals

# def plot_collection(trial, collection_name, regex='.*', figsize=(8, 6)):
#     """
#     Takes a `trial` and a collection name, and 
#     plots all tensors that match the given regex.
#     """
#     fig, ax = plt.subplots(figsize=figsize)
#     sns.despine()

#     tensors = trial.collection(collection_name).tensor_names

#     for tensor_name in sorted(tensors):
#         if re.match(regex, tensor_name):
#             steps, data = get_data(trial, tensor_name)
#             ax.plot(steps, data, label=tensor_name)

#     ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#     ax.set_xlabel('Iteration')

In [None]:
# plot_collection(trial, "metrics")

In [None]:
# def plot_feature_importance(trial, importance_type="weight"):
#     SUPPORTED_IMPORTANCE_TYPES = ["weight", "gain", "cover", "total_gain", "total_cover"]
#     if importance_type not in SUPPORTED_IMPORTANCE_TYPES:
#         raise ValueError(f"{importance_type} is not one of the supported importance types.")
#     plot_collection(
#         trial,
#         "feature_importance",
#         regex=f"feature_importance/{importance_type}/.*")

In [None]:
# plot_feature_importance(trial)

In [None]:
# plot_feature_importance(trial, importance_type="cover")

In [None]:
# plot_collection(trial,"average_shap")

---
## Compile 编译模型
[Amazon SageMaker Neo](https://aws.amazon.com/sagemaker/neo/) optimizes models to run up to twice as fast, with no loss in accuracy. When calling `compile_model()` function, we specify the target instance family (m4) as well as the S3 bucket to which the compiled model would be stored.

In [None]:
compiled_model = xgb
output_path = '/'.join(xgb.output_path.split('/')[:-1])
# NEO 编译模型，优化性能
compiled_model = xgb.compile_model(target_instance_family='ml_m4', 
                                   input_shape={'data': [1, 69]},
                                   role=role,
                                   framework='xgboost',
                                   framework_version='latest',
                                   output_path=output_path)
compiled_model.name = 'deployed-xgboost-customer-churn'

---
## Host 托管终端节点

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

In [None]:
from datetime import datetime, timedelta, timezone
from sagemaker.model_monitor import DataCaptureConfig

# 配置模型监控
data_capture_prefix = f'{prefix}/datacapture'
s3_capture_upload_path = f's3://{bucket}/{data_capture_prefix}'

endpoint_name = f"xgb-customer-churn-model-quality-monitor-{datetime.utcnow():%Y-%m-%d-%H%M}"
print("EndpointName =", endpoint_name)

data_capture_config = DataCaptureConfig(
                        enable_capture=True,
                        sampling_percentage=100,
                        destination_s3_uri=s3_capture_upload_path)

# 部署模型，提供EndPoint支持推理API
xgb_predictor = compiled_model.deploy(
    initial_instance_count = 1, 
    instance_type = 'ml.m4.xlarge',
    endpoint_name=endpoint_name,
    serializer=CSVSerializer(),
    data_capture_config=data_capture_config)

In [None]:
print(s3_capture_upload_path)

### Evaluate 评估模型

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

# 基于测试数据，准备预测验证数据
predictions = predict(test_data.to_numpy()[:,1:])

# predictions

There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values.  In this case, we're simply predicting whether the customer churned (`1`) or not (`0`), which produces a simple confusion matrix. 混淆矩阵

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

_Note, due to randomized elements of the algorithm, you results may differ slightly._

Of the 48 churners, we've correctly predicted 39 of them (true positives). And, we incorrectly predicted 4 customers would churn who then ended up not doing so (false positives).  There are also 9 customers who ended up churning, that we predicted would not (false negatives).

An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1 and we force them into the binary classes that we began with.  However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider adjusting this cutoff.  That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.

To get a rough intuition here, let's look at the continuous values of our predictions.

In [None]:
plt.hist(predictions)
plt.show()

The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions.  For example...

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0))

We can see that changing the cutoff from 0.5 to 0.3 results in 1 more true positives, 3 more false positives, and 1 fewer false negatives.  The numbers are small overall here, but that's 6-10% of customers overall that are shifting because of a change to the cutoff.  Was this the right decision?  We may end up retaining 3 extra customers, but we also unnecessarily incentivized 5 more customers who would have stayed.  Determining optimal cutoffs is a key step in properly applying machine learning in a real-world setting.  Let's discuss this more broadly and then apply a specific, hypothetical solution for our current problem.

---
## Extensions 扩展可选

## Automatic model Tuning (optional) 自动模型调优
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
For example, suppose that you want to solve a binary classification problem on this marketing dataset. Your goal is to maximize the area under the curve (auc) metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.


In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
# 设定超参的取值范围
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}

### eval_metric 评价指标可选范围

- “rmse”: root mean square error
- “logloss”: negative log-likelihood
- “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- “merror”: Multiclass classification error rate. It is calculated as #(wrongcases)#(allcases).
- “mlogloss”: Multiclass logloss
- “auc”: Area under the curve for ranking evaluation.
- “ndcg”:Normalized Discounted Cumulative Gain
- “map”:Mean average precision
- “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
- “ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively

### AUC值：考虑样本不均衡情况，大部分采用AUC值进行模型评估，值越大代表效果越好

- AUC的概率意义是随机取一对正负样本，正样本得分大于负样本的概率
- AUC的最小值为0.5，最大值为1，取值越高越好
- AUC=1，完美分类器，采用这个预测模型时，不管设定什么阀值都能得出完美预测。绝大多数预测的场合，不存在完美分类器。
- 0.5<AU<1，优于随机猜测。这个分类器(模型)妥善设定阀值的话，能有预测价值
- 最终AUC的范围在[0.5，1]之间，并且越接近1越好

In [None]:
# 评估指标选择 auc
objective_metric_name = 'validation:auc'

In [None]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=9,
                            max_parallel_jobs=3)

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

In [None]:
# return the best training job name
tuner.best_training_job()

In [None]:
#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

In [None]:
# Create a serializer
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [None]:
# Predict
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

In [None]:
# Collect predictions and convert from the CSV output our model provides into a NumPy array
pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0))

### (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
xgb_predictor.delete_endpoint()

In [None]:
tuner_predictor.delete_endpoint()