# [Module 1.0] 고객 이탈 데이타 수집 
- 이 노트북은 아래 블로그를 참조하여 일부 재구성을 함.
- Reference:
* Visualizing Amazon SageMaker machine learning predictions with Amazon QuickSight
    * https://aws.amazon.com/blogs/machine-learning/making-machine-learning-predictions-in-amazon-quicksight-and-amazon-sagemaker/
    * Git
        * https://github.com/aws-samples/quicksight-sagemaker-integration-blog


_**Using Gradient Boosted Trees to Predict Mobile Customer Departure**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Inference Pipeline](#Inference)


---

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.

---

## Setup Bucket and IAM Role



In [20]:
# Set up Bucket
import sagemaker
bucket = sagemaker.Session().default_bucket()
# bucket = '<Your Bucket Name if you want to use it>'

prefix = 'sagemaker/customer-churn'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

---
## Downlaod Data
- 한 이동 통신사의 고객 이탈 데이타는 아래에서 다운로드 가능합니다.

The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.  Let's download and read that dataset in now:

In [22]:
raw_data_folder = 'raw_data'
churn_data_folder = 'churn_data'

In [23]:
!wget --directory-prefix={raw_data_folder} http://dataminingconsultant.com/DKD2e_data_sets.zip
!unzip -o {raw_data_folder}/DKD2e_data_sets.zip -d {raw_data_folder}

--2020-07-15 03:34:32--  http://dataminingconsultant.com/DKD2e_data_sets.zip
Resolving dataminingconsultant.com (dataminingconsultant.com)... 160.153.91.162
Connecting to dataminingconsultant.com (dataminingconsultant.com)|160.153.91.162|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1003616 (980K) [application/zip]
Saving to: ‘raw_data/DKD2e_data_sets.zip.1’


2020-07-15 03:34:33 (2.28 MB/s) - ‘raw_data/DKD2e_data_sets.zip.1’ saved [1003616/1003616]

Archive:  raw_data/DKD2e_data_sets.zip
 extracting: raw_data/Data sets/adult.zip  
  inflating: raw_data/Data sets/cars.txt  
  inflating: raw_data/Data sets/cars2.txt  
  inflating: raw_data/Data sets/cereals.CSV  
  inflating: raw_data/Data sets/churn.txt  
  inflating: raw_data/Data sets/ClassifyRisk  
  inflating: raw_data/Data sets/ClassifyRisk - Missing.txt  
 extracting: raw_data/Data sets/DKD2e data sets.zip  
  inflating: raw_data/Data sets/nn1.txt  


### Read Churn Data

In [24]:
import pandas as pd
import os

churn_file_name = os.path.join(raw_data_folder, 'Data sets', 'churn.txt')
churn = pd.read_csv(churn_file_name)
pd.set_option('display.max_columns', 500)
churn.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


미국의 어떤 이동 통신사의 총 3,333개의 레코드가 존재하고, 각 레코드는 21개의 컬럼이 존재 함.

- `State`: 미국의 주
- `Account Length`: 계정이 활성화 된 날짜의 수
- `Area Code`: 지역 번호
- `Phone`: 지역 번호 이외의 나머지 번호
- `Int’l Plan`: 고객이 해외 전화 연결 기능(international calling plan) 사용 여부. yes/no
- `VMail Plan`: 고객이 보이스 메일 기능 사용 여부: yes/no
- `VMail Message`: 월 평균 보이스 메일 평균 수
- `Day Mins`: 하루 동안 전화 통화 평균 분
- `Day Calls`: 하루 동안 전화 통화 평균 수
- `Day Charge`: 낮에 사용한 전화에 대한 청구 금액 
- `Eve Mins, Eve Calls, Eve Charge`: 저녁에 사용 (위의 낮의 정의 와 같음)
- `Night Mins`, `Night Calls`, `Night Charge`: 밤에 사용 (위의 낮의 정의와 같음)
- `Intl Mins`, `Intl Calls`, `Intl Charge`: 해외에 사용 (위의 낮의 정의와 같음)
- `CustServ Calls`: 고객 서비스 센터에 전화한 통화 수
- `Churn?`: 이 고객이 이탈할지 말지 여부. true/false (타켓 변수)

By modern standards, it’s a relatively small dataset, with only 3,333 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- `Account Length`: the number of days that this account has been active
- `Area Code`: the three-digit area code of the corresponding customer’s phone number
- `Phone`: the remaining seven-digit phone number
- `Int’l Plan`: whether the customer has an international calling plan: yes/no
- `VMail Plan`: whether the customer has a voice mail feature: yes/no
- `VMail Message`: presumably the average number of voice mail messages per month
- `Day Mins`: the total number of calling minutes used during the day
- `Day Calls`: the total number of calls placed during the day
- `Day Charge`: the billed cost of daytime calls
- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening
- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime
- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls
- `CustServ Calls`: the number of calls placed to Customer Service
- `Churn?`: whether the customer left the service: true/false

The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.    


## 원본 데이타를 Train, Validation, Test 으로 분리

In [25]:
train_data, validation_data, test_data = np.split(churn.sample(frac=1, random_state=1729), [int(0.7 * len(churn)), int(0.9 * len(churn))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

# Use it for batch_transform and realtime inference
test_data.drop('Churn?', axis=1).to_csv('batch_transform_test.csv', header=False, index=False)

# For QuickSight
# test_data.drop('Churn?', axis=1).to_csv('test.csv', header=True, index=False)

Now we'll upload these files to S3.

In [26]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawtrain/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawvalidation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawtest/batch_transform_test.csv')).upload_file('batch_transform_test.csv')
# boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawtest/test.csv')).upload_file('test.csv')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [48]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/rawtrain/'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/rawvalidation/'.format(bucket, prefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/rawtest/'.format(bucket, prefix), content_type='csv')

In [49]:
s3_input_train.config

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
   'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/rawtrain/',
   'S3DataDistributionType': 'FullyReplicated'}},
 'ContentType': 'csv'}

In [50]:
s3_input_validation.config

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
   'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/rawvalidation/',
   'S3DataDistributionType': 'FullyReplicated'}},
 'ContentType': 'csv'}

In [51]:
s3_input_test.config

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
   'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/rawtest/',
   'S3DataDistributionType': 'FullyReplicated'}},
 'ContentType': 'csv'}

In [52]:
%store s3_input_train
%store s3_input_validation
%store s3_input_test
%store bucket
%store prefix

Stored 's3_input_train' (s3_input)
Stored 's3_input_validation' (s3_input)
Stored 's3_input_test' (s3_input)
Stored 'bucket' (str)
Stored 'prefix' (str)


In [40]:
# %store