# Predicting whether to contact a customer because they are at risk of churning

For updates on the way Sagemaker or AWS behave compared to the notebook code, please refer to https://livebook.manning.com/#!/book/machine-learning-for-business/chapter-3/v-5/119

# Part 1: Load and examine the data

In der nächsten Zelle bitte den eigenen Ablageort übergeben.

In [1]:
data_bucket = 'martin-mlforbusiness'
subfolder = 'ch03'
dataset = 'churn_data.csv'

In [2]:
import pandas as pd
import boto3
import sagemaker
import s3fs
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

role = sagemaker.get_execution_role()
s3 = s3fs.S3FileSystem(anon=False)

In [3]:
df = pd.read_csv(f's3://{data_bucket}/{subfolder}/{dataset}')
df.head()

Unnamed: 0,churned,id,customer_code,co_name,total_spend,week_minus_4,week_minus_3,week_minus_2,last_week,4-3_delta,3-2_delta,2-1_delta
0,0,1,1826,Hoffman Martinez and Chandler,68567.34,0.81,0.02,0.74,1.45,-0.79,0.72,0.71
1,0,2,772,Lee Martin and Escobar,74335.27,1.87,1.02,1.29,1.19,-0.85,0.27,-0.1
2,0,3,479,Hobbs Mcdaniel and Baker,48746.22,1.21,0.7,1.04,2.12,-0.51,0.34,1.08
3,0,4,1692,Williams-Harris,64416.7,0.75,2.08,2.4,2.02,1.33,0.32,-0.38
4,0,5,2578,Beck-Snyder,71623.2,2.33,0.66,1.97,1.6,-1.67,1.31,-0.37


## Frage 1

**Welche Reihen sind Categorical Features?**

>1. churned, last_week, co_name
>2. id, customer_code, co_name
>3. 4-3_delta, 3-2_delta, 2-1_delta
>4. id, total_spend, last_week

In [4]:
print(f'Number of rows in dataset: {df.shape[0]}')
print(df['churned'].value_counts())

Number of rows in dataset: 2999
0    2833
1     166
Name: churned, dtype: int64


# Part 2: Get the data into the right shape

In [5]:
encoded_data = df.drop(['id', 'customer_code', 'co_name'], axis=1)
encoded_data.head()

Unnamed: 0,churned,total_spend,week_minus_4,week_minus_3,week_minus_2,last_week,4-3_delta,3-2_delta,2-1_delta
0,0,68567.34,0.81,0.02,0.74,1.45,-0.79,0.72,0.71
1,0,74335.27,1.87,1.02,1.29,1.19,-0.85,0.27,-0.1
2,0,48746.22,1.21,0.7,1.04,2.12,-0.51,0.34,1.08
3,0,64416.7,0.75,2.08,2.4,2.02,1.33,0.32,-0.38
4,0,71623.2,2.33,0.66,1.97,1.6,-1.67,1.31,-0.37


## Frage 2

**Warum werden diese Feature-Columns entfernt?**

>1. Categorical Daten müssen später separat trainiert werden.
>2. String-Values (Zeichenketten) können nur in Modellen für Natural-Language-Processing verarbeitet werden.
>3. Diese drei Features bringen für das Training in diesem Business-Case keinen Mehrwert.
>4. Categorical Daten können generell nicht für ein Modelltraining genutzt werden.

# Part 3: Create training, validation and test data sets

## Offene Frage
**Welche Funktion haben die jeweiligen Sets?**

**Welche Größe haben die Sets "test", "val", "train"?**

In [6]:
y = encoded_data['churned']
train_df, test_and_val_data, _, _ = train_test_split(encoded_data, y, test_size=0.3, stratify=y, random_state=0)

y = test_and_val_data['churned']
val_df, test_df, _, _ = train_test_split(test_and_val_data, y, test_size=0.333, stratify=y, random_state=0)

print(train_df.shape, val_df.shape, test_df.shape)
print()
print('Number of rows in Train dataset: {train_df.shape[0]}')
print(train_df['churned'].value_counts())
print()
print('Number of rows in Validate dataset: {val_df.shape[0]}')
print(val_df['churned'].value_counts())
print()
print('Number of rows in Test dataset: {test_df.shape[0]}')
print(test_df['churned'].value_counts())

(2099, 9) (600, 9) (300, 9)

Number of rows in Train dataset: {train_df.shape[0]}
0    1983
1     116
Name: churned, dtype: int64

Number of rows in Validate dataset: {val_df.shape[0]}
0    567
1     33
Name: churned, dtype: int64

Number of rows in Test dataset: {test_df.shape[0]}
0    283
1     17
Name: churned, dtype: int64


In [7]:
train_data = train_df.to_csv(None, header=False, index=False).encode()
val_data = val_df.to_csv(None, header=False, index=False).encode()
test_data = test_df.to_csv(None, header=True, index=False).encode()

with s3.open(f'{data_bucket}/{subfolder}/processed/train.csv', 'wb') as f:
    f.write(train_data)

with s3.open(f'{data_bucket}/{subfolder}/processed/val.csv', 'wb') as f:
    f.write(val_data) 
    
with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv', 'wb') as f:
    f.write(test_data) 
    
train_input = sagemaker.inputs.TrainingInput(s3_data=f's3://{data_bucket}/{subfolder}/processed/train.csv', content_type='csv')
val_input = sagemaker.inputs.TrainingInput(s3_data=f's3://{data_bucket}/{subfolder}/processed/val.csv', content_type='csv')    

# Part 4: Train the model

## Frage 3

**Was bedeutet Overfitting?**

>1. Das trainierte Modell wird immer besser.
>2. Für das Training des Modells steht zu wenig Rechenleistung zur Verfügung. 
>3. Das Trainingsset für das Modell hat zu viele Features.
>4. Das trainierte Modell lernt zu stark von den Trainingsdaten. In der Folge wird die Vorhersage schwächer für reale Daten.

### Hyperparamter
**Mit welchen Werten arbeiten wir während des Trainings?**

**max_depth**
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.

**subsample**
Subsample ratio of the training instances. 
Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. 
This will prevent overfitting.

**objective**
You set this hyperparameter to binary:logistic. You use this setting when your target variable is 1 or 0. 
If your target variable is a multiclass variable or a continuous variable, then you use other settings.

**eval_metric**
The evaluation metric you are optimizing for. The metric argument auc stands for area under the curve.

**num_round**
How many times you want to let the machine learning model run through the training data (the number of rounds). 
With each loop through the data, the function gets better at separating the dark circles from the light circles. 
After a while though, the model gets too good; 
It begins to find patterns in the test data that are not reflected in the real world. This is called overfitting.
To avoid this, you set early stopping rounds.

> Mehr Runden sind also immer besser, stimmt's?

**early_stopping_rounds**
The number of rounds where the algorithm fails to improve.

**scale_pos_weight**
The scale positive weight is used with imbalanced datasets to make sure 
the model puts enough emphasis on correctly predicting rare classes during training. 
In the current dataset, about 1 in 17 customers will churn.
So we set scale_pos_weight to 17 to accommodate for this imbalance. 
This tells XGBoost to focus more on customers who actually churn 
rather than on happy customers who are still happy.

### Hyperparametertuning in Action

**Im ersten Durchgang mit folgenden Hyperparamtern bis Deploy und Prediction**
* max_depth=1
* subsample=0.1

**Ergebnisse Durchlauf 1**
* train-auc:
* validation-auc:
* anzahl rounds:
* accuracy_score:

**Im zweiten Durchgang mit folgenden Hyperparamtern bis Deploy und Prediction**
* max_depth=3
* subsample=0.7

**Ergebnisse Durchlauf 2**
* train-auc:
* validation-auc:
* anzahl rounds:
* accuracy_score:

In [8]:
sess = sagemaker.Session()

from sagemaker import image_uris 
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")

estimator = sagemaker.estimator.Estimator(
                        container, 
                        role,
                        instance_count=1, 
                        instance_type='ml.m4.xlarge',
                        output_path=f's3://{data_bucket}/{subfolder}/output',
                        sagemaker_session=sess)

estimator.set_hyperparameters(
                        max_depth=1,
                        subsample=0.1,
                        objective='binary:logistic',
                        eval_metric='auc',
                        num_round=100,
                        early_stopping_rounds=10,
                        scale_pos_weight=17)

estimator.fit({'train': train_input, 'validation': val_input})

2022-12-07 07:49:11 Starting - Starting the training job...ProfilerReport-1670399351: InProgress
...
2022-12-07 07:49:48 Starting - Preparing the instances for training.........
2022-12-07 07:51:27 Downloading - Downloading input data.........
2022-12-07 07:53:08 Training - Downloading the training image.........
2022-12-07 07:54:36 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2022-12-07:07:54:40:INFO] Running standalone xgboost training.[0m
[34m[2022-12-07:07:54:40:INFO] File size need to be processed in the node: 0.12mb. Available memory size in the node: 8834.91mb[0m
[34m[2022-12-07:07:54:40:INFO] Determined delimiter of CSV input is ','[0m
[34m[07:54:40] S3DistributionType set as FullyReplicated[0m
[34m[07:54:40] 2099x8 matrix with 16792 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-12-07:07:54:40:INFO] Determined delimiter of CSV input is ','[0m
[34m[07:54:40] S3Dis

## Part 5: Host the model

In [9]:
endpoint_name = 'customer-churn'

try:
    sess.delete_endpoint(endpoint_name)
    sess.delete_endpoint_config(endpoint_name)
    print('Warning: Existing endpoint deleted to make way for your new endpoint.')
except:
    pass

In [10]:
predictor = estimator.deploy(
            initial_instance_count=1,
            instance_type='ml.m4.xlarge',
            endpoint_name=endpoint_name)

--------!

In [11]:
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer

predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()

## Part 6: Test the model

In [12]:
def get_prediction(row):
    prob = float(predictor.predict(row[1:]))
    return 1 if prob > 0.5 else 0

with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv') as f:
    test_data = pd.read_csv(f)

test_data['prediction'] = test_data.apply(get_prediction, axis=1)
test_data[:10]

Unnamed: 0,churned,total_spend,week_minus_4,week_minus_3,week_minus_2,last_week,4-3_delta,3-2_delta,2-1_delta,prediction
0,0,76897.46,0.56,2.29,1.14,2.23,1.73,-1.15,1.09,0
1,0,19604.63,1.95,2.04,0.82,1.62,0.09,-1.22,0.8,0
2,0,23369.6,1.11,1.54,1.55,1.14,0.43,0.01,-0.41,0
3,1,40709.47,2.4,1.87,0.07,0.61,-0.53,-1.8,0.54,0
4,0,69953.52,2.01,1.2,1.05,1.41,-0.81,-0.15,0.36,0
5,0,71939.07,0.54,1.17,0.21,2.29,0.63,-0.96,2.08,0
6,0,45930.53,0.08,1.43,0.41,1.34,1.35,-1.02,0.93,0
7,0,47080.25,1.54,0.68,0.8,0.54,-0.86,0.12,-0.26,0
8,0,35506.83,1.37,0.93,1.7,0.67,-0.44,0.77,-1.03,0
9,0,39188.12,0.4,1.86,0.1,0.82,1.46,-1.76,0.72,1


In [13]:
print(test_data['churned'].value_counts())
print(test_data['prediction'].value_counts())
print(metrics.accuracy_score(test_data['churned'],test_data['prediction']))

0    283
1     17
Name: churned, dtype: int64
0    266
1     34
Name: prediction, dtype: int64
0.9033333333333333


In [14]:
print(metrics.confusion_matrix(test_data['churned'],test_data['prediction']))

[[260  23]
 [  6  11]]


# Frage 4

**Offene Frage: Was bedeuten die Werte der Konfusion-Matrix und was sagt uns die Matrix über die Performance unseres Modells?**


                        |         Predicted Values
                        |    Positives    |   Negatives
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
             Positives  | True Positives  | False Negatives
     Actual - - - - - - - - - - - - - - - - - - - - - - - -
             Negatives  | False Positives | True Negatives


**Carlos möchte jeden Kunden identifizieren, der Potential für einen Wechsel zur Konkurrenz zeigt. Welchem Feld sollte Carlos die größte Beachtung schenken, um die Modellqualität zu beurteilen?**

                        |         Predicted Values
                        |    Happy (0)    |    Unhappy (1)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
             Happy (0)  |      ( A )      |      ( B )      
     Actual - - - - - - - - - - - - - - - - - - - - - - - -
           Unhappy (1)  |      ( C )      |      ( D )      


>1. A
>2. B
>3. C
>4. D


## Remove the Endpoint (optional)
Comment out this cell to remove the endpoint if you want the endpoint to exist after "run all"

In [15]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)