# Supervised learning (classification and regression)

### Project description
Beta Bank customers are leaving: little by little, chipping away every month.
The bankers figured out it’s cheaper to save the existing customers rather than
to attract new ones. We need to predict whether a customer will leave the bank soon. You have the
data on clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you
need an F1 score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUCROC metric and compare it with the F1.
Data source: https://www.kaggle.com/barelydedicated/bank-customer-churnmodeling{target="blank"}

### Data description
The data can be found in /datasets/Churn.csv file. Download the dataset.
Features
RowNumber — data string index
CustomerId — unique customer identifier
Surname — surname
CreditScore — credit score
Geography — country of residence
Gender — gender
Age — age
Tenure — period of maturation for a customer’s fixed deposit (years)
Balance — account balance
NumOfProducts — number of banking products used by the customer
HasCrCard — customer has a credit card
IsActiveMember — customer’s activeness
EstimatedSalary — estimated salary
Target
Exited — сustomer has left

 ## 1. Download and prepare the data. Explain the procedure.

In [81]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

<div class="alert alert-block alert-success">
<b>Success:</b> It's perfect that all imports are in the first cell!
</div>

In [82]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/Churn.csv')

In [83]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<div class="alert alert-block alert-info">
<b>Remarks: </b> It's better to use <span style="font-family: monospace"> .head() </span> without <span style="font-family: monospace"> print </span> in the notebook.
</div>

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [85]:
len(df['Tenure'][df['Tenure'].isna() == True])

909

In [86]:
df['Tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

In [87]:
df['Tenure'].value_counts()

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: Tenure, dtype: int64

<div class="alert alert-block alert-info">
<b>Remarks: </b> Good that <span style="font-family: monospace"> .unique() </span> was used to view the values. Don't forget about <span style="font-family: monospace"> .value_counts() </span>. This method can be useful to look at the quantity.
</div>

In [88]:
df['Tenure'][df['Tenure'].isna() == False].median()

5.0

In [89]:
df['Tenure']=df['Tenure'].fillna(df['Tenure'][df['Tenure'].isna() == False].median())

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


There were missing values in the Tenure column. I filled them with the median.

<div class="alert alert-block alert-info">
<b>Remarks: </b> Oops, did you replaced with -1 or with median? Both solutions are OK.
</div>

<div class="alert alert-block alert-info">
<b>Remarks: </b> May be some features are useles for machine learning model? Try to find and eliminate them. Кeducing the number of features increases the speed of fitting and predicting.
</div>

I should drop columns that do not predict if a customer will leave. 

In [91]:
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [92]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [93]:
columns_to_drop = ['RowNumber', 'CustomerId', 'Surname']

In [94]:
df = df.drop(columns_to_drop, axis=1)

In [95]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<div class="alert alert-block alert-success">
<b>Success (review 2):</b> Good job! These features are really useles.
</div>

I need to do One-Hot Encoding because some features are categorical and others are numerical. This transforms all to numerical. 

In [96]:
df_ohe = pd.get_dummies(df, drop_first=True)

<div class="alert alert-block alert-success">
<b>Success:</b> Good job! Using <span style="font-family: monospace"> drop_first=True </span> was a good idea!
</div>

In [97]:
target = df_ohe['Exited']
features = df_ohe.drop(['Exited'] , axis=1)
features_train, features_testvalid, target_train, target_testvalid = train_test_split(features, target, test_size=0.40, random_state=12345)
features_test, features_valid, target_test, target_valid = train_test_split(features_testvalid, target_testvalid, test_size=0.50, random_state=12345)

In [98]:
print('validation set:', (len(features_valid)/len(features))*100, '% for features,', (len(target_valid)/len(target))*100, '% for target')

validation set: 20.0 % for features, 20.0 % for target


In [99]:
print('test set:', (len(features_test)/len(features))*100, '% for features,', (len(target_test)/len(target))*100, '% for target')

test set: 20.0 % for features, 20.0 % for target


In [100]:
print('training set:', (len(features_train)/len(features))*100, '% for features,', (len(target_train)/len(target))*100, '% for target')

training set: 60.0 % for features, 60.0 % for target


I finished splitting up my data into a training (60%), validation (20%), and test (20%) set.

<div class="alert alert-block alert-success">
<b>Success:</b> Splitting was done perfectly!
</div>

<div class="alert alert-block alert-info">
<b>Remarks: </b> What about numerical features scaling?
</div>

## 2. Examine the balance of classes. Train the model without taking into account the imbalance. Briefly describe your findings.

In [101]:
target.value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [102]:
target_train.value_counts()

0    4804
1    1196
Name: Exited, dtype: int64

<div class="alert alert-block alert-info">
<b>Remarks: </b> Using <span style="font-family: monospace"> .mean() </span> with bool target may confuse other scientists. Better to use <span style="font-family: monospace"> .value_counts() </span>.
</div>

The above calculations are the ratios of 1s and 0s. There are a lot more 0s than 1s, causing imbalance. 

I will train a model without taking into account the imbalance. Here is one possible model:

In [103]:
model = RandomForestClassifier(random_state=12345)

In [104]:
model.fit(features_train, target_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [105]:
predictions = model.predict(features_valid)

In [106]:
score = f1_score(target_valid, predictions)

In [107]:
score

0.5030864197530864

The score ends up being lower than the desired f1 score, which is at least 0.59.

<div class="alert alert-block alert-info">
<b>Tip: </b>    Your model parameters is OK, but it is better to use default parameters or tune them. Often default parameters are selected so that the model demonstrates rather good quality.
</div>

## 3.  Improve the quality of the model, taking into account the imbalance of classes. Train different models and find the best one. Briefly describe your findings.

In [108]:
#take into account imbalance of classes (need more 1s), class_weight='balanced' does what?, upsample i think, i do not think downsample..., thresholding

In [109]:
target = df_ohe['Exited']
features = df_ohe.drop(['Exited'] , axis=1)
features_train, features_testvalid, target_train, target_testvalid = train_test_split(features, target, test_size=0.40, random_state=12345)
features_test, features_valid, target_test, target_valid = train_test_split(features_testvalid, target_testvalid, test_size=0.50, random_state=12345)

Since I want about 50% of data to be 1s and 50% of data to be 0s, I needed to upsample the data. 

In [110]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle(upsampled_features_train, upsampled_target_train, random_state=12345)
    return features_upsampled, target_upsampled

upsampled_features_train, upsampled_target_train = upsample(target=target_train, features=features_train, repeat=4)


In [111]:
upsampled_target_train.value_counts()

0    4804
1    4784
Name: Exited, dtype: int64

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> OK, no worries, but we have a little error here. Upsampling/downsampling should be applied only to train part (not to valid/test). Applying it to valid/test affects on metrics and they become inflated. Real data is imbalanced so test/valid data should be similar to real data.
</div>

Possibly, downsampling will be useful.

In [112]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)
    return features_downsampled, target_downsampled

downsampled_features_train, downsampled_target_train = downsample(features_train, target_train, 0.24)

In [113]:
downsampled_target_train.value_counts()

1    1196
0    1153
Name: Exited, dtype: int64

<div class="alert alert-block alert-success">
<b>Success (review 2):</b> Well done! Sometimes downsampling performs better.
</div>

I will also adjust the threshold. 

In [114]:
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


for threshold in np.arange(0, 0.6, 0.02):
    predicted_valid = probabilities_one_valid > threshold
    f1 = f1_score(target_valid, predicted_valid)

    print("Threshold = {:.2f} | F1 = {:.3f}".format(
        threshold, f1))

Threshold = 0.00 | F1 = 0.444
Threshold = 0.02 | F1 = 0.444
Threshold = 0.04 | F1 = 0.444
Threshold = 0.06 | F1 = 0.444
Threshold = 0.08 | F1 = 0.444
Threshold = 0.10 | F1 = 0.527
Threshold = 0.12 | F1 = 0.527
Threshold = 0.14 | F1 = 0.527
Threshold = 0.16 | F1 = 0.527
Threshold = 0.18 | F1 = 0.527
Threshold = 0.20 | F1 = 0.564
Threshold = 0.22 | F1 = 0.564
Threshold = 0.24 | F1 = 0.564
Threshold = 0.26 | F1 = 0.564
Threshold = 0.28 | F1 = 0.564
Threshold = 0.30 | F1 = 0.590
Threshold = 0.32 | F1 = 0.590
Threshold = 0.34 | F1 = 0.590
Threshold = 0.36 | F1 = 0.590
Threshold = 0.38 | F1 = 0.590
Threshold = 0.40 | F1 = 0.553
Threshold = 0.42 | F1 = 0.553
Threshold = 0.44 | F1 = 0.553
Threshold = 0.46 | F1 = 0.553
Threshold = 0.48 | F1 = 0.553
Threshold = 0.50 | F1 = 0.503
Threshold = 0.52 | F1 = 0.503
Threshold = 0.54 | F1 = 0.503
Threshold = 0.56 | F1 = 0.503
Threshold = 0.58 | F1 = 0.503


A threshold of 0.32 has the highest F1 at 0.590.


<div class="alert alert-block alert-success">
<b>Success:</b> Well done!
</div>

### Logistic Regression using upsampled training data

In [115]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(upsampled_features_train, upsampled_target_train)
threshold = 0.32
predicted_proba = model.predict_proba(features_valid)
predictions = (predicted_proba [:,1] >= threshold).astype('int')
score = f1_score(target_valid, predictions)

print('f1 score for Logistic Regression',':', score)

f1 score for Logistic Regression : 0.3895843765648473


The f1 score for Logistic Regression lower than our goal (0.59). We can possibly do better.

### Logistic Regression using downsampled training data

In [116]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(downsampled_features_train, downsampled_target_train)
threshold = 0.32
predicted_proba = model.predict_proba(features_valid)
predictions = (predicted_proba [:,1] >= threshold).astype('int')
score = f1_score(target_valid, predictions)

print('f1 score for Logistic Regression',':', score)

f1 score for Logistic Regression : 0.3880299251870324


The f1 score for Logistic Regression lower than our goal (0.59). We can possibly do better.

I need to build a model with the maximum possible F1 score.

### Decision Tree Classifier using upsampled training data

In [117]:
for depth in range(201,402,50):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
    model.fit(upsampled_features_train, upsampled_target_train)
    threshold = 0.32
    predicted_proba = model.predict_proba(features_valid)
    predictions = (predicted_proba [:,1] >= threshold).astype('int')
    score = f1_score(target_valid, predictions)
    print('f1 score for Decision Tree Classifier with', depth, 'depth:', score)

f1 score for Decision Tree Classifier with 201 depth: 0.484548825710754
f1 score for Decision Tree Classifier with 251 depth: 0.484548825710754
f1 score for Decision Tree Classifier with 301 depth: 0.484548825710754
f1 score for Decision Tree Classifier with 351 depth: 0.484548825710754
f1 score for Decision Tree Classifier with 401 depth: 0.484548825710754


The f1 score for Decision Tree Classifier is lower than our goal (0.59).

### Decision Tree Classifier using downsampled training data

In [118]:
for depth in range(201,602,100):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
    model.fit(downsampled_features_train, downsampled_target_train)
    threshold = 0.32
    predicted_proba = model.predict_proba(features_valid)
    predictions = (predicted_proba [:,1] >= threshold).astype('int')
    score = f1_score(target_valid, predictions)
    print('f1 score for Decision Tree Classifier with', depth, 'depth:', score)

f1 score for Decision Tree Classifier with 201 depth: 0.48321048321048327
f1 score for Decision Tree Classifier with 301 depth: 0.48321048321048327
f1 score for Decision Tree Classifier with 401 depth: 0.48321048321048327
f1 score for Decision Tree Classifier with 501 depth: 0.48321048321048327
f1 score for Decision Tree Classifier with 601 depth: 0.48321048321048327


The f1 score for Decision Tree Classifier is lower than our goal (0.59).

### Random Forest Classifier using upsampled training data

In [119]:
%%time
for trees in range(1, 301, 50):
    model = RandomForestClassifier(random_state=12345, class_weight='balanced', n_estimators=trees, max_depth=50)
    model.fit(upsampled_features_train, upsampled_target_train)
    threshold = 0.32
    predicted_proba = model.predict_proba(features_valid)
    predictions = (predicted_proba [:,1] >= threshold).astype('int')
    score = f1_score(target_valid, predictions)
    print(trees,':', score)


1 : 0.5171270718232044
51 : 0.5921450151057401
101 : 0.591321897073663
151 : 0.6092184368737474
201 : 0.6013986013986014
251 : 0.6042296072507553
CPU times: user 10.4 s, sys: 14.3 ms, total: 10.4 s
Wall time: 11 s


The highest f1 value I found occurred in a Random Forest Classifier model for depth 50 and 151 trees.

### Random Forest Classifier using downsampled training data

In [120]:
%%time
for trees in range(1, 601, 50):
    model = RandomForestClassifier(random_state=12345, class_weight='balanced', n_estimators=trees, max_depth=50)
    model.fit(downsampled_features_train, downsampled_target_train)
    threshold = 0.32
    predicted_proba = model.predict_proba(features_valid)
    predictions = (predicted_proba [:,1] >= threshold).astype('int')
    score = f1_score(target_valid, predictions)
    print(trees,':', score)


1 : 0.47960033305578686
51 : 0.49574885546108566
101 : 0.5042567125081859
151 : 0.5084967320261438
201 : 0.5115055884286653
251 : 0.5108338804990151
301 : 0.5091863517060368
351 : 0.5117801047120419
401 : 0.5088062622309198
451 : 0.5085190039318479
501 : 0.5094957432874918
551 : 0.5081645983017635
CPU times: user 15.6 s, sys: 103 ms, total: 15.7 s
Wall time: 16.3 s


This is not higher than 0.59. The highest f1 value I found occurred in a Random Forest Classifier model using upsampled training data for depth 50 and 151 trees.

## 4. Perform the final testing

I need to check the F1 for the test set and measure the AUC-ROC metric to compare it with the F1. 

In [121]:
model = RandomForestClassifier(random_state=12345, class_weight='balanced', n_estimators=151, max_depth=50)
model.fit(upsampled_features_train, upsampled_target_train)

test_predict= model.predict(features_test)
print('f1:', f1_score(target_test, test_predict))

probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
print('AUC-ROC metric:', roc_auc_score(target_test, probabilities_one_test))


f1: 0.6178010471204187
AUC-ROC metric: 0.8376532642950902


The AUC-ROC and F1 are good if their values are close to 1. To pass the project, I need an F1 score exceeds the goal by 0.2 the AUC-ROC metric is high, showing that this is a good model. I found the best model that I could find.