<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка данных</a></span></li><li><span><a href="#Исследование-задачи" data-toc-modified-id="Исследование-задачи-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Исследование задачи</a></span></li><li><span><a href="#Борьба-с-дисбалансом" data-toc-modified-id="Борьба-с-дисбалансом-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Борьба с дисбалансом</a></span></li><li><span><a href="#Тестирование-модели" data-toc-modified-id="Тестирование-модели-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Тестирование модели</a></span></li></ul></div>

# Отток клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком. 

Постройте модель с предельно большим значением *F1*-меры. Чтобы сдать проект успешно, нужно довести метрику до 0.59. Проверьте *F1*-меру на тестовой выборке самостоятельно.

Дополнительно измеряйте *AUC-ROC*, сравнивайте её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Подготовка данных

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report as cr
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler 
#from sklearn.preprocessing import OrdinalEncoder

In [2]:
data = pd.read_csv('/datasets/Churn.csv')
data.info()
data.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
data.isna().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

9% клиентов пропуски. Прикинем что они в банке меньше года. Можно и медианой. А вообще можно сделать несколько вариантов чтоб найти лучший выхлоп. Но этим стоит заниматься когда уже имеется работающая модель.

In [4]:
data.columns = data.columns.str.lower()

In [5]:
data0 = data

In [6]:
data0 = data0.fillna(0)

In [7]:
data0.corr()

Unnamed: 0,rownumber,customerid,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
rownumber,1.0,0.004202,0.00584,0.000783,0.000596,-0.009067,0.007246,0.000599,0.012044,-0.005988,-0.016571
customerid,0.004202,1.0,0.005308,0.009497,-0.015747,-0.012419,0.016972,-0.014025,0.001665,0.015271,-0.006248
creditscore,0.00584,0.005308,1.0,-0.003965,0.003087,0.006268,0.012238,-0.005458,0.025651,-0.001384,-0.027094
age,0.000783,0.009497,-0.003965,1.0,-0.007368,0.028308,-0.03068,-0.011721,0.085472,-0.007201,0.285323
tenure,0.000596,-0.015747,0.003087,-0.007368,1.0,-0.005821,0.010106,0.021387,-0.025856,0.011225,-0.013319
balance,-0.009067,-0.012419,0.006268,0.028308,-0.005821,1.0,-0.30418,-0.014858,-0.010084,0.012797,0.118533
numofproducts,0.007246,0.016972,0.012238,-0.03068,0.010106,-0.30418,1.0,0.003183,0.009612,0.014204,-0.04782
hascrcard,0.000599,-0.014025,-0.005458,-0.011721,0.021387,-0.014858,0.003183,1.0,-0.011866,-0.009933,-0.007138
isactivemember,0.012044,0.001665,0.025651,0.085472,-0.025856,-0.010084,0.009612,-0.011866,1.0,-0.011421,-0.156128
estimatedsalary,-0.005988,0.015271,-0.001384,-0.007201,0.011225,0.012797,0.014204,-0.009933,-0.011421,1.0,0.012097


Матрица коррелляции, к сожалению, нам этого не показала, но я считаю нужным избавиться от фамилии. Это категориальный признак, который придется менять, а он (должен быть) привязан к коду.

In [9]:
test0 = data0.drop(['surname','rownumber'], axis = 1)

In [10]:
test0.index = test0['customerid']
test0 = test0.drop(['customerid'], axis = 1)
test0.head(5)

Unnamed: 0_level_0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15634602,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
15647311,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
15619304,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
15701354,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
15737888,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [12]:
test0 = pd.get_dummies(test0, drop_first=True)
test0.head(5)

Unnamed: 0_level_0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
15634602,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
15647311,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
15619304,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
15701354,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
15737888,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


## Исследование задачи

In [13]:
target0 = test0['exited']
features0 = test0.drop('exited', axis=1)

features_train0, features_x0, target_train0, target_x0 = train_test_split(
    features0, target0, test_size=0.4, random_state=12345)

features_test0, features_valid0, target_test0, target_valid0 = train_test_split(
    features_x0, target_x0, test_size=0.5, random_state=12345)

In [14]:
scaler0 = StandardScaler()
scaler0.fit(features_train0)

features_train0 = pd.DataFrame(scaler0.transform(features_train0), columns=features_train0.columns, index=features_train0.index)
features_valid0 = pd.DataFrame(scaler0.transform(features_valid0), columns=features_valid0.columns, index=features_valid0.index)
features_test0 = pd.DataFrame(scaler0.transform(features_test0), columns=features_test0.columns, index=features_test0.index)

In [15]:
features_train0.head(5)

Unnamed: 0_level_0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15671987,-0.886751,-0.373192,1.104696,1.232271,-0.89156,0.642466,-1.055187,-0.187705,-0.572475,1.728977,0.907278
15815628,0.608663,-0.183385,1.104696,0.600563,-0.89156,-1.556504,-1.055187,-0.333945,-0.572475,-0.578377,-1.102198
15799494,2.052152,0.480939,-0.503694,1.027098,0.830152,-1.556504,0.947699,1.503095,1.746802,-0.578377,0.907278
15711288,-1.457915,-1.417129,0.46134,-1.233163,0.830152,0.642466,-1.055187,-1.071061,-0.572475,-0.578377,0.907278
15699492,0.130961,-1.132419,-0.825373,1.140475,-0.89156,-1.556504,-1.055187,1.524268,1.746802,-0.578377,-1.102198


In [16]:
lr0 = LogisticRegression(solver='liblinear', random_state=12345)
lr0.fit(features_train0, target_train0)
predicted_valid0 = lr0.predict(features_valid0)

In [17]:
print(cr(target_valid0, predicted_valid0))

              precision    recall  f1-score   support

           0       0.81      0.95      0.88      1577
           1       0.52      0.19      0.27       423

    accuracy                           0.79      2000
   macro avg       0.67      0.57      0.58      2000
weighted avg       0.75      0.79      0.75      2000



Ну 0.27 совсем плохо, идем дальше.

In [18]:
prob_valid0 = lr0.predict_proba(features_valid0)
prob_one_valid0 = prob_valid0[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid0, prob_one_valid0))

AUC-ROC равна 0.7386260233168584


Попробуем вторую версию датасета с заменой по медиане.

In [19]:
data1 = data
data1 = data1.fillna(data1['tenure'].median())

In [20]:
test1 = data1.drop(['surname','rownumber'], axis = 1)
test1.index = test1['customerid']
test1 = test1.drop(['customerid'], axis = 1)

#encoder1 = OrdinalEncoder()
#test1[['geography','gender']] = encoder1.fit_transform(test1[['geography','gender']])
test1 = pd.get_dummies(test1, drop_first=True)

target1 = test1['exited']
features1 = test1.drop('exited', axis=1)

features_train1, features_x1, target_train1, target_x1 = train_test_split(
    features1, target1, test_size=0.4, random_state=12345)

features_test1, features_valid1, target_test1, target_valid1 = train_test_split(
    features_x1, target_x1, test_size=0.5, random_state=12345)

scaler1 = StandardScaler()
scaler1.fit(features_train1)

features_train1 = pd.DataFrame(scaler1.transform(features_train1), columns=features_train1.columns, index=features_train1.index)
features_valid1 = pd.DataFrame(scaler1.transform(features_valid1), columns=features_valid1.columns, index=features_valid1.index)
features_test1 = pd.DataFrame(scaler1.transform(features_test1), columns=features_test1.columns, index=features_test1.index)

lr1 = LogisticRegression(solver='liblinear', random_state=12345)
lr1.fit(features_train1, target_train1)
predicted_valid1 = lr1.predict(features_valid1)

print(cr(target_valid1, predicted_valid1))

prob_valid1 = lr1.predict_proba(features_valid1)
prob_one_valid1 = prob_valid1[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid1, prob_one_valid1))

              precision    recall  f1-score   support

           0       0.81      0.95      0.88      1577
           1       0.52      0.19      0.27       423

    accuracy                           0.79      2000
   macro avg       0.67      0.57      0.58      2000
weighted avg       0.75      0.79      0.75      2000

AUC-ROC равна 0.7386065351364398


А разницы как таковой нет. Под конец попробуем просто удалить значения.

In [21]:
data2 = data
data2 = data2.dropna(subset=['tenure'])

test2 = data2.drop(['surname','rownumber'], axis = 1)
test2.index = test2['customerid']
test2 = test2.drop(['customerid'], axis = 1)

#encoder2 = OrdinalEncoder()
#test2[['geography','gender']] = encoder2.fit_transform(test2[['geography','gender']])
test2 = pd.get_dummies(test2, drop_first=True)

target2 = test2['exited']
features2 = test2.drop('exited', axis=1)

features_train2, features_x2, target_train2, target_x2 = train_test_split(
    features2, target2, test_size=0.4, random_state=12345)

features_test2, features_valid2, target_test2, target_valid2 = train_test_split(
    features_x2, target_x2, test_size=0.5, random_state=12345)

scaler2 = StandardScaler()
scaler2.fit(features_train2)

features_train2 = pd.DataFrame(scaler2.transform(features_train2), columns=features_train2.columns, index=features_train2.index)
features_valid2 = pd.DataFrame(scaler2.transform(features_valid2), columns=features_valid2.columns, index=features_valid2.index)
features_test2 = pd.DataFrame(scaler2.transform(features_test2), columns=features_test2.columns, index=features_test2.index)

lr2 = LogisticRegression(solver='liblinear', random_state=12345)
lr2.fit(features_train2, target_train2)
predicted_valid2 = lr2.predict(features_valid2)

print(cr(target_valid2, predicted_valid2))

prob_valid2 = lr2.predict_proba(features_valid2)
prob_one_valid2 = prob_valid2[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid2, prob_one_valid2))

              precision    recall  f1-score   support

           0       0.84      0.97      0.90      1468
           1       0.65      0.21      0.32       351

    accuracy                           0.83      1819
   macro avg       0.74      0.59      0.61      1819
weighted avg       0.80      0.83      0.79      1819

AUC-ROC равна 0.7809799948764526


Попробуем другие модели, а с лучшей уже пойдем к семплированию. Так как разницы в f1 нет, будем использовать 1й датасет.

In [22]:
rfc_best = RandomForestClassifier(max_depth=9, min_samples_leaf=3, min_samples_split=2, n_estimators=30, random_state=12345)
rfc_best.fit(features_train0, target_train0)
rfc_predicted_valid0 = rfc_best.predict(features_valid0)

print(cr(target_valid0, rfc_predicted_valid0))

rfc_prob_valid0 = rfc_best.predict_proba(features_valid0)
rfc_prob_one_valid0 = rfc_prob_valid0[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid0, rfc_prob_one_valid0))

              precision    recall  f1-score   support

           0       0.85      0.98      0.91      1577
           1       0.80      0.37      0.50       423

    accuracy                           0.85      2000
   macro avg       0.83      0.67      0.71      2000
weighted avg       0.84      0.85      0.82      2000

AUC-ROC равна 0.8497896026060194


А вот это выглядит намного привлекательнее.

In [23]:
from sklearn.calibration import CalibratedClassifierCV #добавлял уже на ходу ибо встал вопрос об auc-roc

svc_best = LinearSVC(C=0.02, dual=True, random_state=12345)
svc_best.fit(features_train0, target_train0)
svc_predicted_valid0 = svc_best.predict(features_valid0)

print(cr(target_valid0, svc_predicted_valid0))

clf = CalibratedClassifierCV(base_estimator=svc_best, cv=5)
clf.fit(features_valid0, target_valid0)
svc_prob_valid0 = clf.predict_proba(features_valid0)
svc_prob_one_valid0 = svc_prob_valid0[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid0, svc_prob_one_valid0))

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      1577
           1       0.59      0.14      0.23       423

    accuracy                           0.80      2000
   macro avg       0.70      0.56      0.56      2000
weighted avg       0.76      0.80      0.75      2000

AUC-ROC равна 0.755128014858988


In [24]:
tree = DecisionTreeClassifier

param_tree = { 'max_depth': range (1, 10, 1),
              'min_samples_leaf': range (1,5),
              'min_samples_split': range (2,5) }
grid_tree = GridSearchCV(tree(random_state=12345), param_tree, cv=5)
grid_tree.fit(features_train0, target_train0)

grid_tree.best_params_

{'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 4}

In [25]:
tree_best = DecisionTreeClassifier(max_depth=7, min_samples_leaf=1, min_samples_split=4, random_state=12345)
tree_best.fit(features_train0, target_train0)
tree_predicted_valid0 = tree_best.predict(features_valid0)

print(cr(target_valid0, tree_predicted_valid0))

tree_prob_valid0 = tree_best.predict_proba(features_valid0)
tree_prob_one_valid0 = tree_prob_valid0[:, 1]

print('AUC-ROC равна', roc_auc_score(target_valid0, tree_prob_one_valid0))

              precision    recall  f1-score   support

           0       0.85      0.96      0.90      1577
           1       0.71      0.39      0.50       423

    accuracy                           0.84      2000
   macro avg       0.78      0.67      0.70      2000
weighted avg       0.82      0.84      0.82      2000

AUC-ROC равна 0.8239257890089661


## Борьба с дисбалансом

In [26]:
from sklearn.utils import shuffle

In [27]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = shuffle(pd.concat([features_zeros] + [features_ones] * repeat), random_state=12345)
    target_upsampled = shuffle(pd.concat([target_zeros] + [target_ones] * repeat), random_state=12345)
    
    return features_upsampled, target_upsampled

In [28]:
features_upsampled1, target_upsampled1 = upsample(features_train0, target_train0, 4)

In [29]:
target_upsampled1.sum(), len(target_upsampled1)

(4784, 9588)

In [30]:
# Я еще не напроверялся
rfc_best = RandomForestClassifier(max_depth=9, min_samples_leaf=3, min_samples_split=2, n_estimators=30, random_state=12345)
rfc_best.fit(features_upsampled1, target_upsampled1)
rfc_predicted_up_test = rfc_best.predict(features_test0)

print(cr(target_test0, rfc_predicted_up_test))

rfc_prob_test = rfc_best.predict_proba(features_test0)
rfc_prob_one_test = rfc_prob_test[:, 1]

print('AUC-ROC равна', roc_auc_score(target_test0, rfc_prob_one_test))

              precision    recall  f1-score   support

           0       0.92      0.84      0.88      1582
           1       0.54      0.71      0.61       418

    accuracy                           0.81      2000
   macro avg       0.73      0.77      0.74      2000
weighted avg       0.84      0.81      0.82      2000

AUC-ROC равна 0.8505268601915086


F1 была 0.50, значит положительные изменения есть.

In [31]:
for i in range (1, 11):
    features_upsampled1, target_upsampled1 = upsample(features_train0, target_train0, i)
    
    rfc_best.fit(features_upsampled1, target_upsampled1)
    rfc_predicted_up_test = rfc_best.predict(features_test0)

    rfc_prob_test = rfc_best.predict_proba(features_test0)
    rfc_prob_one_test = rfc_prob_test[:, 1]

    print('Количество умножений равно', i)
    print(cr(target_test0, rfc_predicted_up_test))
    print('AUC-ROC равна', roc_auc_score(target_test0, rfc_prob_one_test))

Количество умножений равно 1
              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1582
           1       0.80      0.45      0.57       418

    accuracy                           0.86      2000
   macro avg       0.83      0.71      0.74      2000
weighted avg       0.85      0.86      0.84      2000

AUC-ROC равна 0.8518924019622669
Количество умножений равно 2
              precision    recall  f1-score   support

           0       0.89      0.94      0.91      1582
           1       0.71      0.55      0.62       418

    accuracy                           0.86      2000
   macro avg       0.80      0.75      0.77      2000
weighted avg       0.85      0.86      0.85      2000

AUC-ROC равна 0.8498750899775585
Количество умножений равно 3
              precision    recall  f1-score   support

           0       0.90      0.89      0.90      1582
           1       0.61      0.64      0.62       418

    accuracy                 