# Описание
Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.
Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком.
Постройте модель с предельно большим значением F1-меры. Чтобы сдать проект успешно, нужно довести метрику до 0.59. Проверьте F1-меру на тестовой выборке самостоятельно.
Дополнительно измеряйте AUC-ROC, сравнивайте её значение с F1-мерой.

# Подготовка данных

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

In [2]:
# загружаем данные в переменную df
df = pd.read_csv('/datasets/Churn.csv')

In [3]:
# смотрим размер таблицы, пропуски и вид таблицы
print('Размер таблицы: {}\n\nПропуски:\n{}'
      .format(df.shape, df.isnull().sum()), '\n')
print(df.info())
df.head()

Размер таблицы: (10000, 14)

Пропуски:
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOf

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
# смотрим уникальные значения в столбце, где есть пропуски
print('Уникальные значения:\n{}\n{}\n\nПроцент пропусков: {:.2%}'
      .format(df['Tenure'].value_counts(), df['Tenure']
              .unique(), df['Tenure'].isna().sum() / len(df)))

Уникальные значения:
1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: Tenure, dtype: int64
[ 2.  1.  8.  7.  4.  6.  3. 10.  5.  9.  0. nan]

Процент пропусков: 9.09%


In [5]:
# дубликаты в UID, если есть, то в теории пропуски Tenure будет возможно заполнить
df['CustomerId'].duplicated().sum()

0

In [6]:
# удаляем пропуски
df.dropna(inplace=True)

In [7]:
# удаление столбца с id строки, фамилией клиента в переменную df_ohe
# основной датасет не изменяется
df_ohe = df.drop(['RowNumber', 'Surname', 'CustomerId'], axis=1)
# кодируем категориальные признаки в численные
df_ohe = pd.get_dummies(df_ohe, drop_first=True)
df_ohe.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


In [8]:
# сохраним в переменные наши признаки для обучения модели и целевой признак
features = df_ohe.drop(['Exited'], axis=1)
target = df_ohe['Exited']

In [9]:
# делим датасет на части 3:1:1
# train = 60%, valid и test = 20%
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.4, random_state=12345
)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, random_state=12345
)
# посмотрим на размер наших выборок
print('Train: {} {}\nValid: {} {}\nTest:  {} {}'
      .format(features_train.shape, target_train.shape, features_valid.shape,
              target_valid.shape, features_test.shape, target_test.shape))

Train: (5454, 11) (5454,)
Valid: (1818, 11) (1818,)
Test:  (1819, 11) (1819,)


In [10]:
scaler = StandardScaler()
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [11]:
# посмотрим на баланс классов
target_train.value_counts(normalize=True)

0    0.793546
1    0.206454
Name: Exited, dtype: float64

# Подготовка данных

На данном этапе мы избавились от пропусков(909) в столбце Tenure. Разделили датасет на обучающие выборки и стандартизировали колличественные признаки, чтобы избежать ловушки с переоцениванием. Стоит отметить, что мы работаем с несбалансированными классами в целевом признаке Exited. Class1(0) = 80%, Class2(1) = 20%, посмотрим как наши модели ведут себя в ситуациях с дисблансом и наоборот, когда классы сбалансированы

# Исследование

In [12]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [13]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 5) # тут оказывается нужно задать значение 5 и данные будут близко к 1:1

In [14]:
print('Баланс классов после увеличения выборки:\n{}'
      .format(target_upsampled.value_counts(normalize=True)))

Баланс классов после увеличения выборки:
1    0.565375
0    0.434625
Name: Exited, dtype: float64


In [15]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [16]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.2) # а тут нужно задать значение 0.2 и данные будут близко к 1:1

In [17]:
print('Баланс классов после уменьшения выборки:\n{}'
      .format(target_downsampled.value_counts(normalize=True)))

Баланс классов после уменьшения выборки:
1    0.565261
0    0.434739
Name: Exited, dtype: float64


# LogisticRegression

In [18]:
# результат с дисбалансом в классах
model = LogisticRegression(solver='liblinear', random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
score = f1_score(target_valid, predicted_valid)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
roc_score = roc_auc_score(target_valid, probabilities_one_valid)

print('F1-score: {}\nROC-AUC: {}'.format(score, roc_score))

F1-score: 0.30400000000000005
ROC-AUC: 0.7736191158144302


In [19]:
# результат с балансом в классах, параметр class_weight = 'balanced'
model = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
score = f1_score(target_valid, predicted_valid)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
roc_score = roc_auc_score(target_valid, probabilities_one_valid)

print('F1-score: {}\nROC-AUC: {}'.format(score, roc_score))

F1-score: 0.509731232622799
ROC-AUC: 0.7777884132187896


In [20]:
# результат с балансом в классах, увеличенная выборка
model = LogisticRegression(solver='liblinear', random_state=12345)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
score = f1_score(target_valid, predicted_valid)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
roc_score = roc_auc_score(target_valid, probabilities_one_valid)

print('F1-score: {}\nROC-AUC: {}'.format(score, roc_score))

F1-score: 0.4867042707493956
ROC-AUC: 0.7783111860500647


In [21]:
# результат с балансом в классах, уменьшенная выборка
model = LogisticRegression(solver='liblinear', random_state=12345)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
score = f1_score(target_valid, predicted_valid)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
roc_score = roc_auc_score(target_valid, probabilities_one_valid)

print('F1-score: {}\nROC-AUC: {}'.format(score, roc_score))

F1-score: 0.48878205128205127
ROC-AUC: 0.7786682914348089


# DecisionTreeClassifier

In [22]:
# результат с дисбалансом в классах
for depth in range(3, 16, 2):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.3726708074534161
ROC-AUC: 0.7994383873562604

Depth: 5
F1-score: 0.5140712945590994
ROC-AUC: 0.8471027524725867

Depth: 7
F1-score: 0.5764331210191083
ROC-AUC: 0.8346049843812412

Depth: 9
F1-score: 0.5446153846153846
ROC-AUC: 0.794063399091038

Depth: 11
F1-score: 0.5128205128205129
ROC-AUC: 0.7138168491156119

Depth: 13
F1-score: 0.4825662482566248
ROC-AUC: 0.6805002052435589

Depth: 15
F1-score: 0.47411444141689374
ROC-AUC: 0.6718339570405903



In [23]:
# результат с балансом в классах, параметр class_weight = 'balanced'
for depth in range(3, 16, 2):
    model = DecisionTreeClassifier(class_weight='balanced', max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.548936170212766
ROC-AUC: 0.7949267105624042

Depth: 5
F1-score: 0.5735449735449736
ROC-AUC: 0.8396523192522141

Depth: 7
F1-score: 0.5413533834586466
ROC-AUC: 0.8106043732524385

Depth: 9
F1-score: 0.5059978189749182
ROC-AUC: 0.7565295983300722

Depth: 11
F1-score: 0.5123595505617977
ROC-AUC: 0.7135416570794302

Depth: 13
F1-score: 0.4882280049566295
ROC-AUC: 0.6767864933171592

Depth: 15
F1-score: 0.4725848563968668
ROC-AUC: 0.6659481608152311



In [24]:
# результат с балансом в классах, увеличенная выборка
for depth in range(3, 16, 2):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.5644599303135889
ROC-AUC: 0.808227045394721

Depth: 5
F1-score: 0.5628415300546449
ROC-AUC: 0.8525219187235507

Depth: 7
F1-score: 0.5583566760037347
ROC-AUC: 0.820662228006266

Depth: 9
F1-score: 0.5303760848601736
ROC-AUC: 0.7658428699492139

Depth: 11
F1-score: 0.5151832460732985
ROC-AUC: 0.7257439848911288

Depth: 13
F1-score: 0.513953488372093
ROC-AUC: 0.7156511558985894

Depth: 15
F1-score: 0.5126262626262627
ROC-AUC: 0.6970908796389186



In [25]:
# результат с балансом в классах, уменьшенная выборка
for depth in range(3, 16, 2):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.5644599303135889
ROC-AUC: 0.7960127527118841

Depth: 5
F1-score: 0.5485519591141397
ROC-AUC: 0.8370393754705416

Depth: 7
F1-score: 0.5516621743036837
ROC-AUC: 0.8186328017862632

Depth: 9
F1-score: 0.5261261261261261
ROC-AUC: 0.7805007574683805

Depth: 11
F1-score: 0.525
ROC-AUC: 0.7395192330701675

Depth: 13
F1-score: 0.5022182786157942
ROC-AUC: 0.7207813244928275

Depth: 15
F1-score: 0.4956369982547993
ROC-AUC: 0.7158766477008118



# RandomForestClassifier

In [26]:
# результат с дисбалансом в классах
for depth in range(3, 16, 2):
    model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.24537037037037032
ROC-AUC: 0.843949548740283

Depth: 5
F1-score: 0.4924242424242424
ROC-AUC: 0.8613685603682972

Depth: 7
F1-score: 0.5402504472271914
ROC-AUC: 0.8683680099842248

Depth: 9
F1-score: 0.5543859649122806
ROC-AUC: 0.8722869654693819

Depth: 11
F1-score: 0.570940170940171
ROC-AUC: 0.8701351294138869

Depth: 13
F1-score: 0.5748299319727891
ROC-AUC: 0.8719519490775085

Depth: 15
F1-score: 0.5841584158415842
ROC-AUC: 0.8661683144441765



In [27]:
# результат с балансом в классах, параметр class_weight = 'balanced'
for depth in range(3, 16, 2):
    model = RandomForestClassifier(class_weight='balanced', n_estimators=100, max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.5864197530864198
ROC-AUC: 0.8511671271608097

Depth: 5
F1-score: 0.5995717344753747
ROC-AUC: 0.864318361291249

Depth: 7
F1-score: 0.6252821670428894
ROC-AUC: 0.867819466661267

Depth: 9
F1-score: 0.6332916145181477
ROC-AUC: 0.8695387266063759

Depth: 11
F1-score: 0.6342141863699583
ROC-AUC: 0.8704001973283364

Depth: 13
F1-score: 0.60625
ROC-AUC: 0.8658719537898271

Depth: 15
F1-score: 0.600326264274062
ROC-AUC: 0.8630105088383582



In [28]:
# результат с балансом в классах, увеличенная выборка
for depth in range(3, 16, 2):
    model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=12345)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.5471537807986407
ROC-AUC: 0.8475518953276258

Depth: 5
F1-score: 0.572202166064982
ROC-AUC: 0.8623966189114913

Depth: 7
F1-score: 0.5858778625954199
ROC-AUC: 0.8660928437185347

Depth: 9
F1-score: 0.608515057113188
ROC-AUC: 0.8692294807061851

Depth: 11
F1-score: 0.6214039125431532
ROC-AUC: 0.8650896352923202

Depth: 13
F1-score: 0.6216560509554141
ROC-AUC: 0.8630896610628118

Depth: 15
F1-score: 0.6181318681318682
ROC-AUC: 0.860828300417666



In [29]:
# результат с балансом в классах, уменьшенная выборка
for depth in range(3, 16, 2):
    model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=12345)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    
    probabilities_valid = model.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]
    roc_score = roc_auc_score(target_valid, probabilities_one_valid)
    
    print('Depth: {}\nF1-score: {}\nROC-AUC: {}\n'.format(depth, score, roc_score))

Depth: 3
F1-score: 0.5314009661835748
ROC-AUC: 0.8450972559948606

Depth: 5
F1-score: 0.5504273504273504
ROC-AUC: 0.8615443519365603

Depth: 7
F1-score: 0.5621716287215411
ROC-AUC: 0.8673611200591985

Depth: 9
F1-score: 0.5654082528533801
ROC-AUC: 0.8688097898416404

Depth: 11
F1-score: 0.5565371024734982
ROC-AUC: 0.8677881739213669

Depth: 13
F1-score: 0.5691202872531419
ROC-AUC: 0.86870670787491

Depth: 15
F1-score: 0.5653333333333332
ROC-AUC: 0.8630758554422675



# Финальное тестирование

In [30]:
# результат модели на тестовой выборке
model = RandomForestClassifier(class_weight='balanced', n_estimators=100, max_depth=9, random_state=12345)
model.fit(features_train, target_train)
predicted_test = model.predict(features_test)
score = f1_score(target_test, predicted_test)

# ROC-AUC score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
roc_score = roc_auc_score(target_valid, probabilities_one_valid)

print('F1-score: {}\nROC-AUC: {}\n'.format(score, roc_score))

F1-score: 0.6063829787234043
ROC-AUC: 0.8695387266063759



# Результаты

На тестовой выборке модель RandomForestClassifier показала результат = 60% по F1. Задача была достич 59% по метрике. Параметры: class_weight = 'balanced', n_estimators = 100, max_depth = 9.
AUC-ROC = 86%