# Отток клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком. 

Постройте модель с предельно большим значением *F1*-меры. Чтобы сдать проект успешно, нужно довести метрику до 0.59. Проверьте *F1*-меру на тестовой выборке самостоятельно.

Дополнительно измеряйте *AUC-ROC*, сравнивайте её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression 
from sklearn.dummy import DummyClassifier
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder

## Подготовка данных

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
df.head(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [5]:
df.corr()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,1.0,0.004202,0.00584,0.000783,-0.007322,-0.009067,0.007246,0.000599,0.012044,-0.005988,-0.016571
CustomerId,0.004202,1.0,0.005308,0.009497,-0.021418,-0.012419,0.016972,-0.014025,0.001665,0.015271,-0.006248
CreditScore,0.00584,0.005308,1.0,-0.003965,-6.2e-05,0.006268,0.012238,-0.005458,0.025651,-0.001384,-0.027094
Age,0.000783,0.009497,-0.003965,1.0,-0.013134,0.028308,-0.03068,-0.011721,0.085472,-0.007201,0.285323
Tenure,-0.007322,-0.021418,-6.2e-05,-0.013134,1.0,-0.007911,0.011979,0.027232,-0.032178,0.01052,-0.016761
Balance,-0.009067,-0.012419,0.006268,0.028308,-0.007911,1.0,-0.30418,-0.014858,-0.010084,0.012797,0.118533
NumOfProducts,0.007246,0.016972,0.012238,-0.03068,0.011979,-0.30418,1.0,0.003183,0.009612,0.014204,-0.04782
HasCrCard,0.000599,-0.014025,-0.005458,-0.011721,0.027232,-0.014858,0.003183,1.0,-0.011866,-0.009933,-0.007138
IsActiveMember,0.012044,0.001665,0.025651,0.085472,-0.032178,-0.010084,0.009612,-0.011866,1.0,-0.011421,-0.156128
EstimatedSalary,-0.005988,0.015271,-0.001384,-0.007201,0.01052,0.012797,0.014204,-0.009933,-0.011421,1.0,0.012097


На целевой признак в большей степени влияют: баланс счета и возраст, в меньшей степени наличие кридитной карты, при необходимости этот столбец можно не использовать при обучении модели. 

для построения модели прогноза следующие столбцы не требуются для анализа:\
RowNumber — индекс строки в данных (не влияют на уход клиента из банка)\
CustomerId — уникальный идентификатор клиента\
Surname — фамилия

In [6]:
df['Exited'].value_counts(normalize= True )

0    0.7963
1    0.2037
Name: Exited, dtype: float64

баланс классов 1 к 4

In [7]:
#df.groupby('Exited')

вероятно необходима болансировка.

In [8]:
df = df.drop(columns = ['RowNumber', 'CustomerId', 'Surname'])
#df = df.drop(columns = ['HasCrCard'])

Tenure в столбце присутсвую пропуски, думаю, что их стоит заменить на "0". с значением, что это новые клиенты. (возможно, их нужно удалить, поскольку это новые клиенты и они могут помешать обучению модели?) 

In [9]:
df['Tenure'].isna().sum()

909

In [10]:
df.groupby('Tenure')['Exited'].count()

Tenure
0.0     382
1.0     952
2.0     950
3.0     928
4.0     885
5.0     927
6.0     881
7.0     925
8.0     933
9.0     882
10.0    446
Name: Exited, dtype: int64

In [11]:
df['Tenure'] = df['Tenure'].fillna(0)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


Для подготовки признаков применим OE

In [13]:
#encoder = OrdinalEncoder()
#df = pd.DataFrame(encoder.fit_transform(df), columns=df.columns)

In [14]:
df.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [15]:
features = df.drop(['Exited'], axis=1)
target = df['Exited']

In [16]:
features_train, features_valid1, target_train, target_valid1 = train_test_split(
    features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid1, target_valid1, test_size=0.5, random_state=12345)

In [17]:
print('Размер обучающей выборки', features_train.shape[0], f'{  features_train.shape[0]/df.shape[0]:.1%}')
print('Размер  валидационной выборки', features_valid.shape[0], f'{  features_valid.shape[0]/df.shape[0]:.1%}')
print('Размер тестовой выборки', features_test.shape[0], f'{  features_test.shape[0]/df.shape[0]:.1%}')

Размер обучающей выборки 6000 60.0%
Размер  валидационной выборки 2000 20.0%
Размер тестовой выборки 2000 20.0%


применим кодирование OHE из sklearn.

In [18]:
var_categorical = ['Gender', 'Geography']

In [19]:
#ohe_train = OneHotEncoder(handle_unknown='error',sparse=False)
#ohe_train.set_params(drop= 'first')

In [20]:
#features_train_ohe = ohe_train.fit_transform(features_train.loc[:,['Gender','Geography']])

In [21]:
#ohe_train.get_feature_names()

In [22]:
#features_train_temp = features_train.drop([var_categorical], axis=1).reset_index(drop=True)

In [23]:
#array_temp_features_train_ohe = pd.DataFrame(features_train_ohe, columns =['Gender_Male',
#        'Geography_Germany', 'Geography_Spain'])

In [24]:
cat = pd.get_dummies(features_train[var_categorical], drop_first=True)
features_train_ohe=pd.concat([features_train, cat], axis=1)

In [25]:
cat2 = pd.get_dummies(features_valid[var_categorical], drop_first=True)
features_valid_ohe=pd.concat([features_valid, cat2], axis=1)

In [26]:
cat3 = pd.get_dummies(features_test[var_categorical], drop_first=True)
features_test_ohe=pd.concat([features_test, cat3], axis=1)

Большой разброс колличественных признаков требует применение масштабирования

In [27]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'IsActiveMember', 'EstimatedSalary', 'NumOfProducts']
scaler = StandardScaler()
scaler.fit(features_train_ohe[numeric])
features_train_ohe[numeric] = scaler.transform(features_train_ohe[numeric])
features_valid_ohe[numeric] = scaler.transform(features_valid_ohe[numeric])
features_test_ohe[numeric] = scaler.transform(features_test_ohe[numeric])

In [28]:
features_train_ohe.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Gender_Male,Geography_Germany,Geography_Spain
7479,-0.886751,Spain,Male,-0.373192,1.104696,1.232271,-0.89156,1,-1.055187,-0.187705,1,0,1
3411,0.608663,France,Female,-0.183385,1.104696,0.600563,-0.89156,0,-1.055187,-0.333945,0,0,0
6027,2.052152,Germany,Male,0.480939,-0.503694,1.027098,0.830152,0,0.947699,1.503095,1,1,0
1247,-1.457915,France,Male,-1.417129,0.46134,-1.233163,0.830152,1,-1.055187,-1.071061,1,0,0
3716,0.130961,Germany,Female,-1.132419,-0.825373,1.140475,-0.89156,0,-1.055187,1.524268,0,1,0
8741,-0.315586,Spain,Female,-1.512032,1.104696,-1.233163,0.830152,1,0.947699,0.552345,0,0,1
7461,-0.585591,Spain,Male,-0.657902,-0.182016,-1.233163,0.830152,0,0.947699,0.814122,1,0,1
5106,-0.544052,France,Female,-0.657902,-1.147051,0.031212,-0.89156,1,-1.055187,-0.608723,0,0,0
6130,-0.211738,Germany,Female,-0.373192,-0.825373,1.190787,0.830152,0,-1.055187,-0.602263,0,1,0
4955,1.273291,Germany,Male,-0.562998,-1.468729,0.111168,0.830152,1,-1.055187,0.508214,1,1,0


In [29]:
features_train_ohe = features_train_ohe.drop(columns = var_categorical)
features_valid_ohe = features_valid_ohe.drop(columns = var_categorical)
features_test_ohe = features_test_ohe.drop(columns = var_categorical)

Разделим датафрейм на выборки

In [30]:
#features = df.drop(['Exited'], axis=1)
#target = df['Exited']

In [31]:
#features_train, features_valid1, target_train, target_valid1 = train_test_split(
#    features, target, test_size=0.4, random_state=12345)
#features_valid, features_test, target_valid, target_test = train_test_split(
#    features_valid1, target_valid1, test_size=0.5, random_state=12345)

In [32]:
#print('Размер обучающей выборки', features_train.shape[0], f'{  features_train.shape[0]/df.shape[0]:.1%}')
#print('Размер  валидационной выборки', features_valid.shape[0], f'{  features_valid.shape[0]/df.shape[0]:.1%}')
#print('Размер тестовой выборки', features_test.shape[0], f'{  features_test.shape[0]/df.shape[0]:.1%}')

на данном этапе:\
Удалили столбцы, которые не влияют на потерю клиента(фамилию, номер идентификатора)\
Пропуски в столбце с временем сотрудничества заменили на 0, считая пропуски за новых клиентов.\
обнаружили преобладание в целевом признаке значений "0", в 4 раза.\
применили кодирование и масштабирование к признакам
    
    

## Исследование задачи

Посторим различные модели прогноза, сравнивая метрику F1

### Дерево решений

In [33]:
for i in range(1,15):
    model = DecisionTreeClassifier(random_state=12345, max_depth = i)
    model.fit(features_train_ohe, target_train)
    predictions_valid = model.predict(features_valid_ohe)
    probabilities_valid = model.predict_proba(features_valid_ohe)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
    print("max_depth =", i, ": ", end='')
    print(f1_score(target_valid, predictions_valid), auc_roc)

max_depth = 1 : 0.0 0.6925565119556736
max_depth = 2 : 0.5217391304347825 0.7501814673449512
max_depth = 3 : 0.4234875444839857 0.7973440741838507
max_depth = 4 : 0.5528700906344411 0.813428129858032
max_depth = 5 : 0.5406249999999999 0.8221680508592478
max_depth = 6 : 0.5696969696969697 0.8164631712023421
max_depth = 7 : 0.5446153846153846 0.8166264918127983
max_depth = 8 : 0.5440729483282676 0.8097072931725936
max_depth = 9 : 0.5746164574616458 0.7894116828676679
max_depth = 10 : 0.5413744740532961 0.7758220470726293
max_depth = 11 : 0.5194109772423025 0.7307387535612966
max_depth = 12 : 0.5145118733509234 0.7173895619983184
max_depth = 13 : 0.5257731958762887 0.6992745842885573
max_depth = 14 : 0.4852374839537869 0.6814030752666058


Дерево решений без балансировки дало максимальное значение F1 0,55

### Случайный лес

In [34]:
best_est = 0
best_depth = 0
best_result = 0
best_roc_auc = 0
for est in range(10, 151, 10):
    for depth in range (1, 15):
        model = RandomForestClassifier(random_state=12345,
                                       n_estimators=est, 
                                       max_depth=depth)
        model.fit(features_train_ohe, target_train)
        predictions_valid = model.predict(features_valid_ohe)
        result = f1_score(target_valid, predictions_valid)
        probabilities_valid = model.predict_proba(features_valid_ohe)
        probabilities_one_valid = probabilities_valid[:, 1]
        auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
        if result > best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth
            best_roc_auc = auc_roc
print("Метрики качества", best_result, "Количество деревьев:", best_est, "Максимальная глубина:", best_depth)
print(best_roc_auc)

Метрики качества 0.5975975975975975 Количество деревьев: 150 Максимальная глубина: 14
0.8433120210018207


Случайный лес без балансировки дал максимальное значение F1 0,597, (параметры: Количество деревьев: 150 Максимальная глубина: 14) 

### Логистическая регрессия

In [35]:
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(features_train_ohe, target_train) 
predictions_valid = model.predict(features_valid_ohe)
probabilities_valid = model.predict_proba(features_valid_ohe)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f1_score(target_valid, predictions_valid))
print(auc_roc)

0.33670033670033667
0.7585728198210733


Логистическая регрессия без балансировки дало значение F1 0,33.

### Общий вывод:

Наилучшее значение показателя F1 у случайного леса (параметры: Количество деревьев: 150 Максимальная глубина: 14), но они не удовлетворительные, требуется балансировка. 

## Борьба с дисбалансом

### Модели со взвешанным классом

In [36]:
best_est = 0
best_depth = 0
best_result = 0
best_roc_auc = 0
for est in range(10, 151, 10):
    for depth in range (1, 15):
        model = RandomForestClassifier(random_state=12345,
                                       n_estimators=est, 
                                       max_depth=depth, class_weight ='balanced')
        model.fit(features_train_ohe, target_train)
        predictions_valid = model.predict(features_valid_ohe)
        result = f1_score(target_valid, predictions_valid)
        probabilities_valid = model.predict_proba(features_valid_ohe)
        probabilities_one_valid = probabilities_valid[:, 1]
        auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
        if result > best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth
            best_roc_auc = auc_roc
print("Метрики качества", best_result, "Количество деревьев:", best_est, "Максимальная глубина:", best_depth)
print(best_roc_auc)

Метрики качества 0.6374589266155531 Количество деревьев: 30 Максимальная глубина: 8
0.8530492562863312


In [37]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight ='balanced') 
model.fit(features_train_ohe, target_train) 
predictions_valid = model.predict(features_valid_ohe)
probabilities_valid = model.predict_proba(features_valid_ohe)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f1_score(target_valid, predictions_valid))
print(auc_roc)

0.4888888888888888
0.7634376568936421


In [38]:
for i in range(1,15):
    model = DecisionTreeClassifier(random_state=12345, max_depth = i, class_weight ='balanced')
    model.fit(features_train_ohe, target_train)
    predictions_valid = model.predict(features_valid_ohe)
    probabilities_valid = model.predict_proba(features_valid_ohe)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
    print("max_depth =", i, ": ", end='')
    print(f1_score(target_valid, predictions_valid), auc_roc)

max_depth = 1 : 0.4994903160040775 0.6925565119556736
max_depth = 2 : 0.541015625 0.7501814673449512
max_depth = 3 : 0.541015625 0.7980472601455368
max_depth = 4 : 0.5277777777777778 0.8190853743368881
max_depth = 5 : 0.5963791267305644 0.8310244134068074
max_depth = 6 : 0.5581835383159887 0.7999473744699641
max_depth = 7 : 0.5565565565565564 0.7941313460642758
max_depth = 8 : 0.54296875 0.7789100163925502
max_depth = 9 : 0.5320197044334976 0.7620275044005831
max_depth = 10 : 0.5188770571151984 0.7491176150351744
max_depth = 11 : 0.5281473899692939 0.7417878465270176
max_depth = 12 : 0.5015416238437821 0.7192465778283199
max_depth = 13 : 0.5065217391304349 0.7094450728591389
max_depth = 14 : 0.48437500000000006 0.6883479817806786


Наибольшее значение F1 у случайного леса (0.637)с параметрами = Количество деревьев: 30 Максимальная глубина: 8

### Увеличение числа меньших классов

In [39]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [40]:
features_upsampled, target_upsampled = upsample(features_train_ohe, target_train, 40)

In [41]:
best_est = 0
best_depth = 0
best_result = 0
best_roc_auc = 0
for est in range(10, 151, 10):
    for depth in range (1, 15):
        model = RandomForestClassifier(random_state=12345,
                                       n_estimators=est, 
                                       max_depth=depth)
        model.fit(features_upsampled, target_upsampled)
        predictions_valid = model.predict(features_valid_ohe)
        result = f1_score(target_valid, predictions_valid)
        probabilities_valid = model.predict_proba(features_valid_ohe)
        probabilities_one_valid = probabilities_valid[:, 1]
        auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
        if result > best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth
            best_roc_auc = auc_roc
print("Метрики качества", best_result, "Количество деревьев:", best_est, "Максимальная глубина:", best_depth)
print(best_roc_auc)

Метрики качества 0.554954954954955 Количество деревьев: 110 Максимальная глубина: 14
0.8266260986335511


In [42]:
for i in range(1,15):
    model = DecisionTreeClassifier(random_state=12345, max_depth = i)
    model.fit(features_upsampled, target_upsampled)
    predictions_valid = model.predict(features_valid_ohe)
    probabilities_valid = model.predict_proba(features_valid_ohe)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
    print("max_depth =", i, ": ", end='')
    print(f1_score(target_valid, predictions_valid), auc_roc)

max_depth = 1 : 0.3457402812241522 0.6898254284141567
max_depth = 2 : 0.3457402812241522 0.7345374699822766
max_depth = 3 : 0.34876929495202336 0.7820350352953986
max_depth = 4 : 0.39922103213242455 0.7984873184570437
max_depth = 5 : 0.3902439024390244 0.8191246922616274
max_depth = 6 : 0.406850025947068 0.8102615246886322
max_depth = 7 : 0.40855614973262033 0.8065573225098144
max_depth = 8 : 0.4236111111111111 0.795277614793218
max_depth = 9 : 0.42951409718056394 0.7883070004052771
max_depth = 10 : 0.4323308270676692 0.777175491020391
max_depth = 11 : 0.43087248322147653 0.7526758872240941
max_depth = 12 : 0.4334277620396601 0.7428607722040419
max_depth = 13 : 0.44324324324324327 0.7261385563667819
max_depth = 14 : 0.4486133768352366 0.7196722699750181


In [43]:
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(features_upsampled, target_upsampled)
predictions_valid = model.predict(features_valid_ohe)
f1 = f1_score(target_valid, predictions_valid)
probabilities_valid = model.predict_proba(features_valid_ohe)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f1, auc_roc)

0.3522966708807417 0.765489750119466


При увеличении числа меньших классов наибольшее значение у случайного леса(0,55) но наблюдается ухудшение по всем моделям 

### Уменьшение числа больших классов

In [44]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [45]:
features_downsampled, target_downsampled = downsample(features_train_ohe, target_train, 0.25)

In [46]:
best_est = 0
best_depth = 0
best_result = 0
best_roc_auc = 0
for est in range(10, 150, 10):
    for depth in range (1, 15):
        model = RandomForestClassifier(random_state=12345,
                                       n_estimators=est, 
                                       max_depth=depth)
        model.fit(features_downsampled, target_downsampled)
        predictions_valid = model.predict(features_valid_ohe)
        result = f1_score(target_valid, predictions_valid)
        probabilities_valid = model.predict_proba(features_valid_ohe)
        probabilities_one_valid = probabilities_valid[:, 1]
        auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
        if result > best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth
            best_roc_auc = auc_roc
print("Метрики качества", best_result, "Количество деревьев:", best_est, "Максимальная глубина:", best_depth)
print(best_roc_auc)

Метрики качества 0.603448275862069 Количество деревьев: 140 Максимальная глубина: 8
0.8520043068249867


In [47]:
for i in range(1,15):
    model = DecisionTreeClassifier(random_state=12345, max_depth = i)
    model.fit(features_downsampled, target_downsampled)
    predictions_valid = model.predict(features_valid_ohe)
    probabilities_valid = model.predict_proba(features_valid_ohe)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
    print("max_depth =", i, ": ", end='')
    print(f1_score(target_valid, predictions_valid), auc_roc)

max_depth = 1 : 0.5061845861084681 0.7021894035168371
max_depth = 2 : 0.5394495412844036 0.7578892928217568
max_depth = 3 : 0.5555555555555556 0.8027820456208905
max_depth = 4 : 0.5357737104825291 0.8180268148246723
max_depth = 5 : 0.5953109072375127 0.8240212861195628
max_depth = 6 : 0.5741265344664779 0.8164707323417152
max_depth = 7 : 0.5282005371530886 0.7930085168673897
max_depth = 8 : 0.5302325581395348 0.7766991392398938
max_depth = 9 : 0.5380333951762523 0.7695296668864438
max_depth = 10 : 0.5257249766136577 0.7573138901154738
max_depth = 11 : 0.5094170403587445 0.736717346463504
max_depth = 12 : 0.47795414462081126 0.6997660583478004
max_depth = 13 : 0.4982876712328767 0.7064576666928787
max_depth = 14 : 0.48726655348047543 0.7049446827043473


In [48]:
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(features_downsampled, target_downsampled)
predictions_valid = model.predict(features_valid_ohe)
f1_score(target_valid, predictions_valid)
probabilities_valid = model.predict_proba(features_valid_ohe)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f1, auc_roc)

0.3522966708807417 0.7625272957131364


Наибольшее значение метрики F1 (0,634) у случайного леса без взвешивания классов с параметрами = Количество деревьев: 50 Максимальная глубина: 7 при уменьшении больших классов.

### Изменение порога

In [49]:
#probabilities_valid = model.predict_proba(features_valid)
#probabilities_one_valid = probabilities_valid[:, 1]
#
#for threshold in np.arange(0.3, 1, 0.02):
#    predicted_valid = probabilities_one_valid > threshold
#    F1 = f1_score(target_valid,predicted_valid)
#    print("Порог = {:.2f} | F1 = {:.3f}".format(
#        threshold, F1))

Наибольший результат у порога 0,5

### Общий вывод:

In [50]:
col = ['модель', 'F1', 'auc_roc']
data = [
    ["дерево решений", 0.57, 0.78],
    ["случайный лес", 0.59, 0.84],
    ["регрессия", 0.33, 0.75],
    ["дерево решений баланс", 0.59, 0.83],
    ["случайный лес баланс", 0.637, 0.85],
    ["регрессия баланс", 0.48, 0.76],
    ["дерево решений увеличение",  0.44, 0.72],
    ["случайный лес баланс увеличение", 0.55, 0.82],
    ["регрессия баланс увеличение", 0.35, 0.76],       
    ["дерево решений уменьшение", 0.59, 0.82],
    ["случайный лес уменьшение" , 0.60, 0.85],
    ["регрессия уменьшение", 0.35, 0.76]        
    ]
tab = pd.DataFrame(data = data, columns = col)

In [51]:
tab

Unnamed: 0,модель,F1,auc_roc
0,дерево решений,0.57,0.78
1,случайный лес,0.59,0.84
2,регрессия,0.33,0.75
3,дерево решений баланс,0.59,0.83
4,случайный лес баланс,0.637,0.85
5,регрессия баланс,0.48,0.76
6,дерево решений увеличение,0.44,0.72
7,случайный лес баланс увеличение,0.55,0.82
8,регрессия баланс увеличение,0.35,0.76
9,дерево решений уменьшение,0.59,0.82


Наибольшее значение метрики F1 (0,637) у случайного леса c взвешиванием классов с параметрами = Количество деревьев: 30 Максимальная глубина: 8

In [52]:
model = RandomForestClassifier(random_state=12345,
                                       n_estimators=30, 
                                       max_depth=8, class_weight ='balanced')
model.fit(features_train_ohe, target_train)
predictions_valid = model.predict(features_valid_ohe)
F1 = f1_score(target_valid, predictions_valid)
probabilities_valid = model.predict_proba(features_valid_ohe)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('для лучшей модели','F1 = ', F1, 'auc_roc = ',  auc_roc)


для лучшей модели F1 =  0.6374589266155531 auc_roc =  0.8530492562863312


## Тестирование модели

In [53]:
predictions_test = model.predict(features_test_ohe)
F1_t = f1_score(target_test, predictions_test)
probabilities_test = model.predict_proba(features_test_ohe)
probabilities_one_test = probabilities_test[:, 1]
auc_roc_t = roc_auc_score(target_test, probabilities_one_test)
print('для лучшей модели на тестовой выборке','F1 = ', F1_t, 'auc_roc = ',  auc_roc_t)

для лучшей модели на тестовой выборке F1 =  0.5971459934138309 auc_roc =  0.8459759156071842


## Общий вывод

В ходе исследования была подобрана модель случайного леса с балансировкой классов с параметрами:Количество деревьев: 30 Максимальная глубина: 8