<span style="color:Green">
    
Дополнительное задание:

- Построить на графике ROC-кривую для самой лучшей модели, посмотреть на что она похожа;
- Построить PR-кривую, сравнить её значение и вид с ROC-кривой. Как они соотносятся и соотносятся ли, вообще?

</span>

# Отток клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком. 

Постройте модель с предельно большим значением *F1*-меры. Чтобы сдать проект успешно, нужно довести метрику до 0.59. Проверьте *F1*-меру на тестовой выборке самостоятельно.

Дополнительно измеряйте *AUC-ROC*, сравнивайте её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling) * _Датасет в текущем репозитории немного изменен!_

---

Импортируем все необходимые библиотеки, модели и методы:

In [218]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.utils import shuffle

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

from sklearn.metrics import f1_score, roc_auc_score

In [199]:
# Отключим предупреждения для того, чтобы успешно смаштабировать количественные признаки
pd.options.mode.chained_assignment = None

## 1. Подготовка данных

Откроем и изучим файл датасета:

In [200]:
df = pd.read_csv('Churn.csv')

In [201]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [202]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [203]:
df.columns = [col.lower() for col in df.columns]
df

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


In [204]:
df.isna().sum()

rownumber            0
customerid           0
surname              0
creditscore          0
geography            0
gender               0
age                  0
tenure             909
balance              0
numofproducts        0
hascrcard            0
isactivemember       0
estimatedsalary      0
exited               0
dtype: int64

In [205]:
df = pd.concat([df, pd.get_dummies(df[['gender', 'geography']])], axis=1)
df.drop(columns=['gender', 'geography'], inplace=True)
df

Unnamed: 0,rownumber,customerid,surname,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,gender_Female,gender_Male,geography_France,geography_Germany,geography_Spain
0,1,15634602,Hargrave,619,42,2.0,0.00,1,1,1,101348.88,1,1,0,1,0,0
1,2,15647311,Hill,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0,0,0,1
2,3,15619304,Onio,502,42,8.0,159660.80,3,1,0,113931.57,1,1,0,1,0,0
3,4,15701354,Boni,699,39,1.0,0.00,2,0,0,93826.63,0,1,0,1,0,0
4,5,15737888,Mitchell,850,43,2.0,125510.82,1,1,1,79084.10,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,39,5.0,0.00,2,1,0,96270.64,0,0,1,1,0,0
9996,9997,15569892,Johnstone,516,35,10.0,57369.61,1,1,1,101699.77,0,0,1,1,0,0
9997,9998,15584532,Liu,709,36,7.0,0.00,1,0,1,42085.58,1,1,0,1,0,0
9998,9999,15682355,Sabbatini,772,42,3.0,75075.31,2,1,0,92888.52,1,0,1,0,1,0


In [206]:
df.drop(columns=['customerid', 'surname', 'rownumber'], inplace=True)
df

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,gender_Female,gender_Male,geography_France,geography_Germany,geography_Spain
0,619,42,2.0,0.00,1,1,1,101348.88,1,1,0,1,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0,0,0,1
2,502,42,8.0,159660.80,3,1,0,113931.57,1,1,0,1,0,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,1,0,1,0,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.0,0.00,2,1,0,96270.64,0,0,1,1,0,0
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,0,1,1,0,0
9997,709,36,7.0,0.00,1,0,1,42085.58,1,1,0,1,0,0
9998,772,42,3.0,75075.31,2,1,0,92888.52,1,0,1,0,1,0


In [207]:
df.tenure.value_counts()

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: tenure, dtype: int64

заполним пропуски средним

In [208]:
df.tenure.fillna(df.tenure.mean(), inplace=True)
df.count()

creditscore          10000
age                  10000
tenure               10000
balance              10000
numofproducts        10000
hascrcard            10000
isactivemember       10000
estimatedsalary      10000
exited               10000
gender_Female        10000
gender_Male          10000
geography_France     10000
geography_Germany    10000
geography_Spain      10000
dtype: int64

## 2. Исследуйте баланс классов, обучите модель без учёта дисбаланса. Кратко опишите выводы.

In [209]:
df.exited.value_counts() / df.shape[0]

0    0.7963
1    0.2037
Name: exited, dtype: float64

In [219]:
np.random.seed(42)

In [220]:
train, test = train_test_split(df, test_size=0.2)

In [221]:
X_train = train.drop(columns=['exited'])
y_train = train['exited']
X_test = test.drop(columns=['exited'])
y_test = test['exited']

In [222]:
X_train.isna().sum()

creditscore          0
age                  0
tenure               0
balance              0
numofproducts        0
hascrcard            0
isactivemember       0
estimatedsalary      0
gender_Female        0
gender_Male          0
geography_France     0
geography_Germany    0
geography_Spain      0
dtype: int64

In [223]:
tree = RandomForestClassifier()
parameters = {'criterion': ["gini", "entropy"],
              'max_depth': range(2, 15, 2),
              'n_estimators' : range(10, 200, 20)}
tree = GridSearchCV(tree, parameters, n_jobs=-1, scoring='f1')
tree.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(2, 15, 2),
                         'n_estimators': range(10, 200, 20)},
             scoring='f1')

In [224]:
tree.best_params_

{'criterion': 'gini', 'max_depth': 14, 'n_estimators': 130}

In [225]:
tree.best_score_

0.5801959617398118

In [226]:
f1_score(y_test, tree.predict(X_test))

0.5700787401574803

In [227]:
tree1 = DecisionTreeClassifier()
parameters = {'criterion': ["gini", "entropy"],
              'max_depth': range(5, 25, 1)}
tree1 = GridSearchCV(tree1, parameters, n_jobs=-1, scoring='f1')
tree1.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(5, 25)},
             scoring='f1')

In [228]:
tree1.best_params_

{'criterion': 'entropy', 'max_depth': 8}

In [229]:
tree1.best_score_

0.5645116162271042

In [230]:
f1_score(y_test, tree1.predict(X_test))

0.5641025641025641

In [231]:
logreg = LogisticRegression()
parameters = {'C': [0.1, 0.5, 1.0, 5, 10],
              'max_iter': range(50, 550, 100),
              'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
logreg = GridSearchCV(logreg, parameters, n_jobs=-1, scoring='f1')
logreg.fit(X_train, y_train)





GridSearchCV(estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.1, 0.5, 1.0, 5, 10],
                         'max_iter': range(50, 550, 100),
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             scoring='f1')

In [232]:
logreg.best_params_

{'C': 10, 'max_iter': 250, 'solver': 'newton-cg'}

In [233]:
logreg.best_score_

0.32133486123335714

In [234]:
f1_score(y_test, logreg.predict(X_test))

0.2985074626865672

Наилучший алгоритм RandomForest  
f1 мера 0.57, меньше чем нужно

## 3. Улучшите качество модели, учитывая дисбаланс классов. Обучите разные модели и найдите лучшую. Кратко опишите выводы.

In [248]:
train, test = df.drop(columns=['exited']), df.exited
X_train, X_test, y_train, y_test = train_test_split(train, test, test_size=0.2, stratify=test)

In [250]:
y_train.value_counts() / df.shape[0]

0    0.637
1    0.163
Name: exited, dtype: float64