<center>
<img src="../../img/ods_stickers.jpg">
## Открытый курс по машинному обучению. Сессия № 2

Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

# <center> Тема 5. Композиции алгоритмов, случайный лес
## <center>Практика. Деревья решений и случайный лес в соревновании Kaggle Inclass по кредитному скорингу

Тут веб-формы для ответов нет, ориентируйтесь на рейтинг [соревнования](https://inclass.kaggle.com/c/beeline-credit-scoring-competition-2), [ссылка](https://www.kaggle.com/t/115237dd8c5e4092a219a0c12bf66fc6) для участия.

Решается задача кредитного скоринга. 

Признаки клиентов банка:
- Age - возраст (вещественный)
- Income - месячный доход (вещественный)
- BalanceToCreditLimit - отношение баланса на кредитной карте к лимиту по кредиту (вещественный)
- DIR - Debt-to-income Ratio (вещественный)
- NumLoans - число заемов и кредитных линий
- NumRealEstateLoans - число ипотек и заемов, связанных с недвижимостью (натуральное число)
- NumDependents - число членов семьи, которых содержит клиент, исключая самого клиента (натуральное число)
- Num30-59Delinquencies - число просрочек выплат по кредиту от 30 до 59 дней (натуральное число)
- Num60-89Delinquencies - число просрочек выплат по кредиту от 60 до 89 дней (натуральное число)
- Delinquent90 - были ли просрочки выплат по кредиту более 90 дней (бинарный) - имеется только в обучающей выборке

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

**Загружаем данные.**

In [2]:
train_df = pd.read_csv('../../data/credit_scoring_train.csv', index_col='client_id')
test_df = pd.read_csv('../../data/credit_scoring_test.csv', index_col='client_id')

In [3]:
y = train_df['Delinquent90']
train_df.drop('Delinquent90', axis=1, inplace=True)

In [4]:
train_df.head()

Unnamed: 0_level_0,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.496289,49.1,13,0,0.0,2,0,5298.360639,0.387028
1,0.433567,48.0,9,2,2.0,1,0,6008.056256,0.234679
2,2206.731199,55.5,21,1,,1,0,,0.348227
3,886.132793,55.3,3,0,0.0,0,0,,0.97193
4,0.0,52.3,1,0,0.0,0,0,2504.613105,1.00435


**Посмотрим на число пропусков в каждом признаке.**

In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75000 entries, 0 to 74999
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   DIR                    75000 non-null  float64
 1   Age                    75000 non-null  float64
 2   NumLoans               75000 non-null  int64  
 3   NumRealEstateLoans     75000 non-null  int64  
 4   NumDependents          73084 non-null  float64
 5   Num30-59Delinquencies  75000 non-null  int64  
 6   Num60-89Delinquencies  75000 non-null  int64  
 7   Income                 60153 non-null  float64
 8   BalanceToCreditLimit   75000 non-null  float64
dtypes: float64(5), int64(4)
memory usage: 5.7 MB


In [6]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75000 entries, 75000 to 149999
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   DIR                    75000 non-null  float64
 1   Age                    75000 non-null  float64
 2   NumLoans               75000 non-null  int64  
 3   NumRealEstateLoans     75000 non-null  int64  
 4   NumDependents          72992 non-null  float64
 5   Num30-59Delinquencies  75000 non-null  int64  
 6   Num60-89Delinquencies  75000 non-null  int64  
 7   Income                 60116 non-null  float64
 8   BalanceToCreditLimit   75000 non-null  float64
dtypes: float64(5), int64(4)
memory usage: 5.7 MB


**Заменим пропуски медианными значениями.**

In [7]:
train_df['NumDependents'].fillna(train_df['NumDependents'].median(), inplace=True)
train_df['Income'].fillna(train_df['Income'].median(), inplace=True)
test_df['NumDependents'].fillna(test_df['NumDependents'].median(), inplace=True)
test_df['Income'].fillna(test_df['Income'].median(), inplace=True)

### Дерево решений без настройки параметров

**Обучите дерево решений максимальной глубины 3, используйте параметр random_state=17 для воспроизводимости результатов.**

In [8]:
first_tree = DecisionTreeClassifier(max_depth=3, random_state=17)
first_tree.fit(train_df, y) # Ваш код здесь

DecisionTreeClassifier(max_depth=3, random_state=17)

**Сделайте прогноз для тестовой выборки.**

In [9]:
first_tree_pred = first_tree.predict(test_df) # Ваш код здесь

**Запишем прогноз в файл.**

In [10]:
def write_to_submission_file(predicted_labels, out_file,
                             target='Delinquent90', index_label="client_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(75000, 
                                                  predicted_labels.shape[0] + 75000),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [11]:
write_to_submission_file(first_tree_pred, 'credit_scoring_first_tree.csv')

**Если предсказывать вероятности дефолта для клиентов тестовой выборки, результат будет намного лучше.**

In [12]:
first_tree_pred_probs = first_tree.predict_proba(test_df)[:, 1]

In [13]:
write_to_submission_file (first_tree_pred_probs, 'credit_scoring_second_tree.csv')

## Дерево решений с настройкой параметров с помощью GridSearch

**Настройте параметры дерева с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр `random_state=17` (для воспроизводимости результатов), не забывайте про распараллеливание (`n_jobs=-1`).**

In [14]:
tree_params = {'max_depth': list(range(3, 8)), 
               'min_samples_leaf': list(range(5, 13))}

locally_best_tree = GridSearchCV(DecisionTreeClassifier(), tree_params) # Ваш код здесь
locally_best_tree.fit(train_df, y) # Ваш код здесь

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [3, 4, 5, 6, 7],
                         'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12]})

In [15]:
locally_best_tree.best_params_, round(locally_best_tree.best_score_, 3)

({'max_depth': 5, 'min_samples_leaf': 11}, 0.935)

**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**

In [16]:
tuned_tree_pred_probs = locally_best_tree.predict_proba(test_df)[:, 1] # Ваш код здесь

In [17]:
write_to_submission_file(tuned_tree_pred_probs, 'credit_scoring_fourth_tree.csv') # Ваш код здесь

### Случайный лес без настройки параметров

**Обучите случайный лес из деревьев неограниченной глубины, используйте параметр `random_state=17` для воспроизводимости результатов.**

In [18]:
first_forest = RandomForestClassifier(random_state=17) # Ваш код здесь
first_forest.fit(train_df, y) # Ваш код здесь

RandomForestClassifier(random_state=17)

In [22]:
first_forest_pred = first_forest.predict_proba(test_df)[:, 1] # Ваш код здесь

**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**

In [23]:
write_to_submission_file(first_forest_pred, 'credit_scoring_fifth_forest.csv') # Ваш код здесь

### Случайный лес c настройкой параметров

**Настройте параметр `max_features` леса с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр random_state=17 (для воспроизводимости результатов), не забывайте про распараллеливание (n_jobs=-1).**

In [27]:
%%time
forest_params = {'max_features': np.linspace(.3, 1, 7)}

locally_best_forest = GridSearchCV(RandomForestClassifier(random_state=17, n_jobs=-1), param_grid=forest_params, cv=5) # Ваш код здесь
locally_best_forest.fit(train_df, y) # Ваш код здесь

CPU times: total: 1min 36s
Wall time: 5min 51s


GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1, random_state=17),
             param_grid={'max_features': array([0.3       , 0.41666667, 0.53333333, 0.65      , 0.76666667,
       0.88333333, 1.        ])})

In [28]:
locally_best_forest.best_params_, round(locally_best_forest.best_score_, 3)

({'max_features': 0.3}, 0.934)

In [29]:
tuned_forest_pred = locally_best_forest.predict_proba(test_df)[:, 1] # Ваш код здесь

In [30]:
write_to_submission_file(tuned_forest_pred, 'credit_scoring_sixth_forest.csv') # Ваш код здесь

**Посмотрите, как настроенный случайный лес оценивает важность признаков по их влиянию на целевой. Представьте результаты в наглядном виде с помощью `DataFrame`.**

In [31]:
locally_best_forest.best_estimator_.feature_importances_

array([0.17009325, 0.16055994, 0.09322744, 0.03296169, 0.03870902,
       0.06307442, 0.06517408, 0.15081634, 0.22538382])

In [34]:
train_df.columns

Index(['DIR', 'Age', 'NumLoans', 'NumRealEstateLoans', 'NumDependents',
       'Num30-59Delinquencies', 'Num60-89Delinquencies', 'Income',
       'BalanceToCreditLimit'],
      dtype='object')

In [37]:
pd.DataFrame(locally_best_forest.best_estimator_.feature_importances_, index=train_df.columns)# Ваш код здесь

Unnamed: 0,0
DIR,0.170093
Age,0.16056
NumLoans,0.093227
NumRealEstateLoans,0.032962
NumDependents,0.038709
Num30-59Delinquencies,0.063074
Num60-89Delinquencies,0.065174
Income,0.150816
BalanceToCreditLimit,0.225384


**Обычно увеличение количества деревьев только улучшает результат. Так что напоследок обучите случайный лес из 300 деревьев с найденными лучшими параметрами. Это может занять несколько минут.**

In [38]:
%%time
final_forest = RandomForestClassifier(n_estimators=300,max_features=0.3, random_state=17, n_jobs=-1) # Ваш код здесь
final_forest.fit(train_df, y)
final_forest_pred = final_forest.predict_proba(test_df)[:, 1]
write_to_submission_file(final_forest_pred, 'credit_scoring_final_forest.csv')

CPU times: total: 2min 3s
Wall time: 18.6 s


**Сделайте посылку на Kaggle.**

In [45]:
%%time
rf_params_new = {'max_depth': list(range(3, 7)), 
                 'max_features': np.linspace(.3, 1, 7),
              'n_estimators' : [300, 400, 500]}

rf_cv = GridSearchCV(RandomForestClassifier(n_jobs=-1, random_state=17, verbose=4), rf_params_new, cv=7) # Ваш код здесь
rf_cv.fit(train_df, y) # Ваш код здесь

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 205 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   14.7s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done  82 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 205 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 148 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:    3.9s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:

building tree 1 of 500
building tree 2 of 500
building tree 3 of 500
building tree 4 of 500
building tree 5 of 500
building tree 6 of 500
building tree 7 of 500
building tree 8 of 500
building tree 9 of 500
building tree 10 of 500
building tree 11 of 500
building tree 12 of 500
building tree 13 of 500building tree 14 of 500

building tree 15 of 500
building tree 16 of 500
building tree 17 of 500
building tree 18 of 500
building tree 19 of 500
building tree 20 of 500
building tree 21 of 500
building tree 22 of 500building tree 23 of 500

building tree 24 of 500


[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.8s


building tree 25 of 500
building tree 26 of 500
building tree 27 of 500
building tree 28 of 500
building tree 29 of 500
building tree 30 of 500
building tree 31 of 500
building tree 32 of 500
building tree 33 of 500
building tree 34 of 500
building tree 35 of 500
building tree 36 of 500
building tree 37 of 500
building tree 38 of 500
building tree 39 of 500
building tree 40 of 500
building tree 41 of 500
building tree 42 of 500
building tree 43 of 500
building tree 44 of 500
building tree 45 of 500
building tree 46 of 500
building tree 47 of 500
building tree 48 of 500
building tree 49 of 500
building tree 50 of 500
building tree 51 of 500
building tree 52 of 500
building tree 53 of 500
building tree 54 of 500
building tree 55 of 500
building tree 56 of 500
building tree 57 of 500
building tree 58 of 500
building tree 59 of 500
building tree 60 of 500
building tree 61 of 500
building tree 62 of 500
building tree 63 of 500
building tree 64 of 500
building tree 65 of 500
building tree 66

[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    5.0s


building tree 92 of 500
building tree 93 of 500
building tree 94 of 500
building tree 95 of 500
building tree 96 of 500
building tree 97 of 500
building tree 98 of 500
building tree 99 of 500
building tree 100 of 500
building tree 101 of 500
building tree 102 of 500
building tree 103 of 500building tree 104 of 500

building tree 105 of 500
building tree 106 of 500
building tree 107 of 500
building tree 108 of 500
building tree 109 of 500
building tree 110 of 500
building tree 111 of 500
building tree 112 of 500
building tree 113 of 500
building tree 114 of 500
building tree 115 of 500
building tree 116 of 500
building tree 117 of 500
building tree 118 of 500
building tree 119 of 500
building tree 120 of 500
building tree 121 of 500
building tree 122 of 500
building tree 123 of 500
building tree 124 of 500
building tree 125 of 500
building tree 126 of 500
building tree 127 of 500
building tree 128 of 500
building tree 129 of 500
building tree 130 of 500
building tree 131 of 500
building

[Parallel(n_jobs=-1)]: Done 205 tasks      | elapsed:   12.2s


building tree 217 of 500
building tree 218 of 500
building tree 219 of 500
building tree 220 of 500
building tree 221 of 500
building tree 222 of 500
building tree 223 of 500
building tree 224 of 500
building tree 225 of 500
building tree 226 of 500
building tree 227 of 500
building tree 228 of 500
building tree 229 of 500
building tree 230 of 500
building tree 231 of 500
building tree 232 of 500
building tree 233 of 500
building tree 234 of 500
building tree 235 of 500
building tree 236 of 500
building tree 237 of 500
building tree 238 of 500
building tree 239 of 500
building tree 240 of 500
building tree 241 of 500
building tree 242 of 500
building tree 243 of 500
building tree 244 of 500
building tree 245 of 500
building tree 246 of 500
building tree 247 of 500
building tree 248 of 500
building tree 249 of 500
building tree 250 of 500
building tree 251 of 500
building tree 252 of 500
building tree 253 of 500
building tree 254 of 500
building tree 255 of 500
building tree 256 of 500


[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   22.0s


building tree 387 of 500
building tree 388 of 500
building tree 389 of 500
building tree 390 of 500
building tree 391 of 500
building tree 392 of 500
building tree 393 of 500
building tree 394 of 500
building tree 395 of 500
building tree 396 of 500
building tree 397 of 500
building tree 398 of 500
building tree 399 of 500
building tree 400 of 500
building tree 401 of 500
building tree 402 of 500
building tree 403 of 500
building tree 404 of 500
building tree 405 of 500
building tree 406 of 500
building tree 407 of 500
building tree 408 of 500
building tree 409 of 500
building tree 410 of 500
building tree 411 of 500
building tree 412 of 500
building tree 413 of 500
building tree 414 of 500
building tree 415 of 500
building tree 416 of 500
building tree 417 of 500
building tree 418 of 500
building tree 419 of 500
building tree 420 of 500
building tree 421 of 500
building tree 422 of 500
building tree 423 of 500
building tree 424 of 500
building tree 425 of 500
building tree 426 of 500


[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   29.1s finished


GridSearchCV(cv=7,
             estimator=RandomForestClassifier(n_jobs=-1, random_state=17,
                                              verbose=4),
             param_grid={'max_depth': [3, 4, 5, 6],
                         'max_features': array([0.3       , 0.41666667, 0.53333333, 0.65      , 0.76666667,
       0.88333333, 1.        ]),
                         'n_estimators': [300, 400, 500]})

In [46]:
rf_cv.best_params_, round(rf_cv.best_score_, 3)

({'max_depth': 6, 'max_features': 0.7666666666666666, 'n_estimators': 500},
 0.935)

In [47]:
rf_cv_pred = rf_cv.predict_proba(test_df)[:, 1]
write_to_submission_file(rf_cv_pred, 'credit_scoring_rf_cv_final_2_forest.csv')

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done  82 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 205 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 376 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    1.4s finished
