<center>
<img src="../../img/ods_stickers.jpg">
## Открытый курс по машинному обучению. Сессия № 2
Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

# <center>Тема 10. Бустинг
## <center> Часть 10. Продвинутые методы работы с категориальными признаками и CatBoost

In [62]:
import numpy as np
import pandas as pd

pd.set_option("display.max.columns", 100)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split

Считаем данные и посмотрим на первые несколько строк. Видим, что у нас тут немало категориальных признаков.

In [2]:
# df = pd.read_csv("../../data/bank.csv")
df = pd.read_csv('https://github.com/Yorko/mlcourse.ai/raw/main/data/bank.csv')

In [3]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   int64 
dtypes: int64(8), object(9)
memory usage: 600.6+ KB


Всего 9 признаков со строковыми значениями.

In [5]:
df.columns[df.dtypes == "object"]

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome'],
      dtype='object')

## Без категориальных признаков
Попытаемся сначала просто проигнорировать категориальные признаки. Обучим случайный лес и посмотрим на ROC AUC на кросс-валидации и на отоженной выборке. Это будет наш бейзлайн.

In [6]:
df_no_cat, y = df.loc[:, df.dtypes != "object"].drop("y", axis=1), df["y"]

In [7]:
df_no_cat_part, df_no_cat_valid, y_train_part, y_valid = train_test_split(
    df_no_cat, y, test_size=0.3, stratify=y, random_state=17
)

In [8]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

In [9]:
forest = RandomForestClassifier(random_state=17)

In [10]:
np.mean(
    cross_val_score(forest, df_no_cat_part, y_train_part, cv=skf, scoring="roc_auc")
)

0.8495016786335677

In [11]:
forest.fit(df_no_cat_part, y_train_part)

In [12]:
roc_auc_score(y_valid, forest.predict_proba(df_no_cat_valid)[:, 1])

0.8631508998911164

## LabelEncoder для категориальных признаков
Сделаем то же самое, но попробуем закодировать категориальные признаки по-простому: с помощью `LabelEncoder`.

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
label_encoder = LabelEncoder()

In [15]:
df_cat_label_enc = df.copy().drop("y", axis=1)
for col in df.columns[df.dtypes == "object"]:
    df_cat_label_enc[col] = label_encoder.fit_transform(df_cat_label_enc[col])

In [16]:
df_cat_label_enc.shape

(4521, 16)

In [17]:
df_cat_label_enc_part, df_cat_label_enc_valid = train_test_split(
    df_cat_label_enc, test_size=0.3, stratify=y, random_state=17
)

In [18]:
np.mean(
    cross_val_score(
        forest, df_cat_label_enc_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.8922975749958866

In [19]:
forest.fit(df_cat_label_enc_part, y_train_part)

In [20]:
roc_auc_score(y_valid, forest.predict_proba(df_cat_label_enc_valid)[:, 1])

0.908855334230022

## Бинаризация категориальных признаков (dummies, OHE)
Теперь сделаем то, что обычно по умолчанию и делают – бинаризацию категориальных признаков. Dummy-признаки, One-Hot Encoding... с небольшими различиями это об одном же - для каждого значения каждого категориального признака завести свой бинарный признак.

In [21]:
df_cat_dummies = pd.get_dummies(df, columns=df.columns[df.dtypes == "object"]).drop(
    "y", axis=1
)

In [22]:
df_cat_dummies.shape

(4521, 51)

In [23]:
df_cat_dummies_part, df_cat_dummies_valid = train_test_split(
    df_cat_dummies, test_size=0.3, stratify=y, random_state=17
)

In [24]:
np.mean(
    cross_val_score(
        forest, df_cat_dummies_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.8988421716862304

In [25]:
forest.fit(df_cat_dummies_part, y_train_part)

In [26]:
roc_auc_score(y_valid, forest.predict_proba(df_cat_dummies_valid)[:, 1])

0.9172511155233887

## Попарные взаимодействия признаков
Пока лес все еще лучше регрессии (хотя мы не тюнили гиперпараметры, но и не будем). Мы хотим идти дальше. Мощной техникой для работы с категориальными признаками будет учет попарных взаимодействий признаков (feature interactions). Построим попарные взаимодействия всех признаков. Вообще тут можно пойти дальше и строить взаимодействия трех и более признаков. Owen Zhang [как-то строил](https://www.youtube.com/watch?v=LgLcfZjNF44) даже 7-way interactions. Чего не сделаешь ради победы на Kaggle! :)

In [27]:
df_interact = df.copy()

In [28]:
cat_features = df.columns[df.dtypes == "object"]
for i, col1 in enumerate(cat_features):
    for j, col2 in enumerate(cat_features[i + 1 :]):
        df_interact[col1 + "_" + col2] = df_interact[col1] + "_" + df_interact[col2]

In [29]:
df_interact.shape

(4521, 53)

In [30]:
df_interact.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,job_marital,job_education,job_default,job_housing,job_loan,job_contact,job_month,job_poutcome,marital_education,marital_default,marital_housing,marital_loan,marital_contact,marital_month,marital_poutcome,education_default,education_housing,education_loan,education_contact,education_month,education_poutcome,default_housing,default_loan,default_contact,default_month,default_poutcome,housing_loan,housing_contact,housing_month,housing_poutcome,loan_contact,loan_month,loan_poutcome,contact_month,contact_poutcome,month_poutcome
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,0,unemployed_married,unemployed_primary,unemployed_no,unemployed_no,unemployed_no,unemployed_cellular,unemployed_oct,unemployed_unknown,married_primary,married_no,married_no,married_no,married_cellular,married_oct,married_unknown,primary_no,primary_no,primary_no,primary_cellular,primary_oct,primary_unknown,no_no,no_no,no_cellular,no_oct,no_unknown,no_no,no_cellular,no_oct,no_unknown,no_cellular,no_oct,no_unknown,cellular_oct,cellular_unknown,oct_unknown
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,0,services_married,services_secondary,services_no,services_yes,services_yes,services_cellular,services_may,services_failure,married_secondary,married_no,married_yes,married_yes,married_cellular,married_may,married_failure,secondary_no,secondary_yes,secondary_yes,secondary_cellular,secondary_may,secondary_failure,no_yes,no_yes,no_cellular,no_may,no_failure,yes_yes,yes_cellular,yes_may,yes_failure,yes_cellular,yes_may,yes_failure,cellular_may,cellular_failure,may_failure
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,0,management_single,management_tertiary,management_no,management_yes,management_no,management_cellular,management_apr,management_failure,single_tertiary,single_no,single_yes,single_no,single_cellular,single_apr,single_failure,tertiary_no,tertiary_yes,tertiary_no,tertiary_cellular,tertiary_apr,tertiary_failure,no_yes,no_no,no_cellular,no_apr,no_failure,yes_no,yes_cellular,yes_apr,yes_failure,no_cellular,no_apr,no_failure,cellular_apr,cellular_failure,apr_failure
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,0,management_married,management_tertiary,management_no,management_yes,management_yes,management_unknown,management_jun,management_unknown,married_tertiary,married_no,married_yes,married_yes,married_unknown,married_jun,married_unknown,tertiary_no,tertiary_yes,tertiary_yes,tertiary_unknown,tertiary_jun,tertiary_unknown,no_yes,no_yes,no_unknown,no_jun,no_unknown,yes_yes,yes_unknown,yes_jun,yes_unknown,yes_unknown,yes_jun,yes_unknown,unknown_jun,unknown_unknown,jun_unknown
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,0,blue-collar_married,blue-collar_secondary,blue-collar_no,blue-collar_yes,blue-collar_no,blue-collar_unknown,blue-collar_may,blue-collar_unknown,married_secondary,married_no,married_yes,married_no,married_unknown,married_may,married_unknown,secondary_no,secondary_yes,secondary_no,secondary_unknown,secondary_may,secondary_unknown,no_yes,no_no,no_unknown,no_may,no_unknown,yes_no,yes_unknown,yes_may,yes_unknown,no_unknown,no_may,no_unknown,unknown_may,unknown_unknown,may_unknown


## Бинаризация категориальных признаков (dummies, OHE) + попарные взаимодействия
Получилось аж 824 бинарных признака – многовато для такой задачи, и тут случайный лес начинает не справляться, да и логистическая регрессия сработала хуже, чем в прошлый раз.

In [31]:
df_interact_cat_dummies = pd.get_dummies(
    df_interact, columns=df_interact.columns[df_interact.dtypes == "object"]
).drop("y", axis=1)

In [32]:
df_interact_cat_dummies.shape

(4521, 824)

In [33]:
df_interact_cat_dummies_part, df_interact_cat_dummies_valid = train_test_split(
    df_interact_cat_dummies, test_size=0.3, stratify=y, random_state=17
)

In [34]:
np.mean(
    cross_val_score(
        forest, df_interact_cat_dummies_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.833999698931206

In [35]:
forest.fit(df_interact_cat_dummies_part, y_train_part)

In [36]:
roc_auc_score(y_valid, forest.predict_proba(df_interact_cat_dummies_valid)[:, 1])

0.8618112043382651

Случайному лесу уже тяжеловато, когда признаков так много, а вот логистической регрессии – норм.

In [37]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(random_state=17)

In [38]:
np.mean(
    cross_val_score(
        logit, df_interact_cat_dummies_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.8609712006693527

In [39]:
logit.fit(df_interact_cat_dummies_part, y_train_part)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [40]:
roc_auc_score(y_valid, logit.predict_proba(df_interact_cat_dummies_valid)[:, 1])

0.8530284591899913

## Mean Target
Теперь будем использовать технику кодирования категориальных признаков средним значением целевого признака. Это очень мощная техника, правда, надо умело ее использовать – легко переобучиться.
Основная идея – для каждого значения категориального признака посчитать среднее значение целевого признака и заменить категориальный признак на посчитанные средние. Правда, считать средние надо на кросс-валидации, а то легко переобучиться.
Но далее я адресую к видео топ-участников соревнований Kaggle, от них можно узнать про эту технику из первых уст.
- [Специализация](https://www.coursera.org/specializations/aml) "Advanced Machine Learning" на Coursera, [курс](https://www.coursera.org/learn/competitive-data-science)", How to Win a Data Science Competition: Learn from Top Kagglers", несколько видео посвящено различным способам построяния признаков с задействованием целевого, и как при этом не переобучиться. Рассказывает Дмитрий Алтухов
- [Лекция](https://www.youtube.com/watch?v=g335THJxkto) с презентацией решения конкурса Kaggle BNP paribas, Станислав Семенов

Похожая техника [используется](https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/) и в CatBoost.

Для начала давайте таким образом закодируем исходные категориальные признаки.

In [41]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [42]:
train_df, y = df.copy(), df["y"]
train_df_part, valid_df, y_train_part, y_valid = train_test_split(
    train_df.drop("y", axis=1), y, test_size=0.3, stratify=y, random_state=17
)

In [43]:
def mean_target_enc(train_df, y_train, valid_df, skf):
    import warnings

    warnings.filterwarnings("ignore")

    glob_mean = y_train.mean()
    train_df = pd.concat([train_df, pd.Series(y_train, name="y")], axis=1)
    new_train_df = train_df.copy()

    cat_features = train_df.columns[train_df.dtypes == "object"].tolist()

    for col in cat_features:
        new_train_df[col + "_mean_target"] = [
            glob_mean for _ in range(new_train_df.shape[0])
        ]

    for train_idx, valid_idx in skf.split(train_df, y_train):
        train_df_cv, valid_df_cv = (
            train_df.iloc[train_idx, :],
            train_df.iloc[valid_idx, :],
        )

        for col in cat_features:

            means = valid_df_cv[col].map(train_df_cv.groupby(col)["y"].mean())
            valid_df_cv[col + "_mean_target"] = means.fillna(glob_mean)

        new_train_df.iloc[valid_idx] = valid_df_cv

    new_train_df.drop(cat_features + ["y"], axis=1, inplace=True)

    for col in cat_features:
        means = valid_df[col].map(train_df.groupby(col)["y"].mean())
        valid_df[col + "_mean_target"] = means.fillna(glob_mean)

    valid_df.drop(train_df.columns[train_df.dtypes == "object"], axis=1, inplace=True)

    return new_train_df, valid_df

In [44]:
train_mean_target_part, valid_mean_target = mean_target_enc(
    train_df_part, y_train_part, valid_df, skf
)

In [45]:
np.mean(
    cross_val_score(
        forest, train_mean_target_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.8813941498132323

In [46]:
forest.fit(train_mean_target_part, y_train_part)

In [47]:
roc_auc_score(y_valid, forest.predict_proba(valid_mean_target)[:, 1])

0.9108221780994471

## Mean Target + попарные взаимодействия

In [48]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [49]:
train_df, y = df_interact.drop("y", axis=1).copy(), df_interact["y"]
train_df_part, valid_df, y_train_part, y_valid = train_test_split(
    train_df, y, test_size=0.3, stratify=y, random_state=17
)

In [50]:
train_mean_target_part, valid_mean_target = mean_target_enc(
    train_df_part, y_train_part, valid_df, skf
)

In [51]:
np.mean(
    cross_val_score(
        forest, train_mean_target_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.8864378441723935

In [52]:
forest.fit(train_mean_target_part, y_train_part)

In [53]:
roc_auc_score(y_valid, forest.predict_proba(valid_mean_target)[:, 1])

0.9138538397489271

Опять лучше справляется логистическая регрессия.

In [54]:
np.mean(
    cross_val_score(
        logit, train_mean_target_part, y_train_part, cv=skf, scoring="roc_auc"
    )
)

0.7839527094441079

In [55]:
logit.fit(train_mean_target_part, y_train_part)

In [56]:
roc_auc_score(y_valid, logit.predict_proba(valid_mean_target)[:, 1])

0.7496530668887038

## Catboost
В библиотеке [Catboost](https://catboost.yandex), помимо всего прочего, реализована как раз техника кодирования категориальных значений средним значением целевого признака. Результаты получаются хорошими именно когда в данных много важных категориальных признаков. Из минусов можно отметить меньшую (пока что) производительность в сравнении с Xgboost и LightGBM.

In [58]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [59]:
from catboost import CatBoostClassifier

In [60]:
ctb = CatBoostClassifier(random_seed=17)

In [61]:
train_df, y = df.drop("y", axis=1), df["y"]
train_df_part, valid_df, y_train_part, y_valid = train_test_split(
    train_df, y, test_size=0.3, stratify=y, random_state=17
)

In [63]:
cat_features_idx = np.where(train_df_part.dtypes == "object")[0].tolist()

In [64]:
%%time
cv_scores = []
for train_idx, test_idx in skf.split(train_df_part, y_train_part):
    cv_train_df, cv_valid_df = (
        train_df_part.iloc[train_idx, :],
        train_df_part.iloc[test_idx, :],
    )
    y_cv_train, y_cv_valid = y_train_part.iloc[train_idx], y_train_part.iloc[test_idx]

    ctb.fit(cv_train_df, y_cv_train, cat_features=cat_features_idx)

    cv_scores.append(roc_auc_score(y_cv_valid, ctb.predict_proba(cv_valid_df)[:, 1]))

Learning rate set to 0.015316
0:	learn: 0.6762525	total: 68.3ms	remaining: 1m 8s
1:	learn: 0.6618182	total: 82ms	remaining: 40.9s
2:	learn: 0.6502370	total: 88.2ms	remaining: 29.3s
3:	learn: 0.6362366	total: 93.9ms	remaining: 23.4s
4:	learn: 0.6199279	total: 112ms	remaining: 22.3s
5:	learn: 0.6093899	total: 119ms	remaining: 19.7s
6:	learn: 0.5942204	total: 135ms	remaining: 19.2s
7:	learn: 0.5814123	total: 154ms	remaining: 19.1s
8:	learn: 0.5681355	total: 173ms	remaining: 19s
9:	learn: 0.5548194	total: 189ms	remaining: 18.7s
10:	learn: 0.5440275	total: 207ms	remaining: 18.6s
11:	learn: 0.5327116	total: 225ms	remaining: 18.6s
12:	learn: 0.5225683	total: 243ms	remaining: 18.5s
13:	learn: 0.5110561	total: 261ms	remaining: 18.4s
14:	learn: 0.5047516	total: 269ms	remaining: 17.7s
15:	learn: 0.4966251	total: 282ms	remaining: 17.3s
16:	learn: 0.4897499	total: 297ms	remaining: 17.2s
17:	learn: 0.4801462	total: 314ms	remaining: 17.1s
18:	learn: 0.4707871	total: 330ms	remaining: 17.1s
19:	learn: 

In [65]:
np.mean(cv_scores)

0.9028648621209946

In [66]:
%%time
ctb.fit(train_df_part, y_train_part, cat_features=cat_features_idx);

Learning rate set to 0.016847
0:	learn: 0.6738676	total: 21.5ms	remaining: 21.5s
1:	learn: 0.6568722	total: 43.6ms	remaining: 21.7s
2:	learn: 0.6386390	total: 67.2ms	remaining: 22.3s
3:	learn: 0.6226240	total: 81.1ms	remaining: 20.2s
4:	learn: 0.6082315	total: 103ms	remaining: 20.5s
5:	learn: 0.5930647	total: 121ms	remaining: 20.1s
6:	learn: 0.5796675	total: 141ms	remaining: 20.1s
7:	learn: 0.5644235	total: 164ms	remaining: 20.3s
8:	learn: 0.5519232	total: 186ms	remaining: 20.5s
9:	learn: 0.5380081	total: 209ms	remaining: 20.7s
10:	learn: 0.5274532	total: 231ms	remaining: 20.8s
11:	learn: 0.5155046	total: 258ms	remaining: 21.3s
12:	learn: 0.5039250	total: 282ms	remaining: 21.4s
13:	learn: 0.4932726	total: 306ms	remaining: 21.5s
14:	learn: 0.4840909	total: 320ms	remaining: 21s
15:	learn: 0.4746328	total: 341ms	remaining: 21s
16:	learn: 0.4679718	total: 355ms	remaining: 20.5s
17:	learn: 0.4605335	total: 374ms	remaining: 20.4s
18:	learn: 0.4506446	total: 397ms	remaining: 20.5s
19:	learn: 

<catboost.core.CatBoostClassifier at 0x7d608ffbd960>

In [67]:
roc_auc_score(y_valid, ctb.predict_proba(valid_df)[:, 1])

0.9190524989858878