# План семинара
- Функционалы и метрики
- Кросс-валидация
- Переобучение и регуляризация
- Гиперпараметры и их оптимизация
- Линейный классификатор в задаче бинарной классификации
- Кодирование категориальных признаков

# Функционалы и метрики

Quick recap

Функционал (или функция потерь == loss function)  - это функция, позволяющая обучить модель (то есть то, что мы стараемся оптимизировать, подбирая параметры модели - в случае линейной регрессии параметры - это веса)

Метрика - это оценка качества модели, которую можно использовать к любым моделям (позволяет ответить на вопрос, насколько точно модель может предсказывать целевую переменную)

Пример: Чтобы обучить линейную регрессию мы можем минизировать функционал MSE

Если мы имеем n наблюдений и k признаков

$\Sigma_{i=0}^{n}(\hat y_{i} - y_{i})^{2} \rightarrow min_{w}$

где $\hat y_{i} = \Sigma_{i=0}^{k}w_{k}X_{ik}$

А как метрику можем использовать RMSE

$RMSE = \sqrt{\Sigma_{i=0}^{n}(\hat y_{i} - y_{i})^{2}}$

Фундаментальное различие функционала и метрик в том, что метрика должна отражать нашу бизнес-задачу или научный вопрос, а функционал должен быть подобран так, чтобы он лучше лучше всего помогал достичь цель (позволял достичь наилучшных показателей метрики или метрик)

Аналогия из обучения в вышке: Чтобы сдать матан, мы можем учить производные различных функций, то есть тогда наш функционал - это количество производных, которые мы знаем. А метрикой того, что мы сдали матан будет являться оценка, полученная в конце курса.

Оценка в курсе - это понятная метрика, которую нам дал мир. А является ли зубрежка производных лучшим функционалом для достижения поставленной цели решать уже вам, как исследователям

И еще, хотя функционал и метрики - это разные по смыслу и использованию инстурменты, они могут быть считаться одинаково (то есть к примеру обучать линейную регрессию можно обучать с помощью функционала MSE, и проверять качество тоже можно с помощью MSE)

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import load_diabetes 

In [2]:
np.random.seed(42)

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
X, y = load_diabetes(return_X_y = True, as_frame = True)

In [5]:
X

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [6]:
y

0      151.0
1       75.0
2      141.0
3      206.0
4      135.0
       ...  
437    178.0
438    104.0
439    132.0
440    220.0
441     57.0
Name: target, Length: 442, dtype: float64

In [7]:
help(load_diabetes)

Help on function load_diabetes in module sklearn.datasets._base:

load_diabetes(*, return_X_y=False, as_frame=False, scaled=True)
    Load and return the diabetes dataset (regression).
    
    Samples total    442
    Dimensionality   10
    Features         real, -.2 < x < .2
    Targets          integer 25 - 346
    
    .. note::
       The meaning of each feature (i.e. `feature_names`) might be unclear
       (especially for `ltg`) as the documentation of the original dataset is
       not explicit. We provide information that seems correct in regard with
       the scientific literature in this field of research.
    
    Read more in the :ref:`User Guide <diabetes_dataset>`.
    
    Parameters
    ----------
    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If Tr

In [8]:
load_diabetes().DESCR

'.. _diabetes_dataset:\n\nDiabetes dataset\n----------------\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\n**Data Set Characteristics:**\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attribute Information:\n      - age     age in years\n      - sex\n      - bmi     body mass index\n      - bp      average blood pressure\n      - s1      tc, total serum cholesterol\n      - s2      ldl, low-density lipoproteins\n      - s3      hdl, high-density lipoproteins\n      - s4      tch, total cholesterol / HDL\n      - s5      ltg, possibly log of serum triglycerides level\n      - s6      glu, blood sugar

In [9]:
# Разобьем данные на обучающую и тестовую выборки

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Как было рассказано на лекции, линейную регрессию можно обучать с помощью разного функционала (не только MSE, который мы разбирали на прошлом семинаре) и оценивать с помощью разных метрик - закодим это 

In [11]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

lr_mse = SGDRegressor(loss = 'squared_error', max_iter = 50000)
lr_mae = SGDRegressor(loss = 'epsilon_insensitive', epsilon = 0, max_iter = 50000)

lr_mse.fit(X_train, y_train)
lr_mae.fit(X_train, y_train)

y_pred_mse = lr_mse.predict(X_test)
y_pred_mae = lr_mae.predict(X_test)

In [12]:
print(f'''MSE loss: 
mae={mean_absolute_error(y_test, y_pred_mse)}
mse={mean_squared_error(y_test, y_pred_mse)}
R2={r2_score(y_test, y_pred_mse)}
''')

MSE loss: 
mae=41.68817803775468
mse=2810.0321211676223
R2=0.49182825368762506



In [13]:
print(f'''MAE loss: 
mae={mean_absolute_error(y_test, y_pred_mae)}
mse={mean_squared_error(y_test, y_pred_mae)}
R2={r2_score(y_test, y_pred_mae)}
''')

MAE loss: 
mae=62.881495822505066
mse=5566.60231891419
R2=-0.006675333039866782



Как мы говорили раньше, метрика должна отражать реальную цель из мира, поэтому нередко возникает потребность в написании своих собственных метрик, которые лучше описывают вашу конретную реальность. В задачах, связанных с медициной (как у нас сейчас), довольно высокая цена ошибки (у человека есть диабет, а мы его не нашли). Поэтому для того, чтобы ответить на вопрос, можно ли модель использовать в жизни, имеет смысл использовать метрику максимальной ошибки модели

$max error = max(|\hat y_{i} - y_{i}|)$

In [14]:
def max_error(y_true, y_pred):
    max_error = np.abs(y_true - y_pred).max()
    return max_error

def quantile_error(y_true, y_pred, q = 0.95):
    q_error = np.quantile(np.abs(y_true -  y_pred), q)
    return q_error

# Оценим максимальную ошибку в обоих случаях

print(f'MSE Loss: {max_error(y_test, y_pred_mse)}')
print(f'MAE Loss: {max_error(y_test, y_pred_mae)}')

MSE Loss: 137.48149670049958
MAE Loss: 175.5306489609794


BTW, в sklearn есть большое количество уже реализованных метрик - можете посмотреть их список и варианты применения здесь

https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

#  Кросс-валидация

Когда выбран функционал и метрика, можно задаться вопросом: а насколько я могу доверять полученным результатам (значениям метрики), не являются ли они случайностями или совпадением? Кросс-валидация - это инструмент для ответа на этот вопрос.

In [15]:
from sklearn.model_selection import cross_validate

здесь можно посмотреть какие параметры требуются для этой функции
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [16]:
# проверим на кросс-валидации значения ошибок MSE, MAE, R2 
# для линейной регрессии, обученной с помощью функционала MSE

num_splits = 5

cv_res = cross_validate(lr_mse,
                     X,
                     y,
                     scoring = 'neg_mean_squared_error', # метрика, которую нужно оценить
                     cv = num_splits # количество разбиений или класс-сплиттер
                    )

print(f"test mse errors are {cv_res['test_score']}")
print(f"mean test mse = {cv_res['test_score'].mean()}")

test mse errors are [-2977.50492814 -3034.73906235 -3158.35528509 -2889.26050471
 -3036.66631155]
mean test mse = -3019.3052183677128


In [17]:
# Проведем кросс-валидацию сразу для нескольких метрик

cv_res2 = cross_validate(lr_mse,
                     X,
                     y,
                     scoring = ['neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'],
                     cv = num_splits
                    )
print(f"""test mse errors are {cv_res2['test_neg_mean_squared_error']} 
and  mean mse = {cv_res2['test_neg_mean_squared_error'].mean()}
""")

print(f"""test mae errors are {cv_res2['test_neg_mean_absolute_error']} 
and  mean mae = {cv_res2['test_neg_mean_absolute_error'].mean()}
""")


print(f"""test R2 are {cv_res2['test_r2']} 
and  mean R2 = {cv_res2['test_r2'].mean()}
""")


test mse errors are [-2990.22752516 -3031.84924771 -3164.95551363 -2909.24855817
 -3024.77488843] 
and  mean mse = -3024.2111466198767

test mae errors are [-45.04310209 -44.93584743 -48.08023766 -42.70091223 -43.78796873] 
and  mean mae = -44.90961362991736

test R2 are [0.38640149 0.5221245  0.49430171 0.44546323 0.5325436 ] 
and  mean R2 = 0.47616690571410414



In [18]:
# для тех, кто хочет хочет дополнительно подумать

# кросс-валидацию можно проводить на основе своей кастомной метрики, но для этого
# из нее нужно сделать объект scorer

from sklearn.metrics import make_scorer

max_error_scorer = make_scorer(max_error, greater_is_better = False)

cv_res3 = cross_validate(lr_mse,
                     X,
                     y,
                     scoring = max_error_scorer,
                     cv = num_splits
                    )
cv_res3['test_score']

array([-138.46105641, -161.08634555, -120.74636425, -131.08006171,
       -135.80441471])

# Немного feature engineering

Один из самых главных источников улучшения качества прогноза модели - это информативный набор признаков. Поэтому в попытке улучшить качество нашей модели обогатим наше признаковое пространство попарными произведениями признаков

In [19]:
import copy

cols = copy.deepcopy(X.columns)
print(cols)

for col1 in cols:
    for col2 in cols:
        col_name = col1 + '_x_' + col2
        if col_name not in X.columns:
            X[col_name] = X[col1] * X[col2]
X

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,...,s6_x_age,s6_x_sex,s6_x_bmi,s6_x_bp,s6_x_s1,s6_x_s2,s6_x_s3,s6_x_s4,s6_x_s5,s6_x_s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,...,-0.000672,-0.000894,-0.001089,-0.000386,0.000780,0.000614,0.000766,0.000046,-0.000351,0.000311
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,...,0.000174,0.004116,0.004746,0.002428,0.000779,0.001767,-0.006861,0.003641,0.006300,0.008502
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,...,-0.002212,-0.001314,-0.001153,0.000147,0.001182,0.000887,0.000839,0.000067,-0.000074,0.000672
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,...,0.000834,0.000418,0.000109,0.000343,-0.000114,-0.000234,0.000337,-0.000321,-0.000212,0.000088
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,...,-0.000251,0.002082,0.001697,-0.001020,-0.000184,-0.000727,-0.000380,0.000121,0.001492,0.002175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,...,0.000301,0.000365,0.000142,0.000431,-0.000041,-0.000018,-0.000207,-0.000019,0.000225,0.000052
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,...,-0.000245,0.002255,-0.000708,-0.003009,0.002195,0.003522,-0.001276,0.001526,-0.000806,0.001979
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491,...,0.000646,0.000785,-0.000246,0.000268,-0.000578,-0.000214,-0.000387,-0.000172,-0.000726,0.000240
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930,...,0.001179,0.001158,-0.001013,-0.000032,-0.000423,-0.000396,0.000744,-0.000689,-0.001155,0.000672


# Переобучение и регуляризация

Переобучение - ситуация, когда модель хорошо выучила обучающую выборку, но при этом показывает гораздо более низкое качество точности на тестовых данных. Это можно интерпретровать как модель стала слишком специфичной и потеряла обобщающую способность

В случае линеной регрессии, одним из симптомов переобучения являются высокие значения весов. С этим борются регуляризацией.

Регуляризация Lasso или L1-регуляризация:

$Q_{lasso}(w) = Q(w) + \alpha \Sigma_{j=0}^{k}|w_{k}|$

Регуляризация Ridge или L2-регуляризация:

$Q_{ridge}(w) = Q(w) + \alpha \Sigma_{j=0}^{k}w_{k}^{2}$


Как было рассказано в лекции, несмотря на то, что оба вида регуляризации ведут к занижению значений весов, отличие регуляризации Lasso заключается в том, что она может привести часть весов к 0 (что эквивалетно безинформативности  соответствующего признака), в случае Ridge регрессии веса могут быть сколько угодно близки к 0, но никогда не равны.

Объяснение в лекции :)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [21]:
# альфа - это гиперпараметр, посмотрим как зависят значения весов от него

from sklearn.linear_model import Lasso

for a in np.arange(0, 1.1, 0.25):
    if a == 0:
        a += 0.00000001
    lasso = Lasso(alpha = a)
    lasso.fit(X_train, y_train)
    y_pred_tr = lasso.predict(X_train)
    y_pred2 = lasso.predict(X_test)
    print('alpha={}'.format(a))
    print('Train MSE:', mean_squared_error(y_train, y_pred_tr))
    print('Test MSE:', mean_squared_error(y_test, y_pred2))
    print(lasso.coef_,'\n')

alpha=1e-08
Train MSE: 2597.5480769975024
Test MSE: 2454.8985668206087
[ 3.01692341e+01 -5.39869580e+00  3.80338527e+02  3.28436246e+02
 -2.71696089e+02  1.60527144e+02 -2.67846555e+02 -6.98661684e+01
  7.43430440e+02  8.61701157e+01  7.60836048e+02  4.29627134e+03
  3.58715360e+02  2.40981458e+03 -7.86704583e+03  1.25999116e+03
  6.33840855e+03  6.88921532e+03  2.36094359e+03 -6.31315627e+02
 -6.70003424e+02 -3.99946028e+04  3.27321507e+01  2.83109437e+03
  2.77309110e+03 -4.94066268e+02  2.47493977e+03 -3.21450318e+03
  9.91064324e+02  1.25924640e+02 -4.52526082e+02  9.29063197e+01
  9.46254435e+02  2.37696974e+03 -1.65647333e+04  7.51696572e+03
  9.34243422e+03  5.85291880e+02  7.51209973e+03  1.70599904e+03
 -6.64202071e+02  3.31973993e+02  8.00438989e+02  1.06377465e+01
  5.18763399e+03 -7.55889482e+03 -1.07937559e+03 -2.24283386e+03
 -1.16749336e+03 -8.94222703e+02 -6.27039849e+03  3.33552500e+03
 -1.05319747e+04  1.33259543e+04  8.72359292e+03 -1.18369430e+04
 -1.64731280e+04 -1

In [22]:
# альфа - это гиперпараметр, посмотрим как зависят значения весов от него

from sklearn.linear_model import Ridge

for a in np.arange(0, 1.1, 0.25):
    if a == 0:
        a += 0.00000001
    ridge = Ridge(alpha = a)
    ridge.fit(X_train, y_train)

    y_pred_tr = ridge.predict(X_train)
    y_pred2 = ridge.predict(X_test)

    print('alpha={}'.format(a))
    print('Train MSE:', mean_squared_error(y_train, y_pred_tr))
    print('Test MSE:', mean_squared_error(y_test, y_pred2))
    print(ridge.coef_,'\n')

alpha=1e-08
Train MSE: 2524.57033992817
Test MSE: 2922.5959753023385
[ 5.36957704e+01 -2.61206938e+02  4.03334715e+02  3.06415716e+02
 -3.40445539e+04  2.99240885e+04  1.22928169e+04 -2.38516753e+02
  1.19307872e+04  9.78236062e+01  8.10615330e+02  1.44330027e+03
 -3.46880694e+02  9.96003121e+02 -2.72709033e+03 -9.64018429e+02
  4.34764814e+03  4.83574849e+03  9.90872622e+02  1.16959364e+02
  1.44330027e+03 -1.57729235e+00 -4.66438547e+01  1.27799403e+03
  6.54671725e+03 -5.22010134e+03 -3.29295902e+03 -2.74690973e+03
 -8.07974951e+02  1.24856689e+02 -3.46880692e+02 -4.66438541e+01
  8.49779966e+02  1.70677170e+03 -8.00783131e+03  4.35310078e+03
  4.46981897e+03  1.66024324e+03  3.53963672e+03  5.95944292e+01
  9.96003119e+02  1.27799403e+03  1.70677170e+03  1.11084923e+02
  3.47839951e+03 -1.94975386e+03 -1.31374193e+03 -1.07411286e+03
 -3.83359473e+02 -9.65884174e+02 -2.72709032e+03  6.54671726e+03
 -8.00783131e+03  3.47839951e+03  2.02009257e+05 -1.47479705e+05
 -1.04078531e+05 -4.2

А какой коэффициент альфа лучший ? И нужна ли здесь регуляризация ?

Чтобы ответить на этот вопрос мы можем с помощью кросс-валидации перебрать различные значения альфы и выбрать лучшее значение. Этот процесс называется оптимизацией гиперпараметров. Альфа является гиперпараметром, потому что задача оптимизации функционала не позволяет найти ее оптимальное значение (в отличие от весов регрессии).

In [28]:
from sklearn.linear_model import LassoCV

n_alphas = 200
alphas = np.linspace(1e-10, 5, n_alphas)

lasso_cv = LassoCV(alphas = alphas, cv = 5, random_state = 42)
lasso_cv.fit(X, y)

print(f'Optimal alpha value is {lasso_cv.alpha_}')


Optimal alpha value is 0.025125628240201005


In [25]:
# Более общий способ использования кросс-валидации для поиска лучшего набора гиперпараметров


from sklearn.model_selection import GridSearchCV

params = {'alpha':alphas}
#print(params)
cv = GridSearchCV(lasso,
                  params,
                  scoring = 'r2',
                  cv = num_splits
                 )
cv.fit(X, y)

print(cv.best_params_)

{'alpha': 0.025125628240201005}


In [26]:
help(GridSearchCV)

Help on class GridSearchCV in module sklearn.model_selection._search:

class GridSearchCV(BaseSearchCV)
 |  GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
 |  
 |  Exhaustive search over specified parameter values for an estimator.
 |  
 |  Important members are fit, predict.
 |  
 |  GridSearchCV implements a "fit" and a "score" method.
 |  It also implements "score_samples", "predict", "predict_proba",
 |  "decision_function", "transform" and "inverse_transform" if they are
 |  implemented in the estimator used.
 |  
 |  The parameters of the estimator used to apply these methods are optimized
 |  by cross-validated grid-search over a parameter grid.
 |  
 |  Read more in the :ref:`User Guide <grid_search>`.
 |  
 |  Parameters
 |  ----------
 |  estimator : estimator object
 |      This is assumed to implement the scikit-learn estimator interface.
 |      Either est

## Задача бинарной классификации

### Логистическая регрессия

y = {-1, 1}

$b(x) = \sigma(<w,x>)$,

где $\sigma(z) = \frac{1}{1 + e^{-z}}$

То есть, мы предсказываем $P(y_i = 1| X_i)$ - вероятность того, что наблюдение принадлежит классу +1

Обучаем с помощью функционала: Максимального лог правдоподобия (флэшбек из статистики)

$Q(w) = -\Sigma_{i=0}^{n}(y_i*log(b(x_i)) + (1 - y_i)log(1 - b(x_i))) \rightarrow min_w$



In [29]:
import pandas as pd
import numpy as np
import seaborn as sns

In [30]:
np.random.seed(42)

In [31]:
data = pd.read_csv('bike_buyers_clean.csv')

In [32]:
data

Unnamed: 0,ID,Marital Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age,Purchased Bike
0,12496,Married,Female,40000,1,Bachelors,Skilled Manual,Yes,0,0-1 Miles,Europe,42,No
1,24107,Married,Male,30000,3,Partial College,Clerical,Yes,1,0-1 Miles,Europe,43,No
2,14177,Married,Male,80000,5,Partial College,Professional,No,2,2-5 Miles,Europe,60,No
3,24381,Single,Male,70000,0,Bachelors,Professional,Yes,1,5-10 Miles,Pacific,41,Yes
4,25597,Single,Male,30000,0,Bachelors,Clerical,No,0,0-1 Miles,Europe,36,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,23731,Married,Male,60000,2,High School,Professional,Yes,2,2-5 Miles,North America,54,Yes
996,28672,Single,Male,70000,4,Graduate Degree,Professional,Yes,0,2-5 Miles,North America,35,Yes
997,11809,Married,Male,60000,2,Bachelors,Skilled Manual,Yes,0,0-1 Miles,North America,38,Yes
998,19664,Single,Male,100000,3,Bachelors,Management,No,3,1-2 Miles,North America,38,No


# Обзор данных

In [33]:
# проверим типы колонок в датасете
data.dtypes

ID                   int64
Marital Status      object
Gender              object
Income               int64
Children             int64
Education           object
Occupation          object
Home Owner          object
Cars                 int64
Commute Distance    object
Region              object
Age                  int64
Purchased Bike      object
dtype: object

In [34]:
X = data.iloc[:,:-1]
X.drop(columns = 'ID', inplace = True)

y = data['Purchased Bike']

In [35]:
X.head()

Unnamed: 0,Marital Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age
0,Married,Female,40000,1,Bachelors,Skilled Manual,Yes,0,0-1 Miles,Europe,42
1,Married,Male,30000,3,Partial College,Clerical,Yes,1,0-1 Miles,Europe,43
2,Married,Male,80000,5,Partial College,Professional,No,2,2-5 Miles,Europe,60
3,Single,Male,70000,0,Bachelors,Professional,Yes,1,5-10 Miles,Pacific,41
4,Single,Male,30000,0,Bachelors,Clerical,No,0,0-1 Miles,Europe,36


In [36]:
y.head()

0     No
1     No
2     No
3    Yes
4    Yes
Name: Purchased Bike, dtype: object

In [37]:
num_cols = X.columns[X.dtypes == 'int64'].tolist()
cat_cols = X.columns[X.dtypes == 'object']

print(f"We have {len(num_cols)} numeric columns: {', '.join(num_cols)}")
print(f"And {len(cat_cols)} categorical columns: {', '.join(cat_cols)}")

We have 4 numeric columns: Income, Children, Cars, Age
And 7 categorical columns: Marital Status, Gender, Education, Occupation, Home Owner, Commute Distance, Region


In [38]:
for col in cat_cols:
    print(col)
    display(X[col].value_counts(normalize = True))
    print()

Marital Status


Marital Status
Married    0.539
Single     0.461
Name: proportion, dtype: float64


Gender


Gender
Male      0.509
Female    0.491
Name: proportion, dtype: float64


Education


Education
Bachelors              0.306
Partial College        0.265
High School            0.179
Graduate Degree        0.174
Partial High School    0.076
Name: proportion, dtype: float64


Occupation


Occupation
Professional      0.276
Skilled Manual    0.255
Clerical          0.177
Management        0.173
Manual            0.119
Name: proportion, dtype: float64


Home Owner


Home Owner
Yes    0.685
No     0.315
Name: proportion, dtype: float64


Commute Distance


Commute Distance
0-1 Miles     0.366
5-10 Miles    0.192
1-2 Miles     0.169
2-5 Miles     0.162
10+ Miles     0.111
Name: proportion, dtype: float64


Region


Region
North America    0.508
Europe           0.300
Pacific          0.192
Name: proportion, dtype: float64




In [39]:
# у нас есть категориальные переменные разных видов!

binary_cols = cat_cols[X[cat_cols].nunique() == 2].tolist()
ordinal_cols = ['Commute Distance', 'Education']
cat_cols = cat_cols.difference(binary_cols + ordinal_cols).tolist()

In [40]:
cat_cols

['Occupation', 'Region']

In [41]:
for col in num_cols:
    print(col)
    display(X[col].describe())
    print()

Income


count      1000.000000
mean      56140.000000
std       31081.609779
min       10000.000000
25%       30000.000000
50%       60000.000000
75%       70000.000000
max      170000.000000
Name: Income, dtype: float64


Children


count    1000.000000
mean        1.908000
std         1.626094
min         0.000000
25%         0.000000
50%         2.000000
75%         3.000000
max         5.000000
Name: Children, dtype: float64


Cars


count    1000.000000
mean        1.452000
std         1.124705
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         4.000000
Name: Cars, dtype: float64


Age


count    1000.000000
mean       44.190000
std        11.353537
min        25.000000
25%        35.000000
50%        43.000000
75%        52.000000
max        89.000000
Name: Age, dtype: float64




In [42]:
X.describe()

Unnamed: 0,Income,Children,Cars,Age
count,1000.0,1000.0,1000.0,1000.0
mean,56140.0,1.908,1.452,44.19
std,31081.609779,1.626094,1.124705,11.353537
min,10000.0,0.0,0.0,25.0
25%,30000.0,0.0,1.0,35.0
50%,60000.0,2.0,1.0,43.0
75%,70000.0,3.0,2.0,52.0
max,170000.0,5.0,4.0,89.0


In [43]:
# classes are balanced !
y.value_counts(normalize=True)

Purchased Bike
No     0.519
Yes    0.481
Name: proportion, dtype: float64

In [44]:
y.head()

0     No
1     No
2     No
3    Yes
4    Yes
Name: Purchased Bike, dtype: object

In [45]:
# transform y to numeric column
y = (y == 'Yes').astype(int)
y

0      0
1      0
2      0
3      1
4      1
      ..
995    1
996    1
997    1
998    0
999    1
Name: Purchased Bike, Length: 1000, dtype: int64

# Подготовка данных

## Кодирование категориальных признаков

In [46]:
# run if not installed yet

#!pip install category_encoders

In [47]:
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.one_hot import OneHotEncoder
from category_encoders.target_encoder import TargetEncoder

In [48]:
X['Education'].unique()

array(['Bachelors', 'Partial College', 'High School',
       'Partial High School', 'Graduate Degree'], dtype=object)

In [49]:
# Ordinal: from categories to numbers

ord_enc = OrdinalEncoder()
ord_enc.fit_transform(X['Education'])

Unnamed: 0,Education
0,1
1,2
2,2
3,1
4,1
...,...
995,3
996,5
997,1
998,1


In [50]:
# One hot: from k categories to k dummy columns

one_hot_enc = OneHotEncoder()

one_hot_enc.fit_transform(X['Education'], drop = 'first')
# * fit -> определить количество новых столбцов (по кол-ву категорий)
# * transform -> создать новые столбцы
# * fit_transform = fit + transform

# Нужно ли удалять какую-то из колонок после такого кодирования ?

Unnamed: 0,Education_1,Education_2,Education_3,Education_4,Education_5
0,1,0,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0
...,...,...,...,...,...
995,0,0,1,0,0
996,0,0,0,0,1
997,1,0,0,0,0
998,1,0,0,0,0


Target encoding вычисляет значения по формуле

$$\frac{mean(target)\cdot n_{rows} + \alpha \cdot globalMean}{n_{rows} + \alpha} $$

In [51]:
# target encoding: from k categories to posterior probabilites of y == 1 - P(y==1 | category == c1)

tgt_enc = TargetEncoder(smoothing=1)

# smoothing - это коэффициент сглаживания alpha, чем он больше, тем больше регуляризация

tgt_enc.fit_transform(X['Education'], y)

Unnamed: 0,Education
0,0.552288
1,0.449057
2,0.449057
3,0.552288
4,0.552288
...,...
995,0.441341
996,0.540230
997,0.552288
998,0.552288


In [52]:
# энкодер можно применять сразу на весь датафрейм

tgt_enc = TargetEncoder(cols=['Education', 'Gender', 'Region'])
tgt_enc.fit_transform(X, y)

Unnamed: 0,Marital Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age
0,Married,0.486762,40000,1,0.552288,Skilled Manual,Yes,0,0-1 Miles,0.493333,42
1,Married,0.475442,30000,3,0.449057,Clerical,Yes,1,0-1 Miles,0.493333,43
2,Married,0.475442,80000,5,0.449057,Professional,No,2,2-5 Miles,0.493333,60
3,Single,0.475442,70000,0,0.552288,Professional,Yes,1,5-10 Miles,0.588542,41
4,Single,0.475442,30000,0,0.552288,Clerical,No,0,0-1 Miles,0.493333,36
...,...,...,...,...,...,...,...,...,...,...,...
995,Married,0.475442,60000,2,0.441341,Professional,Yes,2,2-5 Miles,0.433071,54
996,Single,0.475442,70000,4,0.540230,Professional,Yes,0,2-5 Miles,0.433071,35
997,Married,0.475442,60000,2,0.552288,Skilled Manual,Yes,0,0-1 Miles,0.433071,38
998,Single,0.475442,100000,3,0.552288,Management,No,3,1-2 Miles,0.433071,38


## Масштабирование числовых признаков

In [53]:
X['Income']

0       40000
1       30000
2       80000
3       70000
4       30000
        ...  
995     60000
996     70000
997     60000
998    100000
999     60000
Name: Income, Length: 1000, dtype: int64

In [54]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(X['Income'].values.reshape(-1,1))

array([[-0.51953796],
       [-0.8414326 ],
       [ 0.76804062],
       [ 0.44614598],
       [-0.8414326 ],
       [-1.48522189],
       [ 3.34319779],
       [-0.51953796],
       [-1.16332725],
       [-1.16332725],
       [-0.8414326 ],
       [ 1.08993527],
       [ 3.66509243],
       [-0.51953796],
       [ 0.12425133],
       [-1.48522189],
       [-0.8414326 ],
       [-0.8414326 ],
       [-0.51953796],
       [-1.16332725],
       [-0.51953796],
       [ 0.76804062],
       [-0.51953796],
       [ 0.76804062],
       [-0.51953796],
       [-0.8414326 ],
       [-0.8414326 ],
       [ 1.41182991],
       [ 0.44614598],
       [-1.16332725],
       [-1.16332725],
       [-1.48522189],
       [-1.16332725],
       [ 0.76804062],
       [ 1.08993527],
       [-1.48522189],
       [-1.48522189],
       [-0.8414326 ],
       [-1.16332725],
       [-1.48522189],
       [-0.8414326 ],
       [-0.51953796],
       [-1.48522189],
       [ 3.66509243],
       [-1.16332725],
       [-1

In [55]:
help(StandardScaler)

Help on class StandardScaler in module sklearn.preprocessing._data:

class StandardScaler(sklearn.base.OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  StandardScaler(*, copy=True, with_mean=True, with_std=True)
 |  
 |  Standardize features by removing the mean and scaling to unit variance.
 |  
 |  The standard score of a sample `x` is calculated as:
 |  
 |      z = (x - u) / s
 |  
 |  where `u` is the mean of the training samples or zero if `with_mean=False`,
 |  and `s` is the standard deviation of the training samples or one if
 |  `with_std=False`.
 |  
 |  Centering and scaling happen independently on each feature by computing
 |  the relevant statistics on the samples in the training set. Mean and
 |  standard deviation are then stored to be used on later data using
 |  :meth:`transform`.
 |  
 |  Standardization of a dataset is a common requirement for many
 |  machine learning estimators: they might behave badly if the
 |  individual feat

Есть две проблемы:
- класc StandardScaler не умеет работать только на части колонок датафрейма
- классы sklearn возвращают numpy arrays, а не pandas dataframe, что не удобно

In [56]:
num_cols

['Income', 'Children', 'Cars', 'Age']

In [57]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('scaler', StandardScaler(), num_cols)], remainder='passthrough') # 'drop'

In [58]:
ct.fit_transform(X)

array([[-0.5195379574051056, -0.5586728696623785, -1.2916513760469168,
        ..., 'Yes', '0-1 Miles', 'Europe'],
       [-0.8414326026375131, 0.6718841119728166, -0.4020843126537234,
        ..., 'Yes', '0-1 Miles', 'Europe'],
       [0.7680406235245242, 1.9024410936080116, 0.48748275073947, ...,
        'No', '2-5 Miles', 'Europe'],
       ...,
       [0.12425133305970927, 0.05660562115521903, -1.2916513760469168,
        ..., 'Yes', '0-1 Miles', 'North America'],
       [1.4118299139893389, 0.6718841119728166, 1.3770498141326635, ...,
        'No', '1-2 Miles', 'North America'],
       [0.12425133305970927, 0.6718841119728166, 0.48748275073947, ...,
        'Yes', '10+ Miles', 'North America']], dtype=object)

In [59]:
# нет удобной реализации - напишем сами !

from sklearn.base import TransformerMixin

class CustomScaler(TransformerMixin):
    def __init__(self, cols, scaler = None):
        self.cols = cols
        self.scaler = scaler or StandardScaler() 
    def fit(self, X, y = None):
        num_cols = X.copy()[self.cols]
        self.scaler.fit(num_cols)
        return self
    def transform(self, X, y=None):
        X_res = X.copy()
        num_cols_tr = self.scaler.transform(X_res[self.cols])
        for i, col in enumerate(self.cols):
            X_res[col] = num_cols_tr[:,i]
        return X_res

In [60]:
sc = CustomScaler(num_cols)
X2 = sc.fit_transform(X)

In [64]:
X.head()

Unnamed: 0,Marital Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age
0,Married,Female,40000,1,Bachelors,Skilled Manual,Yes,0,0-1 Miles,Europe,42
1,Married,Male,30000,3,Partial College,Clerical,Yes,1,0-1 Miles,Europe,43
2,Married,Male,80000,5,Partial College,Professional,No,2,2-5 Miles,Europe,60
3,Single,Male,70000,0,Bachelors,Professional,Yes,1,5-10 Miles,Pacific,41
4,Single,Male,30000,0,Bachelors,Clerical,No,0,0-1 Miles,Europe,36


In [63]:
X2.head()

Unnamed: 0,Marital Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age
0,Married,Female,-0.519538,-0.558673,Bachelors,Skilled Manual,Yes,-1.291651,0-1 Miles,Europe,-0.192988
1,Married,Male,-0.841433,0.671884,Partial College,Clerical,Yes,-0.402084,0-1 Miles,Europe,-0.104866
2,Married,Male,0.768041,1.902441,Partial College,Professional,No,0.487483,2-5 Miles,Europe,1.393214
3,Single,Male,0.446146,-1.173951,Bachelors,Professional,Yes,-0.402084,5-10 Miles,Pacific,-0.28111
4,Single,Male,-0.841433,-1.173951,Bachelors,Clerical,No,-1.291651,0-1 Miles,Europe,-0.721722


# Соберем все преобразования данных в pipeline

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

p1 = Pipeline([
    ('ordinal_encoder_', OrdinalEncoder(cols=ordinal_cols + binary_cols + cat_cols)), # плохо!!!
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
    ])

p2 = Pipeline([
    ('one_hot_encoder_', OneHotEncoder(cols=ordinal_cols + binary_cols+cat_cols)),
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
    ])

p3 = Pipeline([
    ('target_encoder_', TargetEncoder(cols=ordinal_cols + binary_cols+cat_cols)),
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
])

p4 = Pipeline([
    ('ordinal_encoder_', OrdinalEncoder(cols=ordinal_cols)),
    ('one_hot_encoder_', OneHotEncoder(cols=binary_cols+cat_cols)),
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
    ])

p5 = Pipeline([
    ('ordinal_encoder_', OrdinalEncoder(cols=ordinal_cols)),
    ('one_hot_encoder_', OneHotEncoder(cols=binary_cols)),
    ('target_encoder_', TargetEncoder(cols=cat_cols)),
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
])

p6 = Pipeline([
    ('one_hot_encoder_', OneHotEncoder(cols=binary_cols)),
    ('target_encoder_', TargetEncoder(cols=cat_cols + ordinal_cols)),
    ('scaler_', CustomScaler(num_cols)),
    ('model_', LogisticRegression())
])

In [66]:
# пример работы с пайплайном
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y)

p1.fit(X_train, y_train)

#print(p1)

y_pred = p1.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.596


# Сравнение качества классификации при разных пайплайнах преобразования данных

Вообще существует довольно большое количество метрик для задачи бинарной классификации (о них будет подробно рассказано на лекциях)

Но для нашей задачи разберем самую простую и интуитивную метрику: accuracy

$accuracy = \frac{1}{n}\Sigma_{i=0}^n [\hat y_i == y_i]$

То есть доля правильных предсказаний

In [67]:
from sklearn.model_selection import cross_validate, cross_val_score
import warnings

warnings.filterwarnings('ignore')

In [68]:
for i, pipe in enumerate([p1, p2, p3, p4, p5, p6]):
    cv_res = cross_validate(pipe,
                            X,
                            y,
                            cv = 5,
                            scoring = 'accuracy'
                           )
    print(f"Pipeline {i + 1}: mean cv accuracy = {cv_res['test_score'].mean()}")

Pipeline 1: mean cv accuracy = 0.629
Pipeline 2: mean cv accuracy = 0.616
Pipeline 3: mean cv accuracy = 0.629
Pipeline 4: mean cv accuracy = 0.617
Pipeline 5: mean cv accuracy = 0.619
Pipeline 6: mean cv accuracy = 0.615


Больше про то, как задавать поле поиска и какие еще есть методы оптимизации гиперпараметров можете прочитать здесь

https://scikit-learn.org/stable/modules/grid_search.html