<a href="https://colab.research.google.com/github/psaw/hse-ai24-ml/blob/main/Boostings_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Прогнозируем задержки самолетов

In [2]:
!pip install catboost lightgbm optuna -q

In [3]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score
import pandas as pd

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [4]:
RANDOM_STATE = 111
DATASET_PATH = 'https://raw.githubusercontent.com/evgpat/edu_stepik_practical_ml/main/datasets/flight_delays_train.csv'

In [5]:
data = pd.read_csv(DATASET_PATH)

X = data.drop('dep_delayed_15min', axis=1)
y = data['dep_delayed_15min'] == 'Y'

X.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732
1,c-4,c-20,c-3,1548,US,PIT,MCO,834
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423


Создайте список номеров колонок с категориальными признаками для бустингов

## Quiz
Какой длины получился список?
6
(подсказка: колонка `DepTime` числовая)

In [10]:
cat_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"{len(cat_features)}: {cat_features}")

6: ['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest']


Разобъем данные на обучение и контроль

In [11]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

In [12]:
Xtrain.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
41207,c-4,c-18,c-1,1457,CO,EWR,TPA,998
28283,c-11,c-1,c-2,1225,UA,DEN,BOS,1754
34619,c-6,c-16,c-5,1650,YV,IAD,CAE,401
8789,c-5,c-18,c-4,923,AA,SLC,DFW,988
38315,c-2,c-14,c-2,1839,AA,STL,SAN,1558


## Модели с параметрами по умолчанию

Обучите CatBoost с гиперпараметрами по умолчанию.

## Quiz
Чему равен ROC-AUC на тестовых данных? Ответ округлите до сотых.

**Ответ:** 0.77

In [None]:
# your code here

model_catboost = CatBoostClassifier(random_seed=RANDOM_STATE, verbose=1)
model_catboost.fit(Xtrain, ytrain, cat_features=cat_features)

y_pred_proba_catboost = model_catboost.predict_proba(Xtest)[:, 1]

roc_auc_catboost = roc_auc_score(ytest, y_pred_proba_catboost)
print(f"ROC-AUC на тестовых данных: {roc_auc_catboost:.2f}")
# ROC-AUC на тестовых данных: 0.77 (20.3s)

Обучите LightGBM с гиперпараметрами по умолчанию.

## Quiz
Чему равен ROC-AUC на тестовых данных? Ответ округлите до сотых.

**Ответ:** 0.73

In [14]:
for c in X.columns:
    col_type = X[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        Xtrain[c] = Xtrain[c].astype('category')
        Xtest[c] = Xtest[c].astype('category')

In [18]:
# your code here
model_lgbm = LGBMClassifier(random_state=RANDOM_STATE)
model_lgbm.fit(Xtrain, ytrain)

y_pred_proba_lgbm = model_lgbm.predict_proba(Xtest)[:, 1]

roc_auc_lgbm = roc_auc_score(ytest, y_pred_proba_lgbm)
print(f"ROC-AUC на тестовых данных: {roc_auc_lgbm:.2f}")


[LightGBM] [Info] Number of positive: 14346, number of negative: 60654
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000570 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 75000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.191280 -> initscore=-1.441714
[LightGBM] [Info] Start training from score -1.441714
ROC-AUC на тестовых данных: 0.73


## Optuna

Выделим дополнительную валидационную выборку.

In [19]:
Xtrain_new, Xval, ytrain_new, yval = train_test_split(Xtrain, ytrain, test_size=0.25, random_state=RANDOM_STATE)

Создайте функцию objective_lgbm, в которой среди гиперпараметров

* num_leaves = trial.suggest_int("num_leaves", 10, 100)
* n_estimators = trial.suggest_int("n_estimators", 10, 1000)

подберите оптимальные, обучая LGBM на Xtrain_new, ytrain_new и проверяя качество (ROC-AUC) на Xval.

Используйте 30 эпох обучения Optuna.


In [None]:
# your code here
import optuna
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

def objective_lgbm(trial):
    num_leaves = trial.suggest_int("num_leaves", 10, 100)
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    
    model = LGBMClassifier(num_leaves=num_leaves, n_estimators=n_estimators, random_state=RANDOM_STATE)
    model.fit(Xtrain_new, ytrain_new)
    
    y_pred_proba = model.predict_proba(Xval)[:, 1]
    roc_auc = roc_auc_score(yval, y_pred_proba)
    
    return roc_auc

study = optuna.create_study(direction="maximize")
study.optimize(objective_lgbm, n_trials=30)


In [21]:
study.best_params

{'num_leaves': 10, 'n_estimators': 218}

Обучите модель с найденными гиперпараметрами на Xtrain, ytrain и оцените ROC-AUC на тестовых данных.

In [22]:
# your code here
model_lgbm_optuna = LGBMClassifier(num_leaves=study.best_params['num_leaves'], n_estimators=study.best_params['n_estimators'], random_state=RANDOM_STATE)
model_lgbm_optuna.fit(Xtrain, ytrain)

y_pred_proba_lgbm_optuna = model_lgbm_optuna.predict_proba(Xtest)[:, 1]

roc_auc_lgbm_optuna = roc_auc_score(ytest, y_pred_proba_lgbm_optuna)
print(f"ROC-AUC на тестовых данных: {roc_auc_lgbm_optuna:.2f}")


[LightGBM] [Info] Number of positive: 14346, number of negative: 60654
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000395 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 75000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.191280 -> initscore=-1.441714
[LightGBM] [Info] Start training from score -1.441714
ROC-AUC на тестовых данных: 0.73


## Quiz

Чему равно количество деревьев в LGBM после подбора гиперпараметров?

**Ответ:** 218
но результат не стабильный


## Работа над улучшением модели

* Попробуйте при помощи Optuna подобрать и другие гиперпарамтеры
* Также подберите гиперпараметры у CatBoost (а не только у LightGBM)

In [23]:
# your code here

params_lgbm = {'boosting_type': 'dart', 'lambda_l1': 0.0011597368484330204, 'lambda_l2': 0.001167729124283521, 'num_leaves': 507, 'bagging_fraction': 0.9964049504013383, 'bagging_freq': 2, 'min_child_samples': 28, 'n_estimators': 426}

model_lgbm_params = LGBMClassifier(**params_lgbm, random_state=RANDOM_STATE)
model_lgbm_params.fit(Xtrain, ytrain)

y_pred_proba_lgbm_params = model_lgbm_params.predict_proba(Xtest)[:, 1]

roc_auc_lgbm_params = roc_auc_score(ytest, y_pred_proba_lgbm_params)
print(f"ROC-AUC на тестовых данных: {roc_auc_lgbm_params:.2f}")


[LightGBM] [Info] Number of positive: 14346, number of negative: 60654
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000373 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 75000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.191280 -> initscore=-1.441714
[LightGBM] [Info] Start training from score -1.441714
ROC-AUC на тестовых данных: 0.74


## Quiz

Поделитесь своими результатами!