## Description

Необхідно побудувати модель, яка буде прогнозувати сегмент churn-водіїв, тобто водіїв, які перестануть користуватися сервісом.

Для цього вам необхідно навчити модель, використовуючи дані з train.csv. Після цього, скориставшись вашим препроцессінгом та використовуючи вашу модель, потрібно передбачити для кожного Id з test.csv, що водій відноситься до сегменту churn-водіїв (1 - відноситься, 0 - не відноситься).
Зверніть увагу, що необхідно спрогнозувати факт відношення до сегменту churn, без прив'язки до періоду (тижня).

## Evaluation

The evaluation metric for this competition is AUC (https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).

Файл з результатами повинен містити:
+ Id – ідентифікатор водія;
+ Predicted - клас 1 або клас 0, яку визначає ваша модель.

## Data

train.csv - the training set<br>
test.csv - the test set<br>
Sample_Submission.csv - a sample submission file in the correct format<br>

Файл train.csv містить числові дані щодо роботи водіїв сервісу Уклон за 4 тижні.
- Id – ідентифікатор водія;
- Week – номер тижня (тижні послідовні, 0 – найновіший);
- V1 - V22 та P1 - P27 – числові дані щодо роботи водіїв у відповідний період.
- Target – значення цільової мітки (1 – churn , 0 – не churn).

Файл test.csv містить дані:
- Id – ідентифікатор водія;
- Week – номер тижня (тижні послідовні, 0 – найновіший);
- V1 - V22 та P1 - P27 – дані щодо роботи водіїв у відповідний період.

Файл з результатами повинен містити:
- Id – ідентифікатор водія;
- Predicted - клас 1 або клас 0, яку визначає ваша модель.

## Module importing

In [1]:
import pandas as pd
import numpy as np
import phik
import shap
import pickle

import lightgbm as lgb

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline, _fit_transform_one, _transform_one

from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.feature_selection import VarianceThreshold

import multiprocessing as mp
from joblib import Parallel, delayed

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [3]:
from pathlib import Path
from tqdm import tqdm
from functools import partial
from typing import List, Dict, Tuple, Union, Optional

In [4]:
import warnings
warnings.filterwarnings('ignore')

## Settings

In [5]:
PATH2DATA = Path('/kaggle/input/techuklon-int20h')
PATH2OUTPUT = Path('/kaggle/working')

In [6]:
ID = 'Id'
WEEK = 'Week'
TARGET = 'target'

service_columns = [ID, WEEK, TARGET]

SEED = 42
METRIC = 'auc'

## Data Loading

In [7]:
df_train = pd.read_csv(PATH2DATA / 'train.csv')
df_train.shape

In [8]:
df_test = pd.read_csv(PATH2DATA / 'test.csv')
df_test.shape

In [9]:
numerical_columns = df_train.drop(columns=service_columns).select_dtypes(include=['float', 'int']).columns.tolist()
assert len(numerical_columns)+3 == df_train.shape[1]

## Simple EDA & Hypothesis

- Гипотеза о том что могут значить фичи **P** и **V** (*p - фичи водителя, v - фичи транспортного средства ???*)
- Добавить индикатор отсутствия фич P (*или можно прям количество фич писать, что по 6 фичам из P у нас пусто*)
- Добавить индикатор отсутствия фич V
- Поискать в данных колонки которые могут нести категориальную информацию (*часы, дни, бинарные колонки*)
- Проверить распределение фич, может стоит к ним применить какое-то преобразование (`*np.log`, `np.exp` и тд*)
- Подумать каких водителей стоит выбросить с обучающей выборки (*мало фич, все нули*)
- Проверить наличие выбросов или подозрительных значений в колонках фич
- Проверить матчится ли обучающая выборка на тестовую (*может часть водителей в обучающей так же присутствует в тестовой и это нам чет даст*)
- Проверить у каждого ли водителя ровно 4 недели (*отдельно проверить тест и трейн*) - есть записей меньше 4, то делаем предположение что водитель только присоединился к уклону, скорее всего такие ребята не склонны к оттоку (*но это можно и перепроверить, посчитав долю таргета для таких ребят*)
- Лаговые фичи: берем прошлые недели и считаем разницу по фичам (*гипотеза такая: если у человека есть постепенное проседание или постепенное увеличение показателей, то он склонен к оттоку*)
- Идеи для агрегации по неделям (*поскольку нам нужно получить предсказания для каждого **Id**, а не пары **Id-Week***):
    - просто посчитать среднее - это в целом проще всего
    - считать `avg`, `min`, `max`, `std`, `sum` - это раздует количество фич в 4 раза, но похер вообще (~~живем один раз~~)
    - просто перенести фичи с колонок в столбики тип: `p1_week1`, `p1_week2`, ...., `p54_week1`, `p54_week2`, ...
- Мб будут идеи какие фичи можно добавить, кроме как попарные умножения
- Возможно есть фичи которые не меняются в рамках этих 4 недель - тип city_id, который для таксиста будет один и тот же или car_id - айдишка его машины (*хз что это нам дает, но может будет осторожней юзать агрегации над ними*)
- Подумать над стратегией работы с дисбалансом (учитывая что метрика ROC-AUC, то в целом можно сильно и не напрягаться, но у меня пока 2 основные идеи на уме):
    - Делаем undersampling мажорного класса до минорного N раз (при этом рандомно семплируя данные с мажорного класса). На выходе получим N выборок, на которых строим N моделей, а финальные скоры усредняем. Получим такой тип ансамбль
    - Поиграться в весами классов в моделях, чтобы добавить штрафы
- Подумать над тем как будем валидировать результат. Я сильно не вникал в данные, но если не будет пересечения тестовой выборки с трейном, то я думаю что StratifiedKFold наше все
- Подумать стоит ли делать проверку на гомогенность (когда распределение фич на трейне не совпадает с тестовыми данными). Просто они могли нарезать выборку для теста совсем с другого периода, где некоторые фичи будут ну прям совсем отличатся.

### Target

In [10]:
assert all(df_train.groupby(ID)[TARGET].count() == 4)

In [11]:
sns.countplot(x=df_train[TARGET]);

In [12]:
df_train[TARGET].value_counts(normalize=True)

### Train data

In [13]:
df_train.head()

In [14]:
df_train[numerical_columns].describe()

In [15]:
def simple_na_report(df: pd.DataFrame) -> pd.core.series.Series:
    t = df_train.isnull().sum() / len(df_train)
    return t[t > 0]

simple_na_report(df_train)

In [16]:
sns.heatmap(df_train.isnull(), cbar=False);

In [17]:
# for col in categorical_columns:
#     if len(df_train[col].unique()) > 25:
#         continue
#     temp = df_train.groupby([col, TARGET]).count()[ID].reset_index()
#     fig = px.bar(temp, x=col, y=ID, color=TARGET, title=f"{col} - count of id`s",width=1000, height=500)
#     fig.show()
# del temp

In [18]:
# print('Columns with high cardinality:')
# high_cardinality_columns = []
# for col in categorical_columns:
#     unique_cnt = len(df_train[col].unique())
#     if unique_cnt > 25:
#         high_cardinality_columns.append(col)
#         print(f'- {col}: {unique_cnt} unique values')

In [19]:
# print('Columns which contain minor categories:')
# minor_cat_columns = []
# for col in categorical_columns:
#     minor_cnt = len([*filter(lambda x: x < 5e-2, df_train[col].value_counts(normalize=True).values)])
#     if minor_cnt > 0:
#         minor_cat_columns.append(col)
#         print(f'- {col}: {minor_cnt} minor categories')

### Test data

In [20]:
df_test.head()

In [21]:
df_test[numerical_columns].describe()

In [22]:
def simple_na_report(df: pd.DataFrame) -> pd.core.series.Series:
    t = df_test.isnull().sum() / len(df_test)
    return t[t > 0]
simple_na_report(df_test)

## Data Preparation

In [32]:
class LogScaler:
    def __init__(self, eps: float = 0.0):
        self.eps = eps

    def transform(self, x) -> pd.DataFrame:
        return np.log10(x + self.eps)

    def inverse_transform(self, x) -> pd.DataFrame:
        return np.power(10, x) - self.eps

In [33]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature_names: List[str]):
        self.feature_names = feature_names 
    
    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        return self
    
    def transform(self, X: pd.DataFrame, y: Optional[pd.Series] = None) -> pd.DataFrame:
        return X[self.feature_names]

In [34]:
class FeatureDrop(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature_names: List[str]):
        self.feature_names = feature_names 
    
    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        return self
    
    def transform(self, X: pd.DataFrame, y: Optional[pd.Series] = None) -> pd.DataFrame:
        return X.drop(columns=self.feature_names)

In [96]:
class NumericalFeaturesGenerator(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.log_scaler = LogScaler(eps=10)
        
    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        return self
    
    def transform(self, X: pd.DataFrame, y: Optional[pd.Series] = None) -> pd.DataFrame:
        X_temp = X.copy()
        
        p=[col for col in numerical_columns if col.startswith('P')]
        v=[col for col in numerical_columns if col.startswith('V')]

        X_temp['P_isnull']=X_temp[[ID, WEEK]+p].isnull().sum(axis=1)
        X_temp['V_isnull']=X_temp[[ID, WEEK]+v].isnull().sum(axis=1)
        
        numerical_columns2=numerical_columns+['P_isnull', 'V_isnull']
        
        X_temp.loc[:, numerical_columns] = self.log_scaler.transform(X_temp[numerical_columns].fillna(0.0))
        
        df1 = X_temp.groupby([ID])[[ID]].max().reset_index(drop=True)

        for col in numerical_columns2:
            df1 = df1.merge(X_temp.groupby([ID])[col].max().reset_index().rename(columns={col:col+'_max'}), on=ID)
            df1 = df1.merge(X_temp.groupby([ID])[col].min().reset_index().rename(columns={col:col+'_min'}), on=ID)
            df1 = df1.merge(X_temp.groupby([ID])[col].mean().reset_index().rename(columns={col:col+'_mean'}), on=ID)
            df1 = df1.merge(X_temp.groupby([ID])[col].std().reset_index().rename(columns={col:col+'_std'}), on=ID)
            df1 = df1.merge(X_temp.groupby([ID])[col].sum().reset_index().rename(columns={col:col+'_sum'}), on=ID)
            df1 = df1.merge(X_temp.groupby([ID])[col].median().reset_index().rename(columns={col:col+'_median'}), on=ID)
            
        df2 = X_temp[X_temp[WEEK] == 0]
        df3 = X_temp[X_temp[WEEK] == 3][numerical_columns2+[ID]]
        
        df2.columns = [col+'_week_0' if col in numerical_columns2 else col for col in df2.columns]
        df3.columns = [col+'_week_3' if col in numerical_columns2 else col for col in df3.columns]
        X_temp=X_temp.set_index(ID)
        for i in range(1, 4):
            t=(X_temp[X_temp[WEEK] == 0][numerical_columns2] - X_temp[X_temp[WEEK] == i][numerical_columns2])
            t.columns=[col+f'_week_0_minus_{i}' for col in t.columns]
            df2=df2.merge(t.reset_index(), on=ID)
            
        for i in range(1, 4):
            t=(X_temp[X_temp[WEEK] == 0][numerical_columns2] / X_temp[X_temp[WEEK] == i][numerical_columns2])
            t.columns=[col+f'_week_0_div_{i}' for col in t.columns]
            df2=df2.merge(t.reset_index(), on=ID)
            
        for i in range(1, 4):
            t=(X_temp[X_temp[WEEK] == 0][numerical_columns2] * X_temp[X_temp[WEEK] == i][numerical_columns2])
            t.columns=[col+f'_week_0_mul_{i}' for col in t.columns]
            df2=df2.merge(t.reset_index(), on=ID)

        df1=df1.merge(df2, on=ID).merge(df3, on=ID).replace([np.inf, -np.inf], np.nan).fillna(0.0)
           
        return df1

In [97]:
class SimpleImputer(BaseEstimator, TransformerMixin):
    
    def __init__(
        self,
        strategy: str = 'mean',
        fill_value: Optional[Union[str, int, float]] = None,
        missing_values: Optional[Union[str]] = None,
        columns=[],
    ):
        from sklearn.impute import SimpleImputer as SI
        self.strategy = strategy
        self.fill_value = fill_value
        self.missing_values = missing_values
        self.imputer = SI(strategy=self.strategy, fill_value=self.fill_value,)
    
    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        self.imputer.fit(X)
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        idx = X.index
        result = pd.DataFrame(self.imputer.transform(X), columns=X.columns)
        result['id'] = idx
        return result.set_index('id')

In [98]:
numerical_pipeline = Pipeline(steps=[
    ('num_features_generator', NumericalFeaturesGenerator()),
#     ('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value=0)),
])

In [99]:
# X_train, X_val, y_train, y_val = train_test_split(
#     df_train.drop(columns=[TARGET]), df_train[TARGET], test_size=0.2,
#     random_state=SEED, shuffle=True, stratify=df_train[TARGET]
# )

In [100]:
# test_index = df_train[ID][: int(len(df_train[ID]) * 0.2)]
# train_index = df_train[ID][int(len(df_train[ID]) * 0.2):]

In [145]:
numerical_pipeline.fit(df_train);
X_train = numerical_pipeline.transform(df_train).reset_index(drop=True)
y_train = X_train[TARGET].values
X_train.drop([WEEK, TARGET],axis=1, inplace=True)
X_train = X_train.set_index(ID)
X_train.shape

In [146]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2,
    random_state=SEED, shuffle=True, stratify=y_train,
)
X_train.shape, X_val.shape

## Feature Engineering

### 1. KNN Features (optional)
+ it may contain data leak

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import NearestNeighbors
from multiprocessing import Pool
from sklearn.preprocessing import StandardScaler as SS

from tqdm import tqdm

In [None]:
class StandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = SS()
    
    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        self.scaler.fit(X)
        return self
    
    def transform(self, X: pd.DataFrame, y: Optional[pd.Series] = None) -> pd.DataFrame:
        idx = X.index
        result = pd.DataFrame(self.scaler.transform(X), columns=X.columns)
        result['id'] = idx
        return result.set_index('id')

In [None]:
class NearestNeighborsFeats(BaseEstimator, TransformerMixin):
    '''
        This class should implement KNN features extraction 
    '''
    def __init__(self, k_list=[3, 8, 32], metric='cosine', n_jobs=-1, n_classes=None, n_neighbors=None, eps=1e-6):
        self.n_jobs = n_jobs
        self.k_list = k_list
        self.metric = metric
        
        if n_neighbors is None:
            self.n_neighbors = max(k_list) 
        else:
            self.n_neighbors = n_neighbors
            
        self.eps = eps        
        self.n_classes_ = n_classes
    
    def fit(self, X, y):
        '''
            Set's up the train set and self.NN object
        '''
        # Create a NearestNeighbors (NN) object. We will use it in `predict` function 
        self.NN = NearestNeighbors(n_neighbors=max(self.k_list), 
                                      metric=self.metric, 
                                      n_jobs=1, 
                                      algorithm='brute' if self.metric=='cosine' else 'auto')
        self.NN.fit(X)
        
        # Store labels 
        self.y_train = y
        
        # Save how many classes we have
        self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_
        
        
    def transform(self, X, y=None):       
        '''
            Produces KNN features for every object of a dataset X
        '''
        test_feats = []
        if self.n_jobs == 1:
            for idx in tqdm(X.index, position=0, leave=True):
                return_dict = self.get_features_for_one(X.loc[[idx]])
                test_feats.append(return_dict)
        else:
            pool = Pool(processes=self.n_jobs) 
            for idx in X.index:
                test_feats.append(pool.apply_async(self.get_features_for_one, (X.loc[[idx]],)))
            test_feats = [res.get() for res in tqdm(test_feats, position=0, leave=True)]
        return pd.DataFrame(test_feats)
        
        
    def get_features_for_one(self, x):
        '''
            Computes KNN features for a single object `x`
        '''

        NN_output = self.NN.kneighbors(x)
        
        # Vector of size `n_neighbors`
        # Stores indices of the neighbors
        neighs = NN_output[1][0]
        
        # Vector of size `n_neighbors`
        # Stores distances to corresponding neighbors
        neighs_dist = NN_output[0][0] 

        # Vector of size `n_neighbors`
        # Stores labels of corresponding neighbors
        neighs_y = self.y_train.iloc[neighs] 
        
        # We will accumulate the computed features here
        # Eventually it will be a list of lists or np.arrays
        # and we will use np.hstack to concatenate those
        return_dict = {}
        
        
        ''' 
            1. Fraction of objects of every class.
               It is basically a KNNСlassifiers predictions.
        '''
        for k in self.k_list:
            feats = np.bincount(neighs_y[:k],minlength=self.n_classes)
            feats  = feats / feats.sum()
            
            assert len(feats) == self.n_classes
            for c in range(self.n_classes):
                return_dict[f'knn_f1_k_{k}_n_{c}_{self.metric}'] = feats[c]
        
        
        '''
            2. Minimum distance to objects of each class
               Find the first instance of a class and take its distance as features.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.
        '''
        feats = []
        for c in range(self.n_classes):
            feat = neighs_dist[neighs_y == c][0] if (neighs_y == c).sum() > 0 else 999
            feats.append(feat)
            return_dict[f'knn_f2_n_{c}_{self.metric}'] = feat
        
        assert len(feats) == self.n_classes
        
        '''
            3. Minimum *normalized* distance to objects of each class
               As 3. but we normalize (divide) the distances
               by the distance to the closest neighbor.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.
        '''
        feats = []
        for c in range(self.n_classes):
            feat = neighs_dist[neighs_y == c][0] if (neighs_y == c).sum() > 0 else 999
            if feat!= 999:
                feat = feat / (self.eps + neighs_dist[0])
            feats.append(feat)
            return_dict[f'knn_f3_n_{c}_{self.metric}'] = feat
        
        assert len(feats) == self.n_classes
        
        
        '''
            4. 
               4.1 Distance to Kth neighbor
                   Think of this as of quantiles of a distribution
               4.2 Distance to Kth neighbor normalized by 
                   distance to the first neighbor
        '''
        for k in self.k_list:
            
            feat_41 = neighs_dist[k-1]
            feat_42 = neighs_dist[k-1] / (neighs_dist[0] + self.eps)
            
            return_dict[f'knn_f41_k_{k}_{self.metric}'] = feat_41
            return_dict[f'knn_f42_k_{k}_{self.metric}'] = feat_42
        
        '''
            5. Mean distance to neighbors of each class for each K from `k_list` 
               For each class select the neighbors of that class among K nearest neighbors 
               and compute the average distance to those objects

               If there are no objects of a certain class among K neighbors, set mean distance to 999
        '''
        for k in self.k_list:
            numerator = np.zeros(self.n_classes)
            denominator = np.full(self.n_classes, self.eps)
            t = neighs_y[:k].max() + 1
            numerator[:t] = np.bincount(neighs_y[:k], weights=neighs_dist[:k])
            denominator[:t] = self.eps + np.bincount(neighs_y[:k])
            feats = np.where(numerator>0, numerator/denominator, 999)
            
            assert len(feats) == self.n_classes
            for c in range(self.n_classes):
                return_dict[f'knn_f5_k_{k}_n_{c}_{self.metric}'] = feats[c]
        
        return return_dict

In [None]:
best_features_for_knn = best_50_features_baseline+filtered_features_1std
len(best_features_for_knn)

In [None]:
knnf_cos = NearestNeighborsFeats(n_jobs=1, metric='cosine')
knnf_min = NearestNeighborsFeats(n_jobs=1, metric='minkowski')

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
knnf_cos.fit(X_train, y_train)
knnf_res_cos_test = knnf_cos.transform(X_test)

knnf_res_cos_test[ID] = df_test.loc[:, ID].values

In [None]:
knnf_min.fit(X_train, y_train)
knnf_res_mink_test = knnf_min.transform(X_test)

knnf_res_mink_test[ID] = df_test.loc[:, ID].values

In [None]:
df_test = df_test.merge(knnf_res_mink_test, on=ID, how='left')
df_test = df_test.merge(knnf_res_cos_test, on=ID, how='left')
df_test.shape

In [None]:
skf = KFold(n_splits=10, random_state=SEED, shuffle=True)

knn_features_list_cos = []
for train_index, test_index in tqdm(skf.split(X=df_train, y=df_train[TARGET]), position=0, leave=True):
    X_train_fold = df_train.loc[train_index, best_features_for_knn]
    y_train_fold = df_train.loc[train_index, TARGET]

    X_test_fold = df_train.loc[test_index, best_features_for_knn]
    y_test_fold = df_train.loc[test_index, TARGET]
    
    print('Train/Val shapes:')
    print((X_train_fold.shape, y_train_fold.shape), (X_test_fold.shape, y_test_fold.shape))
    
    print('Train/Val bad rates:')
    print(y_train_fold.mean(), y_test_fold.mean())
    
    scaler = StandardScaler()
    scaler.fit(X_train_fold)
    X_train_fold = scaler.transform(X_train_fold)
    X_test_fold = scaler.transform(X_test_fold)
    
    knnf = NearestNeighborsFeats(n_jobs=1, n_neighbors=8, metric='cosine')
    knnf.fit(X_train_fold, y_train_fold)
    knnf_fold_res = knnf.transform(X_test_fold)
    
    knnf_fold_res[ID] = df_train.loc[test_index, ID].values
    
    knn_features_list_cos.append(knnf_fold_res.copy())

In [None]:
skf = KFold(n_splits=10, random_state=SEED, shuffle=True)

knn_features_list_mink = []
for train_index, test_index in tqdm(skf.split(X=df_train, y=df_train[TARGET]), position=0, leave=True):
    X_train_fold = df_train.loc[train_index, best_features_for_knn]
    y_train_fold = df_train.loc[train_index, TARGET]

    X_test_fold = df_train.loc[test_index, best_features_for_knn]
    y_test_fold = df_train.loc[test_index, TARGET]

    print('Train/Val shapes:')
    print((X_train_fold.shape, y_train_fold.shape), (X_test_fold.shape, y_test_fold.shape))
    
    print('Train/Val bad rates:')
    print(y_train_fold.mean(), y_test_fold.mean())
    
    scaler = StandardScaler()
    scaler.fit(X_train_fold)
    X_train_fold = scaler.transform(X_train_fold)
    X_test_fold = scaler.transform(X_test_fold)
    
    knnf = NearestNeighborsFeats(n_jobs=1, n_neighbors=8, metric='minkowski')
    knnf.fit(X_train_fold, y_train_fold)
    knnf_fold_res = knnf.transform(X_test_fold)
    
    knnf_fold_res[ID] = df_train.loc[test_index, ID].values
    
    knn_features_list_mink.append(knnf_fold_res.copy())

In [None]:
df_train.shape

In [None]:
knn_features_mink_df = pd.concat(knn_features_list_mink)
knn_features_cos_df = pd.concat(knn_features_list_cos)
knn_features_cos_df.shape, knn_features_mink_df.shape

In [None]:
assert knn_features_cos_df[knn_features_cos_df.id.notna()].shape[0] == knn_features_mink_df[knn_features_mink_df.id.notna()].shape[0]

In [None]:
df_train = df_train.merge(knn_features_mink_df, on=ID, how='left')
df_train = df_train.merge(knn_features_cos_df, on=ID, how='left')
df_train.shape

In [None]:
knn_features_columns = knn_features_mink_df.columns.to_list() + knn_features_cos_df.columns.to_list()
knn_features_columns = [col for col in knn_features_columns if col != ID]

In [None]:
def get_correlated_columns(corr_matrix: pd.DataFrame, threshold: float) -> List[str]:
    col_corr = set()
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (abs(corr_matrix.iloc[i, j]) >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
    return list(col_corr)

In [None]:
knn_pearson_corr = df_train[knn_features_columns].corr()
knn_pearson_corr_columns = get_correlated_columns(knn_pearson_corr, threshold=0.85)
len(knn_pearson_corr_columns)

In [None]:
knn_features_columns = [col for col in knn_features_columns if col not in knn_pearson_corr_columns]
# base_columns += knn_features_columns

## Feature Selection

In [147]:
base_columns = np.asarray(X_train.columns.tolist())
len(base_columns)

### 1. Remove quazi-constant features

In [148]:
qconstant_filter = VarianceThreshold(threshold=0.00001)
qconstant_filter.fit(X_train[base_columns])
qconstant_columns = [col for col in base_columns if col not in base_columns[qconstant_filter.get_support()]]
len(qconstant_columns)

In [149]:
best_features = [col for col in base_columns if col not in qconstant_columns]
len(best_features)

### 2. Remove high correlated features (Pearson)

In [150]:
def get_correlated_columns(corr_matrix: pd.DataFrame, threshold: float) -> List[str]:
    col_corr = set()
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (abs(corr_matrix.iloc[i, j]) >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
    return list(col_corr)

In [151]:
pearson_corr = X_train[best_features].corr()

In [152]:
f, axs = plt.subplots(1, 1, figsize=(35, 15))

sns.heatmap(pearson_corr, center=0, square=True, cbar_kws={"shrink": .5}, ax=axs)
axs.set_title('Pearson correlation matrix');

In [153]:
pearson_corr_columns = get_correlated_columns(pearson_corr, threshold=0.92)
len(pearson_corr_columns)

In [154]:
best_features = [col for col in best_features if col not in pearson_corr_columns]
len(best_features)

### 3. Remove high correlated features (Phik)

In [155]:
phik_corr = X_train[best_features].phik_matrix()

In [156]:
f, axs = plt.subplots(1, 1, figsize=(35, 15))

sns.heatmap(phik_corr, center=0, square=True, cbar_kws={"shrink": .5}, ax=axs)
axs.set_title('Phik correlation matrix');

In [157]:
phik_corr_columns = get_correlated_columns(phik_corr, threshold=0.90)
len(phik_corr_columns)

In [158]:
best_features = [col for col in best_features if col not in phik_corr_columns]
len(best_features)

### 4. Homogenity transformer

In [159]:
from abc import ABC, abstractmethod, ABCMeta
from scipy import stats
from statsmodels.stats.multitest import multipletests
from sklearn.utils.multiclass import type_of_target
from typing import Iterable, Optional, List, Tuple
import itertools
import operator
import numpy as np
from sklearn.model_selection import KFold


class Transformer(ABC):

    @abstractmethod
    def fit(self, ds):
        pass

    @abstractmethod
    def transform(self, ds):
        pass


class FeatureSelectionTransformer(Transformer):
    def __init__(self):
        self.features_in = []
        self.features_out = []

    def fit(self, ds, ds_test=None):
        self.features_in = ds.columns.tolist()
        self.features_out = self.features_in.copy()
        return self

    def transform(self, ds):
        if len(self.features_out) > 0:
            return ds[self.features_out]
        else:
            return ds

    def __repr__(self):
        return "{0}".format(self.__class__.__name__) + (", Features={0}->{1}".format(len(self.features_in), len(self.features_out)) if len(self.features_in) > 0 and len(self.features_out) > 0 else "")


class HomogeneityThresholdTransformer(FeatureSelectionTransformer):

    def __init__(
            self,
            hypothesis_threshold: float = 0.85,
            p_val_threshold: float = 0.01,
            sample_size: int = 2000,
            n_repeat: int = 20,
            voting_threshold: float = 0.35,
            use_multiple_tests: bool = True,
            multiple_tests_method: str = 'bonferroni',
            check_hypothesis_per_target: bool = False,
            fold_threshold: float = 0.6,
            test_functions_cont: Optional[Iterable] = None,
            test_functions_cat: Optional[Iterable] = None,
            n_splits: int = 5,
            fold_method: ABCMeta = KFold,
            random_seed: int = 42,
    ):
        super(HomogeneityThresholdTransformer, self).__init__()
        self.hypothesis_threshold = hypothesis_threshold
        self.p_val_threshold = p_val_threshold
        self.sample_size = sample_size
        self.n_repeat = n_repeat
        self.voting_threshold = voting_threshold
        self.use_multiple_tests = use_multiple_tests
        self.multiple_tests_method = multiple_tests_method
        self.check_hypothesis_per_target = check_hypothesis_per_target
        self.fold_threshold = fold_threshold

        self.test_functions_cont = {
            HomogeneityThresholdTransformer.__ks_2samp_test,
            HomogeneityThresholdTransformer.__anderson_ksamp_test,
            HomogeneityThresholdTransformer.__mannwhitneyu_test,
        } if test_functions_cont is None else test_functions_cont

        self.test_functions_cat = {
            HomogeneityThresholdTransformer.__chisquare_test
        } if test_functions_cat is None else test_functions_cat
        
        self.n_splits = n_splits
        self.fold_method = fold_method
        self.random_seed = random_seed

    @staticmethod
    def __ks_2samp_test(samples):
        """
        This is a two-sided test for the null hypothesis that 2 independent samples
        are drawn from the same continuous distribution.
        """
        sample1, sample2 = samples[0], samples[1]
        statistic, p_val = stats.ks_2samp(sample1, sample2, mode='asymp', alternative='two-sided')
        return statistic, p_val

    @staticmethod
    def __mannwhitneyu_test(samples):
        """
        H0: The two populations are equal versus
        H1: The two populations are not equal.
        """
        sample1, sample2 = samples[0], samples[1]
        statistic, p_val = stats.mannwhitneyu(sample1, sample2, alternative='two-sided')
        return statistic, p_val

    @staticmethod
    def __anderson_ksamp_test(samples):
        """
        The k-sample Anderson-Darling test is a modification of the
        one-sample Anderson-Darling test. It tests the null hypothesis
        that k-samples are drawn from the same population without having
        to specify the distribution function of that population.
        """
        statistic, _, p_val = stats.anderson_ksamp(samples=samples)
        return statistic, p_val

    @staticmethod
    def __chisquare_test(samples):
        """
        Chi-square test of independence of variables in a contingency table.
        """

        sample1, sample2 = samples[0], samples[1]

        value_counts1 = dict(zip(*np.unique(sample1, return_counts=True)))
        value_counts2 = dict(zip(*np.unique(sample2, return_counts=True)))

        f1, f2 = [], []
        for category_name in value_counts1.keys():
            f1.append(value_counts1.get(category_name))
            f2.append(value_counts2.get(category_name, 0))

        statistic, p_val, _, _ = stats.chi2_contingency([f1, f2])
        return statistic, p_val

    def __is_pass_stat_test(self, test_fn, *args) -> Tuple[bool, bool]:
        if len(args) < 2:
            raise ValueError(f'Tuple `args` must contain at least 2 elements, but it contains {len(args)}')

        # prepare each sample
        samples = [feature_sample.dropna() for feature_sample in args]

        # if some sample has zero-length --> skip test computation
        skip_sampling = not all(list(map(bool, map(len, samples))))
        if skip_sampling:
            # is_pass = False, skip_flag = True
            return False, True

        p_vals = []
        for repeat_id in range(self.n_repeat):
            # get subsample for each sample
            prepared_samples = [feature_sample.sample(self.sample_size, replace=True, random_state=self.random_seed).values
                                for feature_sample in samples]

            # check that all numbers aren't identical
            identical_numbers_ind = any([len(np.unique(feature_sample)) < 2 for feature_sample in prepared_samples])

            if not identical_numbers_ind:
                # compute test
                _, p_val = test_fn(prepared_samples)
                p_vals.append(p_val)

        # if some sample has identical numbers --> skip test computation
        if len(p_vals) == 0:
            # is_pass = False, skip_flag = True
            return False, True

        if self.use_multiple_tests:
            # true for hypothesis that can be rejected for given p_val_threshold
            reject, _, _, _ = multipletests(p_vals, alpha=self.p_val_threshold, method=self.multiple_tests_method, )
            result = np.mean(reject) < self.hypothesis_threshold
        else:
            result = ((np.array(p_vals) <= self.p_val_threshold).sum() / len(p_vals)) < self.hypothesis_threshold

        # is_pass = True/False, skip_flag = False
        return result, False

    def __feature_test(self, train, test, features_cont, features_cat) -> List[Tuple[str, bool]]:
        def __test(features, test_functions):
            features_out = []
            for feature in tqdm(features, position=0, leave=True, desc='features progress'):
                passed_tests_cnt = 0.0
                not_skipped_test_cnt = 0.0
                for test_fn in test_functions:
                    # save two indicators:
                    # 1. is_pass - train and test sets have same distribution for each test
                    # 2. skip_flag - we want to skip this test
                    is_pass, skip_flag = self.__is_pass_stat_test(
                        test_fn,
                        train[feature],
                        test[feature],
                    )
                    if not skip_flag:
                        passed_tests_cnt += is_pass
                        not_skipped_test_cnt += 1

                # save tuple (feature_name, skip_fold_flag)
                if not_skipped_test_cnt == 0:
                    features_out.append((feature, True))
                elif (passed_tests_cnt / not_skipped_test_cnt) > self.voting_threshold:
                    features_out.append((feature, False))
            return features_out

        features_out_cont = __test(features_cont, self.test_functions_cont)
        features_out_cat = __test(features_cat, self.test_functions_cat)

        return features_out_cont + features_out_cat

    def __feature_test_per_target(self, train, test, features_cont, features_cat, target_col) -> List[Tuple[str, bool]]:
        def __test(features, test_functions):
            features_out = []
            for feature in tqdm(features, position=0, leave=True, desc='features progress'):
                passed_tests_result = dict.fromkeys(unique_target_values, [0.0, 0.0])
                for target_val in passed_tests_result.keys():
                    for test_fn in test_functions:
                        # save two indicators:
                        # 1. is_pass - train and test sets have same distribution for each test
                        # 2. skip_flag - we want to skip this test
                        is_pass, skip_flag = self.__is_pass_stat_test(
                            test_fn,
                            train.loc[train[target_col] == target_val, feature],
                            test.loc[test[target_col] == target_val, feature],
                        )
                        if not skip_flag:
                            passed_tests_result[target_val][0] += is_pass
                            passed_tests_result[target_val][1] += 1

                passed_tests_result = np.array(list(passed_tests_result.values()))
                # # save tuple (feature_name, skip_fold_flag)
                if any(passed_tests_result[:, 1] == 0):
                    features_out.append((feature, True))
                elif all((passed_tests_result[:, 0] / passed_tests_result[:, 1]) > self.voting_threshold):
                    features_out.append((feature, False))
            return features_out

        unique_target_values = train[target_col].unique()
        features_out_cont = __test(features_cont, self.test_functions_cont)
        features_out_cat = __test(features_cat, self.test_functions_cat)

        return features_out_cont + features_out_cat

    @staticmethod
    def __accumulate(l):
        it = itertools.groupby(l, operator.itemgetter(0))
        for key, subiter in it:
            item_sum = 0
            item_cnt = 0
            for item in subiter:
                item_sum += item[1]
                item_cnt += 1
            yield key, item_sum, item_cnt

    def __check_hypothesis_by_fold(self, ds, features_cont, features_cat, target_col):
        folds_features_out = []
        kf = self.fold_method(n_splits=self.n_splits)
        features_all = features_cont+features_cat
        for train_index, test_index in tqdm(kf.split(ds), position=0, leave=True, desc='kfolds progress'):
            if self.check_hypothesis_per_target:
                folds_features_out.extend(
                    self.__feature_test_per_target(ds.iloc[train_index], ds.iloc[test_index], features_cont, features_cat, target_col)
                )
            else:
                folds_features_out.extend(
                    self.__feature_test(ds.iloc[train_index], ds.iloc[test_index], features_cont, features_cat)
                )

        f_names, f_skipped, f_counts = np.array(list(
            HomogeneityThresholdTransformer.__accumulate(sorted(folds_features_out))
        )).T
        f_skipped = f_skipped.astype(int)
        f_counts = f_counts.astype(int)

        f_counts = f_counts - f_skipped
        f_not_skipped = self.n_splits - f_skipped
        return f_names[((f_counts / f_not_skipped) > self.fold_threshold) | (f_not_skipped == 0)].tolist()

    def __check_hypothesis_by_train(self, ds, ds_test, features_cont, features_cat, target_col):
        train = ds
        test = ds_test

        if self.check_hypothesis_per_target:
            test_features_out = self.__feature_test_per_target(train, test,
                                                               features_cont, features_cat, target_col)
        else:
            test_features_out = self.__feature_test(train, test, features_cont, features_cat)

        f_names, _, _ = np.array(list(
            HomogeneityThresholdTransformer.__accumulate(test_features_out)
        )).T

        return f_names.tolist()

    def fit(self, ds, ds_test=None):
        super(HomogeneityThresholdTransformer, self).fit(ds=ds, ds_test=ds_test)

        self.features_out = []
        features_cont = ds.columns.tolist()
        features_cat = []

        target_col = [TARGET]
        if len(target_col) > 1:
            raise ValueError('Cannot deal with multidimensional target.')
        else:
            target_col = target_col[0]
        features_cont = [col for col in features_cont if col != target_col]

        if type_of_target(ds[target_col]) == 'continuous' and self.check_hypothesis_per_target:
            raise ValueError('It is not possible to check hypothesis for continuous target.')

        folds_features_out = self.__check_hypothesis_by_fold(ds, features_cont, features_cat, target_col)

        if ds_test is not None:
            test_features_out = self.__check_hypothesis_by_train(ds, ds_test, features_cont, features_cat, target_col)
            self.features_out = list(set(test_features_out).intersection(folds_features_out))
        else:
            self.features_out = folds_features_out
            
        self.failed_features = [col for col in features_cont if col not in self.features_out]

        return self


In [161]:
htt = HomogeneityThresholdTransformer(check_hypothesis_per_target=True, random_seed=SEED)
X_train_temp = X_train.copy().reset_index(drop=True)[best_features]
X_train_temp[TARGET] = y_train
X_val_temp = X_val.copy().reset_index(drop=True)[best_features]
X_val_temp[TARGET] = y_val
htt.fit(ds=X_train_temp, ds_test=X_val_temp)

In [162]:
best_features = [col for col in htt.features_out if (col != TARGET) and (col in best_features)]
len(best_features)

In [163]:
# htt = HomogeneityThresholdTransformer(check_hypothesis_per_target=False, random_seed=SEED)
# X_train_temp = X_train.copy().reset_index(drop=True)
# X_train_temp[TARGET] = y_train
# htt.fit(ds=X_train_temp, ds_test=X_test)

In [164]:
# base_columns = [col for col in htt.features_out if col != TARGET]
# len(base_columns)

### 5. Feature Selection - permutation + top-50

In [165]:
class PermutationImportanceFeatureSelectionTransformer():
    
    def __init__(self, model, n_repeats=10, scoring=roc_auc_score, random_state=42, n_jobs=1):
        super(PermutationImportanceFeatureSelectionTransformer, self).__init__()
        self.model = model
        self.n_repeats = n_repeats
        self.scoring = scoring
        self.seed = random_state
        self.n_jobs = n_jobs
        self.importances = {}
        self.features_in = []
        self.features_out = []

    def fit(self, X, y):
        self.features_in = X.columns.tolist()
        self.features_out = self.features_in.copy()

        y_pred = self.model.predict(X)
        baseline_score = self.scoring(y, y_pred)

        self.features = self.model.feature_name()

        importances = {}
        importances_parallel = Parallel(n_jobs=self.n_jobs)(delayed(self.loop_iter)(X, y, baseline_score) for _ in tqdm(range(self.n_repeats), position=0, leave=True))
            
        for importance in tqdm(importances_parallel):
            for feature in importance:
                if feature not in importances.keys():
                    importances[feature] = [importance[feature]]
                else:
                    importances[feature].append(importance[feature])

        self.features_out = []
        for k, v in tqdm(importances.items()):
            if np.mean(v) - np.std(v) > 0:
                self.features_out.append(k)

        self.importances = {f: (np.mean(v), np.std(v)) for f, v in importances.items() if f in self.features_out}

        return self
    
    def loop_iter(self, X, y, baseline_score):
        importances = dict()
        for feature in self.features:
            X_perm = self.permute_feature(X=X, feature=feature)
            y_perm = self.model.predict(X_perm)
            
            feature_score = self.scoring(y, y_perm)
            importances[feature] = baseline_score - feature_score
        return importances
                
    def permute_feature(self, X, feature):
        X_perm = X.copy()
        X_perm[feature] = np.random.permutation(X_perm[feature].values)
        return X_perm

    def __repr__(self):
        return "{0} by {1}".format(self.__class__.__name__, self.model) + (", Features={0}->{1}".format(len(self.features_in), len(self.features_out)) if len(self.features_in) > 0 and len(self.features_out) > 0 else "")


In [166]:
def get_best_perm_features(r, n_std=2, show_log=False) -> List[str]:
    best_features = list()
    importance = []
    for k, v in r.items():
        importance.append((k, v[0], v[1]))
    importance = sorted(importance, key=lambda x: x[1], reverse=True)

    for feature, mean, std in importance:
        if mean - n_std * std > 0:
            if show_log:
                print(f"{feature:<8}", f"{mean:.5f}", f" +/- {std:.5f}")
            best_features.append(feature)    
    return best_features

In [167]:
fit = lgb.Dataset(
    data=X_train[best_features], label=y_train,
)

val = lgb.Dataset(
    data=X_val[best_features], label=y_val,
    reference=fit,
)

model = lgb.train(
    params={'seed': SEED, 'objective': 'binary', 'is_unbalance': True, 'max_depth': -1, 'feature_fraction': 0.5,  'learning_rate': 0.01,  'zero_as_missing': False, 'boosting_type': 'gbdt', 'metric': [METRIC]},
    train_set=fit,
    num_boost_round=5000,
    valid_sets=(fit, val),
    valid_names=('train', 'val'),
    early_stopping_rounds=25,
    verbose_eval=25,
)

In [168]:
lgb.plot_importance(model, max_num_features=25);

In [169]:
best_10_features_baseline = list(map(lambda x: x[0], [(k, v) for k, v in sorted(zip(model.feature_name(), model.feature_importance()), key=lambda x: x[1], reverse=True) if v > 0][:10]))

In [170]:
r = PermutationImportanceFeatureSelectionTransformer(model, n_repeats=4, random_state=SEED, n_jobs=4, scoring=roc_auc_score)
r.fit(X_val[best_features], y_val)

In [171]:
filtered_features = get_best_perm_features(r.importances, show_log=True)
len(filtered_features)

In [172]:
best_features = list(set(filtered_features + best_10_features_baseline))
len(best_features)

In [173]:
pd.concat([X_train[best_features].reset_index(drop=True), pd.DataFrame(y_train, columns=[TARGET])], axis=1).corr().style.background_gradient(cmap='coolwarm').set_precision(3)

In [174]:
print('[', end='')
for feature in sorted(best_features):
    print(f"'{feature}'", end=',\n')
print(']')

In [175]:
best_features = ['P10_sum',
'P10_week_0',
'P10_week_0_mul_1',
'P10_week_0_mul_2',
'P10_week_3',
'P11_min',
'P11_week_0_minus_3',
'P11_week_0_mul_2',
'P11_week_0_mul_3',
'P13_max',
'P13_week_0_minus_1',
'P13_week_0_minus_2',
'P13_week_0_minus_3',
'P15_week_0_minus_1',
'P15_week_0_minus_2',
'P16_max',
'P16_min',
'P16_week_0_minus_1',
'P16_week_0_minus_2',
'P16_week_3',
'P17_mean',
'P17_week_0_minus_1',
'P17_week_0_mul_2',
'P18_max',
'P19_max',
'P19_min',
'P19_week_0_minus_1',
'P19_week_0_minus_2',
'P19_week_3',
'P1_max',
'P1_mean',
'P1_min',
'P1_week_0_minus_2',
'P1_week_0_mul_1',
'P21_max',
'P21_std',
'P21_week_0_minus_1',
'P21_week_0_minus_2',
'P22_max',
'P22_week_0_minus_2',
'P25_std',
'P25_week_0',
'P27_min',
'P27_std',
'P2_max',
'P2_week_0_mul_3',
'P3_std',
'P4_max',
'P5_week_0_minus_3',
'P6_mean',
'P6_std',
'P7_median',
'P8_max',
'P8_week_0_minus_3',
'P9_max',
'V10_sum',
'V11_max',
'V14_sum',
'V19_sum',
'V21_max',
'V3_sum',
'V9_max',
]

## Hyperparameters tuning

In [176]:
import optuna
from optuna.samplers import TPESampler

In [180]:
def get_best_params(X_train, y_train, best_features, n_splits, study):
    def optuna_objective(trial):
        params = {
        'objective': 'binary',
        'zero_as_missing': False,
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'is_unbalance': True,
        'seed': SEED,
        'metric': METRIC,
        'max_depth': -1,  # {'default': -1, 'options': TunerNumberAxis(low=1, high=9, distribution='int')},
        'learning_rate': trial.suggest_loguniform('learning_rate', low=1e-5, high=0.2),
        'num_leaves': trial.suggest_categorical('num_leaves', [8, 16, 32, 64, 128, 256, 512]),#('num_leaves', [2, 4, 8, 16, 32, 64, 128, 256, 512]),
        'max_bin': trial.suggest_int('max_bin', low=8, high=255),
        'min_child_samples': trial.suggest_int('min_child_samples', low=0, high=20),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', low=0.01, high=1),
        'bagging_freq': trial.suggest_int('bagging_freq', low=0, high=32),
        'feature_fraction': trial.suggest_uniform('feature_fraction', low=0.01, high=1),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', low=1e-8, high=12.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', low=1e-8, high=12.0),
#         'num_iterations': trial.suggest_int('num_iterations', low=100, high=10000, step=100),
        }

        cv_split = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
        folds = cv_split.split(X=X_train[best_features], y=y_train) 

        dtrain = lgb.Dataset(
            data=X_train[best_features], label=y_train,
        )

        lgbcv = lgb.cv(
            params=params,
            train_set=dtrain,
            folds=folds,
            seed=SEED,
            metrics={METRIC},
            verbose_eval=False,                   
            early_stopping_rounds=25,                   
            num_boost_round=5000,
        )
        metric_result = max(lgbcv[METRIC+'-mean'])

#     fit = lgb.Dataset(
#         X_train[best_features], y_train,
#     )

#     val = lgb.Dataset(
#         X_val[best_features], y_val,
#         reference=fit,
#     )

#     model = lgb.train(
#         params=params,
#         train_set=fit,
#         num_boost_round=5000,
#         valid_sets=(fit, val),
#         valid_names=('train', 'val'),
#         early_stopping_rounds=25,
#         verbose_eval=False,
#     )
    
#     metric_result = roc_auc_score(y_val, model.predict(X_val[best_features]))
        return metric_result
    
    study.optimize(optuna_objective, n_trials=400, n_jobs=4, show_progress_bar=True)
    
    best_params = study.best_params.copy()
    best_params['max_depth'] = -1
    best_params['metric'] = METRIC
    best_params['zero_as_missing'] = False
    best_params['verbosity'] = -1
    best_params['boosting_type'] = 'gbdt'
    best_params['seed'] = SEED
    best_params['is_unbalance'] = True
    best_params['objective'] = 'binary'
    
    return study.best_params, study.best_value

In [181]:
n_splits = 3

In [182]:
TPE_sampler = TPESampler(seed=SEED, n_startup_trials=100)
TPE_study = optuna.create_study(direction='maximize', sampler=TPE_sampler, study_name='TPE_study')

best_params = get_best_params(X_train, y_train, best_features, n_splits, TPE_study)

In [183]:
TPE_study.best_params

In [184]:
best_params = TPE_study.best_params.copy()
best_params['max_depth'] = -1
best_params['metric'] = METRIC
best_params['zero_as_missing'] = False
best_params['verbosity'] = -1
best_params['boosting_type'] = 'gbdt'
best_params['seed'] = SEED
best_params['is_unbalance'] = True
best_params['objective'] = 'binary'

In [185]:
optuna.visualization.plot_optimization_history(TPE_study)

In [186]:
optuna.visualization.plot_slice(TPE_study)

In [187]:
# optuna.visualization.plot_param_importances(TPE_study)

## Model training - CV

### 1. Find best **num_rounds**

In [188]:
def get_num_boost_rounds(X_train, y_train, best_features, best_params, n_splits):
    cv_split = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    folds = cv_split.split(X=X_train[best_features], y=y_train) 

    dtrain = lgb.Dataset(
        data=X_train[best_features], label=y_train,
    )

    lgbcv = lgb.cv(
        params=best_params,
        train_set=dtrain,
        folds=folds,
        seed=SEED,
        metrics={METRIC},
        verbose_eval=False,                   
        early_stopping_rounds=25,                   
        num_boost_round=8000,
    )

    num_boost_round = np.argmax(lgbcv[METRIC+'-mean']) + 1 # !!!!!!
    print('num_boost_round:', num_boost_round)

    return num_boost_round, lgbcv

In [189]:
num_boost_round, num_boost_result = get_num_boost_rounds(X_train, y_train, best_features, best_params, n_splits)

### 2. CV loop

In [190]:
def get_cv_result(X_train, y_train, n_splits, best_features, best_params, num_boost_round):
    
    cv_split = StratifiedKFold(n_splits=n_splits, random_state=SEED, shuffle=True)
    folds = cv_split.split(X=X_train, y=y_train) 

    y_test_pred_history = list()
    y_test_target_history = list()

    y_train_pred_history = list()
    y_train_target_history = list()

    roc_auc_train_avg = list()
    roc_auc_test_avg = list() 

    for train_index, test_index in tqdm(folds, position=0, leave=True):
        X_train_fold = X_train.reset_index(drop=True).loc[train_index, best_features]
        y_train_fold = y_train[train_index]

        X_test_fold = X_train.reset_index(drop=True).loc[test_index, best_features]
        y_test_fold = y_train[test_index]


        print('Train/Val shapes:')
        print((X_train_fold.shape, y_train_fold.shape), (X_test_fold.shape, y_test_fold.shape))

        print('Train/Val bad rates:')
        print(y_train_fold.mean(), y_test_fold.mean())

        print('Train/Val bad`s count:')
        print(y_train_fold.sum(), y_test_fold.sum())
        print()

        train_fold = lgb.Dataset(
            X_train_fold,
            label=y_train_fold,
        )
        test_fold = lgb.Dataset(
            X_test_fold,
            label=y_test_fold,
            reference=train_fold,
        )

        model = lgb.train(
            params=best_params,
            train_set=train_fold,
            num_boost_round=num_boost_round,
            valid_sets=(train_fold, test_fold),
            valid_names=('train', 'val'),
            verbose_eval=False,
        )

        y_train_pred_fold = model.predict(X_train_fold)
        y_test_pred_fold = model.predict(X_test_fold)

        roc_auc_train_avg.append(roc_auc_score(y_train_fold, y_train_pred_fold))
        roc_auc_test_avg.append(roc_auc_score(y_test_fold, y_test_pred_fold))

        print('Train/Val ROC-AUC:')
        print(roc_auc_score(y_train_fold, y_train_pred_fold), roc_auc_score(y_test_fold, y_test_pred_fold))

        lgb.plot_importance(model, max_num_features=25);
        plt.show()
        print('*'*50)
        print('*'*50)
        print()

    print('Average ROC-AUC Train:', np.mean(roc_auc_train_avg))
    print('Average ROC-AUC Val:', np.mean(roc_auc_test_avg))
    print()

In [191]:
get_cv_result(X_train, y_train, n_splits, best_features, best_params, num_boost_round)

## Model training - on all train data

In [192]:
fit = lgb.Dataset(
    X_train[best_features], y_train,
)

val = lgb.Dataset(
    X_val[best_features], y_val,
    reference=fit,
)

model = lgb.train(
    params=best_params,
    train_set=fit,
    num_boost_round=num_boost_round,
    verbose_eval=False,
)

In [193]:
print(f'ROC-AUC (TRAIN): {roc_auc_score(y_train, model.predict(X_train[best_features]))}')
print(f'ROC-AUC (VAL): {roc_auc_score(y_val, model.predict(X_val[best_features]))}')
y_val_pred = model.predict(X_val[best_features])

In [194]:
from sklearn.metrics import roc_curve

def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

fpr, tpr, thresholds = roc_curve(y_val, y_val_pred)
print(roc_auc_score(y_val, y_val_pred))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print("Threshold value is:", optimal_threshold)
plot_roc_curve(fpr, tpr)

In [195]:
print('*'*10 + 'TRAIN' + '*'*10)
print(classification_report(y_train, (model.predict(X_train[best_features]) > optimal_threshold).astype(int) ))
print()

print('*'*10 + 'VAL' + '*'*10)
print(classification_report(y_val, (model.predict(X_val[best_features]) > optimal_threshold).astype(int) ))
print()

In [196]:
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_train, (model.predict(X_train[best_features]) > optimal_threshold).astype(int),),
)
disp.plot(xticks_rotation='vertical')

In [197]:
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_val, (model.predict(X_val[best_features]) > optimal_threshold).astype(int),),
)
disp.plot(xticks_rotation='vertical')

## Model training - final model

In [198]:
# df_train

In [199]:
numerical_pipeline.fit(df_train);
X_train_full = numerical_pipeline.transform(df_train).reset_index(drop=True)
y_train_full = X_train_full[TARGET].values
X_train_full.drop([WEEK, TARGET],axis=1, inplace=True)
X_train_full = X_train_full.set_index(ID)
X_train_full = X_train_full[best_features]
X_train_full.shape

In [200]:
X_test = numerical_pipeline.transform(df_test).reset_index(drop=True)
X_test.drop([WEEK],axis=1, inplace=True)
X_test = X_test.set_index(ID)
X_test = X_test[best_features]
X_test.shape

In [204]:
assert df_test[[ID]].drop_duplicates().shape[0] == X_test.shape[0]

In [205]:
fit = lgb.Dataset(
    X_train_full, y_train_full,
)

model_full = lgb.train(
    params=best_params,
    train_set=fit,
    num_boost_round=num_boost_round,
    verbose_eval=True,
)

In [206]:
pickle.dump(model_full, open(PATH2OUTPUT / 'lgb_class_weight_model_upd.pkl', 'wb'))

## Model interpreting & Feature importance

In [207]:
lgb.plot_importance(model_full, max_num_features=30);

In [208]:
explainer = shap.TreeExplainer(model_full)
shap_values_0 = explainer.shap_values(X_train_full)[0]
# shap_values_1 = explainer.shap_values(X_train_full)[1]

In [209]:
shap.summary_plot(shap_values_0, X_train_full,) # axis_color='white'

In [210]:
# shap.summary_plot(shap_values_1, X_train_full,) # axis_color='white'

## Inference

In [211]:
X_test[best_features]

In [212]:
y_test_pred = model_full.predict(X_test[best_features])

In [None]:
# plt.hist(y_train_pred);

In [213]:
plt.hist(y_val_pred);

In [214]:
plt.hist(y_test_pred);

## Save the result

In [215]:
X_test['Predicted'] = y_test_pred

In [216]:
submissions = X_test.reset_index()[[ID, 'Predicted']]

In [217]:
submissions.to_csv(PATH2OUTPUT / 'submission.csv', index=False)