## Część 2 - modelowanie

#### kolejne kroki:
- pobranie danych do modelowania
- przygotowanie zestawów danych train-test w kilku wariantach
- modelowanie z zastosowaniem wybranych klasyfikatorów
- omówienie i podsumowanie wyników

In [1]:
#import pakietów
import numpy as np
import pandas as pd
from IPython.display import HTML
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from math import log

### I Przygotowanie danych do modelowania

#### Import danych

Dane do modelowania przygotowane zostały w 4 wersjach - wersje te różnią się sposobem uzupełnienia brakujących wartości w zakresie 5 zmiennych:

- 'offer_amount', 
- 'offer_period', 
- 'interest_rate', 
- 'fee', 
- 'offer_monthly_obligation'.

Są to następujące warianty:

- 1 - zmienne usunięte ze zbioru danych,
- 2 - braki danych uzupełnione medianą,
- 3 - braki danych uzupełnione przez losowanie z rozkładu normalnego,
- 4 - braki danych uzupełnione przez kopiowanie danych z innych rekordów datasetu.

Przygotowanie modelu wykonywane będzie dla każdego z wariantów niezależnie.
Sposób uzupełnienia wartości brakujacych zostanie oceniony po analizie jakości uzyskanych modeli.

In [4]:
#import danych
col_names = ['ID', 'gender', 'city', 'income', 'birth_date', 'application_date', 'requested_amount', 
             'requested_period', 'financial_obligations', 'employer_name', 'account_bank', 
             'mobile_verification_flag', 'var_5', 'var_1', 'offer_amount', 'offer_period', 'interest_rate', 
             'fee', 'offer_monthly_obligation', 'filled_form_flag', 'device', 'var_2', 'source', 'var_4', 
             'disbursed_flag', 'latitude', 'longitude', 'age']

datasets = {
    1:{'name':'none'},
    2:{'name':'median'},
    3:{'name':'distribution'},
    4:{'name':'observation'}
}

datasets[2]['data'] = pd.read_csv('dataset_inputation_median.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[3]['data'] = pd.read_csv('dataset_inputation_distribution.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[4]['data'] = pd.read_csv('dataset_inputation_random_observation.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[1]['data'] = datasets[2]['data'].copy()

columns_to_drop = ['city', 'birth_date', 'application_date', 'employer_name', 'account_bank']
columns_to_rename = {'ID':'id', 
                     'gender':'cat01', 
                     'income':'num01', 
                     'requested_amount':'num02', 
                     'requested_period':'num03',
                     'financial_obligations':'num04',
                     'mobile_verification_flag':'cat02',
                     'var_5':'cat03',
                     'var_1':'cat04',
                     'offer_amount':'num05',
                     'offer_period':'num06',
                     'interest_rate':'num07',
                     'fee':'num08',
                     'offer_monthly_obligation':'num09',
                     'filled_form_flag':'cat05',
                     'device':'cat06',
                     'var_2':'cat07',
                     'source':'cat08',
                     'var_4':'cat09',
                     'disbursed_flag':'explained',
                     'latitude':'num10',
                     'longitude':'num11',
                     'age':'num12'
}

for item in datasets:
    datasets[item]['data'] = datasets[item]['data'].drop(columns_to_drop, axis=1).rename(columns=columns_to_rename)

datasets[1]['data'] = datasets[1]['data'].drop(['num05','num06','num07','num08','num09'], axis=1)


#### Podział train-test

Dla każdego wariantu danych dzielę dataset na zbiór trenujacy i testowy wg proporcji 3:1.
Klasa pozytywnych obserwacji zmiennej objaśnianej jest mało liczna, dlatego żeby zachować proporcjonalny podział
stosuję opcję 'stratifiy'. 

Mała liczność obserwacji pozytywnych w zmiennej objaśnianej będzie wymagać zastosowania skalowania w procesie modelowania.

In [5]:
licznosc = datasets[1]['data'].shape[0]
licznosc_y = sum(datasets[1]['data']['explained'])
licznosc_y_procent = licznosc_y / licznosc
print(f"licznosc_zbioru = {licznosc}")
print(f"liczność obserwacji pozytywnych dla zmiennej objaśnianej = {licznosc_y}")
print(f"udział obserwacji pozytywnych zmiennej objaśnianej w zbiorze = {licznosc_y_procent}\n")
      
# Train - test split
train_test_split_ratio = 0.25

for item in datasets:
    datasets[item]['X'] = datasets[item]['data'].drop(['explained'], axis=1)
    datasets[item]['y'] = datasets[item]['data']['explained']
    datasets[item]['X_train'], datasets[item]['X_test'], datasets[item]['y_train'], datasets[item]['y_test'] = train_test_split(datasets[item]['X'], datasets[item]['y'], test_size=train_test_split_ratio, random_state=42, stratify=datasets[item]['y'])
    y_train_share = sum(datasets[item]['y_train'])/datasets[item]['X_train'].shape[0]
    y_test_share  = sum(datasets[item]['y_test']) /datasets[item]['X_test'].shape[0]
    print(f"Dataset {item}: train - {y_train_share} 'jedynek', test - {y_test_share} 'jedynek'")

licznosc_zbioru = 87020
liczność obserwacji pozytywnych dla zmiennej objaśnianej = 1273.0
udział obserwacji pozytywnych zmiennej objaśnianej w zbiorze = 0.01462882096069869

Dataset 1: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 2: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 3: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 4: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'


#### Podejście do zmiennych kategorycznych

Na etapie przygotowania danych zmienne kategoryczne zostały zakodowane wg rosnącego udziału obserwacji pozytywnych w danej klasie. Bardziej adekwatne podejście (szczególnie w celu zastosowania regresji logistycznej) jest zastosowanie współczynników WOE (Weight Of Evidence), czyli skalowanie wg proporcji ln(liczba obserwacji pozytywnych / liczba obserwacji negatywnych). 

Dla każdego ze zbiorów danych wyznaczam wagi WOE do zmiany kodowania zmiennych kategorycznych. Do wyliczenia wag wykorzystuję wyłącznie obserwacje ze zbioru treningowego.

In [6]:
# WOE calculation for categorical
categorical_variables = [name for name in datasets[1]['data'].columns if 'cat' in name]

for item in datasets:
    WOE_dataset = datasets[item]['X_train'][categorical_variables]
    WOE_dataset['positive'] = datasets[item]['y_train']
    WOE_dataset['negative'] = WOE_dataset.apply(lambda x: 1 - x['positive'], axis=1)
    WOE_mapper = pd.DataFrame(columns=['feature_name', 'feature_code', 'WOE'])
    for variable in categorical_variables:
        variable_data = WOE_dataset[[variable, 'positive', 'negative']]
        out = variable_data.groupby([variable]).sum()
        out['WOE'] = out.apply(lambda x: log(x['positive'] / x['negative']) if x['positive'] > 0 and x['negative'] > 0 else -12, axis = 1)
        out['feature_name'] = variable
        out['feature_code'] = out.index
        out = out[['feature_name', 'feature_code', 'WOE']].reset_index(drop=True)
        WOE_mapper = pd.concat([WOE_mapper, out])
    datasets[item]['WOE_mapper'] = WOE_mapper

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Kodowanie stosuję zarówno do zbioru treningowego, jak i testowego. Wariant danych z kodowaniem WOE oznaczam odrębnie.

In [7]:
# data with WOE

def WOE_mapping(dataset, WOE_mapper):
    output = dataset.copy()
    for variable in categorical_variables:
        mapper = WOE_mapper[WOE_mapper['feature_name'] == variable]
        mapper_dict = {}
        for category in mapper.index:
            mapper_dict[mapper.loc[category]['feature_code']] = mapper.loc[category]['WOE']
        output[variable] = output.apply(lambda x: mapper_dict[x[variable]], axis=1)
    return output

for item in datasets:
    datasets[item]['X_train_woe'] = WOE_mapping(datasets[item]['X_train'], datasets[item]['WOE_mapper'])
    datasets[item]['X_test_woe'] = WOE_mapping(datasets[item]['X_test'], datasets[item]['WOE_mapper'])

#### Standaryzacja

Dane numeryczne w zbiorze danych są różnorodnie skalowane - np. wiek klienta to zmienna z przedziału ok. [0, 100], tymczasem kwota dochodu - już sięga milionów (dochód wyrażony jest w rupiach indyjskich). Zniwelowanie różnic w skali umożliwia zastosowanie standaryzacji zmiennych - stosuję do tego celu wbudowaną funkcjonalność StandardScaler(). 

Standaryzacji poddaję zarówno zbiory ze zwyczajnym kodowaniem zmiennych kategorycznych, jak i z kodowaniem WOE.
Zbiory przeskalowane oznaczam odrębnie.

In [8]:
# data scaled
for item in datasets:
    datasets[item]['scaler'] = StandardScaler()
    datasets[item]['WOE_scaler'] = StandardScaler()
    datasets[item]['scaler'].fit(datasets[item]['X_train'])
    datasets[item]['WOE_scaler'].fit(datasets[item]['X_train_woe'])
    datasets[item]['X_train_scaled'] = datasets[item]['scaler'].transform(datasets[item]['X_train'])
    datasets[item]['X_test_scaled'] = datasets[item]['scaler'].transform(datasets[item]['X_test'])
    datasets[item]['X_train_woe_scaled'] = datasets[item]['WOE_scaler'].transform(datasets[item]['X_train_woe'])
    datasets[item]['X_test_woe_scaled'] = datasets[item]['WOE_scaler'].transform(datasets[item]['X_test_woe'])    

#### Redukcja wymiarowości

Korelacja części zmiennych datasetu ze zmienną objaśnianą może być na tyle niska, że zmienna nie wniesie znaczącego udziału w jakość modelu. Redukcja wymiarowości umozliwia wyłuskanie najistotniejszych zmiennych do budowy modelu. Ograniczenie ilości danych usprawnia proces modelowania kosztem niewielkiej straty jakości modelu. Ma to szczególne znaczenie dla zbiorów ze sporą ilością zmiennych modelowych.

W naszym przypadku redukcja wymiarowości nie jest konieczna z uwagi na niedużą liczbę zmiennych modleowych (poniżej 20). Mimo to wyznaczam zbiór danych z odcięciem części zmiennych - warunkiem jest obniżenie wyjaśnianej wariancji maksymalnie o 5%. Wykorzystuję Principal Component Analysis (jest dostępna wbudowana funkcjonalność w bibliotece sklearn).

Procedurze PCA poddaje się zbiory po regularyzacji, wyliczenie stosuję zarówno do zbioru ze zwyczajnym kodowaniem zmiennych kategorycznych, jak i z kodowaniem WOE. Wersje danych po procedurze PCA odłożone są w osobnych zbiorach.

In [9]:
# PCA

expected_explained_variance_ratio = 0.95

def pca_model(dataset):
    pca = PCA(random_state=42)
    pca.fit(dataset)
    explained_variance = pca.explained_variance_ratio_
    counter = 0
    explained_variance_cumulative = 0
    for variance in explained_variance:
        if explained_variance_cumulative <= expected_explained_variance_ratio:
            explained_variance_cumulative += variance
            counter += 1
    pca = PCA(random_state=42, n_components = counter)
    pca.fit(dataset)
    return pca

for item in datasets:
    datasets[item]['PCA'] = pca_model(datasets[item]['X_train_scaled'])
    datasets[item]['WOE_PCA'] = pca_model(datasets[item]['X_train_woe_scaled'])
    datasets[item]['X_train_pca'] = datasets[item]['PCA'].transform(datasets[item]['X_train_scaled'])
    datasets[item]['X_test_pca'] = datasets[item]['PCA'].transform(datasets[item]['X_test_scaled'])
    datasets[item]['X_train_woe_pca'] = datasets[item]['WOE_PCA'].transform(datasets[item]['X_train_woe_scaled'])
    datasets[item]['X_test_woe_pca'] = datasets[item]['WOE_PCA'].transform(datasets[item]['X_test_woe_scaled'])
    print(f"Dataset {item}: liczba zmiennych przed PCA: {datasets[item]['X_train'].shape[1]}, liczba zmiennych po PCA: {datasets[item]['X_train_pca'].shape[1]}")

Dataset 1: liczba zmiennych przed PCA: 16, liczba zmiennych po PCA: 13
Dataset 2: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 16
Dataset 3: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 17
Dataset 4: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 17


Podumowując - mamy 4 zbiory danych:

    - 1 - usunięte zmienne z brakami,
    - 2 - braki uzupełnione medianą,
    - 3 - braki uzupełnione przez losowanie z rozkładu normalnego,
    - 4 - braki uzupełnione przez kopiowanie danych z innych rekordów datasetu,

    każdy zbiór w 6 wersjach:
    
    - bez skalowania, bez WOE,
    - bez skalowania, z WOE,
    - ze skalowaniem, bez WOE, bez PCA,
    - ze skalowaniem, z WOE, bez PCA,
    - ze skalowaniem, bez WOE, z PCA,
    - ze skalowaniem, z WOE, z PCA.

## II Modelowanie

W modelowaniu przyjąłem następujacy schemat działania - dla danego klasyfikatora:
    
- zdefiniowanie zakresu parametrów klasyfikatora oraz zakresu wariantów danych do modelowania,
- optymalizacja parametrów klasyfikatora z zastosowaniem Grid Search + Cross Validation,
- wybór najlepszych zestawów parametrów,
- powtórzenie modelowania na wybranych zestawach parametrów na pełnym zbiorze treningowym,
- ocena wyników modeli - zestawienie miar dla zbioru testowego i treningowego.

In [10]:
# funkcja zwracająca miary jakości modelu dla zadanego modelu oraz zbioru zmiennych modelowych i obserwacji
def get_measures(model, X, y):
    y_predict = model.predict(X)
    confusion = metrics.confusion_matrix(y, y_predict)
    out = {}
    out['accuracy'] = metrics.accuracy_score(y_predict, y)
    out['precision'] = metrics.precision_score(y_predict, y)
    out['recall'] = metrics.recall_score(y_predict, y)
    out['f1'] = metrics.f1_score(y_predict, y)
    out['auc'] = metrics.roc_auc_score(y, y_predict)
    out['TP'] = confusion[1][1]
    out['FP'] = confusion[1][0]
    out['TN'] = confusion[0][0]
    out['FN'] = confusion[0][1]
    return out

In [11]:
# funkcja realizująca procedurę Grid Search + Cross Validation
def model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1):
    grid_search = GridSearchCV(classifier, param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, verbose = verbose, refit=True)
    grid_search.fit(X, y)
    with open(prefix + '_grid_search.pickle', 'wb') as file:
        pickle.dump(grid_search, file)
    params_list =  ['param_'+x for x in param_grid]
    params_list += ['params', 'mean_train_score', 'mean_test_score'] 
    result = pd.DataFrame(grid_search.cv_results_)[params_list]
    result.to_csv(prefix + ".csv", index=False, sep=';')
    return grid_search, result  

#### 1. Regresja logistyczna

Pierwszym stosowanym klasyfikatorem jest regresja logistczna.
W procedurze Grid Search sprawdzane będą dwa parametry:
 
- parametr C - parametr sterujący regularyzacją,
- parametr 'class_weight' - parametr sterujący przeważaniem zmiennej objaśnianej.

Konieczność zastosowania przeważania obserwacji pozytywnych dla zmiennej objaśnianej wynika z bardzo małego udzuału tych obserwacji w analizowanym zbiorze danych.

In [None]:
# Regresja logistyczna - pierwszy przebieg

# klasyfikator, zestaw przeszukiwanych parametrów klasyfikatora

classifier = LogisticRegression(C = 1.0, class_weight = 'balanced', random_state = 42, verbose = 1)
C = [10**c for c in range(1, 12, 2)]
weights = [1, 50, 100, 200]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'C': C, 'class_weight': class_weight}
param_grid

In [None]:
# warianty danych badane per każdy dataset
data_variants = ['', '_pca', '_scaled', '_woe', '_woe_scaled', '_woe_pca']

Przebieg Grid Search:

- per każdy dataset (4 wersje),
- per każdy sposób przygotowania danych (6 wersji),
- z optymalizacją miar: AUC / F-score.

Z uwagi na czasochłonnośc przeliczenia wynik każdego przebiegu podlega serializacji i zapisowi na dysk, 
dodatkowo zbierane są średnie miary uzyskane w procedurze Cross Validation dla danego zestawu parametrów.

In [None]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_C', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

results.to_csv('logistic_regression_step_1.csv', index=False, sep=';')

Poniżej zestawienie parametrów dla 10 najlepszych uzyskanych wartości miar AUC i F1.
Jakość uzyskiwanych modeli nie jest wysoka.

Analiza przedstawionych danych pozwala stwierdzić, że:

- różnice miar pomiędzy zbiorem testowym i treningowym są niewielkie (w granicach 1pp) --> model nie jest przeuczony,
- jakość modelu nie zależy praktycznie od parametru C,
- dla maksymalizacji AUC preferowane są zbiory z wagami WOE, dla maksymalizacji F-score - bez wag WOE,
- dla obu miar preferowane są zbiory skalowane,
- jakość modelu mocno zależy od parametru 'class_weight' - optymalna wartośc parametru to ok. 50 lub 'balanced'
- w top 10 nie pojawiają się: 
    dataset 1 (= usunięte zmienne z brakami), 
    warianty danych z PCA

In [12]:
results = pd.read_csv("logistic_regression_step_1.csv", sep=';')
results = results.drop(['params'], axis=1)

results_best_auc = results[results['scoring']=='roc_auc']
results_best_auc = results_best_auc.sort_values(by=['mean_test_score'], ascending=False)
results_best_auc.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight,scoring
626,_woe_scaled,2,0.825199,0.829608,100000000000,"{0: 1, 1: 50}",roc_auc
611,_woe_scaled,2,0.825199,0.829608,100000,"{0: 1, 1: 50}",roc_auc
606,_woe_scaled,2,0.825199,0.829608,1000,"{0: 1, 1: 50}",roc_auc
621,_woe_scaled,2,0.825199,0.829608,1000000000,"{0: 1, 1: 50}",roc_auc
616,_woe_scaled,2,0.825199,0.829608,10000000,"{0: 1, 1: 50}",roc_auc
601,_woe_scaled,2,0.825199,0.829608,10,"{0: 1, 1: 50}",roc_auc
614,_woe_scaled,2,0.825147,0.829591,100000,balanced,roc_auc
619,_woe_scaled,2,0.825147,0.829591,10000000,balanced,roc_auc
624,_woe_scaled,2,0.825147,0.829591,1000000000,balanced,roc_auc
629,_woe_scaled,2,0.825147,0.829591,100000000000,balanced,roc_auc


In [13]:
results_best_f1 = results[results['scoring']=='f1']
results_best_f1 = results_best_f1.sort_values(by=['mean_test_score'], ascending=False)
results_best_f1.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight,scoring
516,_scaled,2,0.087597,0.087875,1000,"{0: 1, 1: 50}",f1
531,_scaled,2,0.087597,0.087875,1000000000,"{0: 1, 1: 50}",f1
511,_scaled,2,0.087597,0.087873,10,"{0: 1, 1: 50}",f1
521,_scaled,2,0.087597,0.087875,100000,"{0: 1, 1: 50}",f1
526,_scaled,2,0.087597,0.087875,10000000,"{0: 1, 1: 50}",f1
536,_scaled,2,0.087597,0.087875,100000000000,"{0: 1, 1: 50}",f1
756,,3,0.087434,0.086708,1000,"{0: 1, 1: 50}",f1
766,,3,0.087433,0.086786,10000000,"{0: 1, 1: 50}",f1
871,_scaled,3,0.087076,0.087602,10,"{0: 1, 1: 50}",f1
876,_scaled,3,0.087076,0.087602,1000,"{0: 1, 1: 50}",f1


In [None]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'param_C', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 0)
weights = [10, 30, 40, 50, 60, 70, 80, 90]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        prefix = 'logistic_regression_2_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=11, verbose = 0)
        datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('logistic_regression_step_2.csv', index=False, sep=';')       

In [14]:
results = pd.read_csv("logistic_regression_step_2.csv", sep=';')
results = results.drop(['params'], axis=1)
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight
11,_woe_scaled,2,0.825215,0.829572,,"{0: 1, 1: 40}"
12,_woe_scaled,2,0.825199,0.829608,,"{0: 1, 1: 50}"
13,_woe_scaled,2,0.825174,0.82961,,"{0: 1, 1: 60}"
17,_woe_scaled,2,0.825147,0.829591,,balanced
10,_woe_scaled,2,0.825143,0.829463,,"{0: 1, 1: 30}"
14,_woe_scaled,2,0.825125,0.829584,,"{0: 1, 1: 70}"
49,_woe_scaled,4,0.825055,0.829724,,"{0: 1, 1: 60}"
53,_woe_scaled,4,0.825052,0.829723,,balanced
50,_woe_scaled,4,0.825049,0.82972,,"{0: 1, 1: 70}"
15,_woe_scaled,2,0.825038,0.829544,,"{0: 1, 1: 80}"


Wyniki drugiego przebiegu Grid Search wskazują na:

- sposób przygotowania danych '_woe_scaled' jako zwracający najlepsze wyniki,
- przewagę drugiego datasetu (uzupełnienie braków medianą),
- zakres optywamllnych wartości dla przeważenia klas w zmiennej obserwowanej: 40-80,
- brak przeuczenia - średni wynik na zbiorach uczących i testowych jest prawie taki sam.

Kolejnym krokiem będzie wygenerowanie zestawu modeli z wykorzystaniem całego zbioru uczącego dla:
- wszystkich datasetów,
- wszystkich sposobów przygotowania danych,
- dla różnych wartości parametrów 'class_weight' w zakresie wskazanym przez Grid Search.

Dla każdego modelu zwrócony zostanie zestaw miar zarówno dla zbioru uczącego jak i testowego.

In [None]:
# Regresja logistyczna - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 
                   'recall_test', 'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 
                   'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}
                             
weights = range(40, 80, 5)
class_weights = [{0:1, 1:x} for x in weights]
class_weights.append('balanced')

for dataset in datasets:
    for data_variant in data_variants:
        for class_weight in class_weights:
            X_train = datasets[dataset]['X_train' + data_variant]
            y_train = datasets[dataset]['y_train']
            X_test = datasets[dataset]['X_test' + data_variant]
            y_test = datasets[dataset]['y_test']               
            model = LogisticRegression(C = 10**5, class_weight = class_weight, random_state = 42, n_jobs = -1)
            model.fit(X_train, y_train)
            train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
            test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
            train = train.rename(columns = train_columns_rename)
            test  = test.rename(columns = test_columns_rename)
            result = pd.concat([train, test], axis = 1)
            result['dataset'] = dataset
            result['data_variant'] = data_variant
            result['class_weight'] = class_weight[1]
            result = result[dataset_columns]
            model_results = pd.concat([model_results, result])

model_results.to_csv('logistic_regression_modelling.csv', index=False, sep=';')    

In [16]:
results = pd.read_csv("logistic_regression_modelling.csv", sep=';')
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Unnamed: 0,dataset,data_variant,class_weight,auc_train,auc_test,f1_train,accuracy_train
133,4,_woe_scaled,75,0.760186,0.749607,0.073845,0.697142
97,3,_woe_scaled,75,0.758953,0.746252,0.073582,0.696744
61,2,_woe_scaled,75,0.758017,0.747078,0.07331,0.695917
25,1,_woe_scaled,75,0.757929,0.747461,0.072992,0.69371
60,2,_woe_scaled,70,0.757703,0.743965,0.075068,0.708511
131,4,_woe_scaled,65,0.757601,0.74464,0.077419,0.723558
134,4,_woe_scaled,a,0.757573,0.744309,0.076255,0.716387
132,4,_woe_scaled,70,0.757005,0.745318,0.075045,0.70917
34,1,_woe_pca,75,0.756693,0.751021,0.072054,0.688225
96,3,_woe_scaled,70,0.756575,0.738725,0.074995,0.709339


Powyżej przedstawiono 10 najlepszych modeli (sortowanie wg AUC na zbiorze treningowym malejąco).
Jakość modeli nie jest zadowalająca - niski f-score, dość niska wartość AUC --> siła predykcyjna modelu jest niska.

#### 2. Drzewa decyzyjne

Kolejna próba wykorzystuje jako klasyfikator drzewa decyzyjne.
Dla klasyfikatorów drzewiastych liniowe przeskalowanie danych jest transparentne i nie wpływa na wynik, dlatego nie badam wszystkich wariantów przygotowania danych jak w przypadku regresji logistycznej.

Procedura optymalizacji modelu jest analogiczna jak w przypadku regresji logistycznej:
- Grid Search + Cross Validation z przeszukaniem szerokiego zakresu parametrów klasyfikatora,
- modelowanie dla najlepiej rokujących zestawów parametrów.

Do optymalizacji uwzględniam:
- maksymalną długość drzewa,
- liczbę zmiennych branych pod uwagę w wyborze najlepszego podziału (parametr ten wprowadza element losowości),
- wagi dla klas zmiennej objaśnianej (z uwagi na nierównomierną liczność klas).

Uwzględniając doświadczenia z modelowania metodą regresji logistycznej pomijam pierwszy dataset w dalszych analizach.

In [17]:
classifier = DecisionTreeClassifier(max_depth=None, max_features=None, random_state=42, class_weight=None)

max_depth = [3, 5, 7, 9, 11, 15, 20]
max_features = [None, 0.5, 0.75]
weights = [30, 40, 50, 60, 70, 80, 90]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
class_weight.append(None)
param_grid = {'max_depth': max_depth, 'max_features': max_features, 'class_weight':class_weight}
param_grid

{'max_depth': [3, 5, 7, 9, 11, 15, 20],
 'max_features': [None, 0.5, 0.75],
 'class_weight': [{0: 1, 1: 30},
  {0: 1, 1: 40},
  {0: 1, 1: 50},
  {0: 1, 1: 60},
  {0: 1, 1: 70},
  {0: 1, 1: 80},
  {0: 1, 1: 90},
  'balanced',
  None]}

In [19]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in [2, 3, 4]:
    for data_variant in ['_pca', '_scaled', '_woe_scaled', '_woe_pca']:
        prefix = 'decision_tree_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=-1, verbose = 0)
        datasets[dataset]['grid_search_decision_tree_' + data_variant + '_' + scoring] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('decision_tree_grid_search.csv', index=False, sep=';') 

In [20]:
results = pd.read_csv("decision_tree_grid_search.csv", sep=';')
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,params
1769,_scaled,4,0.822702,0.845665,"{0: 1, 1: 60}",5,0.75,"{'class_weight': {0: 1, 1: 60}, 'max_depth': 5..."
1077,_scaled,3,0.822354,0.856692,"{0: 1, 1: 90}",7,,"{'class_weight': {0: 1, 1: 90}, 'max_depth': 7..."
1958,_woe_scaled,4,0.820863,0.845625,"{0: 1, 1: 60}",5,0.75,"{'class_weight': {0: 1, 1: 60}, 'max_depth': 5..."
362,_scaled,2,0.819775,0.840669,,5,0.75,"{'class_weight': None, 'max_depth': 5, 'max_fe..."
1748,_scaled,4,0.819454,0.847338,"{0: 1, 1: 50}",5,0.75,"{'class_weight': {0: 1, 1: 50}, 'max_depth': 5..."
321,_scaled,2,0.819181,0.858793,"{0: 1, 1: 90}",7,,"{'class_weight': {0: 1, 1: 90}, 'max_depth': 7..."
278,_scaled,2,0.818528,0.846761,"{0: 1, 1: 70}",5,0.75,"{'class_weight': {0: 1, 1: 70}, 'max_depth': 5..."
1937,_woe_scaled,4,0.818021,0.847523,"{0: 1, 1: 50}",5,0.75,"{'class_weight': {0: 1, 1: 50}, 'max_depth': 5..."
1056,_scaled,3,0.816765,0.862087,"{0: 1, 1: 80}",7,,"{'class_weight': {0: 1, 1: 80}, 'max_depth': 7..."
1853,_scaled,4,0.816258,0.844222,balanced,5,0.75,"{'class_weight': 'balanced', 'max_depth': 5, '..."


Analizując uzyskane wyniki można stwierdzić, że:

- wśród najwyżej ocenianych nie występują zbiory przygotowane z wykorzystaniem PCA,
- różnica pomiędzy średnim wynikiem dla zbiorów testowych i treningowych nie przekracza 3pp --> modele nie są przeuczone,
- optymalna wartośc parametru 'class_weight' znajduje się w granicach 60-90
- optymalna wartość parametru 'max_depth' znajduje się w granicach 5-7
- optymalna wartość parametru 'max_features' znajduje się w granicach 75-100%

Kolejnym krokiem jest modelowanie w zakresach parametrów wskazanych przez Grid Search.

In [None]:
# Drzewa decyzyjne - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 
                   'recall_test', 'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 
                   'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}
                             
weights = range(50, 90, 10)
class_weights = [{0:1, 1:x} for x in weights]
max_depths = [5, 6, 7]
max_features_set = [0.75, None]

for dataset in datasets:
    for data_variant in data_variants:
        for max_depth in max_depths:
            for max_features in max_features_set:
                for class_weight in class_weights:
                    X_train = datasets[dataset]['X_train' + data_variant]
                    y_train = datasets[dataset]['y_train']
                    X_test = datasets[dataset]['X_test' + data_variant]
                    y_test = datasets[dataset]['y_test']               
                    model = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features, random_state=42, class_weight=class_weight)
                    model.fit(X_train, y_train)
                    train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                    test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                    train = train.rename(columns = train_columns_rename)
                    test  = test.rename(columns = test_columns_rename)
                    result = pd.concat([train, test], axis = 1)
                    result['dataset'] = dataset
                    result['data_variant'] = data_variant
                    result['class_weight'] = class_weight[1]
                    result = result[dataset_columns]
                    model_results = pd.concat([model_results, result])

model_results.to_csv('decision_tree_modelling.csv', index=False, sep=';') 

In [None]:
results = pd.read_csv("decision_tree_modelling.csv", sep=';')
results = results[results['auc_train'] - results['auc_test'] < 0.05]
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Dla modeli z najwyższą wartością AUC występuje znaczna różnica pomiędzy AUC dla zbioru testowego i treningowego, co wskazuje na przeuczenie modelu. Dlatego ograniczyłem analizowane wyniki wyłącznie do modeli, dla których różnica w ocenach zbioru testowego i treningowego nie przekracza 5pp AUC.

Najlepszy model wykazuje 81% AUC przy bardzo słabej wartości f-score. Siła predykcyjna tego modelu jest słaba.

#### 3. Random Forest

Następnym klasyfikatorem użytym w modelowaniu są lasy losowe.
Jest to naturalne rozszerzenie klasyfikacji za pomocą drzew.

W optymalizacji korzystam z wypracowanego schematu postępowania, optymalizacji podlegają te same parametry, co w przypadku drzew decyzyjnych, dodatkowo optymalizuję parametr liczby estymatorów.

In [27]:
classifier = RandomForestClassifier(n_estimators = 100, max_depth=None, max_features=None, random_state=42, class_weight=None)

n_estimators = [100, 300, 400, 500, 700]
max_depth = [5, 6, 7, 8, 9]
max_features = [None, 0.75, 0.90]
weights = [40, 60, 80]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
class_weight.append(None)
param_grid = {'n_estimators':n_estimators, 'max_depth': max_depth, 'max_features': max_features, 'class_weight':class_weight}
param_grid

{'n_estimators': [100, 300, 400, 500, 700],
 'max_depth': [5, 6, 7, 8, 9],
 'max_features': [None, 0.75, 0.9],
 'class_weight': [{0: 1, 1: 40},
  {0: 1, 1: 60},
  {0: 1, 1: 80},
  'balanced',
  None]}

In [29]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        prefix = 'random_forest_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=3, n_jobs=-1, verbose = 1)
        datasets[dataset]['grid_search_random_forest_' + data_variant] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('random_forest_grid_search.csv', index=False, sep=';') 

start: random_forest_2__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 31.7min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 57.0min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 85.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_2__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 31.8min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 57.0min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 85.9min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_3__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 44.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 79.8min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 120.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_3__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 18.1min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 44.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 79.9min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 120.9min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_4__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 14.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 35.0min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 62.8min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 94.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_4__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 14.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 35.3min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 63.3min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 95.4min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


In [37]:
results = pd.read_csv("random_forest_grid_search.csv", sep=';')
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,param_n_estimators,params
411,_woe_scaled,2,0.847175,0.912623,"{0: 1, 1: 40}",7,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
37,_scaled,2,0.847089,0.91248,"{0: 1, 1: 40}",7,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
36,_scaled,2,0.847078,0.91237,"{0: 1, 1: 40}",7,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
38,_scaled,2,0.846965,0.912489,"{0: 1, 1: 40}",7,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
412,_woe_scaled,2,0.846962,0.912753,"{0: 1, 1: 40}",7,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
39,_scaled,2,0.846908,0.912417,"{0: 1, 1: 40}",7,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
413,_woe_scaled,2,0.846857,0.912874,"{0: 1, 1: 40}",7,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
414,_woe_scaled,2,0.846743,0.912551,"{0: 1, 1: 40}",7,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
35,_scaled,2,0.84663,0.911803,"{0: 1, 1: 40}",7,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
396,_woe_scaled,2,0.846581,0.892889,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."


In [38]:
results = pd.read_csv("random_forest_grid_search.csv", sep=';')
results = results[results['mean_train_score'] - results['mean_test_score'] < 0.05]
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,param_n_estimators,params
396,_woe_scaled,2,0.846581,0.892889,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
397,_woe_scaled,2,0.846453,0.893218,"{0: 1, 1: 40}",6,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
24,_scaled,2,0.846389,0.892897,"{0: 1, 1: 40}",6,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
23,_scaled,2,0.846363,0.892891,"{0: 1, 1: 40}",6,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
22,_scaled,2,0.846353,0.892844,"{0: 1, 1: 40}",6,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
398,_woe_scaled,2,0.846349,0.893216,"{0: 1, 1: 40}",6,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
20,_scaled,2,0.846344,0.8925,"{0: 1, 1: 40}",6,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
399,_woe_scaled,2,0.846337,0.893024,"{0: 1, 1: 40}",6,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
21,_scaled,2,0.84632,0.892534,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
395,_woe_scaled,2,0.846244,0.892707,"{0: 1, 1: 40}",6,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."


Wyniki miar jakości modeli uzyskanych w Grid Search w zależności od parametrów klasyfikatora posortowałem malejąco wg średniej wartości AUC dla zbioru testowego. Analizując wyniki zwrócone dla najlepszych 20 szacowań (powyżej) można stwierdzić, że: 
- róznica na AUC pomiędzy zbiorem testowym i treningowym przekracza 5pp, co może wskazywać na przeuczenie modeli,
- występuje wyłącznie dataset 2,
- jako waga dla klasy objaśnianej występuje prawie wyłącznie wartość 40, 
- parametr max_depth występuje w granicach 6-7,
- parametr max_features występuje wyłącznie w wartości 0.75
- dominuje 'class_weight' na poziomie 40
- przeważa ;max_weight' na poziomie 75%,
- jakość modelu jest nieczuła na liczbę klasyfikatorów w badanym zakresie

Po wprowadzeniu ograniczenia w zbiorze wyników - dopuszczalna różnica AUC_train - AUC_test < 5% (wykluczenie modeli przeuczonych; wyniki poniżej):
- wcześniejsze obserwacje pozostają w mocy.

Dziwi brak czułości metody na parametr liczby estymatorów - prawdopodobnie przeszukiwany zestaw zawierał zbyt wysokie wartości. Podobnie wskazana wartość 'class_weight' to najniższa wartość w badanym zestawie. Niestety ze wzgledu na długi czas obliczeń nie byłem w stanie przetworzyć ponownie Grid Search.

W etapie modelowania generuję zestaw modeli z parametrami klasyfikatora w granicach wskazanych jako najlepsze w Grid Search.

In [42]:
# Random forest - modelowanie
dataset_columns = ['dataset', 'data_variant', 'n_estimators', 'max_depth', 'max_features', 'class_weight', 
                   'auc_train', 'auc_test', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 
                   'precision_train', 'precision_test', 'recall_train', 'recall_test', 'TN_train', 
                   'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}

n_estimators_set = [10, 50, 100, 300] 
weights = range(20, 60, 5)
class_weights = [{0:1, 1:x} for x in weights]
max_depths = [5, 6, 7, 8]
max_features_set = [0.65, 0.75, 0.85]

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        for n_estimators in n_estimators_set:
            for max_depth in max_depths:
                for max_features in max_features_set:
                    for class_weight in class_weights:
                        X_train = datasets[dataset]['X_train' + data_variant]
                        y_train = datasets[dataset]['y_train']
                        X_test = datasets[dataset]['X_test' + data_variant]
                        y_test = datasets[dataset]['y_test']               
                        model = RandomForestClassifier(n_estimators = n_estimators, max_depth=max_depth, max_features=max_features, random_state=42, class_weight=class_weight, n_jobs=-1)
                        model.fit(X_train, y_train)
                        train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                        test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                        train = train.rename(columns = train_columns_rename)
                        test  = test.rename(columns = test_columns_rename)
                        result = pd.concat([train, test], axis = 1)
                        result['dataset'] = dataset
                        result['data_variant'] = data_variant
                        result['n_estimators'] = n_estimators
                        result['max_depth'] = max_depth
                        result['max_features'] = max_features
                        result['class_weight'] = class_weight[1]
                        result = result[dataset_columns]
                        model_results = pd.concat([model_results, result])

model_results.to_csv('random_forest_modelling.csv', index=False, sep=';') 

In [43]:
results = pd.read_csv("decision_tree_modelling.csv", sep=';')
results = results[results['auc_train'] - results['auc_test'] < 0.05]
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Unnamed: 0,dataset,data_variant,class_weight,auc_train,auc_test,f1_train,accuracy_train
330,4,_scaled,60,0.80945,0.760044,0.083704,0.706811
163,2,_woe_scaled,60,0.805169,0.761743,0.078568,0.679062
139,2,_scaled,60,0.804952,0.76158,0.078471,0.678633
164,2,_woe_scaled,70,0.804064,0.754341,0.083418,0.710427
328,4,_scaled,50,0.804004,0.759423,0.092379,0.754018
137,2,_scaled,50,0.803943,0.760031,0.088238,0.735601
257,3,_woe_scaled,50,0.802554,0.754199,0.088472,0.737945
259,3,_woe_scaled,60,0.802378,0.763759,0.078662,0.682709
165,2,_woe_scaled,70,0.802095,0.761473,0.086938,0.730943
161,2,_woe_scaled,50,0.80076,0.762643,0.088462,0.739493


Powyżej ocena jakości uzyskanych modeli - posortowana malejąco wg AUC na zbiorze treningowym. Usunąłem modele, dla których różnica AUC pomiędzy zbiorem testowym i treniongowym przekraczała 5pp (przeuczenie).
Najlepsze rezultaty otrzymano dla datasetów 2 i 4 przy przeważeniu klas dla zmiennej objaśnianej na poziomie ok 60. Podobnie jak dla poprzednio użytych klasyfikatorów jakość najlepszych modeli nie jest zbyt dobra.


#### 4. Gradient Boosting

Kolejnym z badanych klasyfikatorów jest Gradient Boosting. Pierwszym krokiem jest Grid Search, optymalizacji poddaję następujące parametry:

- learning_rate,
- liczba estymatorów
- maksymalną głębokość drzewa,
- maksymalną ilość zmiennych uwzględnianych w danym kroku budowania drzewa.

In [44]:
classifier = GradientBoostingClassifier(learning_rate=0.1, n_estimators=100, subsample=1., max_depth=5, max_features=None, 
                                        random_state=42, verbose=1)
    
learning_rate = [0.01, 0.05, 0.1]
n_estimators = [20, 50, 100, 200, 500]
max_depth = [3, 5, 7]
max_features = [None, 0.5, 0.75, 0.9]                            
param_grid = {'learning_rate': learning_rate, 'n_estimators': n_estimators, 'max_depth':max_depth, 'max_features':max_features}
param_grid

{'learning_rate': [0.01, 0.05, 0.1],
 'n_estimators': [20, 50, 100, 200, 500],
 'max_depth': [3, 5, 7],
 'max_features': [None, 0.5, 0.75, 0.9]}

In [45]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in [3, 4]:
    for data_variant in ['_pca', '_scaled', '_woe_scaled', '_woe_pca']:
        prefix = 'gradient_boosting_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=-1, verbose = 1)
        datasets[dataset]['grid_search_gradient_boosting_' + data_variant] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('gradient_boosting_grid_search.csv', index=False, sep=';') 

start: gradient_boosting_3__pca
Fitting 4 folds for each of 180 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 36.2min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 67.5min finished


      Iter       Train Loss   Remaining Time 
         1           0.1513            1.70m
         2           0.1500            1.71m
         3           0.1490            1.66m
         4           0.1478            1.69m
         5           0.1469            1.68m
         6           0.1460            1.68m
         7           0.1451            1.69m
         8           0.1441            1.70m
         9           0.1433            1.71m
        10           0.1425            1.71m
        20           0.1356            1.69m
        30           0.1299            1.66m
        40           0.1252            1.63m
        50           0.1214            1.59m
        60           0.1182            1.57m
        70           0.1151            1.52m
        80           0.1129            1.48m
        90           0.1106            1.45m
       100           0.1089            1.40m
       200           0.0941            1.03m
       300           0.0863           40.54s
       40

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


KeyboardInterrupt: 

In [46]:
results = pd.read_csv("gradient_boosting_grid_search.csv", sep=';')
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,dataset,data_variant,param_learning_rate,param_n_estimators,param_max_depth,param_max_features,mean_train_score,mean_test_score
69,2,_scaled,0.05,500,3,0.5,0.916667,0.849107
439,2,_woe_scaled,0.05,500,3,0.9,0.917349,0.849024
29,2,_scaled,0.01,500,5,0.5,0.929351,0.848823
498,2,_woe_scaled,0.1,200,3,0.9,0.910566,0.848539
389,2,_woe_scaled,0.01,500,5,0.5,0.928798,0.848518
138,2,_scaled,0.1,200,3,0.9,0.911476,0.84818
128,2,_scaled,0.1,200,3,0.5,0.909537,0.84811
74,2,_scaled,0.05,500,3,0.75,0.918456,0.848103
394,2,_woe_scaled,0.01,500,5,0.75,0.926236,0.8481
434,2,_woe_scaled,0.05,500,3,0.75,0.918972,0.848044


Procedura nie została zakończona z uwagi na dość długi czas oblcizeń.
W oparciu o wyniki uzyskane z częściowego przetworzenia można wyciągnąć następujace wnioski:
- lekkie przeuczenie modeli (różnica AUC train-test w granicach 6-7 pp)
- lepsze wyniki niż dla pozostałych klasyfikatorów,
- 'learning_rate' zróżnicowany w top 20 --> niewielki wpływ na jakość
- 'n_estimators' - dominuje 500 i 200
- 'max_depth; - dominuje 3 i 5 ("weak learners")
- 'max_features' - zróżnicowany

In [48]:
results = pd.read_csv("gradient_boosting_grid_search.csv", sep=';')
results = results[results['mean_train_score'] - results['mean_test_score'] < 0.05]
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,dataset,data_variant,param_learning_rate,param_n_estimators,param_max_depth,param_max_features,mean_train_score,mean_test_score
487,2,_woe_scaled,0.1,100,3,0.5,0.889047,0.847712
68,2,_scaled,0.05,200,3,0.5,0.889684,0.84755
127,2,_scaled,0.1,100,3,0.5,0.888654,0.847411
433,2,_woe_scaled,0.05,200,3,0.75,0.891054,0.847363
438,2,_woe_scaled,0.05,200,3,0.9,0.890735,0.847318
137,2,_scaled,0.1,100,3,0.9,0.890451,0.847225
428,2,_woe_scaled,0.05,200,3,0.5,0.889247,0.847031
78,2,_scaled,0.05,200,3,0.9,0.891683,0.846907
73,2,_scaled,0.05,200,3,0.75,0.891487,0.846885
423,2,_woe_scaled,0.05,200,3,,0.890987,0.846856


Po wyeliminowaniu modeli przeuczonych ( > 5pp różnicy na AUC pomiędzy train-test) - w oparciu o prezentowane wyniki można podtrzymać wysnute wcześniej wnioski.

Wyniki częsciowego przebiegu Grid Search są wystarczające, zeby przejść do fazy wyznaczania modelu.

In [62]:
# Gradient Boosting - modelowanie
dataset_columns = ['dataset', 'data_variant', 'n_estimators', 'max_depth', 'max_features',  
                   'auc_train', 'auc_test', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 
                   'precision_train', 'precision_test', 'recall_train', 'recall_test', 'TN_train', 
                   'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}

n_estimators_set = [100, 300] 
max_depths = [3, 4, 5]
max_features_set = [0.5, 0.75, 0.9]


for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        for n_estimators in n_estimators_set:
            for max_depth in max_depths:
                for max_features in max_features_set:
                    print(f"dataset: {dataset}, data variant: {data_variant}, n_estimators: {n_estimators}, max_depth: {max_depth},max_features: {max_features}")
                    X_train = datasets[dataset]['X_train' + data_variant]
                    y_train = datasets[dataset]['y_train']
                    X_test = datasets[dataset]['X_test' + data_variant]
                    y_test = datasets[dataset]['y_test']               
                    model = GradientBoostingClassifier(learning_rate=0.01, n_estimators=n_estimators, max_depth=max_depth, subsample = 1.,
                                                       max_features=max_features, random_state=42, verbose=0)
                    model.fit(X_train, y_train)
                    train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                    test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                    train = train.rename(columns = train_columns_rename)
                    test  = test.rename(columns = test_columns_rename)
                    result = pd.concat([train, test], axis = 1)
                    result['dataset'] = dataset
                    result['data_variant'] = data_variant
                    result['n_estimators'] = n_estimators
                    result['max_depth'] = max_depth
                    result['max_features'] = max_features
                    result = result[dataset_columns]
                    model_results = pd.concat([model_results, result])

model_results.to_csv('gradient_boosting_modelling.csv', index=False, sep=';') 

dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 3,max_features: 0.5


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 3,max_features: 0.75
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 3,max_features: 0.9
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 4,max_features: 0.5
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 4,max_features: 0.75
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 4,max_features: 0.9
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 5,max_features: 0.5
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 5,max_features: 0.75
dataset: 2, data variant: _scaled, n_estimators: 100, max_depth: 5,max_features: 0.9
dataset: 2, data variant: _scaled, n_estimators: 300, max_depth: 3,max_features: 0.5
dataset: 2, data variant: _scaled, n_estimators: 300, max_depth: 3,max_features: 0.75
dataset: 2, data variant: _scaled, n_estimators: 300, max_depth: 3,max_features: 0.9
dataset: 2, data variant: _scaled, n_estimators: 300, max_dep

dataset: 4, data variant: _woe_scaled, n_estimators: 100, max_depth: 5,max_features: 0.5
dataset: 4, data variant: _woe_scaled, n_estimators: 100, max_depth: 5,max_features: 0.75
dataset: 4, data variant: _woe_scaled, n_estimators: 100, max_depth: 5,max_features: 0.9
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 3,max_features: 0.5
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 3,max_features: 0.75
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 3,max_features: 0.9
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 4,max_features: 0.5
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 4,max_features: 0.75
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 4,max_features: 0.9
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 5,max_features: 0.5
dataset: 4, data variant: _woe_scaled, n_estimators: 300, max_depth: 5,max_features: 0.75
dataset: 4, data 

In [65]:
results = pd.read_csv("gradient_boosting_modelling.csv", sep=';')
# results = results[results['auc_train'] - results['auc_test'] < 0.05]
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Unnamed: 0,dataset,data_variant,auc_train,auc_test,f1_train,accuracy_train
35,2,_woe_scaled,0.503141,0.499977,0.012487,0.985459
70,3,_woe_scaled,0.503141,0.499977,0.012487,0.985459
16,2,_scaled,0.503141,0.499977,0.012487,0.985459
15,2,_scaled,0.502618,0.5,0.010417,0.985444
34,2,_woe_scaled,0.502618,0.499977,0.010417,0.985444
71,3,_woe_scaled,0.502618,0.499977,0.010417,0.985444
17,2,_scaled,0.502618,0.499977,0.010417,0.985444
106,4,_woe_scaled,0.502094,0.5,0.008342,0.985429
105,4,_woe_scaled,0.502094,0.5,0.008342,0.985429
33,2,_woe_scaled,0.502094,0.5,0.008342,0.985429


Metoda skupia się wyłącznie na optymalizacji accuracy, co przy niezrównoważonej próbce prowadzi do 'zdegenerowanego' modelu, który w zasadzie przypisuje wszystkim rekordom zerową zmienną obserwowaną.

Na etapie Grid Searcha optymalizowałem AUC, sam klasyfikator nie daje możliwości wyboru optymalizowanej miary.
Nie ma też mozliwości w samym klasyfikatorze przeważenia klas w zmiennej objaśnianej, co zaskutkowało 'klęską' w poszukiwaniu optymalnego modelu.

#### 5. XGBoost

Ostatnim wypróbowanym klasyfikatorem jest XGBoost.
Przebieg optymalizacji - jak dla pozostałych klasyfikatorów.

In [50]:
classifier = xgboost.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=100, verbosity=1, scale_pos_weight=1)
    
learning_rate = [0.01, 0.1]
n_estimators = [200, 400, 600]
max_depth = [3, 5, 7]
scale_pos_weight = [20, 40, 60]                            
param_grid = {'learning_rate': learning_rate, 'n_estimators': n_estimators, 'max_depth':max_depth, 'scale_pos_weight':scale_pos_weight}
param_grid

{'learning_rate': [0.01, 0.1],
 'n_estimators': [200, 400, 600],
 'max_depth': [3, 5, 7],
 'scale_pos_weight': [20, 40, 60]}

In [None]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in datasets:
    for data_variant in ['_scaled', '_woe_scaled']:
        prefix = 'xgboost_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=-1, verbose = 1)
        datasets[dataset]['grid_search_xgboost_' + data_variant] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('xgboost_grid_search.csv', index=False, sep=';') 

In [51]:
results = pd.read_csv("xgboost_grid_search.csv", sep=';')
results = results.drop(['params'], axis=1)
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_learning_rate,param_max_depth,param_n_estimators,param_scale_pos_weight
0,_woe_scaled,3,0.853271,0.926956,0.1,3,200,20
1,_woe_scaled,2,0.85259,0.914178,0.01,5,600,20
2,_woe_scaled,2,0.852185,0.916303,0.01,5,600,40
3,_woe_scaled,2,0.852183,0.922133,0.1,3,200,20
4,_scaled,2,0.851975,0.921423,0.1,3,200,20
5,_scaled,2,0.851975,0.913451,0.01,5,600,20
6,_woe_scaled,3,0.851967,0.919583,0.01,5,600,20
7,_scaled,1,0.851927,0.914013,0.1,3,200,20
8,_woe_scaled,1,0.851908,0.913781,0.1,3,200,20
9,_scaled,3,0.85188,0.918689,0.01,5,600,20


Wnioski:
- lekkie przetrenowanie modeli,
- wartości AUC porównywalne z otrzymanymi dla Gradient Boostingu
- learning_rate nie wpływa znacząco na jakość modelu
- max depth na poziomie 3-5 ("weak learner")
- liczba estymatorów - preferowana wysoka
- parametr skalujacy zmienną obserwowaną - na poziomie ok. 20.

Poniżej ten sam zestaw z ograniczeniem na AUC_train - AUC_test < 5%.
Analiza tego zestawienia wnosi nowe informacje:
- learning rate - preferowany poziom 0.01,
- scale_pos_weight - bez większego wpływu na model.

In [54]:
results = pd.read_csv("xgboost_grid_search.csv", sep=';')
results = results.drop(['params'], axis=1)
results = results[results['mean_train_score'] - results['mean_test_score'] < 0.05]
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_learning_rate,param_max_depth,param_n_estimators,param_scale_pos_weight
13,_woe_scaled,2,0.85155,0.90109,0.01,5,400,20
14,_scaled,2,0.85151,0.877264,0.01,3,600,40
16,_scaled,2,0.851333,0.878806,0.01,3,600,60
17,_woe_scaled,2,0.851313,0.874652,0.01,3,600,20
20,_scaled,2,0.851285,0.874665,0.01,3,600,20
21,_woe_scaled,2,0.851274,0.877272,0.01,3,600,40
23,_woe_scaled,2,0.851085,0.87875,0.01,3,600,60
37,_scaled,1,0.850593,0.895698,0.01,5,400,20
40,_woe_scaled,1,0.850402,0.895804,0.01,5,400,40
41,_scaled,3,0.850399,0.875355,0.01,3,600,20


In [55]:
# XGBoost - modelowanie
dataset_columns = ['dataset', 'data_variant', 'n_estimators', 'max_depth',   
                   'auc_train', 'auc_test', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 
                   'precision_train', 'precision_test', 'recall_train', 'recall_test', 'TN_train', 
                   'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}

n_estimators_set = [400, 500, 600] 
max_depths = [3, 4, 5]

for dataset in datasets:
    for data_variant in ['_scaled', '_woe_scaled']:
        print(f"start: dataset {dataset} data variant {data_variant}")
        for n_estimators in n_estimators_set:
            for max_depth in max_depths:
                X_train = datasets[dataset]['X_train' + data_variant]
                y_train = datasets[dataset]['y_train']
                X_test = datasets[dataset]['X_test' + data_variant]
                y_test = datasets[dataset]['y_test']               
                model = xgboost.XGBClassifier(n_jobs = 10, max_depth=max_depth, learning_rate=0.01, n_estimators=n_estimators, verbosity=0, scale_pos_weight=50)
                model.fit(X_train, y_train)
                train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                train = train.rename(columns = train_columns_rename)
                test  = test.rename(columns = test_columns_rename)
                result = pd.concat([train, test], axis = 1)
                result['dataset'] = dataset
                result['data_variant'] = data_variant
                result['n_estimators'] = n_estimators
                result['max_depth'] = max_depth
                result = result[dataset_columns]
                model_results = pd.concat([model_results, result])

model_results.to_csv('xgboost_modelling.csv', index=False, sep=';') 

start: dataset 1 data variant _scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 1 data variant _woe_scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 2 data variant _scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 2 data variant _woe_scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 3 data variant _scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 3 data variant _woe_scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 4 data variant _scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


start: dataset 4 data variant _woe_scaled


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


In [58]:
results = pd.read_csv("xgboost_modelling.csv", sep=';')
results = results[results['auc_train'] - results['auc_test'] < 0.05]
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Unnamed: 0,dataset,data_variant,auc_train,auc_test,f1_train,accuracy_train
44,3,_scaled,0.830461,0.758772,0.106951,0.785827
53,3,_woe_scaled,0.828045,0.754404,0.106523,0.786149
71,4,_woe_scaled,0.825912,0.749897,0.106004,0.786011
26,2,_scaled,0.825163,0.748335,0.104699,0.782502
35,2,_woe_scaled,0.823859,0.755627,0.104243,0.781966
62,4,_scaled,0.823463,0.752925,0.104882,0.784234
50,3,_woe_scaled,0.820402,0.763251,0.102013,0.777185
8,1,_scaled,0.820384,0.757615,0.102307,0.778166
41,3,_scaled,0.820267,0.758091,0.102211,0.777936
32,2,_woe_scaled,0.819057,0.755912,0.100935,0.774535


#### Podsumowanie

Sumarycznie przeprowadzono przeliczenia dla:

- 4 różnych zestawów danych różniących się sposobem uzupełnienia wartości brakujacych,
- różnych sposobów obróbki danych, z uwzględnieniem redukcji wymiarowości, skalowania, kodowania WOE dla zmiennych kategorycznych,
- 5 różnych klasyfikatorów.

Z uwagi na niepowodzenie w modelowaniu z wykorzystaniem Gradient Boostingu pominę wyniki tego klasyfikatora w zestawieniu.

In [75]:
columns = ['klasyfikator', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train', 'TN_train', 'FN_train', 'FP_train', 'TP_train']
columns_rename = {'auc_train':'AUC_train', 'auc_test':'AUC_test', 'f1_train':'F1', 'accuracy_train':'ACC', 'TN_train':'TN', 'FN_train':'FN', 'FP_train':'FP', 'TP_train':'TP'}
columns_formats = {'AUC_train':'{:5.4f}', 'AUC_test':'{:5.4f}', 'F1':'{:5.4f}', 'ACC':'{:5.4f}', 'TN':'{:5.0f}', 'FN':'{:5.0f}','FP':'{:5.0f}', 'TP':'{:5.0f}' }

logistic_regression = pd.read_csv("logistic_regression_modelling.csv", sep = ';')
decision_tree = pd.read_csv("decision_tree_modelling.csv", sep = ';')
random_forest = pd.read_csv("random_forest_modelling.csv", sep = ';')
xgboost = pd.read_csv("xgboost_modelling.csv", sep = ';')

logistic_regression['klasyfikator'] = 'regresja logistyczna' 
decision_tree['klasyfikator'] = 'drzewo decyzyjne' 
random_forest['klasyfikator'] = 'las losowy' 
xgboost['klasyfikator'] = 'XGBoost' 

logistic_regression = logistic_regression[columns].rename(columns_rename, axis=1)
decision_tree = decision_tree[columns].rename(columns_rename, axis=1)
random_forest = random_forest[columns].rename(columns_rename, axis=1)
xgboost = xgboost[columns].rename(columns_rename, axis=1)

Poniżej zestawiono miary dla najlepszych modeli uzyskanych w każdej z metod.

a. najlepsze modele pod względem zwróconej miary AUC_train:

In [93]:
logistic_regression_1 = logistic_regression.sort_values(by=['AUC_train'], ascending=False)[:1]
decision_tree_1 = decision_tree.sort_values(by=['AUC_train'], ascending=False)[:1]
random_forest_1 = random_forest.sort_values(by=['AUC_train'], ascending=False)[:1]
xgboost_1 = xgboost.sort_values(by=['AUC_train'], ascending=False)[:1]
best_AUC_train = pd.concat([logistic_regression_1, decision_tree_1, random_forest_1, xgboost_1])
best_AUC_train = best_AUC_train.sort_values(by=['AUC_train'], ascending=False)
best_AUC_train.style.format(columns_formats)

Unnamed: 0,klasyfikator,AUC_train,AUC_test,F1,ACC,TN,FN,FP,TP
1143,las losowy,0.8432,0.7512,0.116,0.8027,51545,12765,110,845
44,XGBoost,0.8305,0.7588,0.107,0.7858,50450,13860,118,837
280,drzewo decyzyjne,0.8174,0.7105,0.0961,0.7581,48638,15672,116,839
133,regresja logistyczna,0.7602,0.7496,0.0738,0.6971,44711,19599,167,788


b. najlepsze modele pod względem zwróconej miary AUC_test:

In [92]:
logistic_regression_2 = logistic_regression.sort_values(by=['AUC_test'], ascending=False)[:1]
decision_tree_2 = decision_tree.sort_values(by=['AUC_test'], ascending=False)[:1]
random_forest_2 = random_forest.sort_values(by=['AUC_test'], ascending=False)[:1]
xgboost_2 = xgboost.sort_values(by=['AUC_test'], ascending=False)[:1]
best_AUC_test = pd.concat([logistic_regression_2, decision_tree_2, random_forest_2, xgboost_2])
best_AUC_test = best_AUC_test.sort_values(by=['AUC_test'], ascending=False)
best_AUC_test.style.format(columns_formats)

Unnamed: 0,klasyfikator,AUC_train,AUC_test,F1,ACC,TN,FN,FP,TP
1263,las losowy,0.7908,0.7716,0.088,0.7463,47910,16400,156,799
153,drzewo decyzyjne,0.7907,0.77,0.0856,0.7349,47154,17156,145,810
13,XGBoost,0.7988,0.7673,0.0928,0.76,48802,15508,154,801
70,regresja logistyczna,0.7551,0.7527,0.0718,0.6882,44126,20184,168,787


c. najlepsze modele pod względem zróconej miary F1:

In [94]:
logistic_regression_3 = logistic_regression.sort_values(by=['F1'], ascending=False)[:1]
decision_tree_3 = decision_tree.sort_values(by=['F1'], ascending=False)[:1]
random_forest_3 = random_forest.sort_values(by=['F1'], ascending=False)[:1]
xgboost_3 = xgboost.sort_values(by=['F1'], ascending=False)[:1]
best_F1 = pd.concat([logistic_regression_3, decision_tree_3, random_forest_3, xgboost_3])
best_F1 = best_F1.sort_values(by=['F1'], ascending=False)
best_F1.style.format(columns_formats)

Unnamed: 0,klasyfikator,AUC_train,AUC_test,F1,ACC,TN,FN,FP,TP
2096,las losowy,0.7226,0.5985,0.2612,0.9605,62230,2080,499,456
44,XGBoost,0.8305,0.7588,0.107,0.7858,50450,13860,118,837
377,drzewo decyzyjne,0.7969,0.6979,0.0992,0.7848,50446,13864,182,773
117,regresja logistyczna,0.7267,0.7007,0.0978,0.8334,53805,10505,366,589


Uzyskane modele mają słabe parametry jakościowe i raczej nie będą miały dobrej siły predykcyjnej.
Najlepsze pod względem AUC_train są przetrenowane (za wyjątkiem najsłabszej regresji logistycznej).

Faworyci pod względem AUC_test mają niskie wartości AUC na poziomie nie przekraczającym 80% przy bardzo niskim poziomie F1-score (poniżej 19%).

Metodą, która zwróciła najlepszy wynik jest las losowy. 
Dla modelu maksymalizującego AUC_test - 79% AUC na zbiorze treningowym, 77% AUC na zbiorze testowym oraz prawie 9% F1-score.
Wynik o podobnych parametrach zwróciła również metoda XGBoost.

Trudno uzasadnić uzyskane wyniki bez dobrej znajomości zbioru danych.
Możliwe, że parametry w datasecie nie są dobtymi predyktorami zmiennej objaśnianej.

Prawdopodobnie zachowanie klientów determinującej wzięcie kredytu nie jest opisywane wyłącznie suchymi faktami takimi jak miejsce złożenia wniosku, wiek, zarobki, parametry oferty kredytowej. Zmienne te najwyraźniej mają niewielką siłę predykcyjną w tym zakresie.