## Część 2 - modelowanie

#### kolejne kroki:
- pobranie danych do modelowania
- przygotowanie zestawów danych train-test w kilku wariantach
- modelowanie z zastosowaniem wybranych klasyfikatorów
- omówienie i podsumowanie wyników

In [1]:
#import pakietów
import numpy as np
import pandas as pd
from IPython.display import HTML
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from math import log

### I Przygotowanie danych do modelowania

#### Import danych

Dane do modelowania przygotowane zostały w 4 wersjach - wersje te różnią się sposobem uzupełnienia brakujących wartości w zakresie 5 zmiennych:

- 'offer_amount', 
- 'offer_period', 
- 'interest_rate', 
- 'fee', 
- 'offer_monthly_obligation'.

Są to następujące warianty:

- 1 - zmienne usunięte ze zbioru danych,
- 2 - braki danych uzupełnione medianą,
- 3 - braki danych uzupełnione przez losowanie z rozkładu normalnego,
- 4 - braki danych uzupełnione przez kopiowanie danych z innych rekordów datasetu.

Przygotowanie modelu wykonywane będzie dla każdego z wariantów niezależnie.
Sposób uzupełnienia wartości brakujacych zostanie oceniony po analizie jakości uzyskanych modeli.

In [4]:
#import danych
col_names = ['ID', 'gender', 'city', 'income', 'birth_date', 'application_date', 'requested_amount', 
             'requested_period', 'financial_obligations', 'employer_name', 'account_bank', 
             'mobile_verification_flag', 'var_5', 'var_1', 'offer_amount', 'offer_period', 'interest_rate', 
             'fee', 'offer_monthly_obligation', 'filled_form_flag', 'device', 'var_2', 'source', 'var_4', 
             'disbursed_flag', 'latitude', 'longitude', 'age']

datasets = {
    1:{'name':'none'},
    2:{'name':'median'},
    3:{'name':'distribution'},
    4:{'name':'observation'}
}

datasets[2]['data'] = pd.read_csv('dataset_inputation_median.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[3]['data'] = pd.read_csv('dataset_inputation_distribution.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[4]['data'] = pd.read_csv('dataset_inputation_random_observation.csv', delimiter = ';', engine='python', header = None, names = col_names, 
                      index_col = 0, skiprows = 1 )
datasets[1]['data'] = datasets[2]['data'].copy()

columns_to_drop = ['city', 'birth_date', 'application_date', 'employer_name', 'account_bank']
columns_to_rename = {'ID':'id', 
                     'gender':'cat01', 
                     'income':'num01', 
                     'requested_amount':'num02', 
                     'requested_period':'num03',
                     'financial_obligations':'num04',
                     'mobile_verification_flag':'cat02',
                     'var_5':'cat03',
                     'var_1':'cat04',
                     'offer_amount':'num05',
                     'offer_period':'num06',
                     'interest_rate':'num07',
                     'fee':'num08',
                     'offer_monthly_obligation':'num09',
                     'filled_form_flag':'cat05',
                     'device':'cat06',
                     'var_2':'cat07',
                     'source':'cat08',
                     'var_4':'cat09',
                     'disbursed_flag':'explained',
                     'latitude':'num10',
                     'longitude':'num11',
                     'age':'num12'
}

for item in datasets:
    datasets[item]['data'] = datasets[item]['data'].drop(columns_to_drop, axis=1).rename(columns=columns_to_rename)

datasets[1]['data'] = datasets[1]['data'].drop(['num05','num06','num07','num08','num09'], axis=1)


#### Podział train-test

Dla każdego wariantu danych dzielę dataset na zbiór trenujacy i testowy wg proporcji 3:1.
Klasa pozytywnych obserwacji zmiennej objaśnianej jest mało liczna, dlatego żeby zachować proporcjonalny podział
stosuję opcję 'stratifiy'. 

Mała liczność obserwacji pozytywnych w zmiennej objaśnianej będzie wymagać zastosowania skalowania w procesie modelowania.

In [5]:
licznosc = datasets[1]['data'].shape[0]
licznosc_y = sum(datasets[1]['data']['explained'])
licznosc_y_procent = licznosc_y / licznosc
print(f"licznosc_zbioru = {licznosc}")
print(f"liczność obserwacji pozytywnych dla zmiennej objaśnianej = {licznosc_y}")
print(f"udział obserwacji pozytywnych zmiennej objaśnianej w zbiorze = {licznosc_y_procent}\n")
      
# Train - test split
train_test_split_ratio = 0.25

for item in datasets:
    datasets[item]['X'] = datasets[item]['data'].drop(['explained'], axis=1)
    datasets[item]['y'] = datasets[item]['data']['explained']
    datasets[item]['X_train'], datasets[item]['X_test'], datasets[item]['y_train'], datasets[item]['y_test'] = train_test_split(datasets[item]['X'], datasets[item]['y'], test_size=train_test_split_ratio, random_state=42, stratify=datasets[item]['y'])
    y_train_share = sum(datasets[item]['y_train'])/datasets[item]['X_train'].shape[0]
    y_test_share  = sum(datasets[item]['y_test']) /datasets[item]['X_test'].shape[0]
    print(f"Dataset {item}: train - {y_train_share} 'jedynek', test - {y_test_share} 'jedynek'")

licznosc_zbioru = 87020
liczność obserwacji pozytywnych dla zmiennej objaśnianej = 1273.0
udział obserwacji pozytywnych zmiennej objaśnianej w zbiorze = 0.01462882096069869

Dataset 1: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 2: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 3: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'
Dataset 4: train - 0.014632651497739983 'jedynek', test - 0.01461732934957481 'jedynek'


#### Podejście do zmiennych kategorycznych

Na etapie przygotowania danych zmienne kategoryczne zostały zakodowane wg rosnącego udziału obserwacji pozytywnych w danej klasie. Bardziej adekwatne podejście (szczególnie w celu zastosowania regresji logistycznej) jest zastosowanie współczynników WOE (Weight Of Evidence), czyli skalowanie wg proporcji ln(liczba obserwacji pozytywnych / liczba obserwacji negatywnych). 

Dla każdego ze zbiorów danych wyznaczam wagi WOE do zmiany kodowania zmiennych kategorycznych. Do wyliczenia wag wykorzystuję wyłącznie obserwacje ze zbioru treningowego.

In [6]:
# WOE calculation for categorical
categorical_variables = [name for name in datasets[1]['data'].columns if 'cat' in name]

for item in datasets:
    WOE_dataset = datasets[item]['X_train'][categorical_variables]
    WOE_dataset['positive'] = datasets[item]['y_train']
    WOE_dataset['negative'] = WOE_dataset.apply(lambda x: 1 - x['positive'], axis=1)
    WOE_mapper = pd.DataFrame(columns=['feature_name', 'feature_code', 'WOE'])
    for variable in categorical_variables:
        variable_data = WOE_dataset[[variable, 'positive', 'negative']]
        out = variable_data.groupby([variable]).sum()
        out['WOE'] = out.apply(lambda x: log(x['positive'] / x['negative']) if x['positive'] > 0 and x['negative'] > 0 else -12, axis = 1)
        out['feature_name'] = variable
        out['feature_code'] = out.index
        out = out[['feature_name', 'feature_code', 'WOE']].reset_index(drop=True)
        WOE_mapper = pd.concat([WOE_mapper, out])
    datasets[item]['WOE_mapper'] = WOE_mapper

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Kodowanie stosuję zarówno do zbioru treningowego, jak i testowego. Wariant danych z kodowaniem WOE oznaczam odrębnie.

In [7]:
# data with WOE

def WOE_mapping(dataset, WOE_mapper):
    output = dataset.copy()
    for variable in categorical_variables:
        mapper = WOE_mapper[WOE_mapper['feature_name'] == variable]
        mapper_dict = {}
        for category in mapper.index:
            mapper_dict[mapper.loc[category]['feature_code']] = mapper.loc[category]['WOE']
        output[variable] = output.apply(lambda x: mapper_dict[x[variable]], axis=1)
    return output

for item in datasets:
    datasets[item]['X_train_woe'] = WOE_mapping(datasets[item]['X_train'], datasets[item]['WOE_mapper'])
    datasets[item]['X_test_woe'] = WOE_mapping(datasets[item]['X_test'], datasets[item]['WOE_mapper'])

#### Standaryzacja

Dane numeryczne w zbiorze danych są różnorodnie skalowane - np. wiek klienta to zmienna z przedziału ok. [0, 100], tymczasem kwota dochodu - już sięga milionów (dochód wyrażony jest w rupiach indyjskich). Zniwelowanie różnic w skali umożliwia zastosowanie standaryzacji zmiennych - stosuję do tego celu wbudowaną funkcjonalność StandardScaler(). 

Standaryzacji poddaję zarówno zbiory ze zwyczajnym kodowaniem zmiennych kategorycznych, jak i z kodowaniem WOE.
Zbiory przeskalowane oznaczam odrębnie.

In [8]:
# data scaled
for item in datasets:
    datasets[item]['scaler'] = StandardScaler()
    datasets[item]['WOE_scaler'] = StandardScaler()
    datasets[item]['scaler'].fit(datasets[item]['X_train'])
    datasets[item]['WOE_scaler'].fit(datasets[item]['X_train_woe'])
    datasets[item]['X_train_scaled'] = datasets[item]['scaler'].transform(datasets[item]['X_train'])
    datasets[item]['X_test_scaled'] = datasets[item]['scaler'].transform(datasets[item]['X_test'])
    datasets[item]['X_train_woe_scaled'] = datasets[item]['WOE_scaler'].transform(datasets[item]['X_train_woe'])
    datasets[item]['X_test_woe_scaled'] = datasets[item]['WOE_scaler'].transform(datasets[item]['X_test_woe'])    

#### Redukcja wymiarowości

Korelacja części zmiennych datasetu ze zmienną objaśnianą może być na tyle niska, że zmienna nie wniesie znaczącego udziału w jakość modelu. Redukcja wymiarowości umozliwia wyłuskanie najistotniejszych zmiennych do budowy modelu. Ograniczenie ilości danych usprawnia proces modelowania kosztem niewielkiej straty jakości modelu. Ma to szczególne znaczenie dla zbiorów ze sporą ilością zmiennych modelowych.

W naszym przypadku redukcja wymiarowości nie jest konieczna z uwagi na niedużą liczbę zmiennych modleowych (poniżej 20). Mimo to wyznaczam zbiór danych z odcięciem części zmiennych - warunkiem jest obniżenie wyjaśnianej wariancji maksymalnie o 5%. Wykorzystuję Principal Component Analysis (jest dostępna wbudowana funkcjonalność w bibliotece sklearn).

Procedurze PCA poddaje się zbiory po regularyzacji, wyliczenie stosuję zarówno do zbioru ze zwyczajnym kodowaniem zmiennych kategorycznych, jak i z kodowaniem WOE. Wersje danych po procedurze PCA odłożone są w osobnych zbiorach.

In [9]:
# PCA

expected_explained_variance_ratio = 0.95

def pca_model(dataset):
    pca = PCA(random_state=42)
    pca.fit(dataset)
    explained_variance = pca.explained_variance_ratio_
    counter = 0
    explained_variance_cumulative = 0
    for variance in explained_variance:
        if explained_variance_cumulative <= expected_explained_variance_ratio:
            explained_variance_cumulative += variance
            counter += 1
    pca = PCA(random_state=42, n_components = counter)
    pca.fit(dataset)
    return pca

for item in datasets:
    datasets[item]['PCA'] = pca_model(datasets[item]['X_train_scaled'])
    datasets[item]['WOE_PCA'] = pca_model(datasets[item]['X_train_woe_scaled'])
    datasets[item]['X_train_pca'] = datasets[item]['PCA'].transform(datasets[item]['X_train_scaled'])
    datasets[item]['X_test_pca'] = datasets[item]['PCA'].transform(datasets[item]['X_test_scaled'])
    datasets[item]['X_train_woe_pca'] = datasets[item]['WOE_PCA'].transform(datasets[item]['X_train_woe_scaled'])
    datasets[item]['X_test_woe_pca'] = datasets[item]['WOE_PCA'].transform(datasets[item]['X_test_woe_scaled'])
    print(f"Dataset {item}: liczba zmiennych przed PCA: {datasets[item]['X_train'].shape[1]}, liczba zmiennych po PCA: {datasets[item]['X_train_pca'].shape[1]}")

Dataset 1: liczba zmiennych przed PCA: 16, liczba zmiennych po PCA: 13
Dataset 2: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 16
Dataset 3: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 17
Dataset 4: liczba zmiennych przed PCA: 21, liczba zmiennych po PCA: 17


Podumowując - mamy 4 zbiory danych:

    - 1 - usunięte zmienne z brakami,
    - 2 - braki uzupełnione medianą,
    - 3 - braki uzupełnione przez losowanie z rozkładu normalnego,
    - 4 - braki uzupełnione przez kopiowanie danych z innych rekordów datasetu,

    każdy zbiór w 6 wersjach:
    
    - bez skalowania, bez WOE,
    - bez skalowania, z WOE,
    - ze skalowaniem, bez WOE, bez PCA,
    - ze skalowaniem, z WOE, bez PCA,
    - ze skalowaniem, bez WOE, z PCA,
    - ze skalowaniem, z WOE, z PCA.

## II Modelowanie

W modelowaniu przyjąłem następujacy schemat działania - dla danego klasyfikatora:
    
- zdefiniowanie zakresu parametrów klasyfikatora oraz zakresu wariantów danych do modelowania,
- optymalizacja parametrów klasyfikatora z zastosowaniem Grid Search + Cross Validation,
- wybór najlepszych zestawów parametrów,
- powtórzenie modelowania na wybranych zestawach parametrów na pełnym zbiorze treningowym,
- ocena wyników modeli - zestawienie miar dla zbioru testowego i treningowego.

In [10]:
# funkcja zwracająca miary jakości modelu dla zadanego modelu oraz zbioru zmiennych modelowych i obserwacji
def get_measures(model, X, y):
    y_predict = model.predict(X)
    confusion = metrics.confusion_matrix(y, y_predict)
    out = {}
    out['accuracy'] = metrics.accuracy_score(y_predict, y)
    out['precision'] = metrics.precision_score(y_predict, y)
    out['recall'] = metrics.recall_score(y_predict, y)
    out['f1'] = metrics.f1_score(y_predict, y)
    out['auc'] = metrics.roc_auc_score(y, y_predict)
    out['TP'] = confusion[1][1]
    out['FP'] = confusion[1][0]
    out['TN'] = confusion[0][0]
    out['FN'] = confusion[0][1]
    return out

In [11]:
# funkcja realizująca procedurę Grid Search + Cross Validation
def model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1):
    grid_search = GridSearchCV(classifier, param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, verbose = verbose, refit=True)
    grid_search.fit(X, y)
    with open(prefix + '_grid_search.pickle', 'wb') as file:
        pickle.dump(grid_search, file)
    params_list =  ['param_'+x for x in param_grid]
    params_list += ['params', 'mean_train_score', 'mean_test_score'] 
    result = pd.DataFrame(grid_search.cv_results_)[params_list]
    result.to_csv(prefix + ".csv", index=False, sep=';')
    return grid_search, result  

#### Regresja logistyczna

Pierwszym stosowanym klasyfikatorem jest regresja logistczna.
W procedurze Grid Search sprawdzane będą dwa parametry:
 
- parametr C - parametr sterujący regularyzacją,
- parametr 'class_weight' - parametr sterujący przeważaniem zmiennej objaśnianej.

Konieczność zastosowania przeważania obserwacji pozytywnych dla zmiennej objaśnianej wynika z bardzo małego udzuału tych obserwacji w analizowanym zbiorze danych.

In [None]:
# Regresja logistyczna - pierwszy przebieg

# klasyfikator, zestaw przeszukiwanych parametrów klasyfikatora

classifier = LogisticRegression(C = 1.0, class_weight = 'balanced', random_state = 42, verbose = 1)
C = [10**c for c in range(1, 12, 2)]
weights = [1, 50, 100, 200]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'C': C, 'class_weight': class_weight}
param_grid

In [None]:
# warianty danych badane per każdy dataset
data_variants = ['', '_pca', '_scaled', '_woe', '_woe_scaled', '_woe_pca']

Przebieg Grid Search:

- per każdy dataset (4 wersje),
- per każdy sposób przygotowania danych (6 wersji),
- z optymalizacją miar: AUC / F-score.

Z uwagi na czasochłonnośc przeliczenia wynik każdego przebiegu podlega serializacji i zapisowi na dysk, 
dodatkowo zbierane są średnie miary uzyskane w procedurze Cross Validation dla danego zestawu parametrów.

In [None]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_C', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

results.to_csv('logistic_regression_step_1.csv', index=False, sep=';')

Poniżej zestawienie parametrów dla 10 najlepszych uzyskanych wartości miar AUC i F1.
Jakość uzyskiwanych modeli nie jest wysoka.

Analiza przedstawionych danych pozwala stwierdzić, że:

- różnice miar pomiędzy zbiorem testowym i treningowym są niewielkie (w granicach 1pp) --> model nie jest przeuczony,
- jakość modelu nie zależy praktycznie od parametru C,
- dla maksymalizacji AUC preferowane są zbiory z wagami WOE, dla maksymalizacji F-score - bez wag WOE,
- dla obu miar preferowane są zbiory skalowane,
- jakość modelu mocno zależy od parametru 'class_weight' - optymalna wartośc parametru to ok. 50 lub 'balanced'
- w top 10 nie pojawiają się: 
    dataset 1 (= usunięte zmienne z brakami), 
    warianty danych z PCA

In [12]:
results = pd.read_csv("logistic_regression_step_1.csv", sep=';')
results = results.drop(['params'], axis=1)

results_best_auc = results[results['scoring']=='roc_auc']
results_best_auc = results_best_auc.sort_values(by=['mean_test_score'], ascending=False)
results_best_auc.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight,scoring
626,_woe_scaled,2,0.825199,0.829608,100000000000,"{0: 1, 1: 50}",roc_auc
611,_woe_scaled,2,0.825199,0.829608,100000,"{0: 1, 1: 50}",roc_auc
606,_woe_scaled,2,0.825199,0.829608,1000,"{0: 1, 1: 50}",roc_auc
621,_woe_scaled,2,0.825199,0.829608,1000000000,"{0: 1, 1: 50}",roc_auc
616,_woe_scaled,2,0.825199,0.829608,10000000,"{0: 1, 1: 50}",roc_auc
601,_woe_scaled,2,0.825199,0.829608,10,"{0: 1, 1: 50}",roc_auc
614,_woe_scaled,2,0.825147,0.829591,100000,balanced,roc_auc
619,_woe_scaled,2,0.825147,0.829591,10000000,balanced,roc_auc
624,_woe_scaled,2,0.825147,0.829591,1000000000,balanced,roc_auc
629,_woe_scaled,2,0.825147,0.829591,100000000000,balanced,roc_auc


In [13]:
results_best_f1 = results[results['scoring']=='f1']
results_best_f1 = results_best_f1.sort_values(by=['mean_test_score'], ascending=False)
results_best_f1.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight,scoring
516,_scaled,2,0.087597,0.087875,1000,"{0: 1, 1: 50}",f1
531,_scaled,2,0.087597,0.087875,1000000000,"{0: 1, 1: 50}",f1
511,_scaled,2,0.087597,0.087873,10,"{0: 1, 1: 50}",f1
521,_scaled,2,0.087597,0.087875,100000,"{0: 1, 1: 50}",f1
526,_scaled,2,0.087597,0.087875,10000000,"{0: 1, 1: 50}",f1
536,_scaled,2,0.087597,0.087875,100000000000,"{0: 1, 1: 50}",f1
756,,3,0.087434,0.086708,1000,"{0: 1, 1: 50}",f1
766,,3,0.087433,0.086786,10000000,"{0: 1, 1: 50}",f1
871,_scaled,3,0.087076,0.087602,10,"{0: 1, 1: 50}",f1
876,_scaled,3,0.087076,0.087602,1000,"{0: 1, 1: 50}",f1


In [None]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'param_C', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 0)
weights = [10, 30, 40, 50, 60, 70, 80, 90]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        prefix = 'logistic_regression_2_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=11, verbose = 0)
        datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('logistic_regression_step_2.csv', index=False, sep=';')       

In [14]:
results = pd.read_csv("logistic_regression_step_2.csv", sep=';')
results = results.drop(['params'], axis=1)
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_C,param_class_weight
11,_woe_scaled,2,0.825215,0.829572,,"{0: 1, 1: 40}"
12,_woe_scaled,2,0.825199,0.829608,,"{0: 1, 1: 50}"
13,_woe_scaled,2,0.825174,0.82961,,"{0: 1, 1: 60}"
17,_woe_scaled,2,0.825147,0.829591,,balanced
10,_woe_scaled,2,0.825143,0.829463,,"{0: 1, 1: 30}"
14,_woe_scaled,2,0.825125,0.829584,,"{0: 1, 1: 70}"
49,_woe_scaled,4,0.825055,0.829724,,"{0: 1, 1: 60}"
53,_woe_scaled,4,0.825052,0.829723,,balanced
50,_woe_scaled,4,0.825049,0.82972,,"{0: 1, 1: 70}"
15,_woe_scaled,2,0.825038,0.829544,,"{0: 1, 1: 80}"


Wyniki drugiego przebiegu Grid Search wskazują na:

- sposób przygotowania danych '_woe_scaled' jako zwracający najlepsze wyniki,
- przewagę drugiego datasetu (uzupełnienie braków medianą),
- zakres optywamllnych wartości dla przeważenia klas w zmiennej obserwowanej: 40-80,
- brak przeuczenia - średni wynik na zbiorach uczących i testowych jest prawie taki sam.

Kolejnym krokiem będzie wygenerowanie zestawu modeli z wykorzystaniem całego zbioru uczącego dla:
- wszystkich datasetów,
- wszystkich sposobów przygotowania danych,
- dla różnych wartości parametrów 'class_weight' w zakresie wskazanym przez Grid Search.

Dla każdego modelu zwrócony zostanie zestaw miar zarówno dla zbioru uczącego jak i testowego.

In [None]:
# Regresja logistyczna - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 
                   'recall_test', 'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 
                   'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}
                             
weights = range(40, 80, 5)
class_weights = [{0:1, 1:x} for x in weights]
class_weights.append('balanced')

for dataset in datasets:
    for data_variant in data_variants:
        for class_weight in class_weights:
            X_train = datasets[dataset]['X_train' + data_variant]
            y_train = datasets[dataset]['y_train']
            X_test = datasets[dataset]['X_test' + data_variant]
            y_test = datasets[dataset]['y_test']               
            model = LogisticRegression(C = 10**5, class_weight = class_weight, random_state = 42, n_jobs = -1)
            model.fit(X_train, y_train)
            train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
            test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
            train = train.rename(columns = train_columns_rename)
            test  = test.rename(columns = test_columns_rename)
            result = pd.concat([train, test], axis = 1)
            result['dataset'] = dataset
            result['data_variant'] = data_variant
            result['class_weight'] = class_weight[1]
            result = result[dataset_columns]
            model_results = pd.concat([model_results, result])

model_results.to_csv('logistic_regression_modelling.csv', index=False, sep=';')    

In [16]:
results = pd.read_csv("logistic_regression_modelling.csv", sep=';')
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Unnamed: 0,dataset,data_variant,class_weight,auc_train,auc_test,f1_train,accuracy_train
133,4,_woe_scaled,75,0.760186,0.749607,0.073845,0.697142
97,3,_woe_scaled,75,0.758953,0.746252,0.073582,0.696744
61,2,_woe_scaled,75,0.758017,0.747078,0.07331,0.695917
25,1,_woe_scaled,75,0.757929,0.747461,0.072992,0.69371
60,2,_woe_scaled,70,0.757703,0.743965,0.075068,0.708511
131,4,_woe_scaled,65,0.757601,0.74464,0.077419,0.723558
134,4,_woe_scaled,a,0.757573,0.744309,0.076255,0.716387
132,4,_woe_scaled,70,0.757005,0.745318,0.075045,0.70917
34,1,_woe_pca,75,0.756693,0.751021,0.072054,0.688225
96,3,_woe_scaled,70,0.756575,0.738725,0.074995,0.709339


Powyżej przedstawiono 10 najlepszych modeli (sortowanie wg AUC na zbiorze treningowym malejąco).
Jakość modeli nie jest zadowalająca - niski f-score, dość niska wartość AUC --> siła predykcyjna modelu jest niska.

#### Drzewa decyzyjne

Kolejna próba wykorzystuje jako klasyfikator drzewa decyzyjne.
Dla klasyfikatorów drzewiastych liniowe przeskalowanie danych jest transparentne i nie wpływa na wynik, dlatego nie badam wszystkich wariantów przygotowania danych jak w przypadku regresji logistycznej.

Procedura optymalizacji modelu jest analogiczna jak w przypadku regresji logistycznej:
- Grid Search + Cross Validation z przeszukaniem szerokiego zakresu parametrów klasyfikatora,
- modelowanie dla najlepiej rokujących zestawów parametrów.

Do optymalizacji uwzględniam:
- maksymalną długość drzewa,
- liczbę zmiennych branych pod uwagę w wyborze najlepszego podziału (parametr ten wprowadza element losowości),
- wagi dla klas zmiennej objaśnianej (z uwagi na nierównomierną liczność klas).

Uwzględniając doświadczenia z modelowania metodą regresji logistycznej pomijam pierwszy dataset w dalszych analizach.

In [17]:
classifier = DecisionTreeClassifier(max_depth=None, max_features=None, random_state=42, class_weight=None)

max_depth = [3, 5, 7, 9, 11, 15, 20]
max_features = [None, 0.5, 0.75]
weights = [30, 40, 50, 60, 70, 80, 90]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
class_weight.append(None)
param_grid = {'max_depth': max_depth, 'max_features': max_features, 'class_weight':class_weight}
param_grid

{'max_depth': [3, 5, 7, 9, 11, 15, 20],
 'max_features': [None, 0.5, 0.75],
 'class_weight': [{0: 1, 1: 30},
  {0: 1, 1: 40},
  {0: 1, 1: 50},
  {0: 1, 1: 60},
  {0: 1, 1: 70},
  {0: 1, 1: 80},
  {0: 1, 1: 90},
  'balanced',
  None]}

In [19]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in [2, 3, 4]:
    for data_variant in ['_pca', '_scaled', '_woe_scaled', '_woe_pca']:
        prefix = 'decision_tree_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=4, n_jobs=-1, verbose = 0)
        datasets[dataset]['grid_search_decision_tree_' + data_variant + '_' + scoring] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('decision_tree_grid_search.csv', index=False, sep=';') 

In [20]:
results = pd.read_csv("decision_tree_grid_search.csv", sep=';')
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(10)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,params
1769,_scaled,4,0.822702,0.845665,"{0: 1, 1: 60}",5,0.75,"{'class_weight': {0: 1, 1: 60}, 'max_depth': 5..."
1077,_scaled,3,0.822354,0.856692,"{0: 1, 1: 90}",7,,"{'class_weight': {0: 1, 1: 90}, 'max_depth': 7..."
1958,_woe_scaled,4,0.820863,0.845625,"{0: 1, 1: 60}",5,0.75,"{'class_weight': {0: 1, 1: 60}, 'max_depth': 5..."
362,_scaled,2,0.819775,0.840669,,5,0.75,"{'class_weight': None, 'max_depth': 5, 'max_fe..."
1748,_scaled,4,0.819454,0.847338,"{0: 1, 1: 50}",5,0.75,"{'class_weight': {0: 1, 1: 50}, 'max_depth': 5..."
321,_scaled,2,0.819181,0.858793,"{0: 1, 1: 90}",7,,"{'class_weight': {0: 1, 1: 90}, 'max_depth': 7..."
278,_scaled,2,0.818528,0.846761,"{0: 1, 1: 70}",5,0.75,"{'class_weight': {0: 1, 1: 70}, 'max_depth': 5..."
1937,_woe_scaled,4,0.818021,0.847523,"{0: 1, 1: 50}",5,0.75,"{'class_weight': {0: 1, 1: 50}, 'max_depth': 5..."
1056,_scaled,3,0.816765,0.862087,"{0: 1, 1: 80}",7,,"{'class_weight': {0: 1, 1: 80}, 'max_depth': 7..."
1853,_scaled,4,0.816258,0.844222,balanced,5,0.75,"{'class_weight': 'balanced', 'max_depth': 5, '..."


Analizując uzyskane wyniki można stwierdzić, że:

- wśród najwyżej ocenianych nie występują zbiory przygotowane z wykorzystaniem PCA,
- różnica pomiędzy średnim wynikiem dla zbiorów testowych i treningowych nie przekracza 3pp --> modele nie są przeuczone,
- optymalna wartośc parametru 'class_weight' znajduje się w granicach 60-90
- optymalna wartość parametru 'max_depth' znajduje się w granicach 5-7
- optymalna wartość parametru 'max_features' znajduje się w granicach 75-100%

Kolejnym krokiem jest modelowanie w zakresach parametrów wskazanych przez Grid Search.

In [None]:
# Drzewa decyzyjne - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 
                   'recall_test', 'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 
                   'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}
                             
weights = range(50, 90, 10)
class_weights = [{0:1, 1:x} for x in weights]
max_depths = [5, 6, 7]
max_features_set = [0.75, None]

for dataset in datasets:
    for data_variant in data_variants:
        for max_depth in max_depths:
            for max_features in max_features_set:
                for class_weight in class_weights:
                    X_train = datasets[dataset]['X_train' + data_variant]
                    y_train = datasets[dataset]['y_train']
                    X_test = datasets[dataset]['X_test' + data_variant]
                    y_test = datasets[dataset]['y_test']               
                    model = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features, random_state=42, class_weight=class_weight)
                    model.fit(X_train, y_train)
                    train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                    test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                    train = train.rename(columns = train_columns_rename)
                    test  = test.rename(columns = test_columns_rename)
                    result = pd.concat([train, test], axis = 1)
                    result['dataset'] = dataset
                    result['data_variant'] = data_variant
                    result['class_weight'] = class_weight[1]
                    result = result[dataset_columns]
                    model_results = pd.concat([model_results, result])

model_results.to_csv('decision_tree_modelling.csv', index=False, sep=';') 

In [None]:
results = pd.read_csv("decision_tree_modelling.csv", sep=';')
results = results[results['auc_train'] - results['auc_test'] < 0.05]
results = results.sort_values(by=['auc_train'], ascending=False)
results = results[['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'accuracy_train']]
results.head(10)

Dla modeli z najwyższą wartością AUC występuje znaczna różnica pomiędzy AUC dla zbioru testowego i treningowego, co wskazuje na przeuczenie modelu. Dlatego ograniczyłem analizowane wyniki wyłącznie do modeli, dla których różnica w ocenach zbioru testowego i treningowego nie przekracza 5pp AUC.

Najlepszy model wykazuje 81% AUC przy bardzo słabej wartości f-score. Siła predykcyjna tego modelu jest słaba.

#### Random Forest

Następnym klasyfikatorem użytym w modelowaniu są lasy losowe.
Jest to naturalne rozszerzenie klasyfikacji za pomocą drzew.

W optymalizacji korzystam z wypracowanego schematu postępowania, optymalizacji podlegają te same parametry, co w przypadku drzew decyzyjnych, dodatkowo optymalizuję parametr liczby estymatorów.

In [27]:
classifier = RandomForestClassifier(n_estimators = 100, max_depth=None, max_features=None, random_state=42, class_weight=None)

n_estimators = [100, 300, 400, 500, 700]
max_depth = [5, 6, 7, 8, 9]
max_features = [None, 0.75, 0.90]
weights = [40, 60, 80]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
class_weight.append(None)
param_grid = {'n_estimators':n_estimators, 'max_depth': max_depth, 'max_features': max_features, 'class_weight':class_weight}
param_grid

{'n_estimators': [100, 300, 400, 500, 700],
 'max_depth': [5, 6, 7, 8, 9],
 'max_features': [None, 0.75, 0.9],
 'class_weight': [{0: 1, 1: 40},
  {0: 1, 1: 60},
  {0: 1, 1: 80},
  'balanced',
  None]}

In [29]:
results = pd.DataFrame(columns = ['dataset', 'data_variant', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        prefix = 'random_forest_' + str(dataset) + '_' + data_variant
        print("start: " + prefix)
        X = datasets[dataset]['X_train' + data_variant]
        y = datasets[dataset]['y_train']
        grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring='roc_auc', cv=3, n_jobs=-1, verbose = 1)
        datasets[dataset]['grid_search_random_forest_' + data_variant] = grid_search
        result['dataset'], result['data_variant'] = str(dataset), data_variant
        results = pd.concat([results, result])

results.to_csv('random_forest_grid_search.csv', index=False, sep=';') 

start: random_forest_2__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 31.7min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 57.0min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 85.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_2__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 31.8min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 57.0min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 85.9min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_3__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 44.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 79.8min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 120.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_3__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 18.1min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 44.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 79.9min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 120.9min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_4__scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 14.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 35.0min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 62.8min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 94.7min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


start: random_forest_4__woe_scaled
Fitting 3 folds for each of 375 candidates, totalling 1125 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 14.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 35.3min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 63.3min
[Parallel(n_jobs=-1)]: Done 1125 out of 1125 | elapsed: 95.4min finished
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  if sys.path[0] == '':


In [37]:
results = pd.read_csv("random_forest_grid_search.csv", sep=';')
# results = results[results['mean_train_score'] - results['mean_test_score'] < 0.05]
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,param_n_estimators,params
411,_woe_scaled,2,0.847175,0.912623,"{0: 1, 1: 40}",7,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
37,_scaled,2,0.847089,0.91248,"{0: 1, 1: 40}",7,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
36,_scaled,2,0.847078,0.91237,"{0: 1, 1: 40}",7,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
38,_scaled,2,0.846965,0.912489,"{0: 1, 1: 40}",7,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
412,_woe_scaled,2,0.846962,0.912753,"{0: 1, 1: 40}",7,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
39,_scaled,2,0.846908,0.912417,"{0: 1, 1: 40}",7,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
413,_woe_scaled,2,0.846857,0.912874,"{0: 1, 1: 40}",7,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
414,_woe_scaled,2,0.846743,0.912551,"{0: 1, 1: 40}",7,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
35,_scaled,2,0.84663,0.911803,"{0: 1, 1: 40}",7,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 7..."
396,_woe_scaled,2,0.846581,0.892889,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."


In [38]:
results = pd.read_csv("random_forest_grid_search.csv", sep=';')
results = results[results['mean_train_score'] - results['mean_test_score'] < 0.05]
results = results.sort_values(by=['mean_test_score'], ascending=False)
results.head(20)

Unnamed: 0,data_variant,dataset,mean_test_score,mean_train_score,param_class_weight,param_max_depth,param_max_features,param_n_estimators,params
396,_woe_scaled,2,0.846581,0.892889,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
397,_woe_scaled,2,0.846453,0.893218,"{0: 1, 1: 40}",6,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
24,_scaled,2,0.846389,0.892897,"{0: 1, 1: 40}",6,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
23,_scaled,2,0.846363,0.892891,"{0: 1, 1: 40}",6,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
22,_scaled,2,0.846353,0.892844,"{0: 1, 1: 40}",6,0.75,400,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
398,_woe_scaled,2,0.846349,0.893216,"{0: 1, 1: 40}",6,0.75,500,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
20,_scaled,2,0.846344,0.8925,"{0: 1, 1: 40}",6,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
399,_woe_scaled,2,0.846337,0.893024,"{0: 1, 1: 40}",6,0.75,700,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
21,_scaled,2,0.84632,0.892534,"{0: 1, 1: 40}",6,0.75,300,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."
395,_woe_scaled,2,0.846244,0.892707,"{0: 1, 1: 40}",6,0.75,100,"{'class_weight': {0: 1, 1: 40}, 'max_depth': 6..."


Wyniki miar jakości modeli uzyskanych w Grid Search w zależności od parametrów klasyfikatora posortowałem malejąco wg średniej wartości AUC dla zbioru testowego. Analizując wyniki zwrócone dla najlepszych 20 szacowań (powyżej) można stwierdzić, że: 
- róznica na AUC pomiędzy zbiorem testowym i treningowym przekracza 5pp, co może wskazywać na przeuczenie modeli,
- występuje wyłącznie dataset 2,
- jako waga dla klasy objaśnianej występuje prawie wyłącznie wartość 40, 
- parametr max_depth występuje w granicach 6-7,
- parametr max_features występuje wyłącznie w wartości 0.75
- dominuje 'class_weight' na poziomie 40
- przeważa ;max_weight' na poziomie 75%,
- jakość modelu jest nieczuła na liczbę klasyfikatorów w badanym zakresie

Po wprowadzeniu ograniczenia w zbiorze wyników - dopuszczalna różnica AUC_train - AUC_test < 5% (wykluczenie modeli przeuczonych; wyniki poniżej):
- wcześniejsze obserwacje pozostają w mocy.

Dziwi brak czułości metody na parametr liczby estymatorów - prawdopodobnie przeszukiwany zestaw zawierał zbyt wysokie wartości. Podobnie wskazana wartość 'class_weight' to najniższa wartość w badanym zestawie. Niestety ze wzgledu na długi czas obliczeń nie byłem w stanie przetworzyć ponownie Grid Search.

In [None]:
# Random forest - modelowanie
dataset_columns = ['dataset', 'data_variant', 'n_estimators', 'max_depth', 'max_features', 'class_weight', 
                   'auc_train', 'auc_test', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 
                   'precision_train', 'precision_test', 'recall_train', 'recall_test', 'TN_train', 
                   'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 
                        'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 
                        'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 
                       'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 
                       'FN':'FN_test'}

n_estimators_set = [10, 50, 100, 300] 
weights = range(20, 60, 5)
class_weights = [{0:1, 1:x} for x in weights]
max_depths = [5, 6, 7, 8]
max_features_set = [0.65, 0.75, 0.85]

for dataset in [2, 3, 4]:
    for data_variant in ['_scaled', '_woe_scaled']:
        for n_estimators in n_estimators_set:
            for max_depth in max_depths:
                for max_features in max_features_set:
                    for class_weight in class_weights:
                        X_train = datasets[dataset]['X_train' + data_variant]
                        y_train = datasets[dataset]['y_train']
                        X_test = datasets[dataset]['X_test' + data_variant]
                        y_test = datasets[dataset]['y_test']               
                        model = RandomForestClassifier(n_estimators = n_estimators, max_depth=max_depth, max_features=max_features, random_state=42, class_weight=class_weight, n_jobs=-1)
                        model.fit(X_train, y_train)
                        train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                        test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                        train = train.rename(columns = train_columns_rename)
                        test  = test.rename(columns = test_columns_rename)
                        result = pd.concat([train, test], axis = 1)
                        result['dataset'] = dataset
                        result['data_variant'] = data_variant
                        result['n_estimators'] = n_estimators
                        result['max_depth'] = max_depth
                        result['max_features'] = max_features
                        result['class_weight'] = class_weight[1]
                        result = result[dataset_columns]
                        model_results = pd.concat([model_results, result])

model_results.to_csv('random_forest_modelling.csv', index=False, sep=';') 

In [None]:
# Regresja logistyczna - druga przymiarka

# klasyfikator + parametry do grid_searcha
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 1)
weights = [30, 40, 50, 60, 70, 80, 90, 100, 120, 140, 160]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])

for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_2_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

In [None]:
# Regresja logistyczna - trzecia przymiarka

# klasyfikator + parametry do grid_searcha
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 1)
weights = range(20, 100, 5)
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

data_variants = ['', '_scaled', '_woe_scaled']

results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])


for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_3_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

In [None]:
# Regresja logistyczna - czwarta przymiarka

# klasyfikator + parametry do grid_searcha
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 1)
weights = range(0, 20, 5)
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

data_variants = ['', '_scaled', '_woe_scaled']

results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])


for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_4_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

In [None]:
# Regresja logistyczna - piąta przymiarka

# klasyfikator + parametry do grid_searcha
classifier = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 1)
weights = range(10, 30, 1)
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')
param_grid = {'class_weight': class_weight}

data_variants = ['', '_scaled', '_woe_scaled']

results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_class_weight', 'params', 'mean_train_score', 'mean_test_score'])


for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'logistic_regression_5_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=7, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

In [None]:
# m = LogisticRegression(C = 10**5, class_weight = 'balanced', random_state = 42, verbose = 1)
# m.fit(datasets[1]['X_train'], datasets[1]['y_train'])
# d = get_measures(m, datasets[1]['X_train'], datasets[1]['y_train'])
d1 = pd.DataFrame.from_dict(d, orient = 'index').transpose()
d2 = pd.DataFrame.from_dict(d, orient = 'index').transpose()
d1 = d1.rename(columns = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 'f1':'f1_train', 'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 'FN':'FN_train'})
d2 = d2.rename(columns = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 'FN':'FN_test'})
df = pd.concat([d1, d2], axis = 1)
df.columns

In [None]:
# Regresja logistyczna - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 'recall_test',
                   'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 'f1':'f1_train',
                        'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 'FN':'FN_train'}
test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 'FN':'FN_test'}
                             
weights = range(1, 80, 4)
class_weights = [{0:1, 1:x} for x in weights]
class_weights.append('balanced')

for dataset in datasets:
    for data_variant in data_variants:
        for class_weight in class_weights:
            X_train = datasets[dataset]['X_train' + data_variant]
            y_train = datasets[dataset]['y_train']
            X_test = datasets[dataset]['X_test' + data_variant]
            y_test = datasets[dataset]['y_test']               
            model = LogisticRegression(C = 10**5, class_weight = class_weight, random_state = 42, verbose = 1)
            model.fit(X_train, y_train)
            train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
            test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
            train = train.rename(columns = train_columns_rename)
            test  = test.rename(columns = test_columns_rename)
            result = pd.concat([train, test], axis = 1)
            result['dataset'] = dataset
            result['data_variant'] = data_variant
            result['class_weight'] = class_weight[1]
            result = result[dataset_columns]
            model_results = pd.concat([model_results, result])


In [None]:
['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
 'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 'recall_test',
 'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

In [None]:
# Regresja logistyczna - modelowanie
dataset_columns = ['dataset', 'data_variant', 'class_weight', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 
                   'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 'recall_test',
                   'TN_train', 'TN_test', 'FN_train', 'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 'f1':'f1_train',
                        'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 'FN':'FN_train'}
test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 'FN':'FN_test'}
                             
weights = range(1, 80, 4)
class_weights = [{0:1, 1:x} for x in weights]
class_weights.append('balanced')

for dataset in datasets:
    for data_variant in data_variants:
        for class_weight in class_weights:
            X_train = datasets[dataset]['X_train' + data_variant]
            y_train = datasets[dataset]['y_train']
            X_test = datasets[dataset]['X_test' + data_variant]
            y_test = datasets[dataset]['y_test']               
            model = LogisticRegression(C = 10**5, class_weight = class_weight, random_state = 42, verbose = 1)
            model.fit(X_train, y_train)
            train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
            test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
            train = train.rename(columns = train_columns_rename)
            test  = test.rename(columns = test_columns_rename)
            result = pd.concat([train, test], axis = 1)
            result['dataset'] = dataset
            result['data_variant'] = data_variant
            result['class_weight'] = class_weight[1]
            result = result[dataset_columns]
            model_results = pd.concat([model_results, result])

In [None]:
model_results.to_csv('logistic_regression_modelling.csv', index=False, sep=';')

In [None]:
data_variants

In [None]:
# XGBoost - pierwsza przymiarka

# klasyfikator + parametry do grid_searcha
classifier = xgboost.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=100, verbosity=1, scale_pos_weight=1)

max_depth = [3, 5, 7, 9, 11]
learning_rate = [0.001, 0.01, 0.1, 1.]
n_estimators = [50, 200, 500, 700]
scale_pos_weight = [1, 10, 30, 50]

param_grid = {'max_depth': max_depth, 'learning_rate':learning_rate, 'n_estimators':n_estimators, 'scale_pos_weight':scale_pos_weight}

results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_max_depth', 'param_learning_rate', 
                                  'param_n_estimators', 'param_scale_pos_weight',  'params', 'mean_train_score', 
                                  'mean_test_score'])

data_variants = ['_scaled', '_woe_scaled']

for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'xgb_1_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=4, n_jobs=11, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])

In [None]:
# XGBoost - druga przymiarka

# klasyfikator + parametry do grid_searcha
classifier = xgboost.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=100, verbosity=1, scale_pos_weight=1)

max_depth = [3, 5, 7]
learning_rate = [0.01, 0.1]
n_estimators = [200, 400, 600]
scale_pos_weight = [20, 40, 60]

param_grid = {'max_depth': max_depth, 'learning_rate':learning_rate, 'n_estimators':n_estimators, 'scale_pos_weight':scale_pos_weight}


results = pd.DataFrame(columns = ['dataset', 'data_variant', 'scoring', 'param_max_depth', 'param_learning_rate', 
                                  'param_n_estimators', 'param_scale_pos_weight',  'params', 'mean_train_score', 
                                  'mean_test_score'])


data_variants = ['_scaled', '_woe_scaled']


for dataset in datasets:
    for data_variant in data_variants:
        for scoring in ['roc_auc', 'f1']:
            prefix = 'xgb_2_' + str(dataset) + '_' + data_variant + '_' + scoring
            print("start: " + prefix)
            X = datasets[dataset]['X_train' + data_variant]
            y = datasets[dataset]['y_train']
            grid_search, result = model_grid_search(prefix, classifier, param_grid, X, y, scoring, cv=3, n_jobs=11, verbose = 1)
            datasets[dataset]['grid_search_logistic_regression_' + data_variant + '_' + scoring] = grid_search
            result['dataset'], result['data_variant'], result['scoring'] = str(dataset), data_variant, scoring
            results = pd.concat([results, result])


In [None]:
results.to_csv('gb_2.csv', index=False, sep=';')

In [None]:
# XGBoost - modelowanie
dataset_columns = ['dataset', 'data_variant', 'param_max_depth', 'param_n_estimators', 'param_scale_pos_weight',
                   'auc_train', 
                   'auc_test', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 'precision_train', 
                   'precision_test', 'recall_train', 'recall_test', 'TN_train', 'TN_test', 'FN_train', 
                   'FN_test', 'FP_train',  'FP_test', 'TP_train', 'TP_test']

model_results = pd.DataFrame(columns = dataset_columns)

train_columns_rename = {'accuracy':'accuracy_train', 'precision':'precision_train', 'recall':'recall_train', 'f1':'f1_train',
                        'auc':'auc_train', 'TP':'TP_train', 'FP':'FP_train', 'TN':'TN_train', 'FN':'FN_train'}

test_columns_rename = {'accuracy':'accuracy_test', 'precision':'precision_test', 'recall':'recall_test', 'f1':'f1_test', 'auc':'auc_test', 'TP':'TP_test', 'FP':'FP_test', 'TN':'TN_test', 'FN':'FN_test'}
                             
max_depths = [3, 4, 5, 6]
n_estimators_set = [400, 500, 600]
scale_pos_weights = [30, 50, 70]

data_variants = ['_pca', '_scaled', '_woe_scaled', '_woe_pca']

for dataset in datasets:
    for data_variant in data_variants:
        for max_depth in max_depths:
            for n_estimators in n_estimators_set:
                for scale_pos_weight in scale_pos_weights:
                    X_train = datasets[dataset]['X_train' + data_variant]
                    y_train = datasets[dataset]['y_train']
                    X_test = datasets[dataset]['X_test' + data_variant]
                    y_test = datasets[dataset]['y_test']               
                    model = xgboost.XGBClassifier(n_jobs=7, max_depth=max_depth, learning_rate=0.01, n_estimators=n_estimators, verbosity=1, scale_pos_weight=50)
                    model.fit(X_train, y_train)
                    train = pd.DataFrame.from_dict(get_measures(model, X_train, y_train), orient = 'index').transpose()
                    test  = pd.DataFrame.from_dict(get_measures(model, X_test , y_test ), orient = 'index').transpose()
                    train = train.rename(columns = train_columns_rename)
                    test  = test.rename(columns = test_columns_rename)
                    result = pd.concat([train, test], axis = 1)
                    result['dataset'] = dataset
                    result['data_variant'] = data_variant
                    result['param_max_depth'] = max_depth
                    result['param_n_estimators'] = n_estimators
                    result['param_scale_pos_weight'] = scale_pos_weight
                    result = result[dataset_columns]
                    model_results = pd.concat([model_results, result])

In [None]:
model_results.to_csv('xgb_2_modelling.csv', index=False, sep=';')

In [None]:
C = [10**c for c in range(3, 9, 2)]
weights = [1, 20, 50, 100, 200]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')

max_depth = [4, 5, 7,  9]
n_estimators = [10, 100, 200, 400]
max_features = ['sqrt', None]
learning_rate = [0.0001, 0.001, 0.01, 0.1, 1.]

classifiers = {
    'log_reg' : 
    {
        'param_grid':{'C': C, 'class_weight': class_weight}, 
        'classifier':LogisticRegression(C = 1.0, class_weight = 'balanced', random_state = 42, verbose = 1)
    },
    'tree' : 
    {
        'param_grid':{'max_depth':max_depth, 'class_weight':class_weight}, 
        'classifier':DecisionTreeClassifier(random_state=42, splitter='best', max_depth=10, class_weight='balanced')
    },
    'random_forest' : 
    {
        'param_grid':{'max_depth':max_depth, 'class_weight':class_weight, 'n_estimators':n_estimators},
        'classifier':RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42, class_weight='balanced', verbose=1)
    },
    'gradient_boosting' :
    {
        'param_grid':{'learning_rate':learning_rate, 'n_estimators':n_estimators, 'max_features':max_features},
        'classifier':GradientBoostingClassifier(learning_rate = 0.01, n_estimators = 10, random_state=42, max_features=None, verbose=1)
    },
    'xgboost' :
    {
        'param_grid':{'learning_rate':learning_rate, 'n_estimators':n_estimators, 'max_features':max_features},
        'classifier':xgboost.XGBClassifier(learning_rate = 0.01, n_estimators = 10, random_state=42, max_features=None, verbose=1)
    }
}

In [None]:
data_variants = {
    '1':{'X_train':set_1[0], 'y_train':set_1[2], 'X_test':set_1[1], 'y_test':set_1[3]},
    '2':{'X_train':set_2[0], 'y_train':set_2[2], 'X_test':set_2[1], 'y_test':set_2[3]},
    '3':{'X_train':set_3[0], 'y_train':set_3[2], 'X_test':set_3[1], 'y_test':set_3[3]},
    '4':{'X_train':set_4[0], 'y_train':set_4[2], 'X_test':set_4[1], 'y_test':set_4[3]}
}

In [None]:
for data in data_variants:
    for model in classifiers:
        model_grid_search(
            data + '_' + model, 
            classifiers[model]['classifier'], 
            classifiers[model]['param_grid'], 
            data_variants[data]['X_train'], 
            data_variants[data]['y_train'])

In [None]:
classifiers['log_reg']['classifier'] = LogisticRegression(C = 1000.0, class_weight = {0: 1, 1: 200}, random_state = 42, verbose = 0)
classifiers['tree']['classifier'] = DecisionTreeClassifier(random_state=42, splitter='best', max_depth=6, class_weight={0: 1, 1: 1})
classifiers['random_forest']['classifier'] = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42, class_weight={0: 1, 1: 50})
classifiers['gradient_boosting']['classifier'] = GradientBoostingClassifier(learning_rate = 0.05, n_estimators = 150, random_state=42, max_features=None, verbose=0) 

In [None]:
for item in classifiers:
    classifiers[item]['classifier'].fit(X_train_scaled, y_train)
    train_assessment = get_measures(classifiers[item]['classifier'], X_train_scaled, y_train)
    test_assessment = get_measures(classifiers[item]['classifier'], X_test_scaled, y_test)
    print(f"{item} AUC_train = {'{:1.3f}'.format(train_assessment[-2])} AUC_test = {'{:1.3f}'.format(train_assessment[-2])}")

In [None]:
classifiers['log_reg']['classifier'] = LogisticRegression(C = 1000.0, class_weight = {0: 1, 1: 200}, random_state = 42, verbose = 0)
classifiers['tree']['classifier'] = DecisionTreeClassifier(random_state=42, splitter='best', max_depth=6, class_weight={0: 1, 1: 1})
classifiers['random_forest']['classifier'] = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42, class_weight={0: 1, 1: 50})
classifiers['gradient_boosting']['classifier'] = GradientBoostingClassifier(learning_rate = 0.05, n_estimators = 150, random_state=42, max_features=None, verbose=0) 

In [None]:
for item in classifiers:
    classifiers[item]['classifier'].fit(X_train_scaled_pca, y_train)
    train_assessment = get_measures(classifiers[item]['classifier'], X_train_scaled_pca, y_train)
    test_assessment = get_measures(classifiers[item]['classifier'], X_test_scaled_pca, y_test)
    print(f"{item} AUC_train = {'{:1.3f}'.format(train_assessment[-2])} AUC_test = {'{:1.3f}'.format(train_assessment[-2])}")

In [None]:
C = [10**c for c in range(-3, 16, 1)]
weights = [1, 2, 5, 10, 20, 30, 40, 50, 70, 100, 120, 150, 200, 250, 300, 400, 500]
class_weight = [{0:1, 1:x} for x in weights]
class_weight.append('balanced')

# 1 - Logistic Regression
log_reg_param_grid = {'C': C, 'class_weight': class_weight}
log_reg_classifier = LogisticRegression(C = 1.0, class_weight = 'balanced', random_state = 42, verbose = 1)
log_reg_grid_search = model_grid_search(log_reg_classifier, log_reg_param_grid, X_train, y_train)

with open('log_reg_grid_search.pickle', 'wb') as file:
    pickle.dump(log_reg_grid_search, file)
    
log_reg_result = pd.DataFrame(log_reg_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
log_reg_result.to_csv("log_reg_result.csv", index=False, sep = ';')

In [None]:
print(str(log_reg_grid_search.best_score_))
HTML(pd.DataFrame(log_reg_grid_search.cv_results_)[['param_C', 'param_class_weight', 'mean_train_score', 'mean_test_score']].to_html())

In [None]:
#2 - DecisionTree
from sklearn.tree import DecisionTreeClassifier

tree_splitter = ['best']
tree_max_depth = [3, 4, 5, 6, 7, 8]
# tree_class_weight = [{0:1, 1:1}, {0:1, 1:50},{0:1, 1:100},{0:1, 1:150}, {0:1, 1:200}, {0:1, 1:250}, {0:1, 1:300}, 'balanced']
tree_class_weight = class_weight

tree_param_grid = {'splitter':tree_splitter, 'max_depth':tree_max_depth, 'class_weight':tree_class_weight}
tree_classifier = DecisionTreeClassifier(random_state=42, splitter='best', max_depth=10, class_weight='balanced')

tree_grid_search = model_grid_search(tree_classifier, tree_param_grid, X_train, y_train)

with open('tree_grid_search.pickle', 'wb') as file:
    pickle.dump(tree_grid_search, file)

tree_result = pd.DataFrame(tree_grid_search.cv_results_)[['param_class_weight', 'param_splitter', 'param_max_depth', 'mean_train_score', 'mean_test_score']]
tree_result.to_csv("tree_result.csv", index=False, sep = ';')
    

In [None]:
#3 - RandomForest

rf_max_depth = [3, 5, 9, 15, 25]
rf_class_weight = [{0:1, 1:1}, {0:1, 1:50}, {0:1, 1:100}, {0:1, 1:300}, 'balanced'] 
rf_n_estimators = [10, 50, 100, 500]
rf_param_grid = {'max_depth':rf_max_depth, 'class_weight':rf_class_weight, 'n_estimators':rf_n_estimators}
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42, class_weight='balanced')

rf_grid_search = model_grid_search(rf_classifier, rf_param_grid, X_train, y_train)

with open('rf_grid_search.pickle', 'wb') as file:
    pickle.dump(rf_grid_search, file)

rf_result = pd.DataFrame(rf_grid_search.cv_results_)[['param_class_weight', 'param_n_estimators', 'param_max_depth', 'mean_train_score', 'mean_test_score']]
rf_result.to_csv("rf_result.csv", index=False, sep = ';')

In [None]:
# 4 - Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

gb_learning_rate = [0.001, 0.01, 0.1, 1., 10.]
gb_n_estimators = [20, 50, 100, 200, 300, 700]
gb_max_features = ['sqrt', 'log2', None]
gb_param_grid = {'learning_rate':gb_learning_rate, 'n_estimators':gb_n_estimators, 'max_features':gb_max_features}
gb_classifier = GradientBoostingClassifier(learning_rate = 0.01, n_estimators = 10, random_state=42, max_features=None, verbose=1)

gb_grid_search = model_grid_search(gb_classifier, gb_param_grid, X_train, y_train)

with open('gb_grid_search.pickle', 'wb') as file:
    pickle.dump(gb_grid_search, file)                 

gb_result = pd.DataFrame(gb_grid_search.cv_results_)[['param_learning_rate', 'param_n_estimators', 'param_max_features', 'mean_train_score', 'mean_test_score']]
gb_result.to_csv("gb_result.csv", index=False, sep = ';')

In [None]:
# 5 - XGBoost

import xgboost

xgb_learning_rate = [0.001, 0.01, 0.1, 1.]
xgb_n_estimators = [20, 50, 100, 200, 300, 500, 700]
xgb_max_features = ['sqrt', 'log2', None]
xgb_param_grid = {'learning_rate':xgb_learning_rate, 'n_estimators':xgb_n_estimators, 'max_features':xgb_max_features}
xgb_classifier = GradientBoostingClassifier(learning_rate = 0.01, n_estimators = 10, random_state=42, max_features=None, verbose=1) 

xgb_grid_search = model_grid_search(xgb_classifier, xgb_param_grid, X_train, y_train)

with open('xgb_grid_search.pickle', 'wb') as file:
    pickle.dump(xgb_grid_search, file)                 

xgb_result = pd.DataFrame(xgb_grid_search.cv_results_)[['param_learning_rate', 'param_n_estimators', 'param_max_features', 'mean_train_score', 'mean_test_score']]
xgb_result.to_csv("xgb_result.csv", index=False, sep = ';')

In [None]:
print(str(rf_grid_search.best_score_))
HTML(pd.DataFrame(rf_grid_search.cv_results_)[['param_max_depth', 'param_class_weight', 'param_n_estimators', 'mean_train_score', 'mean_test_score']].to_html())

In [None]:
models = []
log_reg_C = [10**c for c in range(0, 10, 1)]
for C in log_reg_C:
    params = [C, weight]
    model = LogisticRegression(C=C, class_weight=weight)
    model.fit(X_train, y_train)
    models.append([params, model, get_measures(model, X_train, y_train)[7:], get_measures(model, X_test, y_test)[7:]])

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
pca = PCA(random_state=42)
pca.fit(X_train_scaled)
explained_variance = pca.explained_variance_ratio_

In [None]:
ratio = 0
explained_variance_cum = []
for item in explained_variance:
    ratio += item
    explained_variance_cum.append(ratio)
explained_variance_cum

In [None]:
pca = PCA(n_components = 14, random_state=42)
pca.fit(X_train_scaled)
explained_variance = pca.explained_variance_ratio_

In [None]:
ratio = 0
explained_variance_cum = []
for item in explained_variance:
    ratio += item
    explained_variance_cum.append(ratio)
explained_variance_cum

In [None]:
X_train_scaled_pca = pca.transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)

In [None]:
# 1 - Logistic Regression
log_reg_C = [10**c for c in range(4, 11, 1)]
log_reg_class_weight = [{0:1, 1:1}, {0:1, 1:50},{0:1, 1:100},{0:1, 1:150}, {0:1, 1:200}, {0:1, 1:250}, {0:1, 1:300}, 'balanced'] 
log_reg_param_grid = {'C': log_reg_C, 'class_weight': log_reg_class_weight}
log_reg_grid_search = model_grid_search(LogisticRegression(), log_reg_param_grid, X_train_scaled_pca, y_train)

with open('log_reg_grid_search_pca.pickle', 'wb') as file:
    pickle.dump(log_reg_grid_search, file)

In [None]:
print(str(log_reg_grid_search.best_score_))
HTML(pd.DataFrame(log_reg_grid_search.cv_results_)[['param_C', 'param_class_weight', 'mean_train_score', 'mean_test_score']].to_html())

In [None]:
import matplotlib.pyplot as plt
scores = []
for item in models:
    scores.append(item[2][1])
scores = np.reshape(scores, (15,9))
heatmap(scores, ylabel='C', yticklabels=log_reg_C, xlabel='weight', xticklabels=[item[1] for item in log_reg_class_weight], cmap="viridis")
plt.show()


In [None]:
scores = np.array(results.mean_test_score).reshape(6, 6)
# plot the mean cross-validation scores
heatmap(scores, xlabel='gamma', xticklabels=param_grid['gamma'], ylabel='C', yticklabels=param_grid['C'], cmap="viridis")
plt.show()

In [None]:
def heatmap(values, xlabel, ylabel, xticklabels, yticklabels, cmap=None,
            vmin=None, vmax=None, ax=None, fmt="%0.2f"):
    if ax is None:
        ax = plt.gca()
    # plot the mean cross-validation scores
    img = ax.pcolor(values, cmap=cmap, vmin=vmin, vmax=vmax)
    img.update_scalarmappable()
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_xticks(np.arange(len(xticklabels)) + .5)
    ax.set_yticks(np.arange(len(yticklabels)) + .5)
    ax.set_xticklabels(xticklabels)
    ax.set_yticklabels(yticklabels)
    ax.set_aspect(1)

    for p, color, value in zip(img.get_paths(), img.get_facecolors(),
                               img.get_array()):
        x, y = p.vertices[:-2, :].mean(0)
        if np.mean(color[:3]) > 0.5:
            c = 'k'
        else:
            c = 'w'
        ax.text(x, y, fmt % value, color=c, ha="center", va="center")
    return img

In [None]:
for item in models:
    print(f"C = {item[0][0]}  weight = {item[0][1][1]}  auc_train = {item[2][1]}  auc_test = {item[3][1]}")

In [None]:
from sklearn.linear_model import LogisticRegression

scores = []
for C in [3**c for c in range(-15, 25, 1)]:
    model = LogisticRegression(C=C)
    model.fit(X_train, y_train)
    scores.append([C, get_measures(model, X_train, y_train), get_measures(model, X_test, y_test)])

C = [item[0] for item in scores]
train = [item[1] for item in scores]
test = [item[2] for item in scores]
df_train = pd.DataFrame(train, columns = ['TN_train', 'FN_train', 'FP_train', 'TP_train', 'accuracy_train', 'precision_train', 'recall_train', 'f1_train', 'auc_train'])
df_test = pd.DataFrame(test, columns = ['TN_test', 'FN_test', 'FP_test', 'TP_test', 'accuracy_test', 'precision_test', 'recall_test', 'f1_test', 'auc_test'])
df = pd.concat([pd.DataFrame(C, columns = ['C']), df_train, df_test], axis=1)
df.to_csv("Linear_regression_scores.csv", sep = ';', index=False)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import xgboost

xgb_scores = []
for max_depth in [3, 4, 5, 7, 9, 11]:
    for learning_rate in [0.01, 0.1, 0.3, 0.5, 1, 3]:
        for n_estimators in [200, 500, 700, 1000]:
            xgb = xgboost.XGBClassifier(max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators)
            xgb.fit(X_train, y_train)
            xgb_scores.append([ [max_depth, learning_rate, n_estimators], get_measures(model, X_train, y_train), get_measures(model, X_test, y_test)])

hiperparams = [item[0] for item in xgb_scores]
train = [item[1] for item in xgb_scores]
test = [item[2] for item in xgb_scores]
df_params = pd.DataFrame(hiperparams, columns=['max_depth', 'learning_rate', 'n_estimators'])
df_train = pd.DataFrame(train, columns = ['TN_train', 'FN_train', 'FP_train', 'TP_train', 'accuracy_train', 'precision_train', 'recall_train', 'f1_train', 'auc_train'])
df_test = pd.DataFrame(test, columns = ['TN_test', 'FN_test', 'FP_test', 'TP_test', 'accuracy_test', 'precision_test', 'recall_test', 'f1_test', 'auc_test'])
df = pd.concat([df_params, df_train, df_test], axis=1)
df.to_csv("xgboost_scores.csv", sep = ';', index=False)

In [None]:
def assess_model(model, X_train, y_train, X_test, y_test):
    scores = {}
    confusion_matrix_train = confusion_matrix(model.predict(X_train),y_train)
    confusion_matrix_test = confusion_matrix(model.predict(X_test),y_test)
    accuracy_score_train = accuracy_score(model.predict(X_train),y_train)
    accuracy_score_test = accuracy_score(model.predict(X_test),y_test)
    f1_score_train = f1_score(model.predict(X_train),y_train)
    f1_score_test = f1_score(model.predict(X_test),y_test)

    scores['train'] = {'TN':confusion_matrix_train[0][0], 'FP':confusion_matrix_train[0][1], 
                       'FN':confusion_matrix_train[1][0], 'TP':confusion_matrix_train[1][1],
                        'accuracy':accuracy_score_train, 'F1_score':f1_score_train}

    scores['test'] = {'TN':confusion_matrix_test[0][0], 'FP':confusion_matrix_test[0][1], 
                       'FN':confusion_matrix_test[1][0], 'TP':confusion_matrix_test[1][1],
                        'accuracy':accuracy_score_test, 'F1_score':f1_score_test}
    return pd.DataFrame(scores)

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
positive = list(X_train[y_train==1].index)
negative = list(X_train[y_train==0].index)

positive_size = 20000
negative_size = 180000

positive_random = np.random.uniform(0, len(positive), positive_size)
positive_random = [ positive[int(x)] for x in positive_random]
positive_random = [X_train[X_train.index == x] for x in positive_random]
positive_random = pd.concat(positive_random)
positive_random['y'] = 1

negative_random = np.random.uniform(0, len(negative), negative_size)
negative_random = [ negative[int(x)] for x in negative_random]
negative_random = [X_train[X_train.index == x] for x in negative_random]
negative_random = pd.concat(negative_random)
negative_random['y'] = 0

X_train_bootstrap = pd.concat([positive_random, negative_random])
X_train_bootstrap = X_train_bootstrap.sample(frac=1).reset_index(drop=True)

y_train_bootstrap = X_train_bootstrap['y']
X_train_bootstrap = X_train_bootstrap.drop(['y'], axis=1)

In [None]:
bootstrap_scaler = StandardScaler()
bootstrap_scaler.fit(X_train_bootstrap)
X_train_bootstrap_scaled = bootstrap_scaler.transform(X_train_bootstrap)
X_test_bootstrap_scaled = bootstrap_scaler.transform(X_test)

In [None]:
# sety danych:

plain = X_train, y_train, X_test, y_test
scaled = X_train_scaled, y_train, X_test_scaled, y_test
bootstrap = X_train_bootstrap, y_train_bootstrap, X_test, y_test
bootrstrap_scaled = X_train_bootstrap_scaled, y_train_bootstrap, X_test_bootstrap_scaled, y_test

data_sets = {'plain':plain, 'scaled':scaled, 'boot':bootstrap, 'boot_scal':bootrstrap_scaled}

In [None]:
def assess_model_plain(model, X_train, y_train, X_test, y_test):
    scores = {}
    confusion_matrix_train = confusion_matrix(model.predict(X_train),y_train)
    confusion_matrix_test = confusion_matrix(model.predict(X_test),y_test)
    accuracy_score_train = accuracy_score(model.predict(X_train),y_train)
    accuracy_score_test = accuracy_score(model.predict(X_test),y_test)
    f1_score_train = f1_score(model.predict(X_train),y_train)
    f1_score_test = f1_score(model.predict(X_test),y_test)

    scores['train'] = {'TN_train':confusion_matrix_train[0][0], 'FP_train':confusion_matrix_train[0][1], 
                       'FN_train':confusion_matrix_train[1][0], 'TP_train':confusion_matrix_train[1][1],
                        'accuracy_train':accuracy_score_train, 'F1_score_train':f1_score_train}

    scores['test'] = {'TN_test':confusion_matrix_test[0][0], 'FP_test':confusion_matrix_test[0][1], 
                       'FN_test':confusion_matrix_test[1][0], 'TP_test':confusion_matrix_test[1][1],
                        'accuracy_test':accuracy_score_test, 'F1_score_test':f1_score_test}
    scores['train'].update(scores['test'])
    return scores['train']

In [None]:
# Regresja logistyczna

models = []
counter = 1
for data_set in data_sets:
    for C_param in range(-4, 14, 2):
        model = {}
        model['counter'] = counter
        model['data_set'] = data_set
        model['model'] = LogisticRegression(C=10**C_param)
        ds = data_sets[data_set]
        model['model'].fit(ds[0], ds[1])
        model['C'] = C_param
        assessment = assess_model(model['model'], ds[0], ds[1], ds[2], ds[3])
        model.update(assessment)
        models.append(model)
        counter += 1

In [None]:
[x['train']['F1_score'] for x in models]


In [None]:
# Regresja logistyczna na bootstrapie
log_reg_bootstrap = LogisticRegression()
log_reg_bootstrap.fit(X_train_bootstrap, y_train_bootstrap)

In [None]:
assess_model(log_reg_bootstrap, X_train, y_train, X_test, y_test)

In [None]:
# Regresja logistyczna
log_reg = LogisticRegression()
log_reg_scaled = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg_scaled.fit(X_train_scaled, y_train)

In [None]:
assess_model(log_reg, X_train, y_train, X_test, y_test)

In [None]:
assess_model(log_reg_scaled, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
X_train.head()

In [None]:
from sklearn.tree import DecisionTreeClassifier
import xgboost

xgb = xgboost.XGBClassifier()
xgb.fit(X_train, y_train)


In [None]:
from sklearn.tree import DecisionTreeClassifier
import xgboost

xgb_bootstrap = xgboost.XGBClassifier()
xgb_bootstrap.fit(X_train_bootstrap, y_train_bootstrap)
assess_model(xgb_bootstrap, X_train, y_train, X_test, y_test)

In [None]:
assess_model(xgb, X_train, y_train, X_test, y_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
import xgboost

# xgb_bootrstrap_scaled = xgboost.XGBClassifier()
# xgb_bootrstrap_scaled.fit(bootrstrap_scaled[0], bootrstrap_scaled[1])
assess_model(xgb_bootrstrap_scaled, bootrstrap_scaled[0], bootrstrap_scaled[1], bootrstrap_scaled[2], bootrstrap_scaled[3])