# Evaluation and Comparison of Boosted ML Models in Behavior-Based Malware Detection


## Notebook: CatBoost Training

***

**What is the objective of this file?**

To train the model using the Train Split (Statically Split).

## Checklist

- Ensure that you have the proper dataset files that you intend to use (i.e., whether the lite dataset or full version). 
    - The datasets it will use points to `/Official Development/Dataset/IB` & `/Official Development/Dataset/TB`. 
    - You can run the `/Official Development/Dataset [OFFICIAL] Oliveira Dataset Notebook.ipynb file` or unzip one of the zipped folders in the `/Official Development/Dataset/Processed` towards the two aforementioned folders. 
- Ensure that you have installed the necessary libraries needed to execute the training process. 
    - You can view the list of the specific versions in the thesis document or through the `.sh` or `.bat` files in the repository's home directory.

# 1. CatB Training Setup

Setting training environment parameters.

## 1.0. Tuning Settings

1. What will the output filename be?
2. Will you train a tuned model?
3. What hyperparameter values will you use?

For no. 3, the value looks like
`{'task_type': 'CPU', 'objective': 'Logloss', 'n_estimators': 50, 'max_depth': 11, 'learning_rate': 0.1, 'l2_leaf_reg': 2, 'grow_policy': 'SymmetricTree', 'bootstrap_type': 'Bayesian', 'boosting_type': 'Ordered', 'auto_class_weights': 'Balanced'}`
which is obtained as part of the results of the associated tuning file to this file.

Alternatively, you can just point a file directly to it (overrides the manual setup if not empty).

**Do not include the custom hyperparameter values for `random_state`, `thread_count`, `verbose`, `cat_features`, and `nan_mode` as the values for these are hardcoded to the notebook.**

In [1]:
MODEL_FILENAME = "RYZEN_DEFAULT" # <== Set the prefix for the filename of the output file (don't include file extension)
TUNED_TRAINING = False # <== Set as True if you'll be training a tuned model.

# These parameters are mostly default valued parameters for CatBoost except some that are mentioned in the paper (e.g., Ordered Boosting, Symmeteric Tree); Set as None for truly defaults.
TB_HYPERPARAMS = None # "{'task_type': 'CPU', 'objective':'Logloss', 'grow_policy': 'SymmetricTree', 'bootstrap_type': 'Bayesian', 'boosting_type': 'Ordered'}"
IB_HYPERPARAMS = None # "{'task_type': 'CPU', 'objective':'Logloss', 'grow_policy': 'SymmetricTree', 'bootstrap_type': 'Bayesian', 'boosting_type': 'Ordered'}"

TB_HYPERPARAMS_FILE = "" # <== Pointing a file overrides the value set above.
IB_HYPERPARAMS_FILE = "" # <== Pointing a file overrides the value set above.

STATIC_SPLIT = 0.3 # <== To achieve the 70:30 Static Split
RANDOM_SEED = 1 # <== Must be the same throughout the entire study (acts as a controlled variable), hence let's just settle with 1.

# ⚠️Warning

**Be careful of modifying the code beyond this point as it was designed to run autonomously based on the parameters set above.**

## 1.1. Loading Libraries

In [2]:
#Python Libraries
import time
from datetime import datetime
import json

#Data/Dataset Libraries
import pandas as pd
import numpy as np

#Model Selection
from sklearn.model_selection import train_test_split

#Metrics (for in-training testing only)
from sklearn.metrics import classification_report, accuracy_score, balanced_accuracy_score, f1_score
from sklearn.metrics import precision_score, recall_score, roc_auc_score

#Visualization
from matplotlib import pyplot as plt

#GBDT Models
# import lightgbm
import catboost

#File Writing Library (exclusive for use on LightGBM)
from joblib import dump, load

## 1.2. Logging and Diagnostics

In [3]:
if TUNED_TRAINING:
    MODEL_FILENAME = "TUNED_" + MODEL_FILENAME
    if len(TB_HYPERPARAMS_FILE) != 0:
        f = open(TB_HYPERPARAMS_FILE, "r")
        TB_HYPERPARAMS = f.readline()
        f.close()
    if len(IB_HYPERPARAMS_FILE) != 0:
        f = open(IB_HYPERPARAMS_FILE, "r")
        IB_HYPERPARAMS = f.readline()
        f.close()
    TB_HYPERPARAMS = TB_HYPERPARAMS.replace('\'', '\"').replace("False", "\"False\"").replace("True", "\"True\"")
    TB_HYPERPARAMS = json.loads(str(TB_HYPERPARAMS))
    print("Parsed TB Hyperparams:", TB_HYPERPARAMS)
    IB_HYPERPARAMS = IB_HYPERPARAMS.replace('\'', '\"').replace("False", "\"False\"").replace("True", "\"True\"")
    IB_HYPERPARAMS = json.loads(str(IB_HYPERPARAMS))
    print("Parsed IB Hyperparams:", IB_HYPERPARAMS)
else:
    if TB_HYPERPARAMS != None:
        TB_HYPERPARAMS = TB_HYPERPARAMS.replace('\'', '\"').replace("False", "\"False\"").replace("True", "\"True\"")
        TB_HYPERPARAMS = json.loads(str(TB_HYPERPARAMS))
    else:
        TB_HYPERPARAMS = None
    if IB_HYPERPARAMS != None:
        IB_HYPERPARAMS = IB_HYPERPARAMS.replace('\'', '\"').replace("False", "\"False\"").replace("True", "\"True\"")
        IB_HYPERPARAMS = json.loads(str(IB_HYPERPARAMS))
    else:
        IB_HYPERPARAMS = None

start = end = 0
LOG_FILENAME = "CATB_Training_Log.txt"
def logging(message):
    log = open(LOG_FILENAME, "a")
    log.write(message)
    log.close()
def start_time():
    global start
    start = time.time()
def end_time(process):
    global start
    elapse = time.time()-start
    start = 0
    printout = f"{str(datetime.now())}@{MODEL_FILENAME}: {process} - {round(elapse, 6)}s\n"
    logging(printout)
    return round(elapse, 6)

## 1.3. Loading Datasets

Note that it will use the files in `/Official Development/Dataset/IB` & `/Official Development/Dataset/TB`. 

### 1.3.1. Setting filenames

In [4]:
#Setting filenames of files
TB_Train = "../Dataset/TB/TB_CATB.csv" # <== Location for Time-based Train Split for CatBoost
#TB_Test = "../Dataset/TB/TB_Test_CATB.csv" # <== Location for Time-based Test Split for CatBoost
IB_Train = "../Dataset/IB/IB_CATB.csv" # <== Location for Instance-based Train Split for CatBoost
#IB_Test = "../Dataset/IB/IB_Test_CATB.csv" # <== Location for Instance-based Test Split for CatBoost

### 1.3.2. Loading datasets to DataFrames

In [5]:
#Loading datasets to DataFrames
tb_train = pd.read_csv(TB_Train, low_memory=False).fillna("NaN")
ib_train = pd.read_csv(IB_Train, low_memory=False).fillna("NaN")

print("Dataset Sizes")
print("TB Train Size:", tb_train.shape)
print("IB Train Size:", ib_train.shape)

Dataset Sizes
TB Train Size: (77026, 101)
IB Train Size: (77026, 101)


### 1.3.3. Previewing datasets

In [6]:
#Previewing Time-based Dataset
tb_train.head()

Unnamed: 0,malware,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,1,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,...,LdrGetProcedureAddress,LookupAccountSidW,LdrGetProcedureAddress,LookupAccountSidW,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress
1,1,NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOpenKey,NtQueryValueKey,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,...,FindResourceExW,LoadResource,FindResourceExW,LoadResource,LdrGetDllHandle,LdrGetProcedureAddress,LoadStringW,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle
2,1,NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOpenKey,NtQueryValueKey,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,...,FindResourceExW,LoadResource,FindResourceExW,LoadResource,LdrGetDllHandle,LdrGetProcedureAddress,LoadStringW,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle
3,0,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,...,SetFilePointer,NtReadFile,SetFilePointer,NtReadFile,SetFilePointer,NtReadFile,SetFilePointer,NtReadFile,SetFilePointer,NtReadFile
4,1,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,...,LoadResource,FindResourceExW,LoadResource,FindResourceExW,LoadResource,OleInitialize,FindResourceExW,LoadResource,FindResourceExW,LoadResource


In [7]:
#Previewing Instance-based Dataset
ib_train.head()

Unnamed: 0,malware,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,1,LdrLoadDll,LdrGetProcedureAddress,NtProtectVirtualMemory,NtClose,NtOpenKey,NtQueryValueKey,LdrGetDllHandle,GetSystemInfo,NtAllocateVirtualMemory,...,,,,,,,,,,
1,1,NtClose,NtOpenKey,NtQueryValueKey,LdrGetDllHandle,LdrGetProcedureAddress,GetSystemInfo,NtAllocateVirtualMemory,RegOpenKeyExW,FindFirstFileExW,...,,,,,,,,,,
2,1,NtClose,NtOpenKey,NtQueryValueKey,LdrGetDllHandle,LdrGetProcedureAddress,GetSystemInfo,NtAllocateVirtualMemory,RegOpenKeyExW,FindFirstFileExW,...,,,,,,,,,,
3,0,LdrLoadDll,LdrGetProcedureAddress,NtProtectVirtualMemory,GetSystemTimeAsFileTime,SetUnhandledExceptionFilter,GetSystemInfo,NtAllocateVirtualMemory,RegOpenKeyExW,RegQueryValueExW,...,,,,,,,,,,
4,1,LdrLoadDll,LdrGetProcedureAddress,NtProtectVirtualMemory,NtClose,NtOpenKey,NtQueryValueKey,LdrGetDllHandle,GetSystemInfo,NtAllocateVirtualMemory,...,,,,,,,,,,


### 1.3.4. Statically Splitting the Train Split

Train Split --> Training and Validation Split

However only Training Split will be used.

In [8]:
#Static splitting of Train Split of Time-based
X_tb = tb_train.iloc[:,1:] #All rows, 2nd to last column
y_tb = tb_train.iloc[:,0] #All rows, first column only
X_tb_training, X_tb_validate, y_tb_training, y_tb_validate = train_test_split(X_tb, y_tb, test_size=STATIC_SPLIT, shuffle=True)

#Static splitting of Train Split of Instance-based
X_ib = ib_train.iloc[:,1:] #All rows, 2nd to last column
y_ib = ib_train.iloc[:,0] #All rows, first column only
X_ib_training, X_ib_validate, y_ib_training, y_ib_validate = train_test_split(X_ib, y_ib, test_size=STATIC_SPLIT, shuffle=True)

# 2. Model Training

## 2.1. Setting up the Model

In [9]:
def get_indexes():
    indexes = []
    for i in range(100):
        indexes.append(f"t_{i}")
    return indexes

def setup_model(HYPERPARAMS):
    global TUNED_TRAINING
    indexes = get_indexes()
    if HYPERPARAMS == None:
        return catboost.CatBoostClassifier(random_state=RANDOM_SEED, thread_count=-1, verbose=1, cat_features=indexes, nan_mode='Min', custom_metric=['Logloss', 'AUC', 'Precision'])
    return catboost.CatBoostClassifier(**HYPERPARAMS, random_state=RANDOM_SEED, thread_count=-1, verbose=1, cat_features=indexes, nan_mode='Min', custom_metric=['Logloss', 'AUC', 'Precision'])

## 2.2. Training on Time-Based Behaviors

### 2.2.1 Training Model

In [10]:
#Training Model
start_time()
tb_catb = setup_model(TB_HYPERPARAMS)
tb_catb.fit(X_tb_training, y_tb_training, plot=True, eval_set=catboost.Pool(X_ib_validate, label=y_ib_validate, cat_features=get_indexes()))
end_time("TB_CATB")

#Saving Model as file
tb_catb.save_model("Outputs/"+MODEL_FILENAME+"_TB_CATB.model", format="json")

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.084849
0:	learn: 0.5338848	test: 0.7127637	best: 0.7127637 (0)	total: 607ms	remaining: 10m 6s
1:	learn: 0.4170784	test: 0.7259767	best: 0.7127637 (0)	total: 1.18s	remaining: 9m 46s
2:	learn: 0.3117113	test: 0.7042409	best: 0.7042409 (2)	total: 1.89s	remaining: 10m 26s
3:	learn: 0.2370261	test: 0.6785756	best: 0.6785756 (3)	total: 2.51s	remaining: 10m 24s
4:	learn: 0.1877368	test: 0.6998160	best: 0.6785756 (3)	total: 3.07s	remaining: 10m 10s
5:	learn: 0.1469051	test: 0.7029721	best: 0.6785756 (3)	total: 3.74s	remaining: 10m 19s
6:	learn: 0.1169285	test: 0.7287208	best: 0.6785756 (3)	total: 4.49s	remaining: 10m 36s
7:	learn: 0.0926350	test: 0.7606349	best: 0.6785756 (3)	total: 5.29s	remaining: 10m 55s
8:	learn: 0.0768424	test: 0.7707081	best: 0.6785756 (3)	total: 6.14s	remaining: 11m 16s
9:	learn: 0.0663492	test: 0.8033229	best: 0.6785756 (3)	total: 6.98s	remaining: 11m 31s
10:	learn: 0.0576582	test: 0.8314507	best: 0.6785756 (3)	total: 7.86s	remaining: 11m 46s
11:

93:	learn: 0.0162317	test: 1.0034322	best: 0.6785756 (3)	total: 1m 12s	remaining: 11m 40s
94:	learn: 0.0161589	test: 1.0103969	best: 0.6785756 (3)	total: 1m 13s	remaining: 11m 40s
95:	learn: 0.0160421	test: 1.0144624	best: 0.6785756 (3)	total: 1m 14s	remaining: 11m 40s
96:	learn: 0.0159442	test: 1.0009450	best: 0.6785756 (3)	total: 1m 15s	remaining: 11m 38s
97:	learn: 0.0157682	test: 0.9697819	best: 0.6785756 (3)	total: 1m 15s	remaining: 11m 36s
98:	learn: 0.0156743	test: 1.0196526	best: 0.6785756 (3)	total: 1m 16s	remaining: 11m 35s
99:	learn: 0.0155992	test: 1.0381927	best: 0.6785756 (3)	total: 1m 17s	remaining: 11m 34s
100:	learn: 0.0155825	test: 1.0260054	best: 0.6785756 (3)	total: 1m 17s	remaining: 11m 33s
101:	learn: 0.0155018	test: 1.0449305	best: 0.6785756 (3)	total: 1m 18s	remaining: 11m 32s
102:	learn: 0.0153890	test: 1.0389781	best: 0.6785756 (3)	total: 1m 19s	remaining: 11m 31s
103:	learn: 0.0153389	test: 1.0410136	best: 0.6785756 (3)	total: 1m 20s	remaining: 11m 30s
104:	l

184:	learn: 0.0116284	test: 1.1191994	best: 0.6785756 (3)	total: 2m 23s	remaining: 10m 33s
185:	learn: 0.0116066	test: 1.1169174	best: 0.6785756 (3)	total: 2m 24s	remaining: 10m 33s
186:	learn: 0.0115694	test: 1.1011322	best: 0.6785756 (3)	total: 2m 25s	remaining: 10m 32s
187:	learn: 0.0115200	test: 1.0943434	best: 0.6785756 (3)	total: 2m 26s	remaining: 10m 31s
188:	learn: 0.0114718	test: 1.1092368	best: 0.6785756 (3)	total: 2m 27s	remaining: 10m 32s
189:	learn: 0.0114370	test: 1.1116867	best: 0.6785756 (3)	total: 2m 28s	remaining: 10m 31s
190:	learn: 0.0113704	test: 1.1034579	best: 0.6785756 (3)	total: 2m 28s	remaining: 10m 31s
191:	learn: 0.0113440	test: 1.1194726	best: 0.6785756 (3)	total: 2m 29s	remaining: 10m 30s
192:	learn: 0.0113227	test: 1.1184380	best: 0.6785756 (3)	total: 2m 30s	remaining: 10m 29s
193:	learn: 0.0113057	test: 1.1181612	best: 0.6785756 (3)	total: 2m 31s	remaining: 10m 28s
194:	learn: 0.0112550	test: 1.1186131	best: 0.6785756 (3)	total: 2m 32s	remaining: 10m 28s

275:	learn: 0.0088019	test: 1.2158942	best: 0.6785756 (3)	total: 3m 38s	remaining: 9m 31s
276:	learn: 0.0087992	test: 1.2189639	best: 0.6785756 (3)	total: 3m 38s	remaining: 9m 31s
277:	learn: 0.0087821	test: 1.2109942	best: 0.6785756 (3)	total: 3m 39s	remaining: 9m 30s
278:	learn: 0.0087576	test: 1.2126901	best: 0.6785756 (3)	total: 3m 40s	remaining: 9m 29s
279:	learn: 0.0087167	test: 1.2138748	best: 0.6785756 (3)	total: 3m 41s	remaining: 9m 28s
280:	learn: 0.0087003	test: 1.2360758	best: 0.6785756 (3)	total: 3m 41s	remaining: 9m 28s
281:	learn: 0.0086812	test: 1.2466763	best: 0.6785756 (3)	total: 3m 42s	remaining: 9m 27s
282:	learn: 0.0086579	test: 1.2507514	best: 0.6785756 (3)	total: 3m 43s	remaining: 9m 26s
283:	learn: 0.0086392	test: 1.2596013	best: 0.6785756 (3)	total: 3m 44s	remaining: 9m 25s
284:	learn: 0.0085878	test: 1.2695296	best: 0.6785756 (3)	total: 3m 45s	remaining: 9m 25s
285:	learn: 0.0085542	test: 1.2706723	best: 0.6785756 (3)	total: 3m 46s	remaining: 9m 24s
286:	learn

367:	learn: 0.0070663	test: 1.3625579	best: 0.6785756 (3)	total: 4m 51s	remaining: 8m 20s
368:	learn: 0.0070521	test: 1.3627844	best: 0.6785756 (3)	total: 4m 52s	remaining: 8m 20s
369:	learn: 0.0070065	test: 1.3690009	best: 0.6785756 (3)	total: 4m 53s	remaining: 8m 19s
370:	learn: 0.0070064	test: 1.3690010	best: 0.6785756 (3)	total: 4m 53s	remaining: 8m 18s
371:	learn: 0.0070063	test: 1.3690354	best: 0.6785756 (3)	total: 4m 54s	remaining: 8m 17s
372:	learn: 0.0069940	test: 1.3618751	best: 0.6785756 (3)	total: 4m 55s	remaining: 8m 16s
373:	learn: 0.0069722	test: 1.3698199	best: 0.6785756 (3)	total: 4m 56s	remaining: 8m 15s
374:	learn: 0.0069554	test: 1.3691830	best: 0.6785756 (3)	total: 4m 57s	remaining: 8m 15s
375:	learn: 0.0069337	test: 1.3693596	best: 0.6785756 (3)	total: 4m 57s	remaining: 8m 14s
376:	learn: 0.0069181	test: 1.3748073	best: 0.6785756 (3)	total: 4m 58s	remaining: 8m 13s
377:	learn: 0.0068905	test: 1.3860488	best: 0.6785756 (3)	total: 4m 59s	remaining: 8m 12s
378:	learn

459:	learn: 0.0061231	test: 1.4728169	best: 0.6785756 (3)	total: 6m 4s	remaining: 7m 7s
460:	learn: 0.0061230	test: 1.4728207	best: 0.6785756 (3)	total: 6m 5s	remaining: 7m 7s
461:	learn: 0.0061230	test: 1.4728410	best: 0.6785756 (3)	total: 6m 6s	remaining: 7m 6s
462:	learn: 0.0061230	test: 1.4728448	best: 0.6785756 (3)	total: 6m 6s	remaining: 7m 5s
463:	learn: 0.0061230	test: 1.4728486	best: 0.6785756 (3)	total: 6m 7s	remaining: 7m 4s
464:	learn: 0.0061230	test: 1.4728514	best: 0.6785756 (3)	total: 6m 8s	remaining: 7m 3s
465:	learn: 0.0061230	test: 1.4728589	best: 0.6785756 (3)	total: 6m 9s	remaining: 7m 3s
466:	learn: 0.0061230	test: 1.4728589	best: 0.6785756 (3)	total: 6m 10s	remaining: 7m 2s
467:	learn: 0.0061230	test: 1.4728589	best: 0.6785756 (3)	total: 6m 11s	remaining: 7m 1s
468:	learn: 0.0061230	test: 1.4728793	best: 0.6785756 (3)	total: 6m 11s	remaining: 7m
469:	learn: 0.0061086	test: 1.4814084	best: 0.6785756 (3)	total: 6m 12s	remaining: 7m
470:	learn: 0.0061086	test: 1.4814

551:	learn: 0.0058863	test: 1.5326376	best: 0.6785756 (3)	total: 7m 18s	remaining: 5m 55s
552:	learn: 0.0058863	test: 1.5326381	best: 0.6785756 (3)	total: 7m 19s	remaining: 5m 55s
553:	learn: 0.0058862	test: 1.5326424	best: 0.6785756 (3)	total: 7m 20s	remaining: 5m 54s
554:	learn: 0.0058862	test: 1.5326273	best: 0.6785756 (3)	total: 7m 20s	remaining: 5m 53s
555:	learn: 0.0058862	test: 1.5326353	best: 0.6785756 (3)	total: 7m 21s	remaining: 5m 52s
556:	learn: 0.0058862	test: 1.5326353	best: 0.6785756 (3)	total: 7m 22s	remaining: 5m 51s
557:	learn: 0.0058862	test: 1.5326277	best: 0.6785756 (3)	total: 7m 23s	remaining: 5m 50s
558:	learn: 0.0058862	test: 1.5326282	best: 0.6785756 (3)	total: 7m 23s	remaining: 5m 50s
559:	learn: 0.0058862	test: 1.5326282	best: 0.6785756 (3)	total: 7m 24s	remaining: 5m 49s
560:	learn: 0.0058862	test: 1.5326304	best: 0.6785756 (3)	total: 7m 25s	remaining: 5m 48s
561:	learn: 0.0058862	test: 1.5326229	best: 0.6785756 (3)	total: 7m 26s	remaining: 5m 47s
562:	learn

643:	learn: 0.0058134	test: 1.5416641	best: 0.6785756 (3)	total: 8m 33s	remaining: 4m 43s
644:	learn: 0.0058134	test: 1.5416641	best: 0.6785756 (3)	total: 8m 34s	remaining: 4m 43s
645:	learn: 0.0058134	test: 1.5416646	best: 0.6785756 (3)	total: 8m 35s	remaining: 4m 42s
646:	learn: 0.0058134	test: 1.5416646	best: 0.6785756 (3)	total: 8m 36s	remaining: 4m 41s
647:	learn: 0.0058134	test: 1.5416669	best: 0.6785756 (3)	total: 8m 36s	remaining: 4m 40s
648:	learn: 0.0058134	test: 1.5416674	best: 0.6785756 (3)	total: 8m 37s	remaining: 4m 40s
649:	learn: 0.0058134	test: 1.5416701	best: 0.6785756 (3)	total: 8m 38s	remaining: 4m 39s
650:	learn: 0.0058134	test: 1.5416715	best: 0.6785756 (3)	total: 8m 39s	remaining: 4m 38s
651:	learn: 0.0058134	test: 1.5416715	best: 0.6785756 (3)	total: 8m 40s	remaining: 4m 37s
652:	learn: 0.0058134	test: 1.5416715	best: 0.6785756 (3)	total: 8m 41s	remaining: 4m 36s
653:	learn: 0.0058134	test: 1.5416715	best: 0.6785756 (3)	total: 8m 42s	remaining: 4m 36s
654:	learn

735:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 51s	remaining: 3m 32s
736:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 51s	remaining: 3m 31s
737:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 52s	remaining: 3m 30s
738:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 53s	remaining: 3m 29s
739:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 54s	remaining: 3m 28s
740:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 55s	remaining: 3m 28s
741:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 55s	remaining: 3m 27s
742:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 56s	remaining: 3m 26s
743:	learn: 0.0057324	test: 1.5585452	best: 0.6785756 (3)	total: 9m 57s	remaining: 3m 25s
744:	learn: 0.0057205	test: 1.5518278	best: 0.6785756 (3)	total: 9m 58s	remaining: 3m 24s
745:	learn: 0.0057205	test: 1.5518110	best: 0.6785756 (3)	total: 9m 59s	remaining: 3m 24s
746:	learn

826:	learn: 0.0055583	test: 1.5589133	best: 0.6785756 (3)	total: 11m 3s	remaining: 2m 18s
827:	learn: 0.0055583	test: 1.5589134	best: 0.6785756 (3)	total: 11m 4s	remaining: 2m 18s
828:	learn: 0.0055583	test: 1.5589135	best: 0.6785756 (3)	total: 11m 5s	remaining: 2m 17s
829:	learn: 0.0055582	test: 1.5589136	best: 0.6785756 (3)	total: 11m 5s	remaining: 2m 16s
830:	learn: 0.0055582	test: 1.5589137	best: 0.6785756 (3)	total: 11m 6s	remaining: 2m 15s
831:	learn: 0.0055582	test: 1.5589241	best: 0.6785756 (3)	total: 11m 7s	remaining: 2m 14s
832:	learn: 0.0055582	test: 1.5589241	best: 0.6785756 (3)	total: 11m 8s	remaining: 2m 13s
833:	learn: 0.0055582	test: 1.5589270	best: 0.6785756 (3)	total: 11m 9s	remaining: 2m 13s
834:	learn: 0.0055582	test: 1.5589270	best: 0.6785756 (3)	total: 11m 9s	remaining: 2m 12s
835:	learn: 0.0055582	test: 1.5589270	best: 0.6785756 (3)	total: 11m 10s	remaining: 2m 11s
836:	learn: 0.0055582	test: 1.5589271	best: 0.6785756 (3)	total: 11m 11s	remaining: 2m 10s
837:	lea

917:	learn: 0.0055577	test: 1.5592315	best: 0.6785756 (3)	total: 12m 18s	remaining: 1m 5s
918:	learn: 0.0055577	test: 1.5592732	best: 0.6785756 (3)	total: 12m 18s	remaining: 1m 5s
919:	learn: 0.0055577	test: 1.5592940	best: 0.6785756 (3)	total: 12m 19s	remaining: 1m 4s
920:	learn: 0.0055577	test: 1.5593148	best: 0.6785756 (3)	total: 12m 20s	remaining: 1m 3s
921:	learn: 0.0055577	test: 1.5593149	best: 0.6785756 (3)	total: 12m 21s	remaining: 1m 2s
922:	learn: 0.0055577	test: 1.5593149	best: 0.6785756 (3)	total: 12m 22s	remaining: 1m 1s
923:	learn: 0.0055577	test: 1.5593357	best: 0.6785756 (3)	total: 12m 22s	remaining: 1m 1s
924:	learn: 0.0055577	test: 1.5593357	best: 0.6785756 (3)	total: 12m 23s	remaining: 1m
925:	learn: 0.0055577	test: 1.5593565	best: 0.6785756 (3)	total: 12m 24s	remaining: 59.5s
926:	learn: 0.0055577	test: 1.5593565	best: 0.6785756 (3)	total: 12m 25s	remaining: 58.7s
927:	learn: 0.0055577	test: 1.5593773	best: 0.6785756 (3)	total: 12m 25s	remaining: 57.9s
928:	learn: 0

### 2.2.2. Checking Performance

Using the split for validation for a bit of internal checking of performance (i.e., not official)

In [11]:
print(classification_report(y_tb_validate, tb_catb.predict(X_tb_validate),digits=4))

              precision    recall  f1-score   support

           0     0.9924    0.9726    0.9824     11644
           1     0.9727    0.9924    0.9825     11464

    accuracy                         0.9824     23108
   macro avg     0.9826    0.9825    0.9824     23108
weighted avg     0.9826    0.9824    0.9824     23108



### 2.2.3. Preview of the Tree

*How can it be a tree if there is no proof of a tree?*

In [18]:
tb_catb.plot_tree(0, catboost.Pool(X_tb_training, y_tb_training, cat_features=get_indexes(), feature_names=list(X_tb_training.columns)))

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.graphs.Digraph at 0x2913eefa8d0>

## 2.3. Training on Instance-Based Behaviors

### 2.3.1 Training Model

In [13]:
#Training Model
start_time()
ib_catb = setup_model(IB_HYPERPARAMS)
ib_catb.fit(X_ib_training, y_ib_training, plot=True, eval_set=catboost.Pool(X_ib_validate, label=y_ib_validate, cat_features=get_indexes()))
end_time("TB_CATB")
end_time("IB_CATB")

#Saving Model as file
ib_catb.save_model("Outputs/"+MODEL_FILENAME+"_IB_CATB.model", format="json")

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.084849
0:	learn: 0.4892824	test: 0.4893239	best: 0.4893239 (0)	total: 307ms	remaining: 5m 6s
1:	learn: 0.3711208	test: 0.3715602	best: 0.3715602 (1)	total: 622ms	remaining: 5m 10s
2:	learn: 0.2741167	test: 0.2729150	best: 0.2729150 (2)	total: 956ms	remaining: 5m 17s
3:	learn: 0.2124167	test: 0.2115845	best: 0.2115845 (3)	total: 1.32s	remaining: 5m 30s
4:	learn: 0.1616190	test: 0.1612046	best: 0.1612046 (4)	total: 1.65s	remaining: 5m 28s
5:	learn: 0.1224350	test: 0.1223475	best: 0.1223475 (5)	total: 1.99s	remaining: 5m 29s
6:	learn: 0.0984494	test: 0.0986290	best: 0.0986290 (6)	total: 2.31s	remaining: 5m 28s
7:	learn: 0.0810318	test: 0.0813501	best: 0.0813501 (7)	total: 2.64s	remaining: 5m 27s
8:	learn: 0.0678077	test: 0.0682850	best: 0.0682850 (8)	total: 2.95s	remaining: 5m 25s
9:	learn: 0.0576938	test: 0.0582791	best: 0.0582791 (9)	total: 3.34s	remaining: 5m 30s
10:	learn: 0.0510003	test: 0.0517115	best: 0.0517115 (10)	total: 3.67s	remaining: 5m 29s
11:	learn: 0

93:	learn: 0.0171079	test: 0.0195343	best: 0.0195343 (93)	total: 35.5s	remaining: 5m 41s
94:	learn: 0.0171079	test: 0.0195343	best: 0.0195343 (93)	total: 35.9s	remaining: 5m 41s
95:	learn: 0.0170548	test: 0.0195338	best: 0.0195338 (95)	total: 36.3s	remaining: 5m 41s
96:	learn: 0.0170548	test: 0.0195338	best: 0.0195338 (96)	total: 36.7s	remaining: 5m 41s
97:	learn: 0.0170142	test: 0.0195133	best: 0.0195133 (97)	total: 37s	remaining: 5m 40s
98:	learn: 0.0169648	test: 0.0195017	best: 0.0195017 (98)	total: 37.4s	remaining: 5m 40s
99:	learn: 0.0168643	test: 0.0194556	best: 0.0194556 (99)	total: 37.8s	remaining: 5m 40s
100:	learn: 0.0168643	test: 0.0194556	best: 0.0194556 (100)	total: 38.2s	remaining: 5m 39s
101:	learn: 0.0167739	test: 0.0194107	best: 0.0194107 (101)	total: 38.7s	remaining: 5m 40s
102:	learn: 0.0167739	test: 0.0194107	best: 0.0194107 (102)	total: 39s	remaining: 5m 39s
103:	learn: 0.0167019	test: 0.0193748	best: 0.0193748 (103)	total: 39.3s	remaining: 5m 38s
104:	learn: 0.016

184:	learn: 0.0139735	test: 0.0180720	best: 0.0180720 (184)	total: 1m 10s	remaining: 5m 10s
185:	learn: 0.0139735	test: 0.0180719	best: 0.0180719 (185)	total: 1m 10s	remaining: 5m 9s
186:	learn: 0.0139379	test: 0.0180612	best: 0.0180612 (186)	total: 1m 11s	remaining: 5m 9s
187:	learn: 0.0139379	test: 0.0180612	best: 0.0180612 (187)	total: 1m 11s	remaining: 5m 8s
188:	learn: 0.0139379	test: 0.0180612	best: 0.0180612 (188)	total: 1m 11s	remaining: 5m 8s
189:	learn: 0.0139129	test: 0.0180602	best: 0.0180602 (189)	total: 1m 12s	remaining: 5m 7s
190:	learn: 0.0139129	test: 0.0180602	best: 0.0180602 (190)	total: 1m 12s	remaining: 5m 7s
191:	learn: 0.0139129	test: 0.0180602	best: 0.0180602 (191)	total: 1m 12s	remaining: 5m 6s
192:	learn: 0.0138927	test: 0.0180468	best: 0.0180468 (192)	total: 1m 13s	remaining: 5m 5s
193:	learn: 0.0138371	test: 0.0180469	best: 0.0180468 (192)	total: 1m 13s	remaining: 5m 5s
194:	learn: 0.0138192	test: 0.0180357	best: 0.0180357 (194)	total: 1m 13s	remaining: 5m 5

274:	learn: 0.0129276	test: 0.0177054	best: 0.0177054 (272)	total: 1m 45s	remaining: 4m 37s
275:	learn: 0.0129276	test: 0.0177054	best: 0.0177054 (275)	total: 1m 45s	remaining: 4m 36s
276:	learn: 0.0129276	test: 0.0177054	best: 0.0177054 (276)	total: 1m 45s	remaining: 4m 36s
277:	learn: 0.0128745	test: 0.0176792	best: 0.0176792 (277)	total: 1m 46s	remaining: 4m 35s
278:	learn: 0.0128634	test: 0.0176670	best: 0.0176670 (278)	total: 1m 46s	remaining: 4m 35s
279:	learn: 0.0128000	test: 0.0176492	best: 0.0176492 (279)	total: 1m 46s	remaining: 4m 34s
280:	learn: 0.0127895	test: 0.0176528	best: 0.0176492 (279)	total: 1m 47s	remaining: 4m 34s
281:	learn: 0.0127672	test: 0.0176506	best: 0.0176492 (279)	total: 1m 47s	remaining: 4m 33s
282:	learn: 0.0127672	test: 0.0176506	best: 0.0176492 (279)	total: 1m 47s	remaining: 4m 33s
283:	learn: 0.0127672	test: 0.0176506	best: 0.0176492 (279)	total: 1m 48s	remaining: 4m 32s
284:	learn: 0.0127339	test: 0.0176218	best: 0.0176218 (284)	total: 1m 48s	remain

364:	learn: 0.0110221	test: 0.0171845	best: 0.0171845 (363)	total: 2m 20s	remaining: 4m 5s
365:	learn: 0.0109882	test: 0.0171845	best: 0.0171845 (365)	total: 2m 21s	remaining: 4m 4s
366:	learn: 0.0109882	test: 0.0171845	best: 0.0171845 (366)	total: 2m 21s	remaining: 4m 4s
367:	learn: 0.0109882	test: 0.0171845	best: 0.0171845 (366)	total: 2m 22s	remaining: 4m 4s
368:	learn: 0.0109814	test: 0.0171844	best: 0.0171844 (368)	total: 2m 22s	remaining: 4m 3s
369:	learn: 0.0109499	test: 0.0171855	best: 0.0171844 (368)	total: 2m 23s	remaining: 4m 3s
370:	learn: 0.0108886	test: 0.0171735	best: 0.0171735 (370)	total: 2m 23s	remaining: 4m 3s
371:	learn: 0.0108462	test: 0.0171916	best: 0.0171735 (370)	total: 2m 24s	remaining: 4m 3s
372:	learn: 0.0108245	test: 0.0171951	best: 0.0171735 (370)	total: 2m 24s	remaining: 4m 2s
373:	learn: 0.0107900	test: 0.0171991	best: 0.0171735 (370)	total: 2m 24s	remaining: 4m 2s
374:	learn: 0.0107733	test: 0.0172022	best: 0.0171735 (370)	total: 2m 25s	remaining: 4m 2s

454:	learn: 0.0095788	test: 0.0169453	best: 0.0169453 (454)	total: 3m 3s	remaining: 3m 39s
455:	learn: 0.0095788	test: 0.0169452	best: 0.0169452 (455)	total: 3m 4s	remaining: 3m 39s
456:	learn: 0.0095788	test: 0.0169453	best: 0.0169452 (455)	total: 3m 4s	remaining: 3m 39s
457:	learn: 0.0095671	test: 0.0169462	best: 0.0169452 (455)	total: 3m 4s	remaining: 3m 38s
458:	learn: 0.0095600	test: 0.0169447	best: 0.0169447 (458)	total: 3m 5s	remaining: 3m 38s
459:	learn: 0.0095512	test: 0.0169554	best: 0.0169447 (458)	total: 3m 5s	remaining: 3m 37s
460:	learn: 0.0095278	test: 0.0169457	best: 0.0169447 (458)	total: 3m 6s	remaining: 3m 37s
461:	learn: 0.0095278	test: 0.0169457	best: 0.0169447 (458)	total: 3m 6s	remaining: 3m 37s
462:	learn: 0.0095278	test: 0.0169457	best: 0.0169447 (458)	total: 3m 6s	remaining: 3m 36s
463:	learn: 0.0095121	test: 0.0169556	best: 0.0169447 (458)	total: 3m 7s	remaining: 3m 36s
464:	learn: 0.0094988	test: 0.0169441	best: 0.0169441 (464)	total: 3m 7s	remaining: 3m 35s

544:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 40s	remaining: 3m 3s
545:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 40s	remaining: 3m 3s
546:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 41s	remaining: 3m 3s
547:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 41s	remaining: 3m 2s
548:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 41s	remaining: 3m 2s
549:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 42s	remaining: 3m 1s
550:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 42s	remaining: 3m 1s
551:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 42s	remaining: 3m
552:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 43s	remaining: 3m
553:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 43s	remaining: 3m
554:	learn: 0.0091010	test: 0.0168942	best: 0.0168725 (495)	total: 3m 44s	remaining: 2m 59s
555:	le

634:	learn: 0.0090571	test: 0.0168821	best: 0.0168725 (495)	total: 4m 13s	remaining: 2m 25s
635:	learn: 0.0090571	test: 0.0168821	best: 0.0168725 (495)	total: 4m 13s	remaining: 2m 25s
636:	learn: 0.0090571	test: 0.0168821	best: 0.0168725 (495)	total: 4m 13s	remaining: 2m 24s
637:	learn: 0.0090571	test: 0.0168821	best: 0.0168725 (495)	total: 4m 14s	remaining: 2m 24s
638:	learn: 0.0090571	test: 0.0168821	best: 0.0168725 (495)	total: 4m 14s	remaining: 2m 23s
639:	learn: 0.0090571	test: 0.0168820	best: 0.0168725 (495)	total: 4m 14s	remaining: 2m 23s
640:	learn: 0.0090571	test: 0.0168820	best: 0.0168725 (495)	total: 4m 15s	remaining: 2m 23s
641:	learn: 0.0090571	test: 0.0168819	best: 0.0168725 (495)	total: 4m 15s	remaining: 2m 22s
642:	learn: 0.0090571	test: 0.0168819	best: 0.0168725 (495)	total: 4m 16s	remaining: 2m 22s
643:	learn: 0.0090571	test: 0.0168819	best: 0.0168725 (495)	total: 4m 16s	remaining: 2m 21s
644:	learn: 0.0090571	test: 0.0168819	best: 0.0168725 (495)	total: 4m 16s	remain

724:	learn: 0.0089681	test: 0.0168439	best: 0.0168426 (718)	total: 4m 45s	remaining: 1m 48s
725:	learn: 0.0089532	test: 0.0168485	best: 0.0168426 (718)	total: 4m 45s	remaining: 1m 47s
726:	learn: 0.0089450	test: 0.0168614	best: 0.0168426 (718)	total: 4m 46s	remaining: 1m 47s
727:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 46s	remaining: 1m 47s
728:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 47s	remaining: 1m 46s
729:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 47s	remaining: 1m 46s
730:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 47s	remaining: 1m 45s
731:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 48s	remaining: 1m 45s
732:	learn: 0.0089310	test: 0.0168656	best: 0.0168426 (718)	total: 4m 48s	remaining: 1m 45s
733:	learn: 0.0089125	test: 0.0168676	best: 0.0168426 (718)	total: 4m 48s	remaining: 1m 44s
734:	learn: 0.0089001	test: 0.0168625	best: 0.0168426 (718)	total: 4m 49s	remain

814:	learn: 0.0087072	test: 0.0168968	best: 0.0168426 (718)	total: 5m 16s	remaining: 1m 11s
815:	learn: 0.0087072	test: 0.0168969	best: 0.0168426 (718)	total: 5m 17s	remaining: 1m 11s
816:	learn: 0.0087072	test: 0.0168969	best: 0.0168426 (718)	total: 5m 17s	remaining: 1m 11s
817:	learn: 0.0087072	test: 0.0168969	best: 0.0168426 (718)	total: 5m 17s	remaining: 1m 10s
818:	learn: 0.0087072	test: 0.0168969	best: 0.0168426 (718)	total: 5m 18s	remaining: 1m 10s
819:	learn: 0.0087072	test: 0.0168969	best: 0.0168426 (718)	total: 5m 18s	remaining: 1m 9s
820:	learn: 0.0087072	test: 0.0168970	best: 0.0168426 (718)	total: 5m 18s	remaining: 1m 9s
821:	learn: 0.0087072	test: 0.0168970	best: 0.0168426 (718)	total: 5m 19s	remaining: 1m 9s
822:	learn: 0.0087072	test: 0.0168970	best: 0.0168426 (718)	total: 5m 19s	remaining: 1m 8s
823:	learn: 0.0087072	test: 0.0168970	best: 0.0168426 (718)	total: 5m 19s	remaining: 1m 8s
824:	learn: 0.0087071	test: 0.0168970	best: 0.0168426 (718)	total: 5m 20s	remaining: 

905:	learn: 0.0087066	test: 0.0168977	best: 0.0168426 (718)	total: 5m 48s	remaining: 36.2s
906:	learn: 0.0087066	test: 0.0168977	best: 0.0168426 (718)	total: 5m 48s	remaining: 35.8s
907:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 49s	remaining: 35.4s
908:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 49s	remaining: 35s
909:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 49s	remaining: 34.6s
910:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 50s	remaining: 34.2s
911:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 50s	remaining: 33.8s
912:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 51s	remaining: 33.5s
913:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 51s	remaining: 33.1s
914:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 51s	remaining: 32.7s
915:	learn: 0.0087065	test: 0.0168978	best: 0.0168426 (718)	total: 5m 52s	remaining: 32.3s
9

996:	learn: 0.0086226	test: 0.0168640	best: 0.0168426 (718)	total: 6m 22s	remaining: 1.15s
997:	learn: 0.0086226	test: 0.0168640	best: 0.0168426 (718)	total: 6m 22s	remaining: 767ms
998:	learn: 0.0086226	test: 0.0168640	best: 0.0168426 (718)	total: 6m 23s	remaining: 384ms
999:	learn: 0.0086152	test: 0.0168495	best: 0.0168426 (718)	total: 6m 23s	remaining: 0us

bestTest = 0.01684262022
bestIteration = 718

Shrink model to first 719 iterations.


### 2.3.2. Checking Performance

Using the split for validation for a bit of internal checking of performance (i.e., not official)

In [14]:
print(classification_report(y_ib_validate, ib_catb.predict(X_ib_validate),digits=4))

              precision    recall  f1-score   support

           0     0.9980    0.9922    0.9951     11507
           1     0.9923    0.9980    0.9951     11601

    accuracy                         0.9951     23108
   macro avg     0.9951    0.9951    0.9951     23108
weighted avg     0.9951    0.9951    0.9951     23108



### 2.3.3. Preview of the Tree

*How can it be a tree if there is no proof of a tree?*

In [15]:
ib_catb.plot_tree(0, catboost.Pool(X_ib_training, y_ib_training, cat_features=get_indexes(), feature_names=list(X_ib_training.columns)))

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.graphs.Digraph at 0x2913e763c90>

In [16]:
logging("\n")