# Model Comparison

**Objectives**
- Take note of the results of previous performance tests that used the Holdout Method (i.e., during Model Robustness Test)
- Using `McNemar Test`, determine any differences between GBDT Models (e.g., LGBM Default vs CatBoost Default), between configurations (e.g., LGBM Default vs LGBM Tuned), and between behavior-types (i.e., Time-based LGBM Tuned vs Time-based CatBoost Tuned)
- Use whichever dataset is appropriate (probably the Test/Holdout Split).
- Take note of the results

Assume a `significance level` of **0.05 (5%)** as it was mentioned in RRL relating to Model Comparison (let's just use it the reference no. for significane level).

<hr>

*Kindly double check the statement(s) that will follow:*

Assume null hypothesis as *"there is a significant difference between the two models"*. `<== Modify this accordingly depending on which will be compared (whether if GBDT vs GBDT or Default vs Tuned)`

If the resulting `p-value` is larger than the `significance level`, the null hypothesis is not rejected. Else if otherwise (`p-value` < `significance level`).

In [15]:
import statsmodels.stats.contingency_tables as statsmodels #mcnemar
import mlxtend.evaluate as mlxtend #mcnemar_table, mcnemar
import pandas as pd
import numpy as np
import lightgbm as lgbm
import catboost as catb
from joblib import load

In [16]:
DF_LGBM_TB = pd.read_csv('../Dataset/TB/LGBM_TB_Test.csv', low_memory=False) #<== Point these to the proper Test/Holdout datasets.
DF_LGBM_IB = pd.read_csv('../Dataset/IB/LGBM_IB_Test.csv', low_memory=False)
DF_CATB_TB = pd.read_csv('../Dataset/TB/CATB_TB_Test.csv', low_memory=False) #<== Point these to the proper Test/Holdout datasets.
DF_CATB_IB = pd.read_csv('../Dataset/IB/CATB_IB_Test.csv', low_memory=False)

**Battle Chart:**

**GBDT vs GBDT**
- LGBM TB vs CatBoost TB
- LGBM IB vs CatBoost IB
- Tuned LGBM TB vs Tuned CatBoost TB
- Tuned LGBM IB vs Tuned CatBoost IB

**Default vs Tuned**
- LGBM TB vs Tuned LGBM TB
- LGBM IB vs Tuned LGBM IB
- CatBoost TB vs Tuned CatBoost TB
- CatBoost IB vs Tuned CatBoost IB

In [42]:
# SAMPLE USE CASE (LGBM TB vs Tuned LGBM TB)

default_tb = load('../GBDT_Training/Outputs/Results/Demo//LGBM/Train (Default)/DEMO_LGBM_TB.model') # <== Point these to the respective .model files
tuned_tb = load('../GBDT_Training/Outputs/Results/Demo/LGBM/Train (Tuned)/TUNED_DEMO_LGBM_TB.model')

# The correct target (class) labels
y_target = DF_LGBM_TB['malware'] # <=== Collect labels as list

# Class labels predicted by model 1
y_model1 = default_tb.predict(X=DF_LGBM_TB.iloc[:,1:101]) # <=== Collect Model A predictions as list

# Class labels predicted by model 2
y_model2 = tuned_tb.predict(X=DF_LGBM_TB.iloc[:,1:101]) # <=== Collect Model B predictions as list

table = mlxtend.mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)

display(table)

print("statsmodels.mcnemar:")
print(statsmodels.mcnemar(table, exact=False, correction=False))
print("")

chi2, p = mlxtend.mcnemar(table, exact=False, corrected=False)
print("mlxtend.mcnemar (sanity check):")
print(f"pvalue:\t{p}\nchi2:\t{chi2}\n")
print("")



https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


array([[4328,    8],
       [   5,   47]])

statsmodels.mcnemar:
pvalue      0.40538055645894244
statistic   0.6923076923076923

mlxtend.mcnemar (sanity check):
pvalue:	0.40538055645894244
chi2:	0.6923076923076923


