# Balancing Oliveira

The objective of this notebook is to explore the viability of conducting dataset sample balancing through two techniques:

1. Resampling (specifically RandomOverSampler)
2. SMOTE (specifically SMOTEN)

It turns out, there is a dedicated library called "[Imbalanced Learn"](https://imbalanced-learn.org/stable/index.html) (denoted as `imblearn`) which can be used to handle imbalanced datasets such that it can perform the two techniques.

**It is assumed (unless otherwise stated) that other data pre-processing (i.e., removing duplicates, data cleaning, label encoding, and shuffling/renaming of columns) will be done after this. Hence, this notebook only explores time-based behaviors (i.e., with duplicates).**

This notebook will also look into implementing/testing the proposal of Tustin where she suggested to use part of Oliveira for Model Robustness Testing instead of a third-party dataset as originally proposed. The overview of its steps are as follows:

1. Get the raw dataset (Oliveira).
2. Split it into train (70%) and test (30%)
3. The train split will undergo smote/resampling.
4. Then the test split will be used to determine if the model really works well and robustly

*Note that the hyperparameter `random_state` was set to `1` instead of `None` for test repeatability.*

In [1]:
import time
start_time = 0
def start():
    global start_time
    start_time = time.time()
def end():
    print("Elapsed time:", time.time()-start_time, "seconds")

from sklearn.ensemble import HistGradientBoostingClassifier #Nearest to LightGBM and XGBoost
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

def hgbt_classifier(dataset, cat):
    X = dataset.iloc[:,1:101]
    y = dataset.iloc[:,101]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
    hgbt = HistGradientBoostingClassifier(loss='log_loss', learning_rate=0.1, max_iter=300, max_leaf_nodes=2, 
                                        max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, 
                                        categorical_features=cat, monotonic_cst=None, interaction_cst=None, 
                                        warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, 
                                        n_iter_no_change=10, tol=1e-07, verbose=0, random_state=1, class_weight=None)
    start()
    hgbt.fit(X_train, y_train)
    end()
    y_pred = hgbt.predict(X_test)
    print(classification_report(y_test, y_pred))

# Import Dataset

In [2]:
import pandas as pd

start()
oli = pd.read_csv("oliveira.csv")
end()
print(oli.shape)
oli.head()

Elapsed time: 0.411029577255249 seconds
(43876, 102)


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,071e8c3f8922e186e57548cd4c703a5d,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
1,33f8e6d08a6aae939f25a8e0d63dd523,82,208,187,208,172,117,172,117,172,...,81,240,117,71,297,135,171,215,35,1
2,b68abd064e975e1c6d5f25e748663076,16,110,240,117,240,117,240,117,240,...,65,112,123,65,112,123,65,113,112,1
3,72049be7bd30ea61297ea624ae198067,82,208,187,208,172,117,172,117,172,...,208,302,208,302,187,208,302,228,302,1
4,c9b3700a77facf29172f32df6bc77f48,82,240,117,240,117,240,117,240,117,...,209,260,40,209,260,141,260,141,260,1


# Check Labels

In [3]:
oli["malware"].value_counts()

1    42797
0     1079
Name: malware, dtype: int64

# Obtain Features (X)

This includes the hash as part *feature* for now. 

However, for model training/fitting, the hash will not be considered as a feature anymore. 

In [4]:
X = oli.iloc[:,:101]
X.head()

Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,071e8c3f8922e186e57548cd4c703a5d,112,274,158,215,274,158,215,298,76,...,117,71,297,135,171,215,35,208,56,71
1,33f8e6d08a6aae939f25a8e0d63dd523,82,208,187,208,172,117,172,117,172,...,60,81,240,117,71,297,135,171,215,35
2,b68abd064e975e1c6d5f25e748663076,16,110,240,117,240,117,240,117,240,...,123,65,112,123,65,112,123,65,113,112
3,72049be7bd30ea61297ea624ae198067,82,208,187,208,172,117,172,117,172,...,215,208,302,208,302,187,208,302,228,302
4,c9b3700a77facf29172f32df6bc77f48,82,240,117,240,117,240,117,240,117,...,40,209,260,40,209,260,141,260,141,260


# Obtain Labels (y)

In [5]:
#Obtain labels y
y = oli.iloc[:,101]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: malware, dtype: int64

# Baseline Performance

In [6]:
hgbt_classifier(oli, None)

Elapsed time: 1.5124595165252686 seconds
              precision    recall  f1-score   support

           0       0.90      0.45      0.60       329
           1       0.99      1.00      0.99     12834

    accuracy                           0.98     13163
   macro avg       0.94      0.72      0.79     13163
weighted avg       0.98      0.98      0.98     13163



# 1. Resampling

For this, imblearn's RandomOverSampler will be used as a resampling technique.

Further details can be found [here](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html).

In [7]:
from imblearn.over_sampling import RandomOverSampler

start()
ros = RandomOverSampler(random_state=1, sampling_strategy='minority') #random_state is arbitrary; setting a value makes its output repeatable
X_resampled, y_resampled = ros.fit_resample(X, y) #Resampling of minority classes (i.e., Benign/0 for Oliveira)
end()

oli_resampled = pd.concat([X_resampled,y_resampled.copy()], axis=1)
print("X_resampled:", X_resampled.shape)
print("y_resampled:", y_resampled.shape)
print("Oliveira (Resampled):", oli_resampled.shape, "\n")
print("Class Count:\n" + str(oli_resampled['malware'].value_counts()), "\n")
print("Top 20 Most Repeated Samples:\n" + str(oli_resampled[['hash','malware']].value_counts()[0:20]), "\n")
print("Unique Values:", len(oli_resampled['hash'].unique()))
oli_resampled

Elapsed time: 0.8105578422546387 seconds
X_resampled: (85594, 101)
y_resampled: (85594,)
Oliveira (Resampled): (85594, 102) 

Class Count:
1    42797
0    42797
Name: malware, dtype: int64 

Top 20 Most Repeated Samples:
hash                              malware
03384ab6368b68ed16ecb9e6352539af  0          90
0822ec2ba98d291e5bfc836bc3686096  0          90
f78ea80cec007b2c32fb10f9c6c82f39  0          88
075323e77815ee8bcc7854ce23955a15  0          79
79b78bb3d583748040c41ded09555fd3  0          72
bdaaac3fa3f6796825a51ef1c0e5b3fd  0          71
3d8a7a97ea954dd4fe66279df2b445e0  0          70
d0b42a077320d2ab2d2a80bcbcae02cb  0          60
3daa3068ea8bf5d2e65820c42af62227  0          59
7cc90abc007d2efc476930137899cfda  0          58
484b5a0e5782535e1091412d24198afc  0          58
0566db6153dc8f7bdbef9552a6852139  0          57
9976cd22e18868887eacb927616d5e41  0          57
3a056dfba8365bc058ac05634cb22818  0          56
74db62f95ab558326cc79e4001726832  0          56
c6f4acf3988a14867

Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,071e8c3f8922e186e57548cd4c703a5d,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
1,33f8e6d08a6aae939f25a8e0d63dd523,82,208,187,208,172,117,172,117,172,...,81,240,117,71,297,135,171,215,35,1
2,b68abd064e975e1c6d5f25e748663076,16,110,240,117,240,117,240,117,240,...,65,112,123,65,112,123,65,113,112,1
3,72049be7bd30ea61297ea624ae198067,82,208,187,208,172,117,172,117,172,...,208,302,208,302,187,208,302,228,302,1
4,c9b3700a77facf29172f32df6bc77f48,82,240,117,240,117,240,117,240,117,...,209,260,40,209,260,141,260,141,260,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85589,e1efa1385daabfec74fe877f63f7daf1,286,110,172,240,117,240,117,240,117,...,215,114,215,117,71,25,71,275,260,0
85590,892262603e040080d04a9e5f72e3165b,82,228,16,215,89,208,215,274,158,...,194,82,194,297,194,297,194,82,194,0
85591,0b2a5e7d55bc9ca38c5c736d85b1195f,82,172,117,16,172,117,172,117,198,...,187,215,73,29,73,29,82,240,117,0
85592,1206af04cdc528613fc12471419ebf01,16,172,194,274,158,215,274,158,215,...,158,215,274,158,215,274,158,215,274,0


The data above (`Top 20 Most Repeated Samples`) suggests that the samples that were duplicated in RandomOverSampler were relatively balanced.

Consider the Computation:

| Computation                       | Value |
|-----------------------------------|-------|
| Initial Benign   Samples          | 1079  |
| Benign Samples after Resampling   | 42797 |
| Diff. Before and After Resampling | 41718 |	
		
Distribution of Top 5 Most Resampled Sample:

| Sample                           | Quantity in Resampled | % in Resampled |
|----------------------------------|-----------------------|----------------|
| 03384ab6368b68ed16ecb9e6352539af | 90                    | 0.22%          |
| 0822ec2ba98d291e5bfc836bc3686096 | 90                    | 0.22%          |
| f78ea80cec007b2c32fb10f9c6c82f39 | 88                    | 0.21%          |
| 075323e77815ee8bcc7854ce23955a15 | 79                    | 0.19%          |
| 79b78bb3d583748040c41ded09555fd3 | 72                    | 0.17%          |

In [8]:
hgbt_classifier(oli_resampled, None)

Elapsed time: 2.444812059402466 seconds
              precision    recall  f1-score   support

           0       0.91      0.86      0.88     12827
           1       0.87      0.92      0.89     12852

    accuracy                           0.89     25679
   macro avg       0.89      0.89      0.89     25679
weighted avg       0.89      0.89      0.89     25679



# 2. SMOTE (specifically SMOTEN)

For this, imblearn's SMOTEN will be used as the SMOTE technique. SMOTEN (Synthetic Minority Over-sampling Technique for Nominal) which is described as 

`This method is referred as SMOTEN in [1]. It expects that the data to resample are only made of categorical features.`

Further details can be found [here](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTEN.html).

For this example, it is assumed that the numerical values will act as in-place of categorical data since HGBT does not support categorical data. Apart from that, the hashes are considered still as categorical data despite its irrelevance in the training and prediction process.

In [9]:
from imblearn.over_sampling import SMOTEN

start()
smoten = SMOTEN(random_state=1, sampling_strategy='minority') #random_state is arbitrary; setting a value makes its output repeatable
X_smoten, y_smoten = smoten.fit_resample(X, y) #Resampling of minority classes (i.e., Benign/0 for Oliveira)
end()

oli_smoten = pd.concat([X_smoten,y_smoten.copy()], axis=1)
print("X_smoten", X_smoten.shape)
print("y_smoten", y_smoten.shape)
print("Oliveira (SMOTEN):", oli_smoten.shape, "\n")
print("Class Count:\n" + str(oli_smoten['malware'].value_counts()), "\n")
print("Top 10 Most Repeated Samples:\n" + str(oli_smoten[['hash','malware']].value_counts()[0:20]),"\n")
print("Unique Values:", len(oli_smoten['hash'].unique()))
oli_smoten

Elapsed time: 151.55447340011597 seconds
X_smoten (85594, 101)
y_smoten (85594,)
Oliveira (SMOTEN): (85594, 102) 

Class Count:
1    42797
0    42797
Name: malware, dtype: int64 

Top 10 Most Repeated Samples:
hash                              malware
3cedd98ea184c22ee3b024c72a96e075  0          5965
0fbe9eac4ff5af1a392d92881c70c559  0          3728
0b7e7bc7598abe9cfdc594e17e795cf0  0          1895
125d4cdb14dbe86841037e5bbfc6a0bc  0           895
35dd2f5d51ba224735732424f8ab6398  0           860
05d2a956ac82d30fef9b807e9746b339  0           833
0d77f98fafb6c34a5861e61de06b4b0f  0           718
0d2ab02c993ea29a1989b442bf7150c7  0           622
06b43cb00b61be55b6d100b15edfbc39  0           568
07613e49400add94324002a019d3a9f5  0           566
0bdf4fee812ba46472f51d1ae2ef5e04  0           554
0822ec2ba98d291e5bfc836bc3686096  0           529
3506356c329758e4f703cd2103d7daab  0           505
501e961feebbde040fb836cb5de122c2  0           498
0fca2620b9f96936b7594fc650b1d8ca  0           44

Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,071e8c3f8922e186e57548cd4c703a5d,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
1,33f8e6d08a6aae939f25a8e0d63dd523,82,208,187,208,172,117,172,117,172,...,81,240,117,71,297,135,171,215,35,1
2,b68abd064e975e1c6d5f25e748663076,16,110,240,117,240,117,240,117,240,...,65,112,123,65,112,123,65,113,112,1
3,72049be7bd30ea61297ea624ae198067,82,208,187,208,172,117,172,117,172,...,208,302,208,302,187,208,302,228,302,1
4,c9b3700a77facf29172f32df6bc77f48,82,240,117,240,117,240,117,240,117,...,209,260,40,209,260,141,260,141,260,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85589,0fbe9eac4ff5af1a392d92881c70c559,286,110,172,240,117,240,117,240,117,...,215,114,215,117,71,25,71,275,260,0
85590,0d77f98fafb6c34a5861e61de06b4b0f,82,172,16,274,158,215,117,158,215,...,117,215,274,158,215,274,158,215,274,0
85591,0b2a5e7d55bc9ca38c5c736d85b1195f,82,208,117,208,172,117,172,208,16,...,31,117,73,215,73,29,82,117,85,0
85592,0d77f98fafb6c34a5861e61de06b4b0f,82,172,16,274,158,215,274,158,215,...,158,215,274,158,215,274,158,215,274,0


The data above (`Top 20 Most Repeated Samples`) suggests that the samples that were duplicated in SMOTEN were skewed on a number of specific samples.

Consider the Computation:

| Computation                       | Value |
|-----------------------------------|-------|
| Initial Benign   Samples          | 1079  |
| Benign Samples after Resampling   | 42797 |
| Diff. Before and After Resampling | 41718 |	
		
Distribution of Top 5 Most Resampled Sample:

| Sample                           | Quantity in Resampled | % in Resampled |
|----------------------------------|-----------------------|----------------|
| 3cedd98ea184c22ee3b024c72a96e075 | 5965                  | 14.30%         |
| 0fbe9eac4ff5af1a392d92881c70c559 | 3728                  | 8.94%          |
| 0b7e7bc7598abe9cfdc594e17e795cf0 | 1895                  | 4.54%          |
| 125d4cdb14dbe86841037e5bbfc6a0bc | 895                   | 2.15%          |
| 35dd2f5d51ba224735732424f8ab6398 | 860                   | 2.06%          |

In [10]:
hgbt_classifier(oli_smoten,None)

Elapsed time: 2.402601480484009 seconds
              precision    recall  f1-score   support

           0       0.94      0.95      0.94     12827
           1       0.95      0.94      0.94     12852

    accuracy                           0.94     25679
   macro avg       0.94      0.94      0.94     25679
weighted avg       0.94      0.94      0.94     25679



# 3. Proposed solution to allow for Model Robustness Testing

This implementation of the proposed solution assumes that the split will occur at data preparation.

Taking note of the overview of steps:

1. Get the raw dataset (Oliveira).
2. Split it into train (70%) and test (30%)
3. The train split will undergo smote/resampling.
4. Then the test split will be used to determine if the model really works well and robustly

The Oliveira dataset will be divided into train and test splits at a 70:30 ratio. 

The train split will undergo balancing (either Resampling or SMOTE) while the test split will be as is.

The train split will be used in training where it will undergo further splitting as part of the cross-validation processes (i.e., holdout and repeated holdout) during training.

## 3.1. Loading Raw Oliveira

In [11]:
import time
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
import sklearn.model_selection as model_selection
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import RandomizedSearchCV

oli = pd.read_csv("oliveira.csv")
print(oli.shape)
oli.head()

(43876, 102)


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,071e8c3f8922e186e57548cd4c703a5d,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
1,33f8e6d08a6aae939f25a8e0d63dd523,82,208,187,208,172,117,172,117,172,...,81,240,117,71,297,135,171,215,35,1
2,b68abd064e975e1c6d5f25e748663076,16,110,240,117,240,117,240,117,240,...,65,112,123,65,112,123,65,113,112,1
3,72049be7bd30ea61297ea624ae198067,82,208,187,208,172,117,172,117,172,...,208,302,208,302,187,208,302,228,302,1
4,c9b3700a77facf29172f32df6bc77f48,82,240,117,240,117,240,117,240,117,...,209,260,40,209,260,141,260,141,260,1


## 3.2. Splitting it into train and test at 70:30 ratio.
*Both will be logically named as train_split and test_split respectively.*

In [12]:
X = oli.iloc[:,0:101] #features
y = oli.iloc[:,101] #label

#Splitting to train_split and test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

#Building train_split dataframe.
train_split = pd.concat([X_train,y_train.copy()], axis=1)

#Building test_split dataframe.
test_split = pd.concat([X_test,y_test.copy()], axis=1)

In [13]:
print("train_split shape:", train_split.shape)
print("train_split value_counts:")
print(train_split['malware'].value_counts())
train_split.head()

train_split shape: (30713, 102)
train_split value_counts:
1    29963
0      750
Name: malware, dtype: int64


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
14432,38d19f81159e526a6d410af78f74ccf9,82,240,117,240,117,240,117,240,117,...,240,89,117,31,117,215,260,192,89,1
39388,a04c5d83384df33c48db5b21501aa336,215,274,158,215,274,158,215,172,117,...,60,81,60,81,172,117,25,172,117,1
23274,1ddbb2bac5055ed0bfad44b24d4442ae,215,274,158,215,274,158,215,172,117,...,60,81,60,81,172,117,25,172,117,1
28198,a6f3478cd06844ac9da84103b117a862,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
23969,8fc79157568b69638820ec31e25072ed,240,117,240,117,240,117,240,117,240,...,60,81,60,81,60,81,60,81,60,1


In [14]:
print("test_split shape:", test_split.shape)
print(test_split['malware'].value_counts())
test_split.head()

test_split shape: (13163, 102)
1    12834
0      329
Name: malware, dtype: int64


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
24659,f0f8ba4c3d750a4ce2deea48152a33d4,215,274,158,215,274,158,215,172,117,...,117,15,240,117,240,117,240,117,172,1
34393,39b2d87c1adb582fbcacc3a56e274d48,286,110,172,240,117,240,117,240,117,...,71,275,260,240,117,141,65,260,141,1
20301,429236cdeb63d68bf48a3b48b0a34612,82,208,187,208,172,117,172,208,16,...,172,117,172,117,208,172,117,100,215,1
20025,46079cbf0bcfe8fab9894b4ec88bece3,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
19747,303ceda3f52afa9b69ed4f97fec2c895,82,240,117,240,117,240,117,240,117,...,82,260,141,260,141,260,141,260,141,1


## 3.3. Balancing train_split

*For this normal resampling will be selected as it is more 'balanced' than SMOTEN is.*

In [15]:
X = train_split.iloc[:,0:101]
y = train_split.iloc[:,101]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1) #For use in training and testing model
print("Labels:", y.unique())
print("Label Value Counts:\n" + str(y.value_counts()))

Labels: [1 0]
Label Value Counts:
1    29963
0      750
Name: malware, dtype: int64


In [16]:
ros = RandomOverSampler(random_state=1, sampling_strategy='minority') #random_state is arbitrary; setting a value makes its output repeatable
X_resampled, y_resampled = ros.fit_resample(X, y) #Resampling of minority classes (i.e., Benign/0 for Oliveira)

train_split = pd.concat([X_resampled,y_resampled.copy()], axis=1)
print("X_resampled", X_resampled.shape)
print("y_resampled", y_resampled.shape)
print("Oliveira (Resampling):", train_split.shape, "\n")
print("Class Count:\n" + str(train_split['malware'].value_counts()), "\n")
print("Top 10 Most Repeated Samples:\n" + str(train_split[['hash','malware']].value_counts()[0:20]),"\n")
print("Unique Values:", len(train_split['hash'].unique()))
train_split

X_resampled (59926, 101)
y_resampled (59926,)
Oliveira (Resampling): (59926, 102) 

Class Count:
1    29963
0    29963
Name: malware, dtype: int64 

Top 10 Most Repeated Samples:
hash                              malware
f78ea80cec007b2c32fb10f9c6c82f39  0          91
bdaaac3fa3f6796825a51ef1c0e5b3fd  0          80
03384ab6368b68ed16ecb9e6352539af  0          79
0822ec2ba98d291e5bfc836bc3686096  0          66
ec34862bd722c903bfc1f4788ee2ff72  0          63
22cc743a96c926817719872e07c351cd  0          61
34d0c34237b09bd1c650eedba9bc5fa0  0          60
c6ac1a2aeb8c89a599fbc30924e5f743  0          60
c0344331f17180acd42ef1a30a482f7f  0          59
c11887896f7972ac8d467235ab1b655e  0          58
5dc7d52aeb20e15bf064eadea48ba178  0          58
f2e6e397c0a2a605c791c63466c9d5ae  0          57
32d3f7a71fd8c1da168be1d79868546b  0          56
6a7186b9550640068161ed3bb970a508  0          56
b03995fa325d3e25ce582e052914cf51  0          56
a83d4048ea61b5bcf61dced8fe05394e  0          55
63ca0afe173

Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,38d19f81159e526a6d410af78f74ccf9,82,240,117,240,117,240,117,240,117,...,240,89,117,31,117,215,260,192,89,1
1,a04c5d83384df33c48db5b21501aa336,215,274,158,215,274,158,215,172,117,...,60,81,60,81,172,117,25,172,117,1
2,1ddbb2bac5055ed0bfad44b24d4442ae,215,274,158,215,274,158,215,172,117,...,60,81,60,81,172,117,25,172,117,1
3,a6f3478cd06844ac9da84103b117a862,112,274,158,215,274,158,215,298,76,...,71,297,135,171,215,35,208,56,71,1
4,8fc79157568b69638820ec31e25072ed,240,117,240,117,240,117,240,117,240,...,60,81,60,81,60,81,60,81,60,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59921,32dee5d6b7e38027723972192ceede88,82,16,240,117,260,141,65,215,260,...,117,240,117,260,141,65,215,65,260,0
59922,64c2395ed30bb2e13af1bc2ad27809c0,82,16,172,240,117,235,172,117,235,...,297,93,10,199,93,10,264,215,208,0
59923,849a4f51bd705d66e89777461a387bec,82,274,158,215,86,82,37,70,37,...,215,82,240,117,82,240,117,297,8,0
59924,84088a0399f5d33d00295400d09a54f3,208,187,208,93,208,172,117,82,60,...,215,260,275,240,117,31,117,215,260,0


In [17]:
train_split.drop('hash', axis=1, inplace=True)
X = train_split.iloc[:,0:100]
y = train_split.iloc[:,100]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
train_split

Unnamed: 0,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,t_9,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,malware
0,82,240,117,240,117,240,117,240,117,93,...,240,89,117,31,117,215,260,192,89,1
1,215,274,158,215,274,158,215,172,117,172,...,60,81,60,81,172,117,25,172,117,1
2,215,274,158,215,274,158,215,172,117,172,...,60,81,60,81,172,117,25,172,117,1
3,112,274,158,215,274,158,215,298,76,208,...,71,297,135,171,215,35,208,56,71,1
4,240,117,240,117,240,117,240,117,240,117,...,60,81,60,81,60,81,60,81,60,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59921,82,16,240,117,260,141,65,215,260,141,...,117,240,117,260,141,65,215,65,260,0
59922,82,16,172,240,117,235,172,117,235,297,...,297,93,10,199,93,10,264,215,208,0
59923,82,274,158,215,86,82,37,70,37,240,...,215,82,240,117,82,240,117,297,8,0
59924,208,187,208,93,208,172,117,82,60,81,...,215,260,275,240,117,31,117,215,260,0


## 3.4. Training on Train_Split

*A simple HGBT with K-Cross CV for testing and tuning (RandomizedSearchCV) will be executed for this example.*

In [18]:
#All hyperparameters are defaults except for random_state
hgbt = HistGradientBoostingClassifier(loss='log_loss', learning_rate=0.1, max_iter=300, max_leaf_nodes=31, max_depth=None, 
            min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, 
            monotonic_cst=None, interaction_cst=None, warm_start=False, early_stopping='auto', scoring='loss', 
            validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=1, class_weight=None)

def kfolds(X,y,model):
    kf = model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    sublist = [["accuracy", "f1_score","precision","recall","roc-auc","time"]]
    axis = 0
    for i, (train_index, test_index) in enumerate(kf.split(X,y)):
        print("Fold: " + str(i), end="") 
        start = time.time()
        training_set = np.take(X, train_index, axis)
        training_set_labels = np.take(y, train_index, axis)
        test_set = np.take(X, test_index, axis)
        test_set_labels = np.take(y, test_index, axis)
        model.fit(training_set,training_set_labels)
        m_pred = model.predict(test_set)
        sublist.append([
                        round(accuracy_score(test_set_labels, m_pred),4),
                        round(f1_score(test_set_labels, m_pred, average='weighted'),4),
                        round(precision_score(test_set_labels, m_pred,zero_division=0),4),
                        round(recall_score(test_set_labels, m_pred),4),
                        round(roc_auc_score(test_set_labels, m_pred),4),
                        round(time.time()-start,4)
                        ])
        print(" -", time.time()-start,"seconds")
    print("")
    return sublist

def auto_tune(setup, model, X_train, y_train, X_test, worker=-1):
    auto_tuner = RandomizedSearchCV(model, setup, refit=True, cv=3, verbose=2, n_jobs=worker, error_score=0, random_state=1)
    auto_tuner.fit(X_train,y_train)
    auto_tuner.predict(X_test)
    return auto_tuner.best_params_
    
def model_test(X, y , X_train, X_test, y_train, y_test, model, model_label):
    start = time.time()
    print(model_label, end=" - ")
    model.fit(X_train, y_train)
    print(f"Completed! - {time.time()-start:.4f}s")
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred),"\n")
    #K-Folds prior to tuning.
    for s in kfolds(X,y,hgbt):
        print(s)
    return model

In [19]:
#Default HGBT
hgbt = HistGradientBoostingClassifier(loss='log_loss', learning_rate=0.1, max_iter=300, max_leaf_nodes=31, max_depth=None, 
                                      min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, 
                                      monotonic_cst=None, interaction_cst=None, warm_start=False, early_stopping='auto', scoring='loss', 
                                      validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=1, class_weight=None)
hgbt = model_test(X, y , X_train, X_test, y_train, y_test, hgbt, "Default HGBT")

Default HGBT - Completed! - 20.6004s
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      8913
           1       1.00      1.00      1.00      9065

    accuracy                           1.00     17978
   macro avg       1.00      1.00      1.00     17978
weighted avg       1.00      1.00      1.00     17978
 

Fold: 0 - 18.671481609344482 seconds
Fold: 1 - 17.83095121383667 seconds
Fold: 2 - 19.325040817260742 seconds
Fold: 3 - 19.220861434936523 seconds
Fold: 4 - 20.121238708496094 seconds

['accuracy', 'f1_score', 'precision', 'recall', 'roc-auc', 'time']
[0.9964, 0.9964, 0.9973, 0.9955, 0.9964, 18.6715]
[0.9964, 0.9964, 0.9977, 0.9952, 0.9964, 17.831]
[0.9975, 0.9975, 0.9985, 0.9965, 0.9975, 19.325]
[0.9959, 0.9959, 0.997, 0.9948, 0.9959, 19.2209]
[0.9961, 0.9961, 0.9985, 0.9937, 0.9961, 20.1212]


In [20]:
#RandomizedSearchCV Tuning
#As it is only an example, the it will not really 'tune' per se as it will eat time.
#However, it still shows how to execute/implement tuning using RandomizedSearchCV.
params = [
    {
        'loss':['log_loss'],
        'learning_rate':[0.1], #realistically, learning_rate of 1.0 is not ideal
        'max_iter':[100,1000], #default at 100, seems a bit too law considering learning_rate and dataset size
        'max_leaf_nodes':[None], 
        'max_depth':[None], 
        'min_samples_leaf':[10,20,30], 
        'l2_regularization':[0.0,0.1], 
        'max_bins':[255], 
        'categorical_features':[None], 
        'monotonic_cst':[None], 
        'interaction_cst':[None], 
        'warm_start':[True], 
        'early_stopping':['auto'], 
        'scoring':['loss'], 
        'validation_fraction':[0.1], 
        'n_iter_no_change':[10], 
        'tol':[1e-06], 
        'verbose':[0], 
        'random_state':[1],
        'class_weight':[None,'balanced']
    }
]

start = time.time()
hgbt_t = auto_tune(params, HistGradientBoostingClassifier(), X_train, y_train, X_test, worker=1)
print("HGBT Auto Tune Best Params:\n", hgbt_t)
print("Elapsed time:", time.time()-start)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END categorical_features=None, class_weight=balanced, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.1, loss=log_loss, max_bins=255, max_depth=None, max_iter=100, max_leaf_nodes=None, min_samples_leaf=20, monotonic_cst=None, n_iter_no_change=10, random_state=1, scoring=loss, tol=1e-06, validation_fraction=0.1, verbose=0, warm_start=True; total time= 1.2min
[CV] END categorical_features=None, class_weight=balanced, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.1, loss=log_loss, max_bins=255, max_depth=None, max_iter=100, max_leaf_nodes=None, min_samples_leaf=20, monotonic_cst=None, n_iter_no_change=10, random_state=1, scoring=loss, tol=1e-06, validation_fraction=0.1, verbose=0, warm_start=True; total time= 1.1min
[CV] END categorical_features=None, class_weight=balanced, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.

[CV] END categorical_features=None, class_weight=None, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.1, loss=log_loss, max_bins=255, max_depth=None, max_iter=1000, max_leaf_nodes=None, min_samples_leaf=20, monotonic_cst=None, n_iter_no_change=10, random_state=1, scoring=loss, tol=1e-06, validation_fraction=0.1, verbose=0, warm_start=True; total time= 1.5min
[CV] END categorical_features=None, class_weight=None, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.1, loss=log_loss, max_bins=255, max_depth=None, max_iter=1000, max_leaf_nodes=None, min_samples_leaf=20, monotonic_cst=None, n_iter_no_change=10, random_state=1, scoring=loss, tol=1e-06, validation_fraction=0.1, verbose=0, warm_start=True; total time= 1.5min
[CV] END categorical_features=None, class_weight=None, early_stopping=auto, interaction_cst=None, l2_regularization=0.0, learning_rate=0.1, loss=log_loss, max_bins=255, max_depth=None, max_iter=1000, max_leaf

In [21]:
#Tuned HGBT
hgbt_t = HistGradientBoostingClassifier(**hgbt_t)

hgbt_t = model_test(X, y , X_train, X_test, y_train, y_test, hgbt_t, "Tuned HGBT")

Tuned HGBT - Completed! - 84.9058s
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8913
           1       1.00      1.00      1.00      9065

    accuracy                           1.00     17978
   macro avg       1.00      1.00      1.00     17978
weighted avg       1.00      1.00      1.00     17978
 

Fold: 0 - 19.253552436828613 seconds
Fold: 1 - 18.075571060180664 seconds
Fold: 2 - 19.26048445701599 seconds
Fold: 3 - 19.619152069091797 seconds
Fold: 4 - 19.44312024116516 seconds

['accuracy', 'f1_score', 'precision', 'recall', 'roc-auc', 'time']
[0.9964, 0.9964, 0.9973, 0.9955, 0.9964, 19.2536]
[0.9964, 0.9964, 0.9977, 0.9952, 0.9964, 18.0756]
[0.9975, 0.9975, 0.9985, 0.9965, 0.9975, 19.2605]
[0.9959, 0.9959, 0.997, 0.9948, 0.9959, 19.6192]
[0.9961, 0.9961, 0.9985, 0.9937, 0.9961, 19.4431]


## 3.5 Model Robustness Test

*Using the recently tuned and trained model as the model and the test_split as the input*.

In [22]:
print("test_split value_counts:")
print(test_split['malware'].value_counts())

test_split value_counts:
1    12834
0      329
Name: malware, dtype: int64


In [23]:
#These two are effectively X_test and y_test
X = test_split.iloc[:,1:101]
y = test_split.iloc[:,101]

y_pred = hgbt_t.predict(X)
print("HGBT Default Model Robustness Test")
print(classification_report(y, y_pred),"\n")

y_pred = hgbt_t.predict(X)
print("HGBT Tuned Model Robustness Test")
print(classification_report(y, y_pred),"\n")

HGBT Default Model Robustness Test
              precision    recall  f1-score   support

           0       0.79      0.72      0.75       329
           1       0.99      1.00      0.99     12834

    accuracy                           0.99     13163
   macro avg       0.89      0.86      0.87     13163
weighted avg       0.99      0.99      0.99     13163
 

HGBT Tuned Model Robustness Test
              precision    recall  f1-score   support

           0       0.79      0.72      0.75       329
           1       0.99      1.00      0.99     12834

    accuracy                           0.99     13163
   macro avg       0.89      0.86      0.87     13163
weighted avg       0.99      0.99      0.99     13163
 

