##AnomalyDetection with XGBoost

This cell loads the dataset, selects relevant columns. It defines lists of categorical and numerical columns for further processing, and initializes variables for cross-validation.

The chosen feautures were seen to be the 10 most important in SHAP explanation using IsolationForest ML technique.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split,StratifiedShuffleSplit, GridSearchCV
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import xgboost as xgb
from xgboost.sklearn import XGBClassifier


x = pd.read_csv("/content/drive/MyDrive/data/LMD-2023 [1.75M Elements][Labelled]checked.csv",low_memory=False)

cols = ['DestinationPortName','EventID','ThreadID','Computer','EventRecordID','Initiated',
        'ProcessId','SourcePort', 'DestinationPort','Execution_ProcessID','Label']

x = x[cols]

y = x['Label'].copy()
y = np.where(y > 0, 1, 0) # 1 means outlier

x = x.drop(columns=['Label'])


categorical_cols = ['Computer', 'DestinationPortName', 'EventID', 'Initiated']

numerical_cols = ['EventRecordID','Execution_ProcessID', 'ProcessId','ThreadID','SourcePort','DestinationPort']

x_orig = pd.concat([x[categorical_cols], x[numerical_cols]], axis=1)

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

encoder.fit(x[categorical_cols])

x_categorial = pd.DataFrame(encoder.transform(x[categorical_cols]),
                                columns=encoder.get_feature_names_out(categorical_cols),
                                index=x.index)


x = pd.concat([x_categorial, x[numerical_cols]], axis=1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1,stratify=y)


x_train_categorial = x_train[x_categorial.columns]
x_test_categorial = x_test[x_categorial.columns]

scaler = MinMaxScaler()
scaler.fit(x_train[numerical_cols])

train_num_scaled = pd.DataFrame(scaler.transform(x_train[numerical_cols]), columns=numerical_cols, index=x_train.index)
test_num_scaled = pd.DataFrame(scaler.transform(x_test[numerical_cols]), columns=numerical_cols, index=x_test.index)


x_train = pd.concat([x_train_categorial, train_num_scaled], axis=1)
x_test = pd.concat([x_test_categorial, test_num_scaled], axis=1)





## Tunning Prameters:
In order to decide on boosting parameters, we need to set some initial values of other parameters. Let’s take the following values:

max_depth = 5: This should be between 3-10, I’ve started with 5.

min_child_weight = 1: A smaller value is chosen because it is a highly imbalanced class problem, and leaf nodes can have smaller size groups.

gamma = 0.2: A smaller value like 0.1-0.2 can also be chosen for starting. This will, anyways, be tuned later.

subsample, colsample_bytree = 0.5: This is a commonly used start value. Typical values range between 0.5-0.9.

scale_pos_weight = 1: Because of high-class imbalance.

Please note that all the above are just initial estimates and will be tuned later.

##Tunning n_estimators parameter:
Let’s take the learning rate of 0.001 here and check the optimum number of trees using the cross validation technique.


In [None]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [i for i in range(100,1000,100)],
}

# Initialize Isolation Forest model
xgb1 = XGBClassifier(
    n_estimators = 1000,
    learning_rate=0.001,
    max_depth=3,
    subsample=0.5,
    gamma = 0.2,
    colsample_bytree=0.5,
    min_child_weight=1,
    objective= 'binary:logistic',
    scale_pos_weight=1,
    seed=27)


# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb1,
                           param_grid=param_grid,
                           cv=5,  # Use 5-fold cross-validation
                           n_jobs=-1, # Use all available CPU cores
                           scoring = 'f1')

# Fit the grid search object to your training data
grid_search.fit(x_train, y_train)

# Best parameters found
print("Best Parameters:", grid_search.best_params_)

# Best score (mean cross-validated score)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'n_estimators': 900}
Best Score: 0.8707113400355556


##Tune max_depth and min_child_weight
We tune these first as they will have the highest impact on the model outcome.



In [None]:
# Define the hyperparameter grid
param_grid = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2)
}

# Initialize Isolation Forest model
xgb1 = XGBClassifier(
    n_estimators = 900,
    learning_rate=0.001,
    max_depth=3,
    subsample=0.5,
    gamma = 0.2,
    colsample_bytree=0.5,
    min_child_weight=1,
    objective= 'binary:logistic',
    scale_pos_weight=1,
    seed=27)


# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb1,
                           param_grid=param_grid,
                           cv=5,  # Use 5-fold cross-validation
                           n_jobs=-1, # Use all available CPU cores
                           scoring = 'f1')

# Fit the grid search object to your training data
grid_search.fit(x_train, y_train)

# Best parameters found
print("Best Parameters:", grid_search.best_params_)

# Best score (mean cross-validated score)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'max_depth': 9, 'min_child_weight': 1}
Best Score: 0.9546134806529851


##  Tune gamma


Now let’s tune the gamma value using the parameters already tuned above.


In [None]:
# Define the hyperparameter grid
param_grid = {
    'gamma':[i / 10.0 for i in range(5)],
}

# Initialize Isolation Forest model
xgb1 = XGBClassifier(
    n_estimators = 900,
    learning_rate=0.001,
    max_depth=9,
    subsample=0.5,
    gamma = 0.2,
    colsample_bytree=0.5,
    min_child_weight=1,
    objective= 'binary:logistic',
    scale_pos_weight=1,
    seed=27)


# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb1,
                           param_grid=param_grid,
                           cv=5,  # Use 5-fold cross-validation
                           n_jobs=-1, # Use all available CPU cores
                           scoring = 'f1')

# Fit the grid search object to your training data
grid_search.fit(x_train, y_train)

# Best parameters found
print("Best Parameters:", grid_search.best_params_)

# Best score (mean cross-validated score)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'gamma': 0.1}
Best Score: 0.9546804073640821


##Tune subsample and colsample_bytree
The next step would be to try different subsample and colsample_bytree values.

In [None]:
# Define the hyperparameter grid
param_grid = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}

# Initialize Isolation Forest model
xgb1 = XGBClassifier(
    n_estimators = 900,
    learning_rate=0.001,
    max_depth=9,
    subsample=0.5,
    gamma = 0.1,
    colsample_bytree=0.5,
    min_child_weight=1,
    objective= 'binary:logistic',
    scale_pos_weight=1,
    seed=27)


# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb1,
                           param_grid=param_grid,
                           cv=5,  # Use 5-fold cross-validation
                           n_jobs=-1, # Use all available CPU cores
                           scoring = 'f1')

# Fit the grid search object to your training data
grid_search.fit(x_train, y_train)

# Best parameters found
print("Best Parameters:", grid_search.best_params_)

# Best score (mean cross-validated score)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'colsample_bytree': 0.9, 'subsample': 0.9}
Best Score: 0.9652468159846409


##Tune threshold

The next step would be to find the best threshold selection using precision_recall_curve from scikit-learn library.

In [None]:
from sklearn.metrics import precision_recall_curve, f1_score


xgb1 = XGBClassifier(
n_estimators = 900,
learning_rate=0.001,
max_depth=9,
subsample=0.9,
gamma = 0.1,
colsample_bytree=0.9,
min_child_weight=1,
objective= 'binary:logistic',
scale_pos_weight=1,
seed=27)


xgb1.fit(x_train, y_train)
probs = xgb1.predict_proba(x_test)[:, 1]   # probabilities for positive class

precision, recall, thresholds = precision_recall_curve(y_test, probs)
f1_scores = 2 * (precision * recall) / (precision + recall)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]

print("Best Threshold:", best_threshold)
print("Best F1 Score:", f1_scores[best_idx])

Best Threshold: 0.28792283
Best F1 Score: 0.970988082841606


##Full Results:

here are the full results for the ML approach. Here we did the process 10 times using StratifiedShuffleSplit.

##Final Results:

Cross-Validation Results (10-fold avg):

Accuracy: 0.9954

Precision: 0.9922

Recall: 0.9509

F1 Score: 0.9711

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split,StratifiedShuffleSplit, GridSearchCV
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import xgboost as xgb
from xgboost.sklearn import XGBClassifier


x = pd.read_csv("/content/drive/MyDrive/data/LMD-2023 [1.75M Elements][Labelled]checked.csv",low_memory=False)

cols = ['DestinationPortName','EventID','ThreadID','Computer','EventRecordID','Initiated',
        'ProcessId','SourcePort', 'DestinationPort','Execution_ProcessID','Label']

x = x[cols]

y = x['Label'].copy()
y = np.where(y > 0, 1, 0) # 1 means outlier

x = x.drop(columns=['Label'])


categorical_cols = ['Computer', 'DestinationPortName', 'EventID', 'Initiated']

numerical_cols = ['EventRecordID','Execution_ProcessID', 'ProcessId','ThreadID','SourcePort','DestinationPort']

skf = StratifiedShuffleSplit(n_splits=10, test_size=0.25, random_state=1)


all_fold_metrics = []
for i, (train_index, test_index) in enumerate(skf.split(x, y)):
    x_train , x_test = x.iloc[train_index].copy(), x.iloc[test_index].copy()
    y_train, y_test = y[train_index].copy(), y[test_index].copy()

    if i + 1 == 10:
      x_train_orig = x_train.copy()
      x_test_orig = x_test.copy()

    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    encoder.fit(x_train[categorical_cols])

    x_train_categorial = pd.DataFrame(encoder.transform(x_train[categorical_cols]),
                                    columns=encoder.get_feature_names_out(categorical_cols),
                                    index=x_train.index)
    x_test_categorial = pd.DataFrame(encoder.transform(x_test[categorical_cols]),
                                    columns=encoder.get_feature_names_out(categorical_cols),
                                    index=x_test.index)


    #Scale numerical features with MinMaxScaler fit on train, then applying on val and test
    scaler = MinMaxScaler()
    scaler.fit(x_train[numerical_cols])

    train_num_scaled = pd.DataFrame(scaler.transform(x_train[numerical_cols]), columns=numerical_cols, index=x_train.index)
    test_num_scaled = pd.DataFrame(scaler.transform(x_test[numerical_cols]), columns=numerical_cols, index=x_test.index)


    # Combine encoded categorical and scaled numerical features
    x_train = pd.concat([x_train_categorial, train_num_scaled], axis=1)
    x_test = pd.concat([x_test_categorial, test_num_scaled], axis=1)

    xgb1 = XGBClassifier(
    n_estimators = 900,
    learning_rate=0.001,
    max_depth=9,
    subsample=0.9,
    gamma = 0.1,
    colsample_bytree=0.9,
    min_child_weight=1,
    objective= 'binary:logistic',
    scale_pos_weight=1,
    seed=27)


    model = xgb1.fit(x_train, y_train)
    probs = xgb1.predict_proba(x_test)[:, 1]   # probabilities for positive class

    y_pred = (probs > 0.29).astype(int)
    acc = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)


    print(f"fold{i+1} results")
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}\n")
    all_fold_metrics.append((acc, precision, recall, f1))

results = np.array(all_fold_metrics)
print("\n Cross-Validation Results (10-fold avg):")
print(f"Accuracy: {results[:,0].mean():.4f}")
print(f"Precision: {results[:,1].mean():.4f}")
print(f"Recall: {results[:,2].mean():.4f}")
print(f"F1 Score: {results[:,3].mean():.4f}")



fold1 results
Accuracy: 0.9955
Precision: 0.9914
Recall: 0.9518
F1 Score: 0.9712

fold2 results
Accuracy: 0.9956
Precision: 0.9920
Recall: 0.9534
F1 Score: 0.9724

fold3 results
Accuracy: 0.9955
Precision: 0.9918
Recall: 0.9526
F1 Score: 0.9718

fold4 results
Accuracy: 0.9954
Precision: 0.9920
Recall: 0.9510
F1 Score: 0.9711

fold5 results
Accuracy: 0.9955
Precision: 0.9926
Recall: 0.9510
F1 Score: 0.9713

fold6 results
Accuracy: 0.9955
Precision: 0.9931
Recall: 0.9503
F1 Score: 0.9712

fold7 results
Accuracy: 0.9956
Precision: 0.9919
Recall: 0.9526
F1 Score: 0.9719

fold8 results
Accuracy: 0.9953
Precision: 0.9935
Recall: 0.9483
F1 Score: 0.9704

fold9 results
Accuracy: 0.9955
Precision: 0.9928
Recall: 0.9510
F1 Score: 0.9714

fold10 results
Accuracy: 0.9954
Precision: 0.9919
Recall: 0.9513
F1 Score: 0.9712


 Cross-Validation Results (10-fold avg):
Accuracy: 0.9955
Precision: 0.9923
Recall: 0.9513
F1 Score: 0.9714


Here we do FP and FN analyzing:

In [None]:
FN = np.where((y_test == 1) & (y_pred == 0))[0]
FP = np.where((y_test == 0) & (y_pred == 1))[0]

fn_df = x_test_orig.iloc[FN].copy()
fn_counts = fn_df["EventID"].value_counts(normalize= True)
print(fn_counts)



EventID
1     0.589535
5     0.178488
22    0.105233
13    0.076163
11    0.030814
3     0.014535
12    0.002326
8     0.001163
4     0.000581
6     0.000581
15    0.000581
Name: proportion, dtype: float64


In [None]:
fn_df_1 = fn_df[fn_df['EventID'] == 1]
display(fn_df_1["Computer"].value_counts(normalize= True))

Unnamed: 0_level_0,proportion
Computer,Unnamed: 1_level_1
WIN-J23NIGGP1Q6.sysmon_set.local,0.581854
WINDOWS10EVAL.stefania.local,0.41716
LAPTOP-ROPR18AK,0.000986


In [None]:

fp_df = x_test_orig.iloc[FP].copy()
fp_counts = fp_df["EventID"].value_counts(normalize= True)
print(fp_counts)

EventID
1     0.509091
3     0.461818
5     0.021818
13    0.003636
22    0.003636
Name: proportion, dtype: float64


In [None]:
fp_df_1 = fp_df[fp_df['EventID'] == 1]
display(fp_df_1["Computer"].value_counts(normalize= True))


Unnamed: 0_level_0,proportion
Computer,Unnamed: 1_level_1
WIN-J23NIGGP1Q6.sysmon_set.local,0.907143
WINDOWS10EVAL.stefania.local,0.092857


In [None]:
print('FP ThreadID proportion\n')

display(fp_df_1["ThreadID"].value_counts(normalize= True))

print('FN ThreadID proportion\n')

display(fn_df_1["ThreadID"].value_counts(normalize= True))

FN ThreadID proportion



Unnamed: 0_level_0,proportion
ThreadID,Unnamed: 1_level_1
1292,0.392857
3408,0.278571
1912,0.207143
3016,0.042857
3724,0.028571
3324,0.021429
3184,0.014286
3340,0.007143
836,0.007143


FP ThreadID proportion



Unnamed: 0_level_0,proportion
ThreadID,Unnamed: 1_level_1
1912,0.133136
1508,0.119329
1292,0.108481
5084,0.088757
3408,0.055227
2992,0.044379
3184,0.036489
2560,0.032544
3280,0.030572
2912,0.025641
