After reusing the existing code to prepare and fit additional supervised models to our training data, I was surprised by the results: Each of the models that I tried would result in extremely strong performance on the training data as well as the validation data. If we were overfitting on the training data, the model performance on the test data should be relatively poor. However, it was performing well on the test data too. Still, I was concerned because the results seemed too good to be true. I'm fairly confident that were were introducing ["data leakage"](https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562) with the original process.

What we were doing wrong was this: We were passing the full dataset into the normalizeData() function before splitting it into train and test sets. By doing this, we were "leaking" information from the testing set into the training set because the full set is being used for calculations and aggregations. As a result, the model gets some information about the distribution of the testing set during training. We were giving the model a peak of the real answers in an indirect way, resulting in overfitting on the testing data.

To fix the data leak, we can split the dataset into training and testing sets and run the normalization function on the sets separately. The testing set remains unseen, which will result in a more performant model when it's used with real-world data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as m
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

In [None]:
input_path = "../data/features_encoded.csv"
raw_data = pd.read_csv(input_path, header=0, skiprows=None, index_col=None, delimiter=",")

labels = raw_data['malicious'].apply(lambda x: 1 if x else 0)
features = raw_data.drop('malicious', axis=1)

train_features = features.iloc[:80000, :]
test_features = features.iloc[80000:, :]
train_labels = labels[:80000]
test_labels = labels[80000:]

In [None]:
def calculateNormalizationParams(data):
    means = data.mean()
    stdevs = data.std()
    stdevs[stdevs == 0] = 1  # Replace 0 std to avoid division by zero
    return means, stdevs

def applyNormalization(data, means, stdevs):
    return (data - means) / stdevs

In [None]:
means, stdevs = calculateNormalizationParams(train_features)
normalizedTrainFeatures = applyNormalization(train_features, means, stdevs)
normalizedTestFeatures = applyNormalization(test_features, means, stdevs)

In [None]:
def acc(data, labels, n, d):
    t0 = datetime.now()
    rf = RandomForestClassifier(n_estimators=n, max_depth=d, random_state=0).fit(data, labels)
    predictions = rf.predict(data)
    tn = datetime.now() - t0
    tn = tn - timedelta(microseconds=tn.microseconds)
    return (n, d, m.accuracy_score(labels, predictions), tn)

In [None]:
n_vector = [50, 100, 250, 500]
d_vector = [2, 3, 5, 10, 13, 20]
scores = [acc(normalizedTrainFeatures, train_labels, n, d) for n in n_vector for d in d_vector]
for score in scores:
    print(f"n = {score[0]}, d = {score[1]}, accuracy = {score[2]}, t = {score[3]}")

In [None]:
rf = RandomForestClassifier(max_depth=20, random_state=0)
rf.fit(normalizedTrainFeatures, train_labels)
predictions = rf.predict(normalizedTestFeatures)

In [None]:
acc = m.accuracy_score(test_labels, predictions)
prec = m.precision_score(test_labels, predictions)
recall = m.recall_score(test_labels, predictions)
print("Accuracy score:", acc)
print("Precision score:", prec)
print("Recall score:", recall)

In [None]:
m.ConfusionMatrixDisplay(m.confusion_matrix(test_labels, predictions)).plot()