# NetML Challenge - Malware Detection of IoT

#### Author: Robert Xing | CNetID: rkxing

First, we will load the training data to prepare for training our classifiers on the top-level set first (easy classes). 

The dataset, annotation set (labels), and helper script are provided by the NetML Challenge: https://github.com/ACANETS/NetML-Competition2020/tree/master

In [1]:
from utils.helper import get_training_data

DATA_PATH = './data/'

training_set_folder = DATA_PATH + 'training_set'
training_anno_file_top = DATA_PATH + 'training_anno/' + '2_training_anno_top.json.gz'

# Get training data in np.array format
Xtrain, ytrain, class_label_pair, Xtrain_ids = get_training_data(training_set_folder, training_anno_file_top)

len(Xtrain), len(ytrain), len(class_label_pair), len(Xtrain_ids)


Loading training set ...
Reading 2_training_set.json.gz


(387268, 387268, 2, 387268)

Now it is time to split this training set further so that we can benchmark our classifiers.

We also want to do a sanity check on our split to make sure the dimensions make sense.

In [2]:
from sklearn.model_selection import train_test_split

RANDOM_STATE = 42
TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = train_test_split(Xtrain, ytrain,
                                                test_size=TEST_SIZE,
                                                random_state=RANDOM_STATE,
                                                stratify=ytrain)

len(X_train), len(X_test), len(y_train), len(y_test)

(309814, 77454, 309814, 77454)

Following the feedback I received on my proposal, we need to make sure that the data is preprocessed in the same way that the original challenge expects, so we follow their method using StandardScaler

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now we are ready to test our first classifier. First, I've chosen to try a Random Forest Classifier, as it has performed well on similar classification problems in the past.

We also use RandomizedSearchCV for hyperparameter tuning so we can ensure that we get the best results. We would prefer to use GridSearchCV, but due to the scale of the dataset, we've opted for RandomizedSearch for computational efficiency.

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestClassifier()

param_space = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 4, 8],
    'max_features': ['sqrt', 'log2'],
}

# n_jobs = -1 to use all available CPU cores
rf_clf = RandomizedSearchCV(rf, param_space, n_jobs=-1, cv=3)
rf_clf.fit(X_train_scaled, y_train)

print("Random Forest Best Parameters:", rf_clf.best_params_)
print("Random Forest Best Estimator:", rf_clf.best_estimator_)

Random Forest Best Parameters: {'n_estimators': 100, 'min_samples_split': 4, 'max_features': 'sqrt', 'max_depth': 20}
Random Forest Best Estimator: RandomForestClassifier(max_depth=20, min_samples_split=4)


Now we collect the metric we want to use for benchmarking against nPrint, which is the Balanced Accuracy Score.

In [5]:
from sklearn.metrics import balanced_accuracy_score

y_pred = rf_clf.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Random Forest Balanced Accuracy: {b_acc:.4f}')

Random Forest Balanced Accuracy: 0.9966


This is already a pretty good accuracy score, but for the sake of experimentation we'll try some others as well.

First, however, we need to do another test-train split in order to avoid potential information leakages.

In [6]:
RANDOM_STATE += 1

X_train, X_test, y_train, y_test = train_test_split(Xtrain, ytrain,
                                                test_size=TEST_SIZE,
                                                random_state=RANDOM_STATE,
                                                stratify=ytrain)

len(X_train), len(X_test), len(y_train), len(y_test)

(309814, 77454, 309814, 77454)

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

For the next one, I've elected to use a Gaussian Naive Bayes Classifier, since this model should be fast and efficient.

Again, we will tune to find the best hyperparameters, but here we will use GridSearchCV, since Naive Bayes is fast to fit anyways.

In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

gnb = GaussianNB()

param_space = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1],
}

# n_jobs = -1 to use all available CPU cores
gnb_clf = GridSearchCV(gnb, param_space, n_jobs=-1, cv=3)
gnb_clf.fit(X_train_scaled, y_train)

print("GaussianNB Best Parameters:", gnb_clf.best_params_)
print("GaussianNB Best Estimator:", gnb_clf.best_estimator_)

GaussianNB Best Parameters: {'var_smoothing': 1e-09}
GaussianNB Best Estimator: GaussianNB()


In [9]:
y_pred = gnb_clf.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'GaussianNB Balanced Accuracy: {b_acc:.4f}')

GaussianNB Balanced Accuracy: 0.6640


This performance is clearly much worse than our Random Forest, which we expected anyways from such a simple model, but we'll try one more.

Finally, for our last model experiment we'll try a Stochastic Gradient Descent Classifier.

In [10]:
# again doing another split for valid comparison

RANDOM_STATE += 1

X_train, X_test, y_train, y_test = train_test_split(Xtrain, ytrain,
                                                test_size=TEST_SIZE,
                                                random_state=RANDOM_STATE,
                                                stratify=ytrain)

len(X_train), len(X_test), len(y_train), len(y_test)

(309814, 77454, 309814, 77454)

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()

param_space = {
    'alpha': [1e-6, 1e-5, 0.0001, 0.001, 0.01, 0.1],
    'loss': ['hinge', 'log_loss'],
    'max_iter': [1000, 1500, 2000],
}

sgd_clf = GridSearchCV(sgd, param_space, n_jobs=-1, cv=3)
sgd_clf.fit(X_train_scaled, y_train)

print("SGD Best Parameters:", sgd_clf.best_params_)
print("SGD Best Estimator:", sgd_clf.best_estimator_)

SGD Best Parameters: {'alpha': 1e-05, 'loss': 'hinge', 'max_iter': 1500}
SGD Best Estimator: SGDClassifier(alpha=1e-05, max_iter=1500)


In [13]:
y_pred = sgd_clf.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'SGD Balanced Accuracy: {b_acc:.4f}')

SGD Balanced Accuracy: 0.9800


From these trials, it still seems that our Random Forest Classifier with parameters `{'n_estimators': 100, 'min_samples_split': 4, 'max_features': 'sqrt', 'max_depth': 20}` is our highest-performing model.

Now that we've found our best classifier on the top-level classes, let's try training it on the fine-level classes as well (hard set).

In [14]:
training_set_folder = DATA_PATH + 'training_set'
training_anno_file_fine = DATA_PATH + 'training_anno/' + '2_training_anno_fine.json.gz'

Xtrain, ytrain, class_label_pair, Xtrain_ids = get_training_data(training_set_folder, training_anno_file_fine)

RANDOM_STATE += 1

X_train, X_test, y_train, y_test = train_test_split(Xtrain, ytrain,
                                                test_size=TEST_SIZE,
                                                random_state=RANDOM_STATE,
                                                stratify=ytrain)

len(X_train), len(X_test), len(y_train), len(y_test)


Loading training set ...
Reading 2_training_set.json.gz


(309814, 77454, 309814, 77454)

In [15]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [16]:
# reuse best model and parameters
rf_clf_fine = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_split=4, max_features='sqrt')

rf_clf_fine.fit(X_train_scaled, y_train)

y_pred = rf_clf_fine.predict(X_test_scaled)

b_acc = balanced_accuracy_score(y_test, y_pred)
print(f'Random Forest Balanced Accuracy (Fine): {b_acc:.4f}')

Random Forest Balanced Accuracy (Fine): 0.5718


Since we observe pretty significant performance degradation on the fine-level set, we'll first try another cross-validation to see if we can find better hyperparameters.

***Note that the following cell may take a while to run***

In [17]:
rf = RandomForestClassifier()

param_space = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 20, 30, 40],
    'min_samples_split': [4, 8, 16],
    'max_features': ['sqrt'],
}

rf_clf_fine = RandomizedSearchCV(rf, param_space, n_jobs=-1, cv=5)
rf_clf_fine.fit(X_train_scaled, y_train)

print("Random Forest (Fine) Best Parameters:", rf_clf_fine.best_params_)
print("Random Forest (Fine) Best Estimator:", rf_clf_fine.best_estimator_)



Random Forest (Fine) Best Parameters: {'n_estimators': 500, 'min_samples_split': 16, 'max_features': 'sqrt', 'max_depth': 20}
Random Forest (Fine) Best Estimator: RandomForestClassifier(max_depth=20, min_samples_split=16, n_estimators=500)


In [18]:
y_pred = rf_clf_fine.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Random Forest (Fine) Balanced Accuracy: {b_acc:.4f}')

Random Forest (Fine) Balanced Accuracy: 0.5726


Unfortunately, it seems that even after tuning hyperparameters extensively, we are unable to reach a better accuracy with Random Forest.

Finally, it's time to test our model on the test and challenge sets from the NetML Challenge and find our best results for both classes of problems.

First, let's test on the top-level classes sets, for which we expect good results.

In [19]:
test_std_folder = DATA_PATH + 'test-std_set'
test_std_anno_file_top = DATA_PATH + 'test-std_anno/' + 'top_submission_test-std_anno.json.gz'

# reuse get_training_data() function to load test-std data
X_test, y_test, _, _ = get_training_data(test_std_folder, test_std_anno_file_top)

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

y_pred = rf_clf.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Balanced Accuracy (std, Top): {b_acc:.4f}')


Loading training set ...
Reading 1_test-std_set.json.gz
Balanced Accuracy (std, Top): 0.9837


In [20]:
test_challenge_folder = DATA_PATH + 'test-challenge_set'
test_challenge_anno_file_top = DATA_PATH + 'test-challenge_anno/' + 'top_submission_test-challenge_anno.json.gz'

X_test, y_test, _, _ = get_training_data(test_challenge_folder, test_challenge_anno_file_top)

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

y_pred = rf_clf.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Balanced Accuracy (challenge, Top): {b_acc:.4f}')


Loading training set ...
Reading 0_test-challenge_set.json.gz
Balanced Accuracy (challenge, Top): 0.9851


Next, we test again on the fine-level sets using the `rf_clf_fine` model we trained earlier.

In [21]:
test_std_anno_file_fine = DATA_PATH + 'test-std_anno/' + 'fine_submission_test-std_anno.json.gz'

X_test, y_test, _, _ = get_training_data(test_std_folder, test_std_anno_file_fine)

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

y_pred = rf_clf_fine.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Balanced Accuracy (std, Fine): {b_acc:.4f}')


Loading training set ...
Reading 1_test-std_set.json.gz
Balanced Accuracy (std, Fine): 0.4883




In [22]:
test_challenge_anno_file_fine = DATA_PATH + 'test-challenge_anno/' + 'fine_submission_test-challenge_anno.json.gz'

X_test, y_test, _, _ = get_training_data(test_challenge_folder, test_challenge_anno_file_fine)

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

y_pred = rf_clf_fine.predict(X_test_scaled)
b_acc = balanced_accuracy_score(y_test, y_pred)

print(f'Balanced Accuracy (challenge, Fine): {b_acc:.4f}')


Loading training set ...
Reading 0_test-challenge_set.json.gz
Balanced Accuracy (challenge, Fine): 0.4619




As expected, our performance on the fine-level sets was rather poor, but we were still able to achieve a high balanced accuracy on the top-level sets, beating the nPrintML benchmark of 92.4 that we originally aimed for.

Averaging our performance on the top-level sets, we get a combined validation accuracy of 98.6, around 6 points higher than nPrintML's best result.