![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

In [14]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
from sklearn.metrics import accuracy_score  

In [15]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv') 
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

In [16]:
#Viewing the first 5 rows and shape of each of the loaded datasets 
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [17]:
train_df.shape

(763144, 8)

In [18]:
test_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,382,382,1,101,4026532232,3,15,0
1,379,379,1,100,4026532231,3,15,0
2,1,1,0,0,4026531840,4,0,0
3,1,1,0,0,4026531840,4,17,0
4,1,1,0,0,4026531840,2,0,0


In [19]:
test_df.shape

(188967, 8)

In [20]:
val_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,381,1,101,4026532232,3,15,0
1,378,378,1,100,4026532231,3,15,0
2,1,1,0,0,4026531840,4,0,0
3,1,1,0,0,4026531840,4,12,0
4,1,1,0,0,4026531840,2,0,0


In [21]:
val_df.shape

(188967, 8)

Separating Features and Labels + Scaling the data

In [22]:
X_train = train_df.drop('sus_label', axis=1).values #removing the 'sus_label' column to obtain the features / independent variables
y_train = train_df['sus_label'].values #table consisting of the 'sus_label' or dependent variable column alone
X_test = test_df.drop('sus_label', axis=1).values #removing the 'sus_label' column to obtain the features / independent variables
y_test = test_df['sus_label'].values #table consisting of the 'sus_label' or dependent variable column alone
X_val = val_df.drop('sus_label', axis=1).values #removing the 'sus_label' column to obtain the features / independent variables
y_val = val_df['sus_label'].values #table consisting of the 'sus_label' or dependent variable column alone

scaler = StandardScaler() #initializing a StandardScaler instance

X_train = scaler.fit_transform(X_train) #standardize X_train based on its mean and std
X_test = scaler.transform(X_test) #standardize X_test based on X_train mean and std
X_val = scaler.transform(X_val) #standardize X_val based on X_train mean and std

Converting the Data to Tensor Array (because PyTorch works with Tensors as their fundamental data structure)

In [23]:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1) #reshaping y_train into a 2D array
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1) #reshaping y_test into a 2D array
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1) #reshaping y_val into a 2D array

Defining our Neural Network Model, Loss Function, and Optimizer

In [24]:
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 128),  # First fully connected layer
    nn.ReLU(),  # ReLU activation
    nn.Linear(128, 64),  # Second fully connected layer
    nn.ReLU(),  # ReLU activation
    nn.Linear(64, 1),  # Third fully connected layer
    nn.Sigmoid()  # Sigmoid activation for binary classification
)
criterion = nn.CrossEntropyLoss() #loss function, Cross Entropy being best for classification tasks
optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4) #to adjust model's parameters

Training and Evaluation

In [25]:
num_epoch = 10
for epoch in range(num_epoch):
    model.train()  # Set the model to training mode
    optimizer.zero_grad()  # Clear the gradients
    outputs = model(X_train_tensor)  # Forward pass: compute the model output
    loss = criterion(outputs, y_train_tensor)  # Compute the loss
    loss.backward()  # Backward pass: compute the gradients
    optimizer.step()  # Update the model parameters

# Model Evaluation
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Disable gradient calculation for efficiency
    y_predict_train = model(X_train_tensor).round()  # Predict on training data and round the outputs
    y_predict_test = model(X_test_tensor).round()  # Predict on test data and round the outputs
    y_predict_val = model(X_val_tensor).round()  # Predict on validation data and round the outputs

# Calculate accuracy using torchmetrics
accuracy = Accuracy(task="binary")

train_accuracy = accuracy(y_predict_train, y_train_tensor)
test_accuracy = accuracy(y_predict_test, y_test_tensor)
val_accuracy = accuracy(y_predict_val, y_val_tensor)

# convert to int or float
train_accuracy = train_accuracy.item()
test_accuracy = test_accuracy.item()
val_accuracy = val_accuracy.item()

print("Training accuracy: {0}".format(train_accuracy))
print("Validation accuracy: {0}".format(val_accuracy))
print("Testing accuracy: {0}".format(test_accuracy))

# Calculate the accuracy using sklearn
train_accuracy = accuracy_score(y_train_tensor, y_predict_train)
val_accuracy = accuracy_score(y_val_tensor, y_predict_val)

test_accuracy = accuracy_score(y_test_tensor, y_predict_test)
print(train_accuracy, val_accuracy, test_accuracy)


Training accuracy: 0.9301795959472656
Validation accuracy: 0.9516899585723877
Testing accuracy: 0.09550344944000244
0.9301796253393855 0.9516899776151392 0.09550344769192505
