![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

In [26]:
# Make sure to run this cell to use torchmetrics. If you cannot use pip install to install the torchmetrics, you can use sklearn.
!pip install torchmetrics

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [27]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset, Dataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn

In [28]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [29]:
# Start coding here
# Use as many cells as you need

#### Data preprocessing

In [30]:
# Create a custom dataset class
class TabularDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]

# Separate features and labels for training, testing, and validation sets
X_train = train_df.drop('sus_label', axis=1).values
y_train = train_df['sus_label'].values
X_test = test_df.drop('sus_label', axis=1).values
y_test = test_df['sus_label'].values
X_val = val_df.drop('sus_label', axis=1).values
y_val = val_df['sus_label'].values

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform the training data
X_train = scaler.fit_transform(X_train)

# Transform the test and validation data using the fitted scaler
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

# Convert the numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1)

train_data = TabularDataset(X_train_tensor, y_train_tensor)
val_data = TabularDataset(X_val_tensor, y_val_tensor)

dataloader = DataLoader(train_data, batch_size = 20, shuffle=True)
val_loader = DataLoader(val_data, batch_size = 20, shuffle=True)

#### Define a neural network classifier

In [31]:
import torch
import torch.nn as nn

class NetworkTrafficClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NetworkTrafficClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x


#### Train the classifier

In [32]:
# Assuming train_df and num_classes are defined elsewhere in the notebook
feature_size = len(train_df.columns) - 1
hidden_size = 32
target_size = train_df['sus_label'].nunique()

model = NetworkTrafficClassifier(input_size = feature_size, hidden_size = hidden_size, output_size = target_size)

lr = 0.05
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

epochs = 20

for i in range(epochs):
    running_loss, num_processed = 0, 0 
    for inputs, labels in dataloader:
        model.zero_grad()
        output = model(inputs)
        loss = criterion(output, labels.long().squeeze())  # Ensure labels are of type LongTensor and squeezed
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        num_processed += len(inputs)
    print(f"Epoch: {i+1}, Loss: {running_loss/num_processed}")

Epoch: 1, Loss: 0.01574795674453181
Epoch: 2, Loss: 0.01574655414205907
Epoch: 3, Loss: 0.015746554142020017
Epoch: 4, Loss: 0.015746579479289504
Epoch: 5, Loss: 0.01574655414280106
Epoch: 6, Loss: 0.015746554142020017
Epoch: 7, Loss: 0.015746554142722954
Epoch: 8, Loss: 0.01574655414186381
Epoch: 9, Loss: 0.015746554141590444
Epoch: 10, Loss: 0.01574655414280106
Epoch: 11, Loss: 0.015746554142215277
Epoch: 12, Loss: 0.015746554142566746
Epoch: 13, Loss: 0.01574655414088751
Epoch: 14, Loss: 0.015746554141785704
Epoch: 15, Loss: 0.01574655414229338
Epoch: 16, Loss: 0.015746554141941912
Epoch: 17, Loss: 0.01574655414225433
Epoch: 18, Loss: 0.015746554142566746
Epoch: 19, Loss: 0.01574655414205907
Epoch: 20, Loss: 0.015746554141199926


#### Evaluate the model based on the validation dataset

In [33]:
accuracy_metric = Accuracy(task='binary', num_classes=2)

model.eval()
predicted = []

for i, (inputs, labels) in enumerate(val_loader):
    output = model(inputs)
    predicted_classes = torch.argmax(output, dim=-1)
    # cat = torch.argmax(output, dim = -1)
    predicted.extend(predicted_classes)
    accuracy_metric(predicted_classes, labels.squeeze())  # Fix: Squeeze labels to match shape

#output
val_accuracy = accuracy_metric.compute().item()
# val_accuracy

In [34]:
val_accuracy

0.9958405494689941