# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks.

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files.


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
import h5py
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [None]:
!ls "/content/drive/MyDrive/TUHH/CDS/Part-1"

ls: cannot access '/content/drive/MyDrive/TUHH/CDS/Part-1': No such file or directory


In [None]:
# TODO: display 10 random samples from the loaded dataset

def show_random_samples(hdf5_path):
    num_samples = 3  # Display only 3 samples
    max_length = 100  # Truncate vectors to 100 characters

file_path ='/content/drive/MyDrive/TUHH/CDS/PART-1/student_dataset.hdf5'

with h5py.File(file_path, 'r') as f:
  # Available keys
  print("Available keys in HDF5 file:", list(f.keys()))

  # Extract datasets
  vectors = f['vectors'][:]
  labels = f['labels'][:]

# Random selection of indices
indices = np.random.choice(len(vectors), size=10, replace=False)

# Sample Data
sample_vectors = vectors[indices]
sample_labels = labels[indices]

# Build a DataFrame for display with truncated vectors
df_samples = pd.DataFrame({
            'Sample Index': indices,
            'Vector (Truncated)': [repr(v[:10]) + '...' for v in sample_vectors],  # Truncate vectors
            'Label': sample_labels
})

# Display the samples
print(df_samples)

Available keys in HDF5 file: ['labels', 'source', 'vectors']
   Sample Index                                 Vector (Truncated)  Label
0           856  array([[ 1.40853599e-01,  6.09922826e-01, -2.8...   True
1           495  array([[-7.24035561e-01,  3.26080024e-01, -9.6...  False
2           682  array([[-3.66267025e-01,  2.06676662e-01, -6.6...  False
3           516  array([[-5.26928425e-01, -5.97019613e-01,  2.7...  False
4           451  array([[ 1.16817743e-01, -1.52544391e+00, -3.1...  False
5           334  array([[-8.82684052e-01,  1.60595679e+00, -2.1...  False
6           132  array([[ 1.37810397e+00, -1.19254696e+00, -3.1...  False
7           467  array([[ 1.77723718e+00,  8.70357990e-01, -1.1...  False
8           632  array([[ 1.17301333e+00, -3.04405779e-01,  1.9...   True
9            20  array([[-8.5193080e-01, -2.7158836e-01, -3.201...  False


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [None]:
# TODO: inspect and understand the loaded dataset
with h5py.File(file_path, 'r') as f:
    # Extract labels dataset
    labels = f['labels'][:]
    # Total number of samples
    total_samples = len(labels)

    # Count the number of positive (vulnerable) samples (assuming '1' indicates vulnerability)
    positive_samples = np.sum(labels == 1)

    # Count the number of negative (non-vulnerable) samples
    negative_samples = np.sum(labels == 0)

    # Calculate the vulnerable/non-vulnerable ratio
    vulnerability_ratio = positive_samples / negative_samples

# Display the results
print(f"Total number of samples: {total_samples}")
print(f"Number of vulnerable examples: {positive_samples}")
print(f"Number of non-vulnerable examples: {negative_samples}")
print(f"Vulnerable/Non-vulnerable ratio: {vulnerability_ratio}")

Total number of samples: 1000
Number of vulnerable examples: 283
Number of non-vulnerable examples: 717
Vulnerable/Non-vulnerable ratio: 0.3947001394700139


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel.

``` python
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

``` python
from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
      def forward(self, x):
        pred = self.linear_stack(x)
        return pred
      

# TODO: intialize and load the model
```

In [None]:
import torch
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")


class VulnPredictModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_stack = nn.Sequential(
            nn.Linear(768, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    # Forward propagation
    def forward(self, x):
        pred = self.linear_stack(x)
        return pred

# Initialize the model
model = VulnPredictModel()

# Move the model to device
model.to(device)

# Load the pre-trained weights
model_path = '/content/drive/MyDrive/TUHH/CDS/PART-1/model_2023-03-28_20-03.pth'

# Load the model weights into the model
model.load_state_dict(torch.load(model_path, map_location=device))

model.eval()

Using cpu device


VulnPredictModel(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [None]:
# Convert numpy arrays to PyTorch tensors
vectors_tensor = torch.tensor(vectors, dtype=torch.float32)
labels_tensor = torch.tensor(labels, dtype=torch.float32)

# TODO: makethe prediction for all the samples in the validation set.
with torch.no_grad():
    # Move data to the same device as model
    vectors_tensor = vectors_tensor.to(device)

    # Get predictions
    outputs = model(vectors_tensor)
    predictions = (outputs > 0.3).float().squeeze()

    # Move labels to device for comparison
    labels_tensor = labels_tensor.to(device)

# todo: compute true positives, true negatives, false postives and false negatives.
    true_positives = ((predictions == 1) & (labels_tensor == 1)).sum().item()
    true_negatives = ((predictions == 0) & (labels_tensor == 0)).sum().item()
    false_positives = ((predictions == 1) & (labels_tensor == 0)).sum().item()
    false_negatives = ((predictions == 0) & (labels_tensor == 1)).sum().item()

print("\nConfusion Matrix:")
print(f"True Positives (TP): {true_positives}")
print(f"True Negatives (TN): {true_negatives}")
print(f"False Positives (FP): {false_positives}")
print(f"False Negatives (FN): {false_negatives}")



Confusion Matrix:
True Positives (TP): 94
True Negatives (TN): 704
False Positives (FP): 13
False Negatives (FN): 189


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [None]:
# TODO: calculate accuracy
accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)

# TODO: calculate precision
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

# TODO: calculate recall
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

# TODO: calculate F1-score
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Print the metrics
print("\nPerformance Metrics:")
print(f"Accuracy: {accuracy:.5f} ({(accuracy*100):.1f}%)")
print(f"Precision: {precision:.5f}")
print(f"Recall: {recall:.5f}")
print(f"F1 Score: {f1:.5f}")



Performance Metrics:
Accuracy: 0.79800 (79.8%)
Precision: 0.87850
Recall: 0.33216
F1 Score: 0.48205


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?


1. Impact of Accuracy vs. F1 Score
While accuracy indicates the overall correctness of the model’s predictions (79.8% in our case), it can be misleading for imbalanced datasets like ours, where only 28% of the samples are actually labeled as vulnerable. A model could achieve high accuracy simply by predicting most samples as non-vulnerable, without actually detecting true vulnerabilities.
The F1 score, on the other hand, provides a more balanced view by combining both precision (how many predicted vulnerabilities were correct) and recall (how many actual vulnerabilities were detected). In this project, the F1 score is significantly lower (0.48), revealing that the model struggles to identify many actual vulnerabilities, despite its high accuracy.

2. Most Important Metric for This Problem
For vulnerability detection, recall and F1 score are more important than accuracy. The reason is simple: missing a true vulnerability (false negative) is much more dangerous than falsely flagging safe code. Therefore, the goal should be to maximize the detection of true vulnerabilities, even if it means accepting some false positives.

3. A Better-Suited Metric for Vulnerability Prediction
In this context, recall is especially critical — we want to minimize the number of vulnerabilities that go undetected. The F1 score is also useful because it balances the need for high recall with maintaining reasonable precision.
Alternatively, for more advanced evaluations in imbalanced classification tasks, metrics like Precision-Recall AUC (PR-AUC) or ROC-AUC may also be considered. However, for this project, focusing on F1 score and recall gives a more realistic picture of the model’s effectiveness in detecting security flaws.
