# Fine-tuning CodeBERT for vulnerability detection

In this notebook, we provide an example of fine-tuning CodeBERT for vulnerability detection.

Each input will consist of a function and the corresponding output will be if the function contains a vulnerability or not.

## Dataset

We will use the subset of the Devign [1] dataset which is publicly available (only the projects FFmpeg and Qemu) through the MADE-WIC [2] fused dataset.

The following scripts download MADE-WIC, extract Devign from it and remove the remaining.

[1] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Advances in Neural Information Processing Systems, 2019, vol. 32. [Online]. Available: https://sites.google.com/view/devign

[2] M. Mock, J. Melegati, M. Kretschmann, N. E. Diaz Ferreyra, and B. Russo, “MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Oct. 2024, pp. 2346–2349. doi: 10.1145/3691620.3695348.


In [None]:
!wget -O MADE-WIC.zip https://zenodo.org/records/13370805/files/MADE-WIC.zip?download=1
!unzip -j MADE-WIC.zip "MADE-WIC/Dataset/devign/*" -d devign
!rm MADE-WIC.zip

## Libraries

In this notebook we will use Keras since it describes in a higher-level the layers, facilitating the comprehension of the code. As backend, we will use Tensorflow.

We will also use transformers to take advantage of the pre-trained models available in HuggingFace.

In [None]:
import tensorflow as tf
from transformers import TFRobertaModel, RobertaTokenizer
from tf_keras import Model
from tf_keras.layers import Dense, Input, Dropout
from tf_keras.regularizers import L2
from tf_keras.metrics import Precision, Recall
from tf_keras.losses import BinaryCrossentropy
from tf_keras.optimizers import AdamW
import tensorflow_datasets as tfds
import pandas as pd

### Checking access to the hardware

Let's check if the configuration is correct and we have access to a GPU.

In [None]:
tf.config.list_physical_devices('GPU')

## Model

### Hyperparameters

In [None]:
dropout_prob = 0.1
l2_reg_lambda = 0.2
learning_rate = 2e-5
num_epochs = 10
batch_size = 16
max_length = 512

### Architecture

First we load the pre-trained model:

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("huggingface/CodeBERTa-small-v1") # If you have computing power, you can use microsoft/codebert-base
model = TFRobertaModel.from_pretrained("huggingface/CodeBERTa-small-v1") # If you have computing power, you can use microsoft/codebert-base

Then, we define the input layer (this model requires an attention mask which is used to tell to the model was is not padding the input) and pass it to the loaded model.
From the model, we get the hidden state of the first token ([CLS]) from the last layer.
We add a dropout layer (which randomly "turns off" some neurons) since it usually improves the performance of the model.

In [None]:
input_ids = Input(shape=(512, ), dtype='int32', name='input_ids')
attention_mask = Input(shape=(512, ), dtype='int32', name='attention_mask')
model = model([input_ids, attention_mask])
embedding = model.last_hidden_state[:, 0, :]
embedding = Dropout(dropout_prob)(embedding)

We then connect all these outputs to the output layer consisting of a single neuron.

In [None]:
output = Dense(1,
                kernel_initializer='glorot_normal',
                kernel_regularizer=L2(l2_reg_lambda),
                bias_regularizer=L2(l2_reg_lambda),
                activation='sigmoid')(embedding)

model = Model(inputs=[input_ids, attention_mask], outputs=output)

Then, we compile the model describing what will be the loss function during the training, the optimizer to be used and the metrics to monitor.

In [None]:
model.compile(loss=BinaryCrossentropy(),
              optimizer=AdamW(learning_rate),
              metrics=['accuracy', Precision(), Recall()])

## Preparing the data

Let's now load the dataset.

In [None]:
df = pd.read_csv('devign/complete.csv')

#df.fillna(value='', inplace=True)
#df.replace(to_replace=[None], value='', inplace=True)
dataset = tf.data.Dataset.from_tensor_slices((df['Function'], df['Devign']))
#dataset = dataset.take(1000)
num_samples = len(dataset)

print('Samples in dataset:', num_samples)

Let's take a fraction of the dataset so it can finish faster. If you have a powerful hardware, you can skip the next cell.

In [None]:
dataset = dataset.take(int(int(num_samples) * 0.1))
num_samples = len(dataset)

print('Samples in dataset:', num_samples)

 And split it into train, validation and testing.

In [None]:
train_ds = dataset.take(int(num_samples * 0.8))
validation_ds = dataset.skip(int(num_samples * 0.8)).take(int(num_samples * 0.1))
test_ds = dataset.skip(int(num_samples * 0.9))

print('Samples in train dataset:', len(train_ds))
print('Samples in validation dataset:', len(validation_ds))
print('Samples in test dataset:', len(test_ds))

Since the implementation provided does not have the tokenizer inside the model but rather as a separate class. Let's prepare the inputs to be fed to the model.

In [None]:
def encode_examples(tokenizer, ds):
    # Prepare Input list
    input_ids_list = []
    attention_mask_list = []
    label_list = []

    for code, vulnerable in tfds.as_numpy(ds):
        bert_input = tokenizer.encode_plus(code.decode(),
                                        add_special_tokens=True,
                                        max_length=max_length,
                                        padding='max_length',
                                        return_attention_mask=True,
                                        truncation=True
                                        )
        input_ids_list.append(bert_input['input_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append(vulnerable)

    return { 'input_ids':  tf.convert_to_tensor(input_ids_list),
              'attention_mask': tf.convert_to_tensor(attention_mask_list) }, tf.convert_to_tensor(label_list)

train_ds_encoded, train_labels = encode_examples(tokenizer, train_ds)
validation_ds_encoded, validation_labels = encode_examples(tokenizer, validation_ds)
test_ds_encoded, test_labels = encode_examples(tokenizer, test_ds)

## Training

For the training, we provide the number of epochs to run and the batch size besides training and validation data.

In [None]:
model.fit(train_ds_encoded,
          train_labels,
          epochs=num_epochs,
          batch_size=batch_size,
          validation_data=(validation_ds_encoded, validation_labels))

## Inference

Let's use the trained model to predict the existence of vulnerabilities for the functions in the test dataset.

In [None]:
predictions = model.predict(test_ds_encoded)

Now, let's calculate the performance scores. The output is a number in the interval [0 1]. So we consider any value above 0.5 as an indication of the existence of a vulnerability.

In [None]:
def calculate_scores(predictions, label):

    if hasattr(label, "ndim") and label.ndim > 1:
        label = label.squeeze()

    tp = 0
    tn = 0
    fp = 0
    fn = 0

    for index in range(len(predictions)):
        prediction = predictions[index] if isinstance(predictions[index], bool) else predictions[index][0] > 0.5

        if(label[index] == True):
            if(prediction == True):
                tp = tp + 1
            else:
                fn = fn + 1
        else:
            if(prediction == False):
                tn = tn + 1
            else:
                fp = fp + 1

    print("TP:", tp)
    print("TN:", tn)
    print("FP:", fp)
    print("FN:", fn)

    precision = tp / (tp + fp) if tp + fp > 0 else -1
    recall = tp / (tp + fn) if tp + fn > 0 else -1
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    f1 = 2 * ((precision * recall) / (precision + recall)) if precision + recall > 0 else -1

    print("\nPrecision:", precision)
    print("Recall:", recall)
    print("Accuracy:", accuracy)
    print("F1:", f1)

In [None]:
calculate_scores(predictions, test_labels)