# Fine-tuning a BERT language model for Sensitive Information Detection

## Table of Contents
* Introduction
* Load training dataset with cudf
* Transform labels into pytorch tensor using dlpack
* Transform text using cudf subword tokenizer
* Split into train and test sets
* Loading pretrained model
* Fine-tune the model
* Model evaluation
* Save model file

## Introduction

Detecting sensitive information inside of text data is an arduous task, often requiring complex regex and heuristics. This notebook illustrates how to train a language model using a small sample dataset of API responses that have been previously labeled as containing up to ten different types of sensitive information. We will fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a multi-label classification layer. We will save this model file for deployment using the Morpheus framework.

In [1]:
from os import path
import s3fs
import torch
from torch.nn import BCEWithLogitsLoss
from transformers import AutoModelForSequenceClassification, AdamW
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.dataset import random_split
from torch.utils.dlpack import from_dlpack
from sklearn.metrics import f1_score, accuracy_score, multilabel_confusion_matrix
from tqdm import trange
import cudf
import cupy
from cudf.utils.hash_vocab_utils import hash_vocab
from cudf.core.subword_tokenizer import SubwordTokenizer

## Load training dataset with cudf

To train our model we begin with a dataframe containing a field with text samples and one column for each of ten labels of sensitive data. The label columns are True or False for the presence of the specific sensitive information type in the text.

In [2]:
df = cudf.read_csv("../datasets/training-data/sid-sample-training-data.csv")

## Transform labels into pytorch tensor using dlpack

We find all the columns from the df that are labels for the text data and transform them into a tensor using dlpack. 

In [3]:
label_names = list(df.columns)
label_names.remove('data')
label_names = sorted(label_names)
label_names

['si_address',
 'si_bank_acct',
 'si_credit_card',
 'si_email',
 'si_govt_id',
 'si_name',
 'si_password',
 'si_phone_num',
 'si_secret_keys',
 'si_user']

In [4]:
label2idx = {t: i for i, t in enumerate(label_names)}

In [5]:
labels = from_dlpack(df[label_names].to_dlpack()).type(torch.long)

## Transform text using cudf subword tokenizer

We will define two tokenizers using the pretrained vocabulary from the originial BERT-base-cased and BERT-base-uncased. We will create hash files from the vocabulary. Then we use one of our functions to transform the `text` column into two padded tensors for our model training-- `input_ids` and `attention_mask` based on the vocabulary.

In [6]:
# create one hash file from bert-base-uncased if needed

#hash_vocab('resources/bert-base-uncased-vocab.txt', 'resources/bert-base-uncased-hash.txt')

In [7]:
# if using mini-bert "google/bert_uncased_L-4_H-256_A-4" use uncased vocabulary
bert_uncased_tokenizer = SubwordTokenizer('resources/bert-base-uncased-hash.txt', do_lower_case=True)

In [8]:
tokenizer_output = bert_uncased_tokenizer(df["data"], max_length=256, max_num_rows=len(df["data"]),
                                          padding='max_length', return_tensors='pt', truncation=True,
                                          add_special_tokens=True)

## Split into train and test sets

Create at pytorch dataset, split into testing and training subsets, and load into pytorch dataloaders. 

In [9]:
# create dataset
dataset = TensorDataset(tokenizer_output["input_ids"].type(torch.long),tokenizer_output["attention_mask"], labels)

# use pytorch random_split to create training and validation data subsets
dataset_size = len(tokenizer_output["input_ids"])
train_size = int(dataset_size * .8) # 80/20 split
training_dataset, validation_dataset = random_split(dataset, (train_size, (dataset_size-train_size)))

# create dataloaders
train_dataloader = DataLoader(dataset=training_dataset, shuffle=True, batch_size=32)
val_dataloader = DataLoader(dataset=validation_dataset, shuffle=False, batch_size=64)

## Load pretrained model from huggingface repo or fine-tune a morpheus pretrained model

In [10]:
num_labels = len(label_names)

# load the following model for mini-bert from huggingface
# model = AutoModelForSequenceClassification.from_pretrained("google/bert_uncased_L-4_H-256_A-4", num_labels=num_labels)

model = torch.load('repo_model/sid-minibert-20211021.pth')

In [11]:
model.train()
model.cuda(); # move model to GPU

In [12]:
# find number of gpus
n_gpu = torch.cuda.device_count()

# use DataParallel if you have more than one GPU
if n_gpu > 1:
    model = torch.nn.DataParallel(model)

## Fine-tune model

In [13]:
# using hyperparameters recommended in orginial BERT paper
# the optimizer allows us to apply different hyperpameters for specific parameter groups
# apply weight decay to all parameters other than bias, gamma, and beta
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters,lr=2e-5)

In [14]:
# number of training epochs, keep low to avoid overfitting
epochs = 1

# train loop
for _ in trange(epochs, desc="Epoch"):
  # tracking variables
    tr_loss = 0 #running loss
    nb_tr_examples, nb_tr_steps = 0, 0
  
  # train the data for one epoch
    for batch in train_dataloader:
        # unpack the inputs from dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # clear out the gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(b_input_ids, attention_mask=b_input_mask)
        logits = outputs[0]
        
        # using binary cross-entropy with logits as loss function
        # assigns independent probabilities to each label
        loss_func = BCEWithLogitsLoss() 
        loss = loss_func(logits.view(-1,num_labels),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation 
        if n_gpu > 1:
            loss = loss.mean() # mean() to average on multi-gpu parallel training
        # backward pass
        loss.backward()
        
        # update parameters and take a step using the computed gradient
        optimizer.step()
        
        # update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    print("Train loss: {}".format(tr_loss/nb_tr_steps))

Epoch: 100%|██████████| 1/1 [00:24<00:00, 24.37s/it]

Train loss: 0.0006268636239110492





## Model evaluation

We evaluate the accuracy on the 20% of data we have in the validation set. We report the `F1 macro accuracy`- correct_predictions divided by total_predictions is calculated for each label and averaged, and the `flat accuracy`- correct_predictions divided by total_predctions of the model for the validation set as a whole.

In [15]:
# model to eval mode to evaluate loss on the validation set
model.eval()

# variables to gather full output
logit_preds,true_labels,pred_labels = [],[],[]

# predict
for batch in val_dataloader:
    # unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    with torch.no_grad():
        # forward pass
        output = model(b_input_ids, attention_mask=b_input_mask)
        b_logit_pred = output[0]
        b_pred_label = torch.sigmoid(b_logit_pred)
        b_logit_pred = b_logit_pred.detach().cpu().numpy()
        b_pred_label = b_pred_label.detach().cpu().numpy()
        b_labels = b_labels.detach().cpu().numpy()
    
    logit_preds.extend(b_logit_pred)
    true_labels.extend(b_labels)
    pred_labels.extend(b_pred_label)

# calculate accuracy, using 0.50 threshold
threshold = 0.50
pred_bools = [pl>threshold for pl in pred_labels]
true_bools = [tl==1 for tl in true_labels]
val_f1_accuracy = f1_score(true_bools,pred_bools,average='macro')*100
val_flat_accuracy = accuracy_score(true_bools, pred_bools)*100

print('F1 Macro Validation Accuracy: ', val_f1_accuracy)
print('Flat Validation Accuracy: ', val_flat_accuracy)

F1 Macro Validation Accuracy:  99.87012987012986
Flat Validation Accuracy:  99.75


In [16]:
# confusion matrix for each label

for label, cf in zip(label_names, multilabel_confusion_matrix(true_bools, pred_bools)):
                     print(label)
                     print(cf)

si_address
[[370   0]
 [  0  30]]
si_bank_acct
[[354   0]
 [  0  46]]
si_credit_card
[[357   0]
 [  0  43]]
si_email
[[362   0]
 [  0  38]]
si_govt_id
[[361   0]
 [  0  39]]
si_name
[[361   1]
 [  0  38]]
si_password
[[357   0]
 [  0  43]]
si_phone_num
[[355   0]
 [  0  45]]
si_secret_keys
[[365   0]
 [  0  35]]
si_user
[[365   0]
 [  0  35]]


## Save model file

In [17]:
if torch.cuda.device_count() > 1:
    model = model.module

# torch.save(model, output_file)   

## Conclusion

Using pretrained BERT model (`mini-bert`) from the huggingface repo or the morpheus repo and a custom traning for multi-label classification, we are able to train a sensitive information detector from our PCAP labeled training dataset.