# PII Data Detection

First approach to the [PII Data Detection competition](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview) posted on Kaggle.

The goal of this competition is to develop a model that detects personally identifiable information (PII) in student writing/essays.

This notebook will focus on the development of simple model trained on the competition dataset containing approximately 22,000 essays. The model should assign labels to the following seven types of PII:

| Label         | Description                                                                                                                                        |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| NAME_STUDENT  | The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names. | 
| EMAIL         | A student’s email address.                                                                                                                         |
| USERNAME      | A student's username on any platform.                                                                                                              |
| ID_NUM        | A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.                     |
| PHONE_NUM     | A phone number associated with a student.                                                                                                          |
| URL_PERSONAL  | A URL that might be used to identify a student.                                                                                                    |
| STREET_ADDRESS | A full or partial street address that is associated with the student, such as their home address.                                                  |

## Metadata

The competition dataset contains a compilation of original documents along with corresponding tokens that were generated using the SpaCy English tokeniser. There are corresponding labels for each token, presented in the BIO format. This means that the first token of a PII entity is labelled with a prefix 'B-', and the following tokens representing the entity are labelled with a prefix "I-". Non-PII tokens are labeled "O". There are also a few extra fields in the JSON data, which are detailed in the table below:

| Field               | Description                                                                                                                                        |
|---------------------|------------------------------------------------------------------------|
| document            | Integer ID of the essay                                                | 
| full_text           | UTF-8 representation of the essay.                                     |
| tokens              | String representation of each token.                                   |
| trailing_whitespace | Boolean value indicating whether each token is followed by whitespace. |
| labels              | Token label in BIO format.                                             |

Download the competition dataset [here](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/data?select=train.json).

## Approach

State of the art methods utilise large pretrained transformer based language models such as DeBERTa for named entity recognition. However, I decided to employ a much smaller LSTM model in this approach, prioritising lower memory usage. A tokenizer was trained from scratch using the byte pair encoding algorithm. This was done in the hope of capturing the value of subword tokenisation. A pretrained tokenizer was not used so that the vocab size would be kept small, and hopefully relevant to the text in the essays used for the competition.

In [1]:
# Importing packages
import json
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

from tokenizers import ByteLevelBPETokenizer
from sklearn.metrics import f1_score

In [2]:
# Load in training and test data
with open("./PII_Data/train.json", 'r') as f:
    json_data = json.load(f)

In [3]:
# Take a look at some tokens and corresponding labels
for i in range(1):
    print(len(json_data[i]["tokens"]))
    print(json_data[i]["tokens"][:50])
    print(json_data[i]["labels"][:50])


753
['Design', 'Thinking', 'for', 'innovation', 'reflexion', '-', 'Avril', '2021', '-', 'Nathalie', 'Sylla', '\n\n', 'Challenge', '&', 'selection', '\n\n', 'The', 'tool', 'I', 'use', 'to', 'help', 'all', 'stakeholders', 'finding', 'their', 'way', 'through', 'the', 'complexity', 'of', 'a', 'project', 'is', 'the', ' ', 'mind', 'map', '.', '\n\n', 'What', 'exactly', 'is', 'a', 'mind', 'map', '?', 'According', 'to', 'the']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUDENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [4]:
# Create train and test splits from data
# random.shuffle(json_data) # shuffle data for randomised train/test split
train_ratio = 0.8
n_train_samples = int(len(json_data) * train_ratio)
train_data = json_data[:n_train_samples]
test_data = json_data[n_train_samples:]

del json_data

# Get raw text for training tokenizer
train_docs = [sample["full_text"] for sample in train_data]


In [5]:
# Train BPE tokenizer
VOCAB_SIZE = 1000
bpe_tokenizer = ByteLevelBPETokenizer()
bpe_tokenizer.train_from_iterator(train_docs, vocab_size=VOCAB_SIZE, min_frequency=2)

# Add special tokens
special_tokens = ["<PAD>", "<STARTOFTEXT>", "<ENDOFTEXT>"]
bpe_tokenizer.add_special_tokens(special_tokens)

enc = bpe_tokenizer.encode("This is a test sentence. Hi...")
vocab = bpe_tokenizer.get_vocab()
word_boundary_tokens = [token for token in vocab if token.startswith('Ġ')]
print(enc.ids)
print(enc.tokens)
print(len(word_boundary_tokens), word_boundary_tokens[:50]) # Ġ represents the beginning of a word

# Pad token
pad_enc = bpe_tokenizer.encode("<PAD>").ids[0]

[329, 279, 330, 257, 773, 265, 295, 511, 13, 722, 72, 13, 13, 13]
['Th', 'is', 'Ġis', 'Ġa', 'Ġtest', 'Ġs', 'ent', 'ence', '.', 'ĠH', 'i', '.', '.', '.']
381 ['Ġour', 'Ġle', 'Ġtest', 'Ġhas', 'Ġma', 'ĠSt', 'ĠThis', 'Ġsit', 'Ġident', 'Ġal', 'Ġrequire', 'Ġabout', 'ĠâĢ', 'Ġlearn', 'Ġem', 'Ġa', 'Ġdis', 'Ġpo', 'Ġ1', 'Ġj', 'Ġlot', 'Ġwill', 'Ġused', 'Ġstarted', 'Ġfoc', 'Ġworking', 'Ġknow', 'Ġonly', 'Ġcl', 'Ġtr', 'Ġthat', 'Ġ2', 'Ġdifferent', 'ĠW', 'Ġbeen', 'Ġch', 'Ġdo', 'ĠD', 'Ġup', 'Ġob', 'Ġmore', 'Ġat', 'Ġneed', 'Ġtechn', 'Ġwhen', 'Ġbec', 'Ġli', 'Ġcons', 'Ġwho', 'Ġbe']


In [6]:
# Define the relevant labels within the data
targets = [
    'B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 
    'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 
    'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL'
]
label_set = targets + ["O"] + ["<PAD>"]
label2id = {l: i for i,l in enumerate(label_set)}
id2label = {v:k for k,v in label2id.items()}
print(id2label)

{0: 'B-EMAIL', 1: 'B-ID_NUM', 2: 'B-NAME_STUDENT', 3: 'B-PHONE_NUM', 4: 'B-STREET_ADDRESS', 5: 'B-URL_PERSONAL', 6: 'B-USERNAME', 7: 'I-ID_NUM', 8: 'I-NAME_STUDENT', 9: 'I-PHONE_NUM', 10: 'I-STREET_ADDRESS', 11: 'I-URL_PERSONAL', 12: 'O', 13: '<PAD>'}


In [7]:
def relabel_tokens(tokenizer, train_data: list, debug=False):
    """ Generate tokens with newly created vocab, mapping relevant labels to new tokens. """

    prev_ws: bool = False
    new_tokens = []
    new_labels = []
    idx = 0
    token_map = []
    for t, l, ws in zip(train_data["tokens"], train_data["labels"], train_data["trailing_whitespace"]):
        # Add space to this string if required
        if prev_ws is True:
            curr = " " + t
        else:
            curr = t
        # Keep track of next string ws required
        if ws is True:
            prev_ws = True
        else:
            prev_ws = False
        # Create new tokens/labels
        enc = tokenizer.encode(curr)
        if debug is False:
            new_tokens += enc.ids
            new_labels += [label2id[l]] * len(enc.tokens)
        else:
            new_tokens += enc.tokens
            new_labels += [l] * len(enc.tokens)
        # Map new tokens to original token
        token_map.extend([idx] * len(enc.tokens))
        idx += 1

    return list(zip(new_tokens, new_labels)), token_map

labelled_tokens, token_map = relabel_tokens(bpe_tokenizer, train_data[0], debug=True)
print(len(labelled_tokens), len(token_map))
print(labelled_tokens[20:70])


1346 1346
[('0', 'O'), ('2', 'O'), ('1', 'O'), ('-', 'O'), ('N', 'B-NAME_STUDENT'), ('at', 'B-NAME_STUDENT'), ('h', 'B-NAME_STUDENT'), ('al', 'B-NAME_STUDENT'), ('ie', 'B-NAME_STUDENT'), ('ĠS', 'I-NAME_STUDENT'), ('y', 'I-NAME_STUDENT'), ('ll', 'I-NAME_STUDENT'), ('a', 'I-NAME_STUDENT'), ('ĊĊ', 'O'), ('Challenge', 'O'), ('Ġ&', 'O'), ('Ġse', 'O'), ('lection', 'O'), ('ĊĊ', 'O'), ('The', 'O'), ('Ġtool', 'O'), ('ĠI', 'O'), ('Ġuse', 'O'), ('Ġto', 'O'), ('Ġhelp', 'O'), ('Ġall', 'O'), ('Ġstakeholders', 'O'), ('Ġfind', 'O'), ('ing', 'O'), ('Ġtheir', 'O'), ('Ġway', 'O'), ('Ġthrough', 'O'), ('Ġthe', 'O'), ('Ġcomple', 'O'), ('x', 'O'), ('ity', 'O'), ('Ġof', 'O'), ('Ġa', 'O'), ('Ġproject', 'O'), ('Ġis', 'O'), ('Ġthe', 'O'), ('ĠĠ', 'O'), ('m', 'O'), ('ind', 'O'), ('Ġmap', 'O'), ('.', 'O'), ('ĊĊ', 'O'), ('W', 'O'), ('hat', 'O'), ('Ġex', 'O')]


In [8]:
def map_predictions(labelled_tokens: list[tuple], token_map: list, original_tokens: list):
    """ Map NER predictions back to original tokens. """
    prev_idx = -1
    predictions = []
    for item, idx in zip(labelled_tokens, token_map):
        if idx != prev_idx:
            token, label = item
            predictions.append((original_tokens[idx], label))
            prev_idx = idx
    return predictions

predictions = map_predictions(labelled_tokens, token_map, train_data[0]["tokens"])
print(predictions == list(zip(train_data[0]["tokens"], train_data[0]["labels"])))
print(len(predictions), predictions[:20])
del predictions

True
753 [('Design', 'O'), ('Thinking', 'O'), ('for', 'O'), ('innovation', 'O'), ('reflexion', 'O'), ('-', 'O'), ('Avril', 'O'), ('2021', 'O'), ('-', 'O'), ('Nathalie', 'B-NAME_STUDENT'), ('Sylla', 'I-NAME_STUDENT'), ('\n\n', 'O'), ('Challenge', 'O'), ('&', 'O'), ('selection', 'O'), ('\n\n', 'O'), ('The', 'O'), ('tool', 'O'), ('I', 'O'), ('use', 'O')]


We have now developed a method to break down our word sized tokens into smaller subword tokens, which removes the need to use "unk" tokens for previously unseen words during inference. Next, we are required to devise a method to break up our documents into smaller sequences for training. We should make sure that 'I' tags are included in the same sequence as their corresponding 'B' tags. 

In [9]:
def gen_sequences(labelled_tokens, context_length, padding):
    """ Split up token sequences into smaller sequences for training. """
    train_x = []
    train_y = []
    pad_enc = bpe_tokenizer.encode("<PAD>")
    pad_label = label2id["<PAD>"]
    for i in range(0, len(labelled_tokens), context_length):
        label = ""
        sequence = []
        # End on 'o' label to avoid seperating B and I
        while (len(sequence) < context_length - padding or label != 'O') and len(sequence) + i < len(labelled_tokens):
            idx = len(sequence) + i
            token, label = labelled_tokens[idx]
            sequence.append(labelled_tokens[idx])
            if len(sequence) >= context_length: # give up at this point
                break

        sequence.extend([(pad_enc.ids[0], pad_label)] * (context_length - len(sequence)))
        tokens, labels = zip(*sequence)
        train_x.append(tokens)
        train_y.append(labels)
    return train_x, train_y

# Now testing this function
CONTEXT_LENGTH = 128
PADDING = 28
x, y = gen_sequences(labelled_tokens, CONTEXT_LENGTH, PADDING)
print(x[5][-40:])

('ions', ',', 'Ġwe', 'Ġcan', 'Ġuse', ':', 'Ġwho', ',', 'Ġwhat', ',', 'ĠĠ', 'w', 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000)


In [10]:
# Now generate training data
CONTEXT_LENGTH = 128
PADDING = 28
train_x = []
train_y = []
for document in train_data:
    labelled_tokens, token_map = relabel_tokens(bpe_tokenizer, document)
    tokens, labels = gen_sequences(labelled_tokens, CONTEXT_LENGTH, PADDING)
    train_x += tokens
    train_y += labels

In [11]:
print(len(train_x), len(train_x[100]))
print(len(train_y), len(train_y[100]))
print(train_x[10])

train_x_tensor = torch.tensor(train_x)
train_y_tensor = torch.tensor(train_y)
dataset = TensorDataset(train_x_tensor, train_y_tensor)
BATCH_SIZE = 32
train_data_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

58109 128
58109 128
(701, 260, 445, 990, 304, 289, 831, 533, 600, 420, 13, 661, 35, 436, 346, 258, 396, 333, 285, 879, 304, 307, 69, 288, 87, 282, 12, 32, 85, 81, 415, 799, 15, 17, 16, 12, 45, 264, 71, 281, 471, 421, 88, 308, 64, 661, 32, 77, 703, 87, 822, 25, 465, 407, 465, 708, 421, 71, 474, 67, 284, 459, 415, 651, 565, 661, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000)


In [21]:
# Define Recurrent Neural Network 
class Net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, label_set):
        super(Net, self).__init__()
        self.token_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=1, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, len(label_set))

    def forward(self, input):
        embeds = self.token_embeddings(input)
        lstm_out, _ = self.lstm(embeds)
        fc_out = self.fc(lstm_out)
        return F.log_softmax(fc_out, dim=1)
    
class CustomLoss(nn.Module):
    def __init__(self):
        super(CustomLoss, self).__init__()

    def forward(self, outputs, targets):
        # Reshape targets and outputs
        targets = targets.view(-1)
        outputs = outputs.view(-1, len(label_set))
        # Mask out padding tokens
        mask = (targets != label2id["<PAD>"]).float()
        num_tokens = int(torch.sum(mask))
        outputs = outputs[range(outputs.shape[0]), targets] * mask
        # Calculate and return loss
        return -torch.sum(outputs)/num_tokens
    
# Train Loop
def train(model, optimizer, loss_function, train_data_loader, epochs):
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for tokens, targets in train_data_loader:
            optimizer.zero_grad()
            outputs = model(tokens)
            loss = loss_function(outputs, targets)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_data_loader)}')

In [22]:
# Hyperparameters
vocab_size = bpe_tokenizer.get_vocab_size()
embedding_dim = 100
hidden_dim = 64
epochs = 10

# Initialise and train model
model = Net(vocab_size, embedding_dim, hidden_dim, label_set)
loss_fn = CustomLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
train(model, optimizer, loss_fn, train_data_loader, epochs)

Epoch 1/10, Loss: 4.822322313480965
Epoch 2/10, Loss: 4.821358713547038
Epoch 3/10, Loss: 4.821168584708075
Epoch 4/10, Loss: 4.820999118987684
Epoch 5/10, Loss: 4.820978017630556
Epoch 6/10, Loss: 4.820925838621703
Epoch 7/10, Loss: 4.820893440477649
Epoch 8/10, Loss: 4.820859763328199
Epoch 9/10, Loss: 4.820866241854193
Epoch 10/10, Loss: 4.8207827630547175


In [23]:
# Create test data
test_x = []
test_y = []
for document in test_data:
    labelled_tokens, token_map = relabel_tokens(bpe_tokenizer, document)
    tokens, labels = gen_sequences(labelled_tokens, CONTEXT_LENGTH, PADDING)
    test_x += tokens
    test_y += labels

test_x_tensor = torch.tensor(test_x)

In [24]:
# Make predictions
model.eval()
with torch.no_grad():
    predictions = []
    for inputs, labels in zip(test_x_tensor, test_y):
        outputs = model(inputs.unsqueeze(0))
        _, predicted_labels = torch.max(outputs ,dim=2)
        predictions.append(predicted_labels.squeeze().tolist())

print(len(predictions))


13775


In [25]:
# Calculate Accuracy
# Flatten the predictions and ground truth labels, excluding padding tokens
flat_predictions = []
flat_labels = []

for pred, true in zip(predictions, test_y):
    for p, t in zip(pred, true):
        if t != label2id["<PAD>"]:  # Exclude padding tokens
            flat_predictions.append(p)
            flat_labels.append(t)

# Calculate the F1-score
f1 = f1_score(flat_labels, flat_predictions, average='micro')

print(f'F1-score: {f1}')

F1-score: 0.16165690460354581


We have succeeded in training a model that achieved and F1 score of 0.1616 on a set of test data. We can also note that the tokenisation technique that was implemented functioned properly and there was no issue with previously unseen tokens. This score is relatively low for this task and could be improved with a more complex model architecture among other techniques, but it is fair to say that the model does have some capability of identifying personal information within academic essays.