# PII Detection
## Authors
Victor Quagraine, Hutton Amison-Addy, Delali Nsiah Asare

## Problem
PII or Personal Identifiable Information are information details that can be used to identify individuals. Our task was to create a model capable of identifying these PII in any given document.

## Approach
We considered NLP to tackle this problem since our dataset involved text used in context with each other. Our selected model was the Bidirectional Long Short Term Model selected to gain context of how words are used in a document and their relation with other words before and after them


# Dependency Modules

In [None]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from IPython.display import display, clear_output
import matplotlib.pyplot as plt

# Loading Data

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')
# import os
# for dirname, _, filenames in os.walk('/kaggle/piidata'):
#     print(os.join)


# Data Preprocessing
Our dataset was presented in a json format containing various document id's, documents and token labels. 

In [None]:
# train_df =  pd.read_json("/content/drive/MyDrive/Colab Notebooks/Datasets/pii/train.json")#laoding the data
train_df =  pd.read_json("/kaggle/input/piidata/train.json")#laoding the data


The dataset is split into a training/validation set and a held out set for testing after training.

In [None]:
train_df, testing_df = train_test_split(train_df, test_size=0.2, random_state=42)#training set, held out set


In [None]:
train_df.head()# first five items in the training set

Our load unpack function unpacks the tokens from the documents to form the larger dataset of tokens and labels and returns the row id, document number, the token id, the token and its label

In [None]:
def list_unpack(df: pd.DataFrame) -> pd.DataFrame:
    row_ids=[]
    token=[]
    labels=[]
    document=[]
    token_id=[]
    row_id = 0

    for i in range(0,len(df['tokens'])):
        document += [df['document'].iloc[i]]*len(df['tokens'].iloc[i])
        for j in range(0,len(df['tokens'].iloc[i])):
            token_id.append(j)
            row_ids.append(row_id)
            row_id+=1
        token += df['tokens'].iloc[i]
        labels += df['labels'].iloc[i]

    temp_dict = {
        "row_id": row_ids,
        "document": document,
        "token_id": token_id,
        "token": token,
        "label": labels
    }

    return pd.DataFrame(temp_dict)

Unpacking the training dataset tokens

In [None]:
train_data=list_unpack(train_df)

In [None]:
train_data.head()

# Spliting the dataset Into Features and labels
X holds our token input for the model whilst y holds labels, outputs

In [None]:
X = train_data['token']#tokens
y = train_data['label']#labels
labels= train_data['label'].unique()#unique labels for encoding

In [None]:
X,y

## Creating the encoder
The labels are encoded to be used during the training of the model.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


# Label Encoding
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
print("Label Encoded Labels:", encoded_labels)


# Create Train-Validation Split
##### The training set after going through preprocesssing is being split in to the train set and the validation set which will be used to test for generalisation during each epoch.

##### The set is split 80-20 for train to vaidate respectively.
##### The set are also randomised to prevent overfitting

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# DataLoader
To load the data into the model, we needed to create our own dataset object based on the Torch Dataset Object. 

In [None]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]



Using the torch dataloader, we were able to prepare both training data and validation data in an object that could be used to train and validated the model.

In [None]:
#Training Dataset
train_dataset = CustomDataset(X_train.values, y_train.values)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)

#Validation Dataset
validation_dataset = CustomDataset(X_val.values, y_val.values)
validation_loader = DataLoader(validation_dataset, batch_size=2048, shuffle=True)


# Definition of BiLSTM Model
THe Bidirectional LSTM trains on data in both the normal and reverse direction the input comes in. This was to allow the model to gain a better context of words, how they are used and how they relate to each other depending on what comes after what.
##### embed
The torch module contains an embedding function which is able to perform word embedding on the input.

In [None]:
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, num_classes):
        super(BiLSTM, self).__init__()#Inherits from the BiLSTM from the torch module
        self.embedding = nn.Embedding(vocab_size, embed_size)#the embedding function provided for the BiLSTM model from torch
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, bidirectional=True, batch_first=True)# definition of the LSTM from torch indicating the use of both forward and back traverse during raining
        self.fc = nn.Linear(hidden_size * 2, num_classes)  # Bidirectional models implement two sets of hidden layers.

    def forward(self, x):
        embeds = self.embedding(x)
        lstm_out, _ = self.lstm(embeds)
        out = self.fc(lstm_out)
        return out

# Vocabulary, Model, GPU device, Loss Function and Optimizer
The vocab is a dictionary of unique tokens and their identifying index. The device is the processing units used for the batch training. THe Criterion/Loss uses a Cross Entropy loss and the optimizer uses the Adaptive Moment Estimation or Adam Optimizer.  

In [None]:
# Define vocabulary based on unique tokens in the dataset
vocab = {token: idx for idx, token in enumerate(set(train_data['token']))}
vocab_size = len(vocab)
num_classes=len(labels)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')# if GPU cores are available use them for batch training else use the cpu 
model = BiLSTM(vocab_size, embed_size=100, hidden_size=100, num_layers=2, num_classes=num_classes).to(device)
criterion = nn.CrossEntropyLoss()#Loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)#Optimizer


# Training loop


In [None]:
# Training Loop
train_losses = []
train_loss = []
valid_loss = []
accuracies = []
num_epochs = 5
fig, ax = plt.subplots()  # Creates figure and axis objects for plotting the loss, validation loss and accuracy

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for batch in train_loader:
        inputs, labels = batch
        inputs = torch.tensor([vocab.get(token, 0) for token in inputs]).to(device)  # Converts tokens to indices

     
        labels = torch.tensor([label_encoder.transform([label])[0] for label in labels]).to(device)


        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)

    # Appends the epoch loss to train_losses list
    epoch_loss = running_loss / len(train_loader.dataset)
    train_loss.append(epoch_loss)

    # Performs validation
    model.eval()  # Sets model to evaluation mode
    val_running_loss = 0.0
    correct_predictions = 0
    total_predictions = 0

    with torch.no_grad():
        for val_batch in validation_loader:
            val_inputs, val_labels = val_batch
            val_inputs = torch.tensor([vocab.get(token, 0) for token in val_inputs]).to(device)

            # Encodes labels into numerical values
            encoded_val_labels = [label_encoder.transform([label])[0] for label in val_labels]
            val_labels = torch.tensor(encoded_val_labels).to(device)

            # Forward pass
            val_outputs = model(val_inputs)
            val_loss = criterion(val_outputs, val_labels)
            val_running_loss += val_loss.item() * val_inputs.size(0)

            # Calculate accuracy
            _, predicted = torch.max(val_outputs, 1)
            total_predictions += val_labels.size(0)
            correct_predictions += (predicted == val_labels).sum().item()

        # Calculate validation loss and accuracy
        val_epoch_loss = val_running_loss / len(validation_loader.dataset)
        valid_loss.append(val_epoch_loss)
        accuracy = correct_predictions / total_predictions
        accuracies.append(accuracy)

    # Plot training and validation losses along with accuracy after each epoch
    ax.clear()
    ax.plot(range(1, epoch + 2), train_loss, label='Training Loss')
    ax.plot(range(1, epoch + 2), valid_loss, label='Validation Loss')
    ax.plot(range(1, epoch + 2), accuracies, label='Accuracy')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss / Accuracy')
    ax.set_title('Training and Validation Losses with Accuracy')
    ax.legend()
    display(fig)
    clear_output(wait=True)

plt.close()  # Close the plot after training


# Test

In [None]:
# test_df =  pd.read_json("/content/drive/MyDrive/Colab Notebooks/Datasets/pii/test.json")#laoding the data
test_df=testing_df

In [None]:
test=list_unpack(test_df)

In [None]:
Xtest = test['token']

In [None]:
# Converts words to indices using vocab and creates a tensor with the appropriate data type
input_tensor = torch.tensor([vocab.get(word, 0) for word in Xtest.values], dtype=torch.long)

# Ensures the model is in evaluation mode
model.eval()

# Performs forward pass without gradient computation
with torch.no_grad():
    predictions = model(input_tensor)

In [None]:

predicted_indices = torch.argmax(predictions, dim=1)

predicted_labels = label_encoder.classes_[predicted_indices.cpu().numpy()]

In [None]:
test['predicted_labels']=predicted_labels

In [None]:
test_split=test

Confusion matrix for the Predictions. There are 3146 true positive values which is justifying the results of the evaluation metrics.

In [None]:
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(test_split['label'], test_split['predicted_labels']))

# Anaysis
The evaluations metrics for the BiLSTM model. All the evaluation metrics used here lie between 0 and 1, inclusive.

### Accuracy
Accuracy takes the correct predictions over the total values in the observations. An accuracy score of 
# 0.99
the model's prediction matches the actual value
# 99%
of the time.

### Precision
Precision checks how many times the model predicts a positive value correctly. It checks the number of positive predictions that were actually correct. A precision score of 
# 0.99
indicates that the model is predicting positive values correctly about 
# 99% 
of the time.

### Recall
Recall answers the question 'How many actual positive values were identified by the model?'. Since the model has high recall, we know that about
# 99% 
of the time, the model's predictions will closely match the value we expect.

### F1
F1 score is a weighted average of the model's precision and recall. A high recall of 
# 0.99
shows that the model has both a high recall ability and high precision, which translates to a model that is likely generalisable.


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(test_split['label'], test_split['predicted_labels'])
precision = precision_score(test_split['label'], test_split['predicted_labels'], average='micro')
recall = recall_score(test_split['label'], test_split['predicted_labels'], average='micro')
f1 = f1_score(test_split['label'], test_split['predicted_labels'], average='micro')

print(f"Accuracy score: {accuracy}")
print(f"Precision score: {precision}")
print(f"Recall score: {recall}")
print(f"F1 score: {f1}")

# Submission file

In [None]:
df = test_split[['row_id','document', 'token_id', 'predicted_labels']]
df.columns = ['row_id','document','token','label']
df.head()

In [None]:
df.to_csv("/kaggle/input/piidata/submissions-mfc.csv", index=False)

# Evaluation file

In [None]:
df = test_split[['row_id','document', 'token_id', 'label']]
df.columns = ['row_id','document','token','label']
df.head()

In [None]:
# df.to_csv("/content/drive/MyDrive/Colab Notebooks/Datasets/pii/sub/evaluations.csv", index=False)
df.to_json("/kaggle/input/piidata/results-mfc.json")

# Result 

In [None]:
temp = {
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1_score': f1
}
df = pd.DataFrame(temp, index=[0,1,2,3])
df

In [None]:
# df.to_json("/content/drive/MyDrive/Colab Notebooks/Datasets/pii/sub/results-mfc.json")
df.to_json("/kaggle/input/piidata/results-mfc.json")

# References 
