# PII Data Detection

First approach to the [PII Data Detection competition](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview) posted on Kaggle.

The goal of this competition is to develop a model that detects personally identifiable information (PII) in student writing/essays.

This notebook will focus on the development of simple model trained on the competition dataset containing approximately 22,000 essays. The model should assign labels to the following seven types of PII:

| Label         | Description                                                                                                                                        |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| NAME_STUDENT  | The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names. | 
| EMAIL         | A student’s email address.                                                                                                                         |
| USERNAME      | A student's username on any platform.                                                                                                              |
| ID_NUM        | A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.                     |
| PHONE_NUM     | A phone number associated with a student.                                                                                                          |
| URL_PERSONAL  | A URL that might be used to identify a student.                                                                                                    |
| STREET_ADRESS | A full or partial street address that is associated with the student, such as their home address.                                                  |

## Metadata

The competition dataset contains a compilation of original documents along with corresponding tokens that were generated using the SpaCy English tokeniser. There are corresponding labels for each token, presented in the BIO format. This means that the first token of a PII entity is labelled with a prefix 'B-', and the following tokens representing the entity are labelled with a prefix "I-". Non-PII tokens are labeled "O". There are also a few extra fields in the JSON data, which are detailed in the table below:

| Field               | Description                                                                                                                                        |
|---------------------|------------------------------------------------------------------------|
| document            | Integer ID of the essay                                                | 
| full_text           | UTF-8 representation of the essay.                                     |
| tokens              | String representation of each token.                                   |
| trailing_whitespace | Boolean value indicating whether each token is followed by whitespace. |
| labels              | Token label in BIO format.                                             |

Download the competition dataset [here](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/data?select=train.json).

In [1]:
# Importing packages
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [5]:
# Load in training and test data
with open("./PII_Data/train.json", 'r') as f:
    json_train_data = json.load(f)

with open("./PII_Data/test.json", 'r') as f:
    json_test_data = json.load(f)

In [14]:
document_data = json_train_data[0]
print(document_data["full_text"][:80])
print(document_data["tokens"])
print(document_data["labels"])

Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla

Challenge & 
['Design', 'Thinking', 'for', 'innovation', 'reflexion', '-', 'Avril', '2021', '-', 'Nathalie', 'Sylla', '\n\n', 'Challenge', '&', 'selection', '\n\n', 'The', 'tool', 'I', 'use', 'to', 'help', 'all', 'stakeholders', 'finding', 'their', 'way', 'through', 'the', 'complexity', 'of', 'a', 'project', 'is', 'the', ' ', 'mind', 'map', '.', '\n\n', 'What', 'exactly', 'is', 'a', 'mind', 'map', '?', 'According', 'to', 'the', 'definition', 'of', 'Buzan', 'T.', 'and', 'Buzan', 'B.', '(', '1999', ',', 'Dessine', '-', 'moi', ' ', "l'intelligence", '.', 'Paris', ':', 'Les', 'Éditions', "d'Organisation", '.', ')', ',', 'the', 'mind', 'map', '(', 'or', 'heuristic', 'diagram', ')', 'is', 'a', 'graphic', ' ', 'representation', 'technique', 'that', 'follows', 'the', 'natural', 'functioning', 'of', 'the', 'mind', 'and', 'allows', 'the', 'brain', "'s", ' ', 'potential', 'to', 'be', 'released', '.', 'Cf', 'Annex1', '\n\n', 'This