<a href="https://www.kaggle.com/code/jackren000/piidatadetection-dataanalysis?scriptVersionId=163018064" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Define the objective:
The objective of the analysis for "The Learning Agency Lab - PII Data Detection" Kaggle competition is to develop and optimize an accurate and computationally efficient model capable of detecting personally identifiable information in student writing to facilitate the safe release of educational datasets.

In [1]:
# install the 'seqeval' and 'evaluate' libraries using pip
# seqeval is a Python library used for evaluating sequence labeling performance (i.e. NER)
# evaluate provides an interface for computing evaluation metrics for text classification, ect.
!pip install seqeval evaluate -q

In [2]:
#### import libraries
import pandas as pd
import numpy as np
import os
import json
# import argparse library to handle command-line arguments
import argparse
# a Hugging Face Dataset used for natural language processing (NLP)
from datasets import Dataset
from pathlib import Path
# import the 'chain' function to combine multiple iterables into a single iterable
from itertools import chain
# tokenizer that can automatically find the model's required tokenization from the model name
from transformers import AutoTokenizer
# model class that can automatically find a token classification model from the model name
from transformers import AutoModelForTokenClassification
# class that provides an API for feature-complete training in PyTorch
from transformers import Trainer
# class to store hyperparameters for training
from transformers import TrainingArguments
# data collator that dynamically pads the inputs received, used for token classification tasks
from transformers import DataCollatorForTokenClassification
# use `partial` to create a new function with fixed arguments
from functools import partial
# the seqeval library is a Python framework for sequence labeling evaluation
from seqeval.metrics import recall_score, precision_score, f1_score

2024-02-15 23:10:16.955761: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-15 23:10:16.955891: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-15 23:10:17.068055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# define the path/name of the pre-trained model we'll be using for fine tuning
# here, specifies the DeBERTa V3 base model 
TRAINING_MODEL_PATH = "microsoft/deberta-v3-base"
# define the maximum number of tokens to be used in the model from the 
# input sequence, the length is based on the model's capabilities and the dataset's requirements
TRAINING_MAX_LENGTH = 1024
# the directory where the outputs like the trained model, evaluation reports, 
# and any checkpoints will be saved.
OUTPUT_DIR = "model_output"

### Data Collection:
Obtain the JSON dataset and load it.

In [4]:
# load the JSON file content
with open("/kaggle/input/pii-detection-removal-from-educational-data/train.json", 'r') as file:
    data = json.load(file)

### Data Exploration:
Explore the dataset to get an initial understanding of its structure and content.

In [5]:
#### gain insight from an example of data
# data is a list contains multiple dictionaries, 
# choose the first dictionary as an example
example = data[0]
print("### Data Size ###")
print(f"The samples that dataset contains: {len(data)}\n")

print("### Sample Keys ###")
print(f'The keys of the data is: {example.keys()}\n')

print("### Sample Document Number ###")
print(f"The document of this sample is: {example['document']}\n")

print("### Sample Full Text ###")
print(f"The full_text of this sample is: {example['full_text']}\n")

print("### Sample Tokens ###")
print(f"The tokens of this sample is: {example['tokens']}\n")

print("### Sample Trailing Whitespace ###")
print(f"The trailing_whitespce of this sample is: {example['trailing_whitespace']}\n")

print("### Sample Labels ###")
print(f"The labels of this sample is: {example['labels']}")

### Data Size ###
The samples that dataset contains: 6807

### Sample Keys ###
The keys of the data is: dict_keys(['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels'])

### Sample Document Number ###
The document of this sample is: 7

### Sample Full Text ###
The full_text of this sample is: Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla

Challenge & selection

The tool I use to help all stakeholders finding their way through the complexity of a project is the  mind map.

What exactly is a mind map? According to the definition of Buzan T. and Buzan B. (1999, Dessine-moi  l'intelligence. Paris: Les Éditions d'Organisation.), the mind map (or heuristic diagram) is a graphic  representation technique that follows the natural functioning of the mind and allows the brain's  potential to be released. Cf Annex1

This tool has many advantages:

•  It is accessible to all and does not require significant material investment and can be done  quickly

•  It is 

### Data Cleaning:
Preprocess the data to ensure its quality and consistency. Downsize the data and organize it into subsets, distinguishing between those with named entities and those without.

In [6]:
#### downsize the data
positive_samples = [] # samples that contain named entities
negative_samples = [] # samples that do not contain any named entity

# loop over all the dictionaries in the data list
for essay in data:
    # loop over all values of key 'labels' in the dictionary
    # check if the essay contains any named entities
    if any(label != '0' for label in essay['labels']):
        positive_samples.append(essay)
    else:
        negative_samples.append(essay)

### Data Preprocessing:
Feature Engineering: create new features from existing data to reveal additional insights.  
Data Transformation: transform data into a desired format required by machine learning model.

In [7]:
#### creating label-id mappings for data labels
# using set comprehension, the code collects all unique labels 
# from all essay['labels'] lists in the data list and sorts them into list
all_labels = sorted({label for essay in data for label in essay['labels']})
print('######## All Labels ########')
print(all_labels)

# create a dictionary mapping each label to a unique ID using enumerate
label2id = {label: id for id, label in enumerate(all_labels)}
print('######## Labels to ID ########')
print(label2id)

# reverse the label2id dict to create a mapping from IDs back to labels
# items() return a list of key-value tuples
id2label = {id: label for label, id in label2id.items()}
print('######## ID to Labels ########')
print(id2label)

######## All Labels ########
['B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL', 'O']
######## Labels to ID ########
{'B-EMAIL': 0, 'B-ID_NUM': 1, 'B-NAME_STUDENT': 2, 'B-PHONE_NUM': 3, 'B-STREET_ADDRESS': 4, 'B-URL_PERSONAL': 5, 'B-USERNAME': 6, 'I-ID_NUM': 7, 'I-NAME_STUDENT': 8, 'I-PHONE_NUM': 9, 'I-STREET_ADDRESS': 10, 'I-URL_PERSONAL': 11, 'O': 12}
######## ID to Labels ########
{0: 'B-EMAIL', 1: 'B-ID_NUM', 2: 'B-NAME_STUDENT', 3: 'B-PHONE_NUM', 4: 'B-STREET_ADDRESS', 5: 'B-URL_PERSONAL', 6: 'B-USERNAME', 7: 'I-ID_NUM', 8: 'I-NAME_STUDENT', 9: 'I-PHONE_NUM', 10: 'I-STREET_ADDRESS', 11: 'I-URL_PERSONAL', 12: 'O'}


In [8]:
#### define custom tokenize function 
# tokenize original text into desired tokens that ML model can understand
def tokenize(essay, tokenizer, label2id, max_seq_length):
    '''
    Tokenize the input example using the provided tokenizer and assign labels.
    '''
    
    # assign labels to each character 
    char_labels = []
    
    # original label is assigned to the tokens (i.e. "hello"),
    # now assign the label to each character (i.e. "h", "e", ...)
    for token, label, has_whitespace in zip(
        essay['tokens'],
        essay['provided_labels'], 
        essay['trailing_whitespace']
    ):
        char_labels.extend([label] * len(token))
        # assign "O" label for whitespace
        if has_whitespace:
            char_labels.append("O")  
    
    # join tokens into a single string and convert character labels to a NumPy array
    labels = np.array(char_labels)
    # assign full text to the text variable
    text = essay['full_text']
    # tokenize the text with offset mapping to align tokens with character labels
    tokenized_output = tokenizer(
        text, 
        return_offsets_mapping=True, 
        max_length=max_seq_length, 
        truncation=True)
    
    # initialize the list to store labels for each token
    token_labels = []  
    
    # assign labels to each token based on the character label at the start index
    for start_idx, end_idx in tokenized_output.offset_mapping:
    # handle special tokens (e.g., CLS token at the start) and skip them
        if start_idx == end_idx == 0:
            token_labels.append(label2id["O"])
        else:
            # adjust start_idx if it points to a whitespace
            adjusted_start_idx = start_idx + text[start_idx].isspace()
            # append the corresponding token label
            token_labels.append(label2id[labels[adjusted_start_idx]])
    
    # add the token labels and the sequence length to the tokenized output
    tokenized_output['labels'] = token_labels
    tokenized_output['sequence_length'] = len(tokenized_output.input_ids)
    
    return tokenized_output

### Data Modeling:
Train your models on the dataset, and tune them for better performance.

In [9]:
# using the AutoTokenizer.from_pretrained() to ensure that the tokenization
# aligns with the pretrained model's expections
tokenizer = AutoTokenizer.from_pretrained(TRAINING_MODEL_PATH)

# prepare the dataset from a dictionary format
# key will be the column names, value will assign to each row
processed_dataset = Dataset.from_dict({
    'full_text': [sentence['full_text'] for sentence in data],
    # convert int object into string
    'document': [str(sentence['document']) for sentence in data],
    'tokens': [sentence['tokens'] for sentence in data],
    'trailing_whitespace': [sentence['trailing_whitespace'] for sentence in data],
    # rename 'labels' to 'provided_labels'
    'provided_labels': [sentence['labels'] for sentence in data] 
})

# appply the tokenize function
# the map() will apply tokenize() to every value of the dataset
processed_dataset = processed_dataset.map(
    # the tokenize() is applied in the dataset
    tokenize, 
    fn_kwargs={
        "tokenizer": tokenizer, 
        "label2id": label2id, 
        "max_seq_length": TRAINING_MAX_LENGTH
    }, 
    num_proc=3 # 3 parallel processes 
)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



    

#0:   0%|          | 0/2269 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/2269 [00:00<?, ?ex/s]

 

#2:   0%|          | 0/2269 [00:00<?, ?ex/s]

### Data Interpretation:
Translate the model's findings into insights that are relevant to the objectives.
Use domain knowledge to provide context to the results and identify potential biases or limitations in the analysis.

In [10]:
# take one document as an example
e = processed_dataset[0]

# print out tokens and their provided labels from the original dataset,
# but only those tokens which are labeled as entities (not "O").
for token, label in zip(e["tokens"], e["provided_labels"]):
    # check if the label is not "O" (Outside of an entity)
    if label != "O":
        # Print the token and label pair
        print((token, label))

# print a separator line to distinguish between the two outputs
print("#" * 20)

# now, using the tokenizer to convert token IDs back to tokens, and print them
# with their corresponding labels from the processed dataset.
for token_id, label_id in zip(e["input_ids"], e["labels"]):
    # convert the token ID back to its token string representation
    token = tokenizer.convert_ids_to_tokens([token_id])[0]
    # retrieve the string label from the 'id2label' mapping using the label ID
    label = id2label[label_id]
    # check if the label is not "O"
    if label != "O":
        print((token, label))

('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
####################
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')


### Model Evaluation:

Assess the model using appropriate metrics, it is crucial to evaluate its performance to ensure that it provides accurate and reliable predictions or insights. 

In [11]:
#### define the model evalutation metrics
def compute_metrics(p, all_labels):
    """
    Compute performance metrics for token classification.
    
    Args:
        p (Tuple[np.ndarray, np.ndarray]): A tuple containing the model predictions and labels.
        all_labels (List[str]): A list of all possible labels.
    
    Returns:
        dict: A dictionary containing recall, precision, and F1 score.
    """
    predictions, labels = p
    # convert logits to label IDs
    predictions = np.argmax(predictions, axis=2)

    # convert the index predictions to actual label predictions
    # only take into account non-ignored tokens (-100 in PyTorch signifies ignore index)
    true_predictions = [
        [all_labels[pred] for (pred, label) in zip(prediction_row, label_row) if label != -100]
        for prediction_row, label_row in zip(predictions, labels)
    ]
    
    true_labels = [
        [all_labels[label] for (pred, label) in zip(prediction_row, label_row) if label != -100]
        for prediction_row, label_row in zip(predictions, labels)
    ]
    
    # Calculate recall, precision, and f1 score
    recall = recall_score(true_labels, true_predictions)
    precision = precision_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions)  # Directly use the f1_score from seqeval
    
    # Prepare the results dictionary
    results = {
        'recall': recall,
        'precision': precision,
        'f1': f1
    }
    
    return results

In [12]:
# setting up a machine learning environment for token classification tasks
model = AutoModelForTokenClassification.from_pretrained(
        TRAINING_MODEL_PATH, # the path of pre-trained model
        num_labels = len(all_labels), # the number of labels the model should predict
        id2label = id2label, # the dictionary mapping numberical label IDs to string
        label2id = label2id, # the dictionary mapping string label to numerical IDS
        ignore_mismatched_sizes = True) # for fine tune the model, adding new predicted labels thus is not the same as before

collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# the hyperparameters and settings for the training process are specified.
args = TrainingArguments(
    output_dir=OUTPUT_DIR, # directory to save training artifacts, including checkpoints
    fp16=True, # enables mixed-precision training, reducing memory usage and potentially speeding up
    learning_rate=2e-5, # learning rate for optimizer
    num_train_epochs=3, # number of epochs to train for
    per_device_train_batch_size=4, # each GPU/CPU processes 4 examples per batch
    gradient_accumulation_steps=2, # the number of steps to accumulate gradients before a backward
    report_to='none', # disable logging
    evaluation_strategy='no', # disable evaluation during training
    do_eval=False, # whether to run evaluation on the eval_dataset when evaluation_strategy is triggered
    save_total_limit=1, # the maximum model checkpoints to keep
    logging_steps=20, # the frequency of logging training information
    lr_scheduler_type='cosine', # cosine learning rate schedule
        metric_for_best_model="f1", # the metric to use when comparing model checkpoints
        greater_is_better=True, # a higher metric score is better for metric_for_best_model
        warmup_ratio=0.1, # the proportion of training to perform linear learning rate warmup for
        weight_decay=0.01 # weights will decay by a factor of 0.01
)

# Trainer() is a class that provides an API for training and evaluation
trainer = Trainer(
    model=model, # the model to be trained or fine-tuned
    args=args, # training arguments defined above
    train_dataset=processed_dataset, # the dataset to be used for training, which should be tokenized and formatted properly
    data_collator=collator, # handle batching and token-label alignment
    tokenizer=tokenizer, # the tokenizer used to process data
    compute_metrics=partial(compute_metrics, all_labels=all_labels), # a function that computes metrics
)

In [14]:
# # jupyter magic command, display time passing by
# # %%time
# trainer.train()

### Result Interpretation:

Understanding what the metrics tell about the model, how it will perform in practical applications.

In [15]:
# # save both the model and tokenizer to ensure that all components needed to 
# # run the model and preprocess input data in the future are kept
# # save the DeBERTa v3 base model
# trainer.save_model("deberta3base_1024")
# # save the corresponding tokenizer for the model
# tokenizer.save_pretrained("deberta3base_1024")

### Model Deployment
Using fine-tuned model to predict the test data results.

In [16]:
#### load the test file
# load the JSON file content
with open("/kaggle/input/pii-detection-removal-from-educational-data/test.json", 'r') as file:
    test_data = json.load(file)

In [17]:
#### prepare the test dataset
# prepare the dataset from a dictionary format
# key will be the column names, value will assign to each row
test_data = Dataset.from_dict({
    'full_text': [essay['full_text'] for essay in test_data],
    'document': [essay['document'] for essay in test_data],
    'tokens': [essay['tokens'] for essay in test_data],
    'trailing_whitespace': [essay['trailing_whitespace'] for essay in test_data]
    
})

In [18]:
#### define the tokenize for test data
def tokenize(essay, tokenizer):
    '''
    Tokenizes the text data within an essay dictionary for inference
    '''
    # extract the full text from the essay dictionary
    text = essay['full_text']    
    # perform tokenization of the input text
    tokenized = tokenizer(
        text, 
        return_offsets_mapping=True,
        truncation=True,
        max_length=INFERENCE_MAX_LENGTH)
    
    # initialize the token_map list
    token_map = []
    idx = 0
    # iterate over each token and its corresponding trailing whitespace
    for token, has_whitespace in zip(essay["tokens"], essay["trailing_whitespace"]):
        # map the current token index to all characters in the token
        token_map.extend([idx]*len(token))
        # if there is trailing whitespace, map it to -1
        if has_whitespace:
            token_map.append(-1)
        # increment the token index
        idx += 1
        
    # add the 'token_map' key and store the token_map variable
    tokenized['token_map'] = token_map
    
    # return the tokenized input and the token_map
    return tokenized


In [19]:
#### load the fine-tuned pre-trained tokenizer
# define the model path
model_path = '/kaggle/input/piimodel/deberta3base_1024'
INFERENCE_MAX_LENGTH = 2048

tokenizer = AutoTokenizer.from_pretrained(model_path)
test_data = test_data.map(
    tokenize, 
    fn_kwargs={"tokenizer": tokenizer}, 
    num_proc=2) # the number of parallel processes used for training

# load the fine-tuned pre-trained model
model = AutoModelForTokenClassification.from_pretrained(model_path)

# load the data collator used for padding input
collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16)

# setting up training arguments
args = TrainingArguments(
    ".",  # output directory where training artifacts will be written
    per_device_eval_batch_size=1,  # evaluation batch size per device (e.g., GPU)
    report_to="none",  # disables reporting to any external entity
)

# initializing the Trainer
trainer = Trainer(
    model=model,  # the pre-trained model to be fine-tuned or evaluated
    args=args,  # training arguments set up above
    data_collator=collator,  # responsible for batching and preparing data
    tokenizer=tokenizer,  # tokenizer to be used for pre-processing text data
)

   

#0:   0%|          | 0/5 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/5 [00:00<?, ?ex/s]

In [20]:
# call the predict method of the trainer object on test_data
# extract the .predictions attribute from the result
predictions = trainer.predict(test_data).predictions
# apply the softmax function to predictions
pred_softmax = np.exp(predictions) / np.sum(np.exp(predictions), axis=2).reshape(predictions.shape[0], predictions.shape[1], 1)
# loads the configuration JSON file
config = json.load(open(os.path.join(model_path, "config.json")))
# id2label is a dictionary extracted from the loaded configuration
# mapping numerical IDs to their corresponding labels
id2label = config["id2label"]
# find the index of the maximum value along the last axis
preds = predictions.argmax(-1)
# apply softmax to the first 12 classes (assuming 'O' is the 13th class)
preds_without_O = pred_softmax[:, :, :12].argmax(-1)
# extract the probabilities of the 'O' class
O_preds = pred_softmax[:, :, 12]
# define the threshold for 'O' class predictions
threshold = 0.9
# decide final predictions based on the threshold:
# if the probability of 'O' is less than the threshold, use preds_without_O, else use preds
preds_final = np.where(O_preds < threshold, preds_without_O, preds)

In [21]:
# initialize lists to store the processed information
triplets = []
document, token, label, token_str = [], [], [], []

# iterate through the predictions and supporting data
for p, token_map, offsets, tokens, doc in zip(preds_final, test_data["token_map"], test_data["offset_mapping"], test_data["tokens"], test_data["document"]):

    for token_pred, (start_idx, end_idx) in zip(p, offsets):
        label_pred = id2label[str(token_pred)]

        # skip if start and end indices are both 0
        if start_idx + end_idx == 0:
            continue

        # adjust start index if token map is -1
        if token_map[start_idx] == -1:
            start_idx += 1

        # skip any whitespace tokens
        while start_idx < len(token_map) and tokens[token_map[start_idx]].isspace():
            start_idx += 1

        # break if start index goes beyond token map length
        if start_idx >= len(token_map):
            break

        token_id = token_map[start_idx]

        # process only non-"O" labels and valid token IDs
        if label_pred != "O" and token_id != -1:
            triplet = (label_pred, token_id, tokens[token_id])

            # add unique triplets to the list and record details
            if triplet not in triplets:
                document.append(doc)
                token.append(token_id)
                label.append(label_pred)
                token_str.append(tokens[token_id])
                triplets.append(triplet)

In [22]:
# create the text dataframe
test_data = pd.DataFrame({
    "document": document,
    "token": token,
    "label": label,
    "token_str": token_str
})
# add row id column
test_data["row_id"] = list(range(len(test_data)))
display(test_data.head(100))

Unnamed: 0,document,token,label,token_str,row_id
0,7,9,B-NAME_STUDENT,Nathalie,0
1,7,10,I-NAME_STUDENT,Sylla,1
2,7,482,B-NAME_STUDENT,Nathalie,2
3,7,483,I-NAME_STUDENT,Sylla,3
4,7,741,B-NAME_STUDENT,Nathalie,4
5,7,742,I-NAME_STUDENT,Sylla,5
6,10,0,B-NAME_STUDENT,Diego,6
7,10,1,I-NAME_STUDENT,Estrada,7
8,10,464,B-NAME_STUDENT,Diego,8
9,10,465,I-NAME_STUDENT,Estrada,9


In [23]:
# save the predictions
test_data[["row_id", "document", "token", "label"]].to_csv("submission.csv", index=False)
print('Completed!')

Completed!
