<a href="https://www.kaggle.com/code/jackren000/piidatadetection-dataanalysis?scriptVersionId=161877243" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Define the objectives:
The objective of the analysis for "The Learning Agency Lab - PII Data Detection" Kaggle competition is to develop and optimize an accurate and computationally efficient model capable of detecting personally identifiable information in student writing to facilitate the safe release of educational datasets.

In [1]:
# install the 'seqeval' and 'evaluate' libraries using pip
!pip install seqeval evaluate -q

In [2]:
#### import libraries
import pandas as pd
import numpy as np
import json
# import argparse library to handle command-line arguments
import argparse
# a Hugging Face Dataset used for natural language processing (NLP)
from datasets import Dataset
from pathlib import Path
# import the 'chain' function to combine multiple iterables into a single iterable
from itertools import chain
# tokenizer that can automatically find the model's required tokenization from the model name
from transformers import AutoTokenizer
# model class that can automatically find a token classification model from the model name
from transformers import AutoModelForTokenClassification
# class that provides an API for feature-compplete training in PyTorch
from transformers import Trainer
# class to store hyperparameters for training
from transformers import TrainingArguments
# data collator that dynamically pads the inputs received, used for token classification tasks
from transformers import DataCollatorForTokenClassification

2024-02-06 06:37:21.508945: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-06 06:37:21.509048: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-06 06:37:21.641638: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# define the path/name of the pre-trained model we'll be using for training
# here, specifies the DeBERTa V3 base model 
TRAINING_MODEL_PATH = "microsoft/deberta-v3-base"
# define the maximum number of tokens to be used in the model from the 
# input sequence, the length is based on the model's capabilities and the dataset's requirements
TRAINING_MAX_LENGTH = 1024
# the directory where the outputs like the trained model, evaluation reports, 
# and any checkpoints will be saved.
OUTPUT_DIR = "model_output"

### Data Collection:
Obtain the JSON dataset and load it.

In [4]:
# load the JSON file content
with open("/kaggle/input/pii-detection-removal-from-educational-data/train.json", 'r') as file:
    data = json.load(file)

### Data Exploration:
Explore the dataset to get an initial understanding of its structure and content.

In [5]:
#### gain insight from an example 
# data is a list contains multiple dictionaries, choose the first dictionary
# as an example
example = data[0]
print("### Data Size ###")
print(f"The length of the data is: {len(data)}\n")

print("### Sample Keys ###")
print(f'The keys of the data is: {example.keys()}\n')

print("### Sample Document Number ###")
print(f"The document of this sample is: {example['document']}\n")

print("### Sample Full Text ###")
print(f"The full_text of this sample is: {example['full_text']}\n")

print("### Sample Tokens ###")
print(f"The tokens of this sample is: {example['tokens']}\n")

print("### Sample Trailing Whitespace ###")
print(f"The trailing_whitespce of this sample is: {example['trailing_whitespace']}")

print("### Sample Labels ###")
print(f"The labels of this sample is: {example['labels']}")

### Data Size ###
The length of the data is: 6807

### Sample Keys ###
The keys of the data is: dict_keys(['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels'])

### Sample Document Number ###
The document of this sample is: 7

### Sample Full Text ###
The full_text of this sample is: Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla

Challenge & selection

The tool I use to help all stakeholders finding their way through the complexity of a project is the  mind map.

What exactly is a mind map? According to the definition of Buzan T. and Buzan B. (1999, Dessine-moi  l'intelligence. Paris: Les Éditions d'Organisation.), the mind map (or heuristic diagram) is a graphic  representation technique that follows the natural functioning of the mind and allows the brain's  potential to be released. Cf Annex1

This tool has many advantages:

•  It is accessible to all and does not require significant material investment and can be done  quickly

•  It is scalable

### Data Cleaning:
Preprocess the data to ensure its quality and consistency. Downsize the data and organize it into subsets, distinguishing between those with named entities and those without.

In [6]:
#### downsize the data
positive_samples = [] # samples that contain named entities
negative_samples = [] # samples that do not contain any named entity

# loop over all the dictionaries in the data list
for sentence in data:
    # loop over all values of key 'labels' in the dictionary
    # check if the sentence contains any named entities
    if any(label != '0' for label in sentence['labels']):
        positive_samples.append(sentence)
    else:
        negative_samples.append(sentence)

### Data Preprocessing:
Feature Engineering: create new features from existing data to reveal additional insights.  
Data Transformation: transform data into a desired format required by machine learning model.

In [7]:
#### creating label-id mappings for data labels
# utilize set comprehension to flat all labels from the 'labels' key in each item of 'data' and remove duplicates
all_labels = sorted({label for sentence in data for label in sentence['labels']})
print('######## All Labels ########')
print(all_labels)

# create a dictionary mapping each label to a unique ID using enumerate
label2id = {label: id for id, label in enumerate(all_labels)}
print('######## Labels to ID ########')
print(label2id)

# reverse the label2id dict to create a mapping from IDs back to labels
# items() return a list of key-value tuples
id2label = {id: label for label, id in label2id.items()}
print('######## ID to Labels ########')
print(id2label)

######## All Labels ########
['B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL', 'O']
######## Labels to ID ########
{'B-EMAIL': 0, 'B-ID_NUM': 1, 'B-NAME_STUDENT': 2, 'B-PHONE_NUM': 3, 'B-STREET_ADDRESS': 4, 'B-URL_PERSONAL': 5, 'B-USERNAME': 6, 'I-ID_NUM': 7, 'I-NAME_STUDENT': 8, 'I-PHONE_NUM': 9, 'I-STREET_ADDRESS': 10, 'I-URL_PERSONAL': 11, 'O': 12}
######## ID to Labels ########
{0: 'B-EMAIL', 1: 'B-ID_NUM', 2: 'B-NAME_STUDENT', 3: 'B-PHONE_NUM', 4: 'B-STREET_ADDRESS', 5: 'B-URL_PERSONAL', 6: 'B-USERNAME', 7: 'I-ID_NUM', 8: 'I-NAME_STUDENT', 9: 'I-PHONE_NUM', 10: 'I-STREET_ADDRESS', 11: 'I-URL_PERSONAL', 12: 'O'}


In [8]:
#### custom function tokenize original text into desired tokens that ML model can understand
def tokenize(input_dict, tokenizer, label_map, max_seq_length):
    '''
    Tokenize the input example using the provided tokenizer and assign labels.
    '''
    
    # reconstruct the text from tokens and assign labels, handling whitespace
    reconstructed_text = []
    char_labels = []
    
    for token, label, has_whitespace in zip(
        input_dict["tokens"], 
        input_dict["provided_labels"], 
        input_dict["trailing_whitespace"]
    ):
        reconstructed_text.append(token)
        char_labels.extend([label] * len(token))
        
        if has_whitespace:
            reconstructed_text.append(" ")
            char_labels.append("O")  # assign "O" label for whitespace
    
    # join tokens into a single string and convert character labels to a NumPy array
    text = ''.join(reconstructed_text)
    labels = np.array(char_labels)
    
    # tokenize the text with offset mapping to align tokens with character labels
    tokenized_output = tokenizer(
        text, 
        return_offsets_mapping=True, 
        max_length=max_seq_length, 
        truncation=True)
    
    # initialize the list to store labels for each token
    token_labels = []  
    
    # assign labels to each token based on the character label at the start index
    for start_idx, end_idx in tokenized_output.offset_mapping:
        # handle special tokens (e.g., CLS token at the start)
        if start_idx == 0 and end_idx == 0:
            token_labels.append(label_map["O"])
            # using continue to start new iteration
            continue
        
        # skip leading whitespace in the token
        if text[start_idx].isspace():
            start_idx += 1
        # append the corresponding token label
        token_labels.append(label_map[labels[start_idx]])
    
    # add the token labels and the sequence length to the tokenized output
    tokenized_output['labels'] = token_labels
    tokenized_output['sequence_length'] = len(tokenized_output.input_ids)
    
    return tokenized_output

### Data Modeling:
Train your models on the dataset, and tune them for better performance.

In [9]:
# using the AutoTokenizer.from_pretrained() to ensure that the tokenization
# aligns with the pretrained model's expections
tokenizer = AutoTokenizer.from_pretrained(TRAINING_MODEL_PATH)

# prepare the dataset 
processed_dataset = Dataset.from_dict({
    'full_text': [x['full_text'] for x in data],
    'document': [str(x['document']) for x in data],
    'tokens': [x['tokens'] for x in data],
    'trailing_whitespace': [x['trailing_whitespace'] for x in data],
    # rename 'labels' to 'provided_labels'
    'provided_labels': [x['labels'] for x in data] 
})

# appply the tokenize function
# the map() will apply tokenize() to every value of the dataset
processed_dataset = processed_dataset.map(
    tokenize, # the tokenize() is applied in the dataset
    fn_kwargs={
        "tokenizer": tokenizer, 
        "label_map": label2id, 
        "max_seq_length": TRAINING_MAX_LENGTH
    }, 
    num_proc=3 # 3 parallel processes 
)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



    

#0:   0%|          | 0/2269 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/2269 [00:00<?, ?ex/s]

 

#2:   0%|          | 0/2269 [00:00<?, ?ex/s]

### Data Interpretation:
Translate the model's findings into insights that are relevant to the objectives.
Use domain knowledge to provide context to the results and identify potential biases or limitations in the analysis.

In [10]:
# take one document as an example
e = processed_dataset[0]

# print out tokens and their provided labels from the original dataset,
# but only those tokens which are labeled as entities (not "O").
for token, label in zip(e["tokens"], e["provided_labels"]):
    # check if the label is not "O" (Outside of an entity)
    if label != "O":
        # Print the token and label pair
        print((token, label))

# print a separator line to distinguish between the two outputs
print("#" * 20)

# now, using the tokenizer to convert token IDs back to tokens, and print them
# with their corresponding labels from the processed dataset.
for token_id, label_id in zip(e["input_ids"], e["labels"]):
    # convert the token ID back to its token string representation
    token = tokenizer.convert_ids_to_tokens([token_id])[0]
    # retrieve the string label from the 'id2label' mapping using the label ID
    label = id2label[label_id]
    # check if the label is not "O"
    if label != "O":
        print((token, label))

('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
('Nathalie', 'B-NAME_STUDENT')
('Sylla', 'I-NAME_STUDENT')
####################
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')
('N', 'B-NAME_STUDENT')
('atha', 'B-NAME_STUDENT')
('lie', 'B-NAME_STUDENT')
('▁S', 'I-NAME_STUDENT')
('ylla', 'I-NAME_STUDENT')


### Model Evaluation:

Assess the model using appropriate metrics 

In [11]:
from seqeval.metrics import recall_score, precision_score, f1_score

def compute_metrics(p, all_labels):
    """
    Compute performance metrics for token classification.
    
    Args:
        p (Tuple[np.ndarray, np.ndarray]): A tuple containing the model predictions and labels.
        all_labels (List[str]): A list of all possible labels.
    
    Returns:
        dict: A dictionary containing recall, precision, and F1 score.
    """
    predictions, labels = p
    # Convert logits to label IDs
    predictions = np.argmax(predictions, axis=2)

    # Convert the index predictions to actual label predictions
    # Only take into account non-ignored tokens (-100 in PyTorch signifies ignore index)
    true_predictions = [
        [all_labels[pred] for (pred, label) in zip(prediction_row, label_row) if label != -100]
        for prediction_row, label_row in zip(predictions, labels)
    ]
    
    true_labels = [
        [all_labels[label] for (pred, label) in zip(prediction_row, label_row) if label != -100]
        for prediction_row, label_row in zip(predictions, labels)
    ]
    
    # Calculate recall, precision, and f1 score
    recall = recall_score(true_labels, true_predictions)
    precision = precision_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions)  # Directly use the f1_score from seqeval
    
    # Prepare the results dictionary
    results = {
        'recall': recall,
        'precision': precision,
        'f1': f1
    }
    
    return results

In [12]:
# load pre-trained model
model = AutoModelForTokenClassification.from_pretrained(
        TRAINING_MODEL_PATH, # the path of pre-trained model
        num_labels = len(all_labels), # the number of labels the model should predict
        id2label = id2label, # the dictionary mapping numberical label IDs to string
        label2id = label2id, # the dictionary mapping string label to numerical IDS
        ignore_mismatched_sizes = True) # the model to ignore any size mismatches between the pre-trained model's expected label size and the num_labels you provide

collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
from functools import partial

# the hyperparameters and settings for the training process are specified.
args = TrainingArguments(
    output_dir=OUTPUT_DIR, # the directory where the training outputs will be saved
    fp16=True, # enables mixed-precision training, reducing memory usage and potentially speeding up
    learning_rate=2e-5, # learning rate for optimizer
    num_train_epochs=3, # number of epochs to train for
    per_device_train_batch_size=4, # each GPU/CPU processes 4 examples per batch
    gradient_accumulation_steps=2, # the number of steps to accumulate gradients before a backward
    report_to='none', # disable logging
    evaluation_strategy='no', # disable evaluation during training
    do_eval=False, # whether to run evaluation on the eval_dataset when evaluation_strategy is triggered
    save_total_limit=1, # the maximum model checkpoints to keep
    logging_steps=20, # the frequency of logging training information
    lr_scheduler_type='cosine', # cosine learning rate schedule
        metric_for_best_model="f1", # the metric to use when comparing model checkpoints
        greater_is_better=True, # a higher metric score is better for metric_for_best_model
        warmup_ratio=0.1, # the proportion of training to perform linear learning rate warmup for
        weight_decay=0.01 # weights will decay by a factor of 0.01
)

# Trainer() is a class that provides an API for training and evaluation
trainer = Trainer(
    model=model, # the model to be trained or fine-tuned
    args=args, # training arguments defined above
    train_dataset=processed_dataset, # the dataset to be used for training, which should be tokenized and formatted properly
    data_collator=collator, # handle batching and token-label alignment
    tokenizer=tokenizer, # the tokenizer used to process data
    compute_metrics=partial(compute_metrics, all_labels=all_labels), # a function that computes metrics
)

In [14]:
%%time
trainer.train()

Step,Training Loss
20,3.454
40,2.9668
60,1.453
80,0.1354
100,0.0108
120,0.0113
140,0.0183
160,0.0123
180,0.0056
200,0.0082


CPU times: user 53min 39s, sys: 13min 12s, total: 1h 6min 52s
Wall time: 1h 6min 50s


TrainOutput(global_step=2553, training_loss=0.06462583896945513, metrics={'train_runtime': 4010.4109, 'train_samples_per_second': 5.092, 'train_steps_per_second': 0.637, 'total_flos': 9461924226456672.0, 'train_loss': 0.06462583896945513, 'epoch': 3.0})