## 🛑 Wait a second - after this you should also look at the inference notebook
- My inference notebook (containing equally many emojis) is here: I would love an upvote if you use the notebook or learned something new!
- https://www.kaggle.com/code/valentinwerner/893-deberta3base-inference

## 🏟️ Credits (because this baseline did mostly already exist when I joiend)

- @Nicholas Broad published the transformer baseline which performs only marginally worse: https://www.kaggle.com/code/nbroad/transformer-ner-baseline-lb-0-854
- @Joseph Josia published the training notebook which I basically copy pasted (which is based itself on nbroad, but yeah): https://www.kaggle.com/code/takanashihumbert/piidd-deberta-model-starter-training



## 💡 What I added
- Downsampling negative samples (samples without labels, but they possible still work as examples where names should not be tagged as name)
- Adding @moths external data: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/469493
- Adding PJMathematicianss external data: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/470921
- However, I used my cleaned version instead (the punctuation is flawed in the original data set at the time of this trainign): https://www.kaggle.com/code/valentinwerner/fix-punctuation-tokenization-external-dataset

Doing this brought the LB score to .888 - Trained in Kaggle Notebook, no tricks or secrets.

- I added emojis because that seems to be the kaggle upvote meta

## 📝 Config & Imports
- 1024 max length has been working well for me. As some samples are longer, you may want to go as high as you can 

In [1]:
TRAINING_MODEL_PATH = "microsoft/deberta-v3-base"
TRAINING_MAX_LENGTH = 1024
OUTPUT_DIR = "output"

In [2]:
!pip install seqeval evaluate -q

In [3]:
import json
import argparse
from itertools import chain
from functools import partial

import torch
from transformers import AutoTokenizer, Trainer, TrainingArguments
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification
import evaluate
from datasets import Dataset, features
import numpy as np



## 🗺️ Data Selection and Label Mapping
- As mentioned before, I additionaly use the moth dataset

In [4]:
data = json.load(open("/kaggle/input/pii-detection-removal-from-educational-data/train.json"))

# downsampling of negative examples
p=[] # positive samples (contain relevant labels)
n=[] # negative samples (presumably contain entities that are possibly wrongly classified as entity)
for d in data:
    if any(np.array(d["labels"]) != "O"): p.append(d)
    else: n.append(d)
print("original datapoints: ", len(data))

external = json.load(open("/kaggle/input/fix-punctuation-tokenization-external-dataset/pii_dataset_fixed.json"))
print("external datapoints: ", len(external))

moredata = json.load(open("/kaggle/input/fix-punctuation-tokenization-external-dataset/moredata_dataset_fixed.json"))
print("moredata datapoints: ", len(moredata))

data = moredata+external+p+n[:len(n)//3]
print("combined: ", len(data))

original datapoints:  6807
external datapoints:  4434
moredata datapoints:  2000
combined:  9333


In [8]:
data[0].keys()

dict_keys(['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels'])

In [None]:
data[0]['labels']

In [5]:
all_labels = sorted(list(set(chain(*[x["labels"] for x in data]))))
label2id = {l: i for i,l in enumerate(all_labels)}
id2label = {v:k for k,v in label2id.items()}

target = [
    'B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 
    'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 
    'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL'
]

print(id2label)

{0: 'B-EMAIL', 1: 'B-ID_NUM', 2: 'B-NAME_STUDENT', 3: 'B-PHONE_NUM', 4: 'B-STREET_ADDRESS', 5: 'B-URL_PERSONAL', 6: 'B-USERNAME', 7: 'I-ID_NUM', 8: 'I-NAME_STUDENT', 9: 'I-PHONE_NUM', 10: 'I-STREET_ADDRESS', 11: 'I-URL_PERSONAL', 12: 'O'}


## ♟️ Data Tokenization
- This tokenizer is actually special, comparing to usual NLP challenges

In [6]:
def tokenize(example, tokenizer, label2id, max_length):

    # rebuild text from tokens
    text = []
    labels = []

    for t, l, ws in zip(
        example["tokens"], example["provided_labels"], example["trailing_whitespace"]
    ):
        text.append(t)
        labels.extend([l] * len(t))

        if ws:
            text.append(" ")
            labels.append("O")

    # actual tokenization
    tokenized = tokenizer("".join(text), return_offsets_mapping=True, max_length=max_length)

    labels = np.array(labels)

    text = "".join(text)
    token_labels = []

    for start_idx, end_idx in tokenized.offset_mapping:
        # CLS token
        if start_idx == 0 and end_idx == 0:
            token_labels.append(label2id["O"])
            continue

        # case when token starts with whitespace
        if text[start_idx].isspace():
            start_idx += 1

        token_labels.append(label2id[labels[start_idx]])

    length = len(tokenized.input_ids)

    return {**tokenized, "labels": token_labels, "length": length}

In [7]:
tokenizer = AutoTokenizer.from_pretrained(TRAINING_MODEL_PATH)

ds = Dataset.from_dict({
    "full_text": [x["full_text"] for x in data],
    "document": [str(x["document"]) for x in data],
    "tokens": [x["tokens"] for x in data],
    "trailing_whitespace": [x["trailing_whitespace"] for x in data],
    "provided_labels": [x["labels"] for x in data],
})
ds = ds.map(tokenize, fn_kwargs={"tokenizer": tokenizer, "label2id": label2id, "max_length": TRAINING_MAX_LENGTH}, num_proc=3)
# ds = ds.class_encode_column("group")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



    

#0:   0%|          | 0/3111 [00:00<?, ?ex/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


 

#1:   0%|          | 0/3111 [00:00<?, ?ex/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


 

#2:   0%|          | 0/3111 [00:00<?, ?ex/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [8]:
x = ds[0]

for t,l in zip(x["tokens"], x["provided_labels"]):
    if l != "O":
        print((t,l))

print("*"*100)

for t, l in zip(tokenizer.convert_ids_to_tokens(x["input_ids"]), x["labels"]):
    if id2label[l] != "O":
        print((t,id2label[l]))

('Richard', 'B-NAME_STUDENT')
('Chang', 'B-NAME_STUDENT')
('gwilliams@yahoo.com', 'B-EMAIL')
('Richard', 'B-NAME_STUDENT')
('Richard', 'B-NAME_STUDENT')
('Richard', 'B-NAME_STUDENT')
('711', 'B-STREET_ADDRESS')
('Golden', 'I-STREET_ADDRESS')
('Overpass', 'I-STREET_ADDRESS')
('West', 'I-STREET_ADDRESS')
('Andreaville', 'I-STREET_ADDRESS')
('OH', 'I-STREET_ADDRESS')
('Richard', 'B-NAME_STUDENT')
('Richard', 'B-NAME_STUDENT')
('Richard', 'B-NAME_STUDENT')
****************************************************************************************************
('▁Richard', 'B-NAME_STUDENT')
('▁Chang', 'B-NAME_STUDENT')
('▁g', 'B-EMAIL')
('william', 'B-EMAIL')
('s', 'B-EMAIL')
('@', 'B-EMAIL')
('yahoo', 'B-EMAIL')
('.', 'B-EMAIL')
('com', 'B-EMAIL')
('▁Richard', 'B-NAME_STUDENT')
('▁Richard', 'B-NAME_STUDENT')
('▁Richard', 'B-NAME_STUDENT')
('▁711', 'B-STREET_ADDRESS')
('▁Golden', 'I-STREET_ADDRESS')
('▁Over', 'I-STREET_ADDRESS')
('pass', 'I-STREET_ADDRESS')
('▁West', 'I-STREET_ADDRESS')
('▁Andr

## 🧮 Competition metrics
- Note that we are not using the normal F1 score.
- Although it is early in the competition, there are plenty of discsussions already explaining this:
- e.g., here: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/470024

In [9]:
from seqeval.metrics import recall_score, precision_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score

def compute_metrics(p, all_labels):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [all_labels[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [all_labels[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    recall = recall_score(true_labels, true_predictions)
    precision = precision_score(true_labels, true_predictions)
    f1_score = (1 + 5*5) * recall * precision / (5*5*precision + recall)
    
    results = {
        'recall': recall,
        'precision': precision,
        'f1': f1_score
    }
    return results

In [10]:
model = AutoModelForTokenClassification.from_pretrained(
    TRAINING_MODEL_PATH,
    num_labels=len(all_labels),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)
collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# I decided to uses no eval
# final_ds = ds.train_test_split(test_size=0.2, seed=42) # cannot use stratify_by_column='group'
# final_ds

## 🏋🏻‍♀️ Training
- I actually do not use an eval set for submission to train on all data
- Values are not really tuned and go by gut feeling, as this is my first iteration / baseline

In [12]:
# I actually chose to not use any validation set. This is only for the model I use for submission.
args = TrainingArguments(
    output_dir=OUTPUT_DIR, 
    fp16=True,
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    report_to="none",
    evaluation_strategy="no",
    do_eval=False,
    save_total_limit=1,
    logging_steps=20,
    lr_scheduler_type='cosine',
    metric_for_best_model="f1",
    greater_is_better=True,
    warmup_ratio=0.1,
    weight_decay=0.01
)

trainer = Trainer(
    model=model, 
    args=args, 
    train_dataset=ds,
    data_collator=collator, 
    tokenizer=tokenizer,
    compute_metrics=partial(compute_metrics, all_labels=all_labels),
)

In [13]:
%%time
trainer.train()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
20,2.4298
40,1.3488
60,0.4117
80,0.1376
100,0.0842
120,0.0507
140,0.0378
160,0.0313
180,0.0207
200,0.0201




CPU times: user 1h 46min 27s, sys: 26min 48s, total: 2h 13min 16s
Wall time: 1h 18min 1s


TrainOutput(global_step=1749, training_loss=0.057461005176021894, metrics={'train_runtime': 4680.5974, 'train_samples_per_second': 5.982, 'train_steps_per_second': 0.374, 'total_flos': 1.2183608904271872e+16, 'train_loss': 0.057461005176021894, 'epoch': 3.0})

## 💾 Save models
- You can click on "Save version" (top right) and "Save & Run All (Commit)"
- Then you can use this notebook as input for your inference notebook

In [None]:
trainer.save_model("deberta3base_1024")
tokenizer.save_pretrained("deberta3base_1024")