# This one works now, so best not change it. Keep it as a reference point

# HuggingFace Training Baseline

I wanted to create my own baseline for this competition, and I tried to do so "without peeking" at the kernels published by others. Ideally this can be used for training on a Kaggle kernel. Let's see how good we can get.

This baseline is based on the following notebook by Sylvain Gugger: https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb

I initially started building with Roberta - thanks to Chris Deotte for pointing to Longformer :) The evaluation code is from Rob Mulla.

The notebook requires a couple of hours to run, so we'll use W&B to be able to monitor it along the way and keep the record of our experiments.

## Setup

In [91]:
import os

SAMPLE = False # set True for debugging

In [94]:
import wandb
from wandb_creds import *

wandb.login(key=API_KEY)
wandb.init(project="feedback_prize_pytorch", tags=TAGS, entity="feedback_prize_michael_and_wilson")






VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [95]:
# CONFIG  # mk: i put these in hf_config.py

EXP_NUM = 1
task = "ner"
model_checkpoint = "longformer-large-4096-hf"
max_length = 1024
stride = 128
min_tokens = 6
model_path = f'{model_checkpoint.split("/")[-1]}-{EXP_NUM}'
DATA_DIR = 'data'

# TRAINING HYPERPARAMS
BS = 1
GRAD_ACC = 8
LR = 5e-5
WD = 0.01
WARMUP = 0.1
N_EPOCHS = 5

## Data Preprocessing

In [96]:
import pandas as pd

# read train data
train = pd.read_csv(os.path.join(DATA_DIR,'train.csv'))  # mk: i put this in Config of config.py
train.head(1)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...


In [97]:
# check unique classes
classes = train.discourse_type.unique().tolist()  # mk: i put this in Config of hf_config.py
classes

['Lead',
 'Position',
 'Evidence',
 'Claim',
 'Concluding Statement',
 'Counterclaim',
 'Rebuttal']

In [98]:
# setup label indices
from collections import defaultdict

tags = defaultdict()  # mk: i put this in Config of hf_config.py

for i, c in enumerate(classes):  # mk: i put this in hf_functions.py as label_to_index
    tags[f'B-{c}'] = i
    tags[f'I-{c}'] = i + len(classes)
tags[f'O'] = len(classes) * 2
tags[f'Special'] = -100
    
l2i = dict(tags)

i2l = defaultdict()  # mk: i put this in hf_functions as index_to label
for k, v in l2i.items(): 
    i2l[v] = k
i2l[-100] = 'Special'

i2l = dict(i2l)

N_LABELS = len(i2l) - 1  # not accounting for -100  # mk: i put this in hf_functions.py as create_n_labels

In [99]:
# some helper functions

from pathlib import Path

path = Path('data/train')  # i put this in Config of hf_config.py 

def get_raw_text(ids):
    with open(path/f'{ids}.txt', 'r') as file: data = file.read()
    return data

In [100]:
# group training labels by text file

df1 = train.groupby('id')['discourse_type'].apply(list).reset_index(name='classlist')  # mk: moved this to hf_functions.py
df2 = train.groupby('id')['discourse_start'].apply(list).reset_index(name='starts')
df3 = train.groupby('id')['discourse_end'].apply(list).reset_index(name='ends')
df4 = train.groupby('id')['predictionstring'].apply(list).reset_index(name='predictionstrings')

df = pd.merge(df1, df2, how='inner', on='id')
df = pd.merge(df, df3, how='inner', on='id')
df = pd.merge(df, df4, how='inner', on='id')
df['text'] = df['id'].apply(get_raw_text)

df.head()

Unnamed: 0,id,classlist,starts,ends,predictionstrings,text
0,0000D23A521A,"[Position, Evidence, Evidence, Claim, Counterc...","[0.0, 170.0, 358.0, 438.0, 627.0, 722.0, 836.0...","[170.0, 357.0, 438.0, 626.0, 722.0, 836.0, 101...",[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1...,"Some people belive that the so called ""face"" o..."
1,00066EA9880D,"[Lead, Position, Claim, Evidence, Claim, Evide...","[0.0, 456.0, 638.0, 738.0, 1399.0, 1488.0, 231...","[455.0, 592.0, 738.0, 1398.0, 1487.0, 2219.0, ...",[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1...,Driverless cars are exaclty what you would exp...
2,000E6DE9E817,"[Position, Counterclaim, Rebuttal, Evidence, C...","[17.0, 64.0, 158.0, 310.0, 438.0, 551.0, 776.0...","[56.0, 157.0, 309.0, 422.0, 551.0, 775.0, 961....","[2 3 4 5 6 7 8, 10 11 12 13 14 15 16 17 18 19 ...",Dear: Principal\n\nI am arguing against the po...
3,001552828BD0,"[Lead, Evidence, Claim, Claim, Evidence, Claim...","[0.0, 161.0, 872.0, 958.0, 1191.0, 1542.0, 161...","[160.0, 872.0, 957.0, 1190.0, 1541.0, 1612.0, ...",[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1...,Would you be able to give your car up? Having ...
4,0016926B079C,"[Position, Claim, Claim, Claim, Claim, Evidenc...","[0.0, 58.0, 94.0, 206.0, 236.0, 272.0, 542.0, ...","[57.0, 91.0, 150.0, 235.0, 271.0, 542.0, 650.0...","[0 1 2 3 4 5 6 7 8 9, 10 11 12 13 14 15, 16 17...",I think that students would benefit from learn...


In [101]:
# debugging
if SAMPLE: df = df.sample(n=100).reset_index(drop=True)

In [102]:
# we will use HuggingFace datasets
from datasets import Dataset, load_metric

ds = Dataset.from_pandas(df)
datasets = ds.train_test_split(test_size=0.1, shuffle=True, seed=42)
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'classlist', 'starts', 'ends', 'predictionstrings', 'text', '__index_level_0__'],
        num_rows: 14034
    })
    test: Dataset({
        features: ['id', 'classlist', 'starts', 'ends', 'predictionstrings', 'text', '__index_level_0__'],
        num_rows: 1560
    })
})

In [103]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file longformer-large-4096-hf/config.json
Model config LongformerConfig {
  "_name_or_path": "longformer-large-4096-hf",
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "ignore_attention_mask": false,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 4098,
  "model_type": "longformer",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  

In [104]:
# Not sure if this is needed, but in case we create a span with certain class without starting token of that class,
# let's convert the first token to be the starting token.

e = [0,7,7,7,1,1,8,8,8,9,9,9,14,4,4,4]

def fix_beginnings(labels):
    for i in range(1,len(labels)):
        curr_lab = labels[i]
        prev_lab = labels[i-1]
        if curr_lab in range(7,14):
            if prev_lab != curr_lab and prev_lab != curr_lab - 7:
                labels[i] = curr_lab -7
    return labels

fix_beginnings(e)

[0, 7, 7, 7, 1, 1, 8, 8, 8, 2, 9, 9, 14, 4, 4, 4]

In [105]:
# tokenize and add labels
def tokenize_and_align_labels(examples):

    o = tokenizer(examples['text'], truncation=True, padding=True, return_offsets_mapping=True, max_length=max_length, stride=stride, return_overflowing_tokens=True)

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = o["overflow_to_sample_mapping"]

    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = o["offset_mapping"]
    
    o["labels"] = []

    for i in range(len(offset_mapping)):
                   
        sample_index = sample_mapping[i]

        labels = [l2i['O'] for i in range(len(o['input_ids'][i]))]

        for label_start, label_end, label in \
        list(zip(examples['starts'][sample_index], examples['ends'][sample_index], examples['classlist'][sample_index])):
            for j in range(len(labels)):
                token_start = offset_mapping[i][j][0]
                token_end = offset_mapping[i][j][1]
                if token_start == label_start: 
                    labels[j] = l2i[f'B-{label}']    
                if token_start > label_start and token_end <= label_end: 
                    labels[j] = l2i[f'I-{label}']

        for k, input_id in enumerate(o['input_ids'][i]):
            if input_id in [0,1,2]:
                labels[k] = -100

        labels = fix_beginnings(labels)
                   
        o["labels"].append(labels)
        
    return o

In [106]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True, batch_size=20000, remove_columns=datasets["train"].column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [107]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping', 'labels'],
        num_rows: 14574
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping', 'labels'],
        num_rows: 1625
    })
})

## Model and Training

In [108]:
# we will use auto model for token classification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=N_LABELS)

loading configuration file longformer-large-4096-hf/config.json
Model config LongformerConfig {
  "_name_or_path": "longformer-large-4096-hf",
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14"
  },
  "ignore_attention_mask":

In [109]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    logging_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=LR,
    per_device_train_batch_size=BS,
    per_device_eval_batch_size=BS,
    num_train_epochs=N_EPOCHS,
    weight_decay=WD,
    report_to='wandb', 
    gradient_accumulation_steps=GRAD_ACC,
    warmup_ratio=WARMUP
)

PyTorch: setting up devices


In [110]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [111]:
# this is not the competition metric, but for now this will be better than nothing...

metric = load_metric("seqeval")

In [112]:
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [i2l[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [i2l[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [113]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics, 
)

In [114]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")  # new addition
device

device(type='cuda')

In [115]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(args.num_train_epochs))

  0%|          | 0/5 [00:00<?, ?it/s]

In [116]:
trainer.train()  #
# for epoch in range(args.num_train_epochs):
#     trainer.train()
#     for batch in train_dataloader:
#           batch = {k: v.to(device) for k, v in batch.items()}
#           outputss = model.(**batch)
#           loss = outputs.loss
#           loss.backward()
#           optimizer.step()
#           lr_scheduler.step()
#           optimizer.zero_grad()
#           progress_bar.update(1)
# wandb.log()  # new additions
# wandb.watch(model)  # new additions
wandb.finish()  #

The following columns in the training set  don't have a corresponding argument in `LongformerForTokenClassification.forward` and have been ignored: overflow_to_sample_mapping, offset_mapping.
***** Running training *****
  Num examples = 14574
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 9105
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss


The following columns in the evaluation set  don't have a corresponding argument in `LongformerForTokenClassification.forward` and have been ignored: overflow_to_sample_mapping, offset_mapping.
***** Running Evaluation *****
  Num examples = 1625
  Batch size = 1
Saving model checkpoint to longformer-large-4096-hf-finetuned-ner/checkpoint-1821
Configuration saved in longformer-large-4096-hf-finetuned-ner/checkpoint-1821/config.json
Model weights saved in longformer-large-4096-hf-finetuned-ner/checkpoint-1821/pytorch_model.bin
tokenizer config file saved in longformer-large-4096-hf-finetuned-ner/checkpoint-1821/tokenizer_config.json
Special tokens file saved in longformer-large-4096-hf-finetuned-ner/checkpoint-1821/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `LongformerForTokenClassification.forward` and have been ignored: overflow_to_sample_mapping, offset_mapping.
***** Running Evaluation *****
  Num examples = 1625
  Bat




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/accuracy,▁▆▆▆█
eval/f1,▃▁▆▆█
eval/loss,▃▂▁▃█
eval/precision,▃▁▆▆█
eval/recall,▁▂█▇█
eval/runtime,▁█▁▂▁
eval/samples_per_second,█▁▇▇█
eval/steps_per_second,█▁▇▇█
train/epoch,▁▁▃▃▅▅▆▆███
train/global_step,▁▁▃▃▅▅▆▆███

0,1
eval/accuracy,0.81786
eval/f1,0.2719
eval/loss,0.7263
eval/precision,0.21681
eval/recall,0.36452
eval/runtime,114.3649
eval/samples_per_second,14.209
eval/steps_per_second,14.209
train/epoch,5.0
train/global_step,9105.0


In [117]:
trainer.save_model(model_path)

Saving model checkpoint to longformer-large-4096-hf-1
Configuration saved in longformer-large-4096-hf-1/config.json
Model weights saved in longformer-large-4096-hf-1/pytorch_model.bin
tokenizer config file saved in longformer-large-4096-hf-1/tokenizer_config.json
Special tokens file saved in longformer-large-4096-hf-1/special_tokens_map.json


## Validation

In [118]:
def tokenize_for_validation(examples):
    o = tokenizer(examples['text'], truncation=True, return_offsets_mapping=True, max_length=4096)
    # The offset mappings will give us a map from token to character position in the original context. This will help us compute the start_positions and end_positions.
    offset_mapping = o["offset_mapping"]
    o["labels"] = []
    for i in range(len(offset_mapping)):
        labels = [l2i['O'] for i in range(len(o['input_ids'][i]))]
        for label_start, label_end, label in \
        list(zip(examples['starts'][i], examples['ends'][i], examples['classlist'][i])):
            for j in range(len(labels)):
                token_start = offset_mapping[i][j][0]
                token_end = offset_mapping[i][j][1]
                if token_start == label_start: 
                    labels[j] = l2i[f'B-{label}']    
                if token_start > label_start and token_end <= label_end: 
                    labels[j] = l2i[f'I-{label}']
        for k, input_id in enumerate(o['input_ids'][i]):
            if input_id in [0,1,2]:
                labels[k] = -100
        labels = fix_beginnings(labels)
        o["labels"].append(labels)
    return o

In [119]:
tokenized_val = datasets.map(tokenize_for_validation, batched=True)
tokenized_val

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'classlist', 'starts', 'ends', 'predictionstrings', 'text', '__index_level_0__', 'input_ids', 'attention_mask', 'offset_mapping', 'labels'],
        num_rows: 14034
    })
    test: Dataset({
        features: ['id', 'classlist', 'starts', 'ends', 'predictionstrings', 'text', '__index_level_0__', 'input_ids', 'attention_mask', 'offset_mapping', 'labels'],
        num_rows: 1560
    })
})

In [120]:
# ground truth for validation
l = []
for example in tokenized_val['test']:
    for c, p in list(zip(example['classlist'], example['predictionstrings'])):
        l.append({
            'id': example['id'],
            'discourse_type': c,
            'predictionstring': p,
        })
gt_df = pd.DataFrame(l)
gt_df

Unnamed: 0,id,discourse_type,predictionstring
0,7B5F5B33B566,Lead,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...
1,7B5F5B33B566,Position,43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 5...
2,7B5F5B33B566,Evidence,69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 8...
3,7B5F5B33B566,Claim,166 167 168 169 170 171 172 173 174 175 176 17...
4,7B5F5B33B566,Evidence,180 181 182 183 184 185 186 187 188 189 190 19...
...,...,...,...
14461,B3E4B633261B,Claim,94 95 96 97 98 99 100 101 102 103 104 105 106 ...
14462,B3E4B633261B,Evidence,113 114 115 116 117 118 119 120 121 122 123 12...
14463,B3E4B633261B,Counterclaim,126 127 128 129 130 131 132
14464,B3E4B633261B,Rebuttal,133 134 135 136 137 138 139 140 141 142 143 14...


In [121]:
# visualization with displacy

# import pandas as pd
# import os
# from pathlib import Path
# import spacy
# from spacy import displacy
# from pylab import cm, matplotlib

# this bit throw an error

In [122]:
path = Path('data/train')

colors = {
            'Lead': '#8000ff',
            'Position': '#2b7ff6',
            'Evidence': '#2adddd',
            'Claim': '#80ffb4',
            'Concluding Statement': 'd4dd80',
            'Counterclaim': '#ff8042',
            'Rebuttal': '#ff0000',
            'Other': '#007f00',
         }

def visualize(df, text):
    ents = []
    example = df['id'].loc[0]

    for i, row in df.iterrows():
        ents.append({
                        'start': int(row['discourse_start']), 
                         'end': int(row['discourse_end']), 
                         'label': row['discourse_type']
                    })

    doc2 = {
        "text": text,
        "ents": ents,
        "title": example
    }

    options = {"ents": train.discourse_type.unique().tolist() + ['Other'], "colors": colors}
    displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)

In [123]:
predictions, labels, _ = trainer.predict(tokenized_val['test'])

The following columns in the test set  don't have a corresponding argument in `LongformerForTokenClassification.forward` and have been ignored: __index_level_0__, offset_mapping, ends, text, classlist, predictionstrings, starts, id.
***** Running Prediction *****
  Num examples = 1560
  Batch size = 1
Input ids are automatically padded from 564 to 1024 to be a multiple of `config.attention_window`: 512


Input ids are automatically padded from 545 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 404 to 512 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 652 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 362 to 512 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 585 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 502 to 512 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 639 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 710 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 389 to 512 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 242 to 512 to be a multiple of `confi

In [124]:
preds = np.argmax(predictions, axis=-1)
preds.shape

(1560, 4096)

In [125]:
# code that will convert our predictions into prediction strings, and visualize it at the same time this most likely requires some refactoring
# mk: ummmm yeah....

def get_class(c):
    if c == 14: return 'Other'
    else: return i2l[c][2:]

def pred2span(pred, example, viz=False, test=False):
    example_id = example['id']
    n_tokens = len(example['input_ids'])
    classes = []
    all_span = []
    for i, c in enumerate(pred.tolist()):
        if i == n_tokens-1:
            break
        if i == 0:
            cur_span = example['offset_mapping'][i]
            classes.append(get_class(c))
        elif i > 0 and (c == pred[i-1] or (c-7) == pred[i-1]):
            cur_span[1] = example['offset_mapping'][i][1]
        else:
            all_span.append(cur_span)
            cur_span = example['offset_mapping'][i]
            classes.append(get_class(c))
    all_span.append(cur_span)
    
    if test:
        text = get_test_text(example_id)  # something wrong here. looks like there should be a function but its missing
    else:
        text = get_raw_text(example_id)  # obviously this is a function call
    
    # map token ids to word (whitespace) token ids
    predstrings = []
    for span in all_span:
        span_start = span[0]
        span_end = span[1]
        before = text[:span_start]
        token_start = len(before.split())
        if len(before) == 0: token_start = 0
        elif before[-1] != ' ': token_start -= 1
        num_tkns = len(text[span_start:span_end+1].split())
        tkns = [str(x) for x in range(token_start, token_start+num_tkns)]
        predstring = ' '.join(tkns)
        predstrings.append(predstring)
                    
    rows = []
    for c, span, predstring in zip(classes, all_span, predstrings):
        e = {
            'id': example_id,
            'discourse_type': c,
            'predictionstring': predstring,
            'discourse_start': span[0],
            'discourse_end': span[1],
            'discourse': text[span[0]:span[1]+1]
        }
        rows.append(e)


    df = pd.DataFrame(rows)
    df['length'] = df['discourse'].apply(lambda t: len(t.split()))
    
    # short spans are likely to be false positives, we can choose a min number of tokens based on validation
    df = df[df.length > min_tokens].reset_index(drop=True)
    if viz: visualize(df, text)

    return df

In [126]:
from spacy import displacy
pred2span(preds[0], tokenized_val['test'][0], viz=True)

Unnamed: 0,id,discourse_type,predictionstring,discourse_start,discourse_end,discourse,length
0,7B5F5B33B566,Lead,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...,0,221,"When people ask for advice\n\n, they sometimes...",43
1,7B5F5B33B566,Position,43 44 45 46 47 48 49,222,264,taking advice from another person can help,7
2,7B5F5B33B566,Claim,58 59 60 61 62 63 64,308,354,"you understand things more clearly and faster,",7
3,7B5F5B33B566,Evidence,69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 8...,376,498,Multiple opinions really is a foundation to a ...,25
4,7B5F5B33B566,Evidence,95 96 97 98 99 100 101 102 103 104 105 106 107...,505,699,to address problems and concerns. If it was ju...,41
5,7B5F5B33B566,Evidence,135 136 137 138 139 140 141 142 143 144 145 14...,700,841,"In the world we live in, we make a lot of impo...",29
6,7B5F5B33B566,Evidence,180 181 182 183 184 185 186 187 188 189 190 19...,927,2197,Abraham\n\nLincoln never saw how African Ameri...,235
7,7B5F5B33B566,Concluding Statement,425 426 427 428 429 430 431 432 433 434 435 43...,2249,2587,"advice is there to help you, opinions are ther...",68


In [127]:
pred2span(preds[1], tokenized_val['test'][1], viz=True)

Unnamed: 0,id,discourse_type,predictionstring,discourse_start,discourse_end,discourse,length
0,3CF52C3ED074,Lead,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...,0,171,All students do is waste their time and i'm ti...,33
1,3CF52C3ED074,Position,33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 4...,172,323,I agree with the principal that all students s...,24
2,3CF52C3ED074,Other,79 80 81 82 83 84 85 86 87,463,508,"\n\nMy first reason is, It could benefit a lot.",9
3,3CF52C3ED074,Evidence,89 90 91 92 93 94 95 96 97 98 99 100 101 102 1...,509,1110,Doing extracurricular activity can give an ide...,105
4,3CF52C3ED074,Evidence,202 203 204 205 206 207 208 209 210 211 212 21...,1161,1935,"In competitions there is always a reward, and ...",141
5,3CF52C3ED074,Evidence,354 355 356 357 358 359 360 361 362 363 364 36...,1994,2260,Tell your friends and family what you did toda...,46
6,3CF52C3ED074,Concluding Statement,402 403 404 405 406 407 408 409 410 411 412 41...,2277,2698,i strongly agree with the principal decision t...,69


In [128]:
dfs = []
for i in range(len(tokenized_val['test'])):
    dfs.append(pred2span(preds[i], tokenized_val['test'][i]))

pred_df = pd.concat(dfs, axis=0)
pred_df['class'] = pred_df['discourse_type']
pred_df

Unnamed: 0,id,discourse_type,predictionstring,discourse_start,discourse_end,discourse,length,class
0,7B5F5B33B566,Lead,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...,0,221,"When people ask for advice\n\n, they sometimes...",43,Lead
1,7B5F5B33B566,Position,43 44 45 46 47 48 49,222,264,taking advice from another person can help,7,Position
2,7B5F5B33B566,Claim,58 59 60 61 62 63 64,308,354,"you understand things more clearly and faster,",7,Claim
3,7B5F5B33B566,Evidence,69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 8...,376,498,Multiple opinions really is a foundation to a ...,25,Evidence
4,7B5F5B33B566,Evidence,95 96 97 98 99 100 101 102 103 104 105 106 107...,505,699,to address problems and concerns. If it was ju...,41,Evidence
...,...,...,...,...,...,...,...,...
6,B3E4B633261B,Evidence,72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 8...,396,548,They will see all the people they helped all t...,29,Evidence
7,B3E4B633261B,Evidence,102 103 104 105 106 107 108 109 110 111 112 11...,555,628,"for communities, people, plants an animals an ...",13,Evidence
8,B3E4B633261B,Evidence,116 117 118 119 120 121 122 123 124 125,634,688,this world a better place for the generations ...,10,Evidence
9,B3E4B633261B,Evidence,133 134 135 136 137 138 139 140,727,765,but they will thank you for it latter.,8,Evidence


In [129]:
# source: https://www.kaggle.com/robikscube/student-writing-competition-twitch#Competition-Metric-Code

def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(" "))
    set_gt = set(row.predictionstring_gt.split(" "))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter / len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp_micro(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition

    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = (
        gt_df[["id", "discourse_type", "predictionstring"]]
        .reset_index(drop=True)
        .copy()
    )
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    pred_df["pred_id"] = pred_df.index
    gt_df["gt_id"] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(
        gt_df,
        left_on=["id", "class"],
        right_on=["id", "discourse_type"],
        how="outer",
        suffixes=("_pred", "_gt"),
    )
    joined["predictionstring_gt"] = joined["predictionstring_gt"].fillna(" ")
    joined["predictionstring_pred"] = joined["predictionstring_pred"].fillna(" ")

    joined["overlaps"] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5,
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined["overlap1"] = joined["overlaps"].apply(lambda x: eval(str(x))[0])
    joined["overlap2"] = joined["overlaps"].apply(lambda x: eval(str(x))[1])

    joined["potential_TP"] = (joined["overlap1"] >= 0.5) & (joined["overlap2"] >= 0.5)
    joined["max_overlap"] = joined[["overlap1", "overlap2"]].max(axis=1)
    tp_pred_ids = (
        joined.query("potential_TP")
        .sort_values("max_overlap", ascending=False)
        .groupby(["id", "predictionstring_gt"])
        .first()["pred_id"]
        .values
    )

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined["pred_id"].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query("potential_TP")["gt_id"].unique()
    unmatched_gt_ids = [c for c in joined["gt_id"].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    # calc microf1
    my_f1_score = TP / (TP + 0.5 * (FP + FN))
    return my_f1_score


def score_feedback_comp(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    for discourse_type, gt_subset in gt_df.groupby("discourse_type"):
        pred_subset = (
            pred_df.loc[pred_df["class"] == discourse_type]
            .reset_index(drop=True)
            .copy()
        )
        class_score = score_feedback_comp_micro(pred_subset, gt_subset)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1

## CV Score

In [130]:
score_feedback_comp(pred_df, gt_df, return_class_scores=True)

(0.622527447496722,
 {'Claim': 0.5571841453344344,
  'Concluding Statement': 0.798907476954592,
  'Counterclaim': 0.4758679085520745,
  'Evidence': 0.7149005943840951,
  'Lead': 0.7856782652546647,
  'Position': 0.6389797253106606,
  'Rebuttal': 0.38617401668653156})