### A Hybrid Approach to deriving context in long documents

The introduction of the longformer architecture through the paper "Longformer: The Long-document Transformer" (and references in the subsequent paper on the BigBird architecture) bought together new approaches to handle documents in a resource efficient manner. The attention mechanism computed up untill then (RoBERTa model) was computationally intensive (O($n^2$)).

The paper in question introduced us to the  sliding window attention mechanism, among others, which I felt was quite interesting as the optimality now decreased to a time complexity that is linear. It did so by employing fixed-size windows around each tokens, enabling large receptive fields with linear scalability(O(n x w)), thus offering a balance between local context importance and computational efficiency.

The Longformer's approach emphasizes efficiency and captures information in large inputs significantly through long-text document analysis. It also introduces a sparse attention mechanism, where tokens attend to a fixed number of global tokens, reducing the computational complexity from quadratic to linear concerning the sequence length. A dilated sliding window is employed in multi-headed attention with varying dilation configurations per head, enabling increased receptive fields without substantial computational overhead, and enhancing performance in capturing both local and long-range contextual information.

As Longformer increases the capacity to capture long-range dependencies, there is a potential risk of overfitting, especially when dealing with smaller datasets. The final longformer model, as available on huggingface, employs a combination of the global attention mechanism and sliding window attention mechanism. Training models with global attention mechanisms can be more challenging than models with local attention. This apporach of local attnetion and relative positional encoding (efficiently using memory) was pitched through the paper "Shortformer: Better Language Modeling using Shorter Inputs". Internally the long documents can be split into multiple shorter groups and then trained on the shortformer model. Although this doesn't theoretically perform better than the longformer model on long documents, it is computationally more efficient than the latter. Hence, I intend to build a grouping feature on top of the shortformer model and compare it with the performance of a longformer-base model(pre-trained).

In [None]:
! pip3 install autocorrect -qq
from autocorrect import Speller

In [None]:
import os
import gc
import ast
import time
import wandb
from tqdm import tqdm
from collections import defaultdict

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

import torch
from torch.utils.data import Dataset, DataLoader

from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification



Testing how the autocorrect library functions. I intend to use this as the dataset consists of essays composed by students varying from 6th grade to the 12th and through visual glancing, I observed quite a unintentional spelling mistakes.

In [None]:
misspell = ["conclution worksing", "consulating coclusion"]
spell = Speller(lang = 'en')
for misspelled_items in misspell:
    print("Before: ", misspelled_items, "\t After: ", spell(misspelled_items))

Before:  conclution worksing 	 After:  conclusion working
Before:  consulating coclusion 	 After:  consulting conclusion


I intend to build the shortformer model on top of the RoBERTa model with the following specifications.

In [None]:
MODEL_NAME = 'roberta-large'
MODEL_PATH = 'model'
RUN_NAME = f"{MODEL_NAME}-4"

MAX_LENGTH = 512
DOC_STRIDE = 128 #Strides for skipping possible overlapping tokens

config = {'train_batch_size': 4,
          'valid_batch_size': 2,
          'epochs': 5,
          'learning_rates': [2.5e-5, 2.5e-5, 2.5e-6, 2.5e-6, 2.5e-7],
          'max_grad_norm': 10,
          'device': 'cuda' if torch.cuda.is_available() else 'cpu',
          'model_name': MODEL_NAME,
          'max_length': MAX_LENGTH,
          'doc_stride': DOC_STRIDE,
          }

A 3rd party interface to visualize the model training and validation loss functions for each of the categories.

In [None]:
try:
    api_key = 'e47e017c812018f74ccafcc437b2d02fc71db05e'
    wandb.login(key=api_key)
    wandb.init(project="experimenting", name=RUN_NAME, config=config)
except:
    print('Error in initializing Weights and Bias API')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlikithvp21[0m ([33mexperimenting[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
df_initial = pd.read_csv('/kaggle/input/project/train.csv')
print(df_initial.shape)

(144293, 8)


In [None]:
df_initial.head(10)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...
5,423A1CA112E2,1622628000000.0,887.0,1150.0,That's why there's a thing that's called no te...,Evidence,Evidence 3,163 164 165 166 167 168 169 170 171 172 173 17...
6,423A1CA112E2,1622628000000.0,1151.0,1533.0,Sometimes on the news there is either an accid...,Evidence,Evidence 4,211 212 213 214 215 216 217 218 219 220 221 22...
7,423A1CA112E2,1622628000000.0,1534.0,1602.0,Phones are fine to use and it's also the best ...,Claim,Claim 2,282 283 284 285 286 287 288 289 290 291 292 29...
8,423A1CA112E2,1622628000000.0,1603.0,1890.0,If you go through a problem and you can't find...,Evidence,Evidence 5,297 298 299 300 301 302 303 304 305 306 307 30...
9,423A1CA112E2,1622628000000.0,1891.0,2027.0,The news always updated when people do somethi...,Concluding Statement,Concluding Statement 1,355 356 357 358 359 360 361 362 363 364 365 36...


In [None]:
df_initial['predictionstring'][1]

'45 46 47 48 49 50 51 52 53 54 55 56 57 58 59'

#### Named Entity Recognition:

We shall now convert the data to NER labels and save them.

In [None]:
if os.path.isfile("/kaggle/input/train-ner/train_NER.csv"):
    df_texts = pd.read_csv("/kaggle/input/train-ner/train_NER.csv", converters={'entities':ast.literal_eval, 'text_split': ast.literal_eval})
else:
    train_names, train_texts = [], []
    for f in tqdm(list(os.listdir('../input/project/train'))):
        train_names.append(f.replace('.txt', ''))
        train_texts.append(spell(open('../input/project/train/' + f, 'r').read()))

        df_texts = pd.DataFrame({'id': train_names, 'text': train_texts})
    df_texts['text_split'] = df_texts.text.str.split()

    all_entities = []
    for _, row in tqdm(df_texts.iterrows(), total=len(df_texts)):
        total = len(row['text_split'])
        entities = ["O"] * total
        for _, row2 in df_initial[df_initial['id'] == row['id']].iterrows():
            discourse = row2['discourse_type']
            list_ix = [int(x) for x in row2['predictionstring'].split(' ')]
            entities[list_ix[0]] = f"B-{discourse}"
            for k in list_ix[1:]: entities[k] = f"I-{discourse}"
        all_entities.append(entities)
    df_texts['entities'] = all_entities
    df_texts.to_csv('../input/project/train_NER.csv',index=False)

print(df_texts.shape)
df_texts.head()

(15594, 4)


Unnamed: 0,id,text,text_split,entities
0,62C57C524CD2,I think we should be able to play in a sport i...,"[I, think, we, should, be, able, to, play, in,...","[B-Position, I-Position, I-Position, I-Positio..."
1,80667AD3FFD8,Some schools require summer projects for stude...,"[Some, schools, require, summer, projects, for...","[B-Position, I-Position, I-Position, I-Positio..."
2,21868C40B94F,Driverless cars have been argued and talked ab...,"[Driverless, cars, have, been, argued, and, ta...","[B-Lead, I-Lead, I-Lead, I-Lead, I-Lead, I-Lea..."
3,87A6EF3113C6,"The author of ""The Challenge of Exploring Venu...","[The, author, of, ""The, Challenge, of, Explori...","[B-Position, I-Position, I-Position, I-Positio..."
4,24687D08CFDA,"Wow, from the mar really look like humans face...","[Wow,, from, the, mar, really, look, like, hum...","[B-Lead, I-Lead, I-Lead, I-Lead, I-Lead, I-Lea..."


In [None]:
(df_texts['text_split'].str.len() == df_texts['entities'].str.len()).all()

True

In [None]:
output_labels = ['O', 'B-Lead', 'I-Lead', 'B-Position', 'I-Position', 'B-Claim', 'I-Claim', 'B-Counterclaim', 'I-Counterclaim',
          'B-Rebuttal', 'I-Rebuttal', 'B-Evidence', 'I-Evidence', 'B-Concluding Statement', 'I-Concluding Statement']

LABELS_TO_IDS = {v:k for k,v in enumerate(output_labels)}
IDS_TO_LABELS = {k:v for k,v in enumerate(output_labels)}

LABELS_TO_IDS

{'O': 0,
 'B-Lead': 1,
 'I-Lead': 2,
 'B-Position': 3,
 'I-Position': 4,
 'B-Claim': 5,
 'I-Claim': 6,
 'B-Counterclaim': 7,
 'I-Counterclaim': 8,
 'B-Rebuttal': 9,
 'I-Rebuttal': 10,
 'B-Evidence': 11,
 'I-Evidence': 12,
 'B-Concluding Statement': 13,
 'I-Concluding Statement': 14}

In [None]:
IDS = df_initial.id.unique()
print(f'There are {len(IDS)} train texts. We will split 90% 10% for validation.')
np.random.seed(42)
train_idx = np.random.choice(np.arange(len(IDS)),int(0.9*len(IDS)),replace=False)
valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)
np.random.seed(None)

df_train = df_texts.loc[df_texts['id'].isin(IDS[train_idx])].reset_index(drop=True)
df_val = df_texts.loc[df_texts['id'].isin(IDS[valid_idx])].reset_index(drop=True)

print(f"FULL Dataset : {df_texts.shape}")
print(f"TRAIN Dataset: {df_train.shape}")
print(f"TEST Dataset : {df_val.shape}")

There are 15594 train texts. We will split 90% 10% for validation.
FULL Dataset : (15594, 4)
TRAIN Dataset: (14034, 4)
TEST Dataset : (1560, 4)


Download the RoBERTa Model with the configurations.

In [None]:
def download_model():
    os.mkdir(MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_prefix_space=True)
    tokenizer.save_pretrained(MODEL_PATH)
    config_model = AutoConfig.from_pretrained(MODEL_NAME)
    config_model.num_labels = 15
    config_model.save_pretrained(MODEL_PATH)
    backbone = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, config=config_model)
    backbone.save_pretrained(os.path.join(MODEL_PATH, 'pytorch_model.bin'))
    print(f"Model downloaded to {MODEL_PATH}/")
download_model()

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model downloaded to model/


In [None]:
# !rm -rf /kaggle/working/model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

In [None]:
def get_labels(word_ids, word_labels):
    label_ids = []
    for word_idx in word_ids:
        if word_idx is None:
            label_ids.append(-100)
        else:
            label_ids.append(LABELS_TO_IDS[word_labels[word_idx]])
    return label_ids

In [None]:
def tokenize(df, to_tensor=True, with_labels=True):
    encoded = tokenizer(df['text_split'].tolist(),
                        is_split_into_words=True,
                        return_overflowing_tokens=True,
                        stride=DOC_STRIDE,
                        max_length=MAX_LENGTH,
                        padding="max_length",
                        truncation=True)

    if with_labels:
        encoded['labels'] = []
    encoded['wids'] = []
    n = len(encoded['overflow_to_sample_mapping'])
    for i in range(n):
        text_idx = encoded['overflow_to_sample_mapping'][i]
        word_ids = encoded.word_ids(i)

        if with_labels:
            word_labels = df['entities'].iloc[text_idx]
            label_ids = get_labels(word_ids, word_labels)
            encoded['labels'].append(label_ids)
        encoded['wids'].append([w if w is not None else -1 for w in word_ids])

    if to_tensor:
        encoded = {key: torch.as_tensor(val) for key, val in encoded.items()}
    return encoded

In [None]:
tokenized_train = tokenize(df_train)
tokenized_val = tokenize(df_val)

In [None]:
tokenized_train['overflow_to_sample_mapping'][:10]

tensor([0, 1, 1, 2, 2, 3, 4, 5, 5, 6])

In [None]:
class FeedbackPrizeDataset(Dataset):
    def __init__(self, tokenized_ds):
        self.data = tokenized_ds

    def __getitem__(self, index):
        item = {k: self.data[k][index] for k in self.data.keys()}
        return item

    def __len__(self):
        return len(self.data['input_ids'])

In [None]:
ds_train = FeedbackPrizeDataset(tokenized_train)
dl_train = DataLoader(ds_train, batch_size=config['train_batch_size'],
                      shuffle=True, num_workers=2, pin_memory=True)

ds_val = FeedbackPrizeDataset(tokenized_val)
dl_val = DataLoader(ds_val, batch_size=config['valid_batch_size'],
                    shuffle=False, num_workers=2, pin_memory=True)

I will employ a gradient clipping method to prevent exploding of the gradients and log the training metrics through a third-party API Weights & Biases for visualization.

In [None]:
def train(model, optimizer, dl_train, epoch):
    time_start = time.time()
    for g in optimizer.param_groups:
        g['lr'] = config['learning_rates'][epoch]
    lr = optimizer.param_groups[0]['lr']
    epoch_prefix = f"[Epoch {epoch+1:2d} / {config['epochs']:2d}]"
    print(f"{epoch_prefix} Starting epoch {epoch+1:2d} with LR = {lr}")

    model.train()
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for idx, batch in enumerate(dl_train):
        ids = batch['input_ids'].to(config['device'], dtype = torch.long)
        mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
        labels = batch['labels'].to(config['device'], dtype = torch.long)
        loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels,
                               return_dict=False)
        tr_loss += loss.item()
        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        loss_step = tr_loss/nb_tr_steps
        if idx % 200 == 0:
            print(f"{epoch_prefix}     Steps: {idx:4d} --> Loss: {loss_step:.4f}")
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
        wandb.log({'Train Loss (Step)': loss_step, 'Train Accuracy (Step)' : tr_accuracy / nb_tr_steps})
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=config['max_grad_norm']
        )

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps

    torch.save(model.state_dict(), f'pytorch_model_e{epoch}.bin')
    torch.cuda.empty_cache()
    gc.collect()

    elapsed = time.time() - time_start

    print(epoch_prefix)
    print(f"{epoch_prefix} Training loss    : {epoch_loss:.4f}")
    print(f"{epoch_prefix} Training accuracy: {tr_accuracy:.4f}")
    print(f"{epoch_prefix} Model saved to pytorch_model_e{epoch}.bin  [{elapsed/60:.2f} mins]")
    wandb.log({'Train Loss (Epoch)': epoch_loss, 'Train Accuracy (Epoch)' : tr_accuracy})
    print(epoch_prefix)

In [None]:
def calc_overlap(row): #Calculate the overlap between prediction and truth
    set_pred = set(row.predictionstring_pred.split(' '))
    set_gt = set(row.predictionstring_gt.split(' '))
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]

Evaluation based on the overlap between ground truth and predicted word indices.

1. For each sample, all ground truths and predictions for a given class are compared.
2. If the overlap between the ground truth and prediction is >= 0.5, and the overlap between the prediction and the ground truth >= 0.5, the prediction is a match and considered a true positive. If multiple matches exist, the match with the highest pair of overlaps is taken.
3. Any unmatched ground truths are false negatives and any unmatched predictions are false positives.

In [None]:
def score_feedback_comp(pred_df, gt_df):
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','class','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','class'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5,
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    my_f1_score = TP / (TP + 0.5*(FP+FN))
    return my_f1_score

I dynamically aggregate the predictions across chunks by tracking word-level indices and maintaining accumulated data for each text ID, ensuring
seamless merging of predictions for samples split across different batches.

In [None]:
def inference(dl):
    predictions = defaultdict(list) #For text-level data, to help in the merging process by data accumulation of all groups of data
    seen_words_idx = defaultdict(list)

    for batch in dl:
        ids = batch["input_ids"].to(config['device'])
        mask = batch["attention_mask"].to(config['device'])
        outputs = model(ids, attention_mask=mask, return_dict=False)

        del ids, mask

        batch_preds = torch.argmax(outputs[0], axis=-1).cpu().numpy()

        # Go over each prediction, getting the text_id reference
        for k, (chunk_preds, text_id) in enumerate(zip(batch_preds, batch['overflow_to_sample_mapping'].tolist())):

            # The word_ids are absolute references in the original text
            word_ids = batch['wids'][k].numpy()

            # Map from ids to labels
            chunk_preds = [IDS_TO_LABELS[i] for i in chunk_preds]

            for idx, word_idx in enumerate(word_ids):
                if word_idx == -1:
                    pass
                elif word_idx not in seen_words_idx[text_id]:
                    # Add predictions if the word doesn't have a prediction from a previous chunk
                    predictions[text_id].append(chunk_preds[idx])
                    seen_words_idx[text_id].append(word_idx)

    final_predictions = [predictions[k] for k in sorted(predictions.keys())]
    return final_predictions

In [None]:
def get_predictions(df, dl):

    all_labels = inference(dl)
    final_preds = []

    for i in range(len(df)):
        idx = df.id.values[i]
        pred = all_labels[i]
        preds = []
        j = 0

        while j < len(pred):
            cls = pred[j]
            if cls == 'O': pass
            else: cls = cls.replace('B','I')
            end = j + 1
            while end < len(pred) and pred[end] == cls:
                end += 1
            if cls != 'O' and cls != '' and end - j > 7:
                final_preds.append((idx, cls.replace('I-',''),
                                    ' '.join(map(str, list(range(j, end))))))
            j = end

    df_pred = pd.DataFrame(final_preds)
    df_pred.columns = ['id','class','predictionstring']
    return df_pred

In [None]:
def validate(model, df_all, df_val, dl_val, epoch):

    time_start = time.time()

    # Put model in eval model
    model.eval()

    # Valid targets: needed because df_val has a subset of the columns
    df_valid = df_all.loc[df_all['id'].isin(IDS[valid_idx])]

    # OOF predictions
    oof = get_predictions(df_val, dl_val)

    # Compute F1-score
    f1s = []
    classes = oof['class'].unique()

    epoch_prefix = f"[Epoch {epoch+1:2d} / {config['epochs']:2d}]"
    print(f"{epoch_prefix} Validation F1 scores")

    f1s_log = {}
    for c in classes:
        pred_df = oof.loc[oof['class']==c].copy()
        gt_df = df_valid.loc[df_valid['discourse_type']==c].copy()
        f1 = score_feedback_comp(pred_df, gt_df)
        print(f"{epoch_prefix}   * {c:<10}: {f1:4f}")
        f1s.append(f1)
        f1s_log[f'F1 {c}'] = f1

    elapsed = time.time() - time_start
    print(epoch_prefix)
    print(f'{epoch_prefix} Overall Validation F1: {np.mean(f1s):.4f} [{elapsed:.2f} secs]')
    print(epoch_prefix)
    f1s_log['Overall F1'] = np.mean(f1s)
    wandb.log(f1s_log)

In [None]:
config_model = AutoConfig.from_pretrained('/kaggle/working/model/config.json')
model = AutoModelForTokenClassification.from_pretrained('/kaggle/working/model/pytorch_model.bin',config=config_model)
model.to(config['device']);

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=config['learning_rates'][0])

# Loop
for epoch in range(config['epochs']):
    print()
    train(model, optimizer, dl_train, epoch)
    validate(model, df_initial, df_val, dl_val, epoch)

print("Final model saved as 'pytorch_model.bin'")
torch.save(model.state_dict(), 'pytorch_model.bin')


[Epoch  1 /  5] Starting epoch  1 with LR = 2.5e-05


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  1 /  5]     Steps:    0 --> Loss: 3.9966
[Epoch  1 /  5]     Steps:  200 --> Loss: 1.2659
[Epoch  1 /  5]     Steps:  400 --> Loss: 1.0732
[Epoch  1 /  5]     Steps:  600 --> Loss: 0.9930
[Epoch  1 /  5]     Steps:  800 --> Loss: 0.9400
[Epoch  1 /  5]     Steps: 1000 --> Loss: 0.9127
[Epoch  1 /  5]     Steps: 1200 --> Loss: 0.8890
[Epoch  1 /  5]     Steps: 1400 --> Loss: 0.8706
[Epoch  1 /  5]     Steps: 1600 --> Loss: 0.8537
[Epoch  1 /  5]     Steps: 1800 --> Loss: 0.8343
[Epoch  1 /  5]     Steps: 2000 --> Loss: 0.8235
[Epoch  1 /  5]     Steps: 2200 --> Loss: 0.8131
[Epoch  1 /  5]     Steps: 2400 --> Loss: 0.8015
[Epoch  1 /  5]     Steps: 2600 --> Loss: 0.7933
[Epoch  1 /  5]     Steps: 2800 --> Loss: 0.7857
[Epoch  1 /  5]     Steps: 3000 --> Loss: 0.7865
[Epoch  1 /  5]     Steps: 3200 --> Loss: 0.7808
[Epoch  1 /  5]     Steps: 3400 --> Loss: 0.7765
[Epoch  1 /  5]     Steps: 3600 --> Loss: 0.7731
[Epoch  1 /  5]     Steps: 3800 --> Loss: 0.7674
[Epoch  1 /  5]     

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  1 /  5] Validation F1 scores
[Epoch  1 /  5]   * Lead      : 0.772864
[Epoch  1 /  5]   * Claim     : 0.511492
[Epoch  1 /  5]   * Evidence  : 0.621083
[Epoch  1 /  5]   * Counterclaim: 0.456814
[Epoch  1 /  5]   * Rebuttal  : 0.359574
[Epoch  1 /  5]   * Concluding Statement: 0.730453
[Epoch  1 /  5]   * Position  : 0.660455
[Epoch  1 /  5]
[Epoch  1 /  5] Overall Validation F1: 0.5875 [147.82 secs]
[Epoch  1 /  5]

[Epoch  2 /  5] Starting epoch  2 with LR = 2.5e-05


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  2 /  5]     Steps:    0 --> Loss: 0.3913
[Epoch  2 /  5]     Steps:  200 --> Loss: 0.6246
[Epoch  2 /  5]     Steps:  400 --> Loss: 0.6129
[Epoch  2 /  5]     Steps:  600 --> Loss: 0.6122
[Epoch  2 /  5]     Steps:  800 --> Loss: 0.6138
[Epoch  2 /  5]     Steps: 1000 --> Loss: 0.6221
[Epoch  2 /  5]     Steps: 1200 --> Loss: 0.6192
[Epoch  2 /  5]     Steps: 1400 --> Loss: 0.6174
[Epoch  2 /  5]     Steps: 1600 --> Loss: 0.6196
[Epoch  2 /  5]     Steps: 1800 --> Loss: 0.6175
[Epoch  2 /  5]     Steps: 2000 --> Loss: 0.6142
[Epoch  2 /  5]     Steps: 2200 --> Loss: 0.6109
[Epoch  2 /  5]     Steps: 2400 --> Loss: 0.6094
[Epoch  2 /  5]     Steps: 2600 --> Loss: 0.6075
[Epoch  2 /  5]     Steps: 2800 --> Loss: 0.6055
[Epoch  2 /  5]     Steps: 3000 --> Loss: 0.6048
[Epoch  2 /  5]     Steps: 3200 --> Loss: 0.6062
[Epoch  2 /  5]     Steps: 3400 --> Loss: 0.6146
[Epoch  2 /  5]     Steps: 3600 --> Loss: 0.6137
[Epoch  2 /  5]     Steps: 3800 --> Loss: 0.6130
[Epoch  2 /  5]     

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  2 /  5] Validation F1 scores
[Epoch  2 /  5]   * Position  : 0.641034
[Epoch  2 /  5]   * Lead      : 0.711233
[Epoch  2 /  5]   * Claim     : 0.501585
[Epoch  2 /  5]   * Evidence  : 0.626982
[Epoch  2 /  5]   * Counterclaim: 0.511290
[Epoch  2 /  5]   * Concluding Statement: 0.765943
[Epoch  2 /  5]   * Rebuttal  : 0.410693
[Epoch  2 /  5]
[Epoch  2 /  5] Overall Validation F1: 0.5955 [147.85 secs]
[Epoch  2 /  5]

[Epoch  3 /  5] Starting epoch  3 with LR = 2.5e-06


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  3 /  5]     Steps:    0 --> Loss: 0.2295
[Epoch  3 /  5]     Steps:  200 --> Loss: 0.5102
[Epoch  3 /  5]     Steps:  400 --> Loss: 0.5046
[Epoch  3 /  5]     Steps:  600 --> Loss: 0.5013
[Epoch  3 /  5]     Steps:  800 --> Loss: 0.5030
[Epoch  3 /  5]     Steps: 1000 --> Loss: 0.4983
[Epoch  3 /  5]     Steps: 1200 --> Loss: 0.4917
[Epoch  3 /  5]     Steps: 1400 --> Loss: 0.4888
[Epoch  3 /  5]     Steps: 1600 --> Loss: 0.4864
[Epoch  3 /  5]     Steps: 1800 --> Loss: 0.4840
[Epoch  3 /  5]     Steps: 2000 --> Loss: 0.4795
[Epoch  3 /  5]     Steps: 2200 --> Loss: 0.4774
[Epoch  3 /  5]     Steps: 2400 --> Loss: 0.4766
[Epoch  3 /  5]     Steps: 2600 --> Loss: 0.4766
[Epoch  3 /  5]     Steps: 2800 --> Loss: 0.4739
[Epoch  3 /  5]     Steps: 3000 --> Loss: 0.4714
[Epoch  3 /  5]     Steps: 3200 --> Loss: 0.4697
[Epoch  3 /  5]     Steps: 3400 --> Loss: 0.4683
[Epoch  3 /  5]     Steps: 3600 --> Loss: 0.4678
[Epoch  3 /  5]     Steps: 3800 --> Loss: 0.4669
[Epoch  3 /  5]     

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  3 /  5] Validation F1 scores
[Epoch  3 /  5]   * Lead      : 0.787592
[Epoch  3 /  5]   * Claim     : 0.540274
[Epoch  3 /  5]   * Evidence  : 0.675736
[Epoch  3 /  5]   * Counterclaim: 0.534722
[Epoch  3 /  5]   * Rebuttal  : 0.434783
[Epoch  3 /  5]   * Concluding Statement: 0.785690
[Epoch  3 /  5]   * Position  : 0.685390
[Epoch  3 /  5]
[Epoch  3 /  5] Overall Validation F1: 0.6349 [147.09 secs]
[Epoch  3 /  5]

[Epoch  4 /  5] Starting epoch  4 with LR = 2.5e-06


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  4 /  5]     Steps:    0 --> Loss: 0.4685
[Epoch  4 /  5]     Steps:  200 --> Loss: 0.4425
[Epoch  4 /  5]     Steps:  400 --> Loss: 0.4268
[Epoch  4 /  5]     Steps:  600 --> Loss: 0.4223
[Epoch  4 /  5]     Steps:  800 --> Loss: 0.4240
[Epoch  4 /  5]     Steps: 1000 --> Loss: 0.4218
[Epoch  4 /  5]     Steps: 1200 --> Loss: 0.4228
[Epoch  4 /  5]     Steps: 1400 --> Loss: 0.4270
[Epoch  4 /  5]     Steps: 1600 --> Loss: 0.4234
[Epoch  4 /  5]     Steps: 1800 --> Loss: 0.4219
[Epoch  4 /  5]     Steps: 2000 --> Loss: 0.4204
[Epoch  4 /  5]     Steps: 2200 --> Loss: 0.4196
[Epoch  4 /  5]     Steps: 2400 --> Loss: 0.4198
[Epoch  4 /  5]     Steps: 2600 --> Loss: 0.4213
[Epoch  4 /  5]     Steps: 2800 --> Loss: 0.4216
[Epoch  4 /  5]     Steps: 3000 --> Loss: 0.4219
[Epoch  4 /  5]     Steps: 3200 --> Loss: 0.4230
[Epoch  4 /  5]     Steps: 3400 --> Loss: 0.4227
[Epoch  4 /  5]     Steps: 3600 --> Loss: 0.4226
[Epoch  4 /  5]     Steps: 3800 --> Loss: 0.4224
[Epoch  4 /  5]     

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  4 /  5] Validation F1 scores
[Epoch  4 /  5]   * Lead      : 0.801067
[Epoch  4 /  5]   * Claim     : 0.542964
[Epoch  4 /  5]   * Evidence  : 0.672311
[Epoch  4 /  5]   * Counterclaim: 0.534653
[Epoch  4 /  5]   * Rebuttal  : 0.435208
[Epoch  4 /  5]   * Concluding Statement: 0.798500
[Epoch  4 /  5]   * Position  : 0.689748
[Epoch  4 /  5]
[Epoch  4 /  5] Overall Validation F1: 0.6392 [147.11 secs]
[Epoch  4 /  5]

[Epoch  5 /  5] Starting epoch  5 with LR = 2.5e-07


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  5 /  5]     Steps:    0 --> Loss: 0.1559
[Epoch  5 /  5]     Steps:  200 --> Loss: 0.3811
[Epoch  5 /  5]     Steps:  400 --> Loss: 0.3841
[Epoch  5 /  5]     Steps:  600 --> Loss: 0.3870
[Epoch  5 /  5]     Steps:  800 --> Loss: 0.3903
[Epoch  5 /  5]     Steps: 1000 --> Loss: 0.3870
[Epoch  5 /  5]     Steps: 1200 --> Loss: 0.3881
[Epoch  5 /  5]     Steps: 1400 --> Loss: 0.3893
[Epoch  5 /  5]     Steps: 1600 --> Loss: 0.3883
[Epoch  5 /  5]     Steps: 1800 --> Loss: 0.3898
[Epoch  5 /  5]     Steps: 2000 --> Loss: 0.3896
[Epoch  5 /  5]     Steps: 2200 --> Loss: 0.3883
[Epoch  5 /  5]     Steps: 2400 --> Loss: 0.3867
[Epoch  5 /  5]     Steps: 2600 --> Loss: 0.3860
[Epoch  5 /  5]     Steps: 2800 --> Loss: 0.3880
[Epoch  5 /  5]     Steps: 3000 --> Loss: 0.3876
[Epoch  5 /  5]     Steps: 3200 --> Loss: 0.3880
[Epoch  5 /  5]     Steps: 3400 --> Loss: 0.3877
[Epoch  5 /  5]     Steps: 3600 --> Loss: 0.3873
[Epoch  5 /  5]     Steps: 3800 --> Loss: 0.3873
[Epoch  5 /  5]     

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Epoch  5 /  5] Validation F1 scores
[Epoch  5 /  5]   * Lead      : 0.802555
[Epoch  5 /  5]   * Claim     : 0.548313
[Epoch  5 /  5]   * Evidence  : 0.674937
[Epoch  5 /  5]   * Counterclaim: 0.535745
[Epoch  5 /  5]   * Rebuttal  : 0.438679
[Epoch  5 /  5]   * Concluding Statement: 0.795270
[Epoch  5 /  5]   * Position  : 0.691815
[Epoch  5 /  5]
[Epoch  5 /  5] Overall Validation F1: 0.6410 [147.16 secs]
[Epoch  5 /  5]
Final model saved as 'pytorch_model.bin'


Testing the data

In [None]:
def load_df_test():
    test_names, df_test = [], []
    for f in list(os.listdir('../input/project/test')):
        test_names.append(f.replace('.txt', ''))
        df_test.append(open('../input/project/test/' + f, 'r').read())
    df_test = pd.DataFrame({'id': test_names, 'text': df_test})
    df_test['text_split'] = df_test.text.str.split()
    return df_test

df_test = load_df_test()
df_test.head()

Unnamed: 0,id,text,text_split
0,0FB0700DAF44,"During a group project, have you ever asked a ...","[During, a, group, project,, have, you, ever, ..."
1,D72CB1C11673,Making choices in life can be very difficult. ...,"[Making, choices, in, life, can, be, very, dif..."
2,18409261F5C2,80% of Americans believe seeking multiple opin...,"[80%, of, Americans, believe, seeking, multipl..."
3,DF920E0A7337,Have you ever asked more than one person for h...,"[Have, you, ever, asked, more, than, one, pers..."
4,D46BCB48440A,"When people ask for advice,they sometimes talk...","[When, people, ask, for, advice,they, sometime..."


In [None]:
tokenized_test = tokenize(df_test, with_labels=False) #Using the same tokenizer

In [None]:
len(tokenized_test['input_ids'])

10

In [None]:
len(df_test)

5

In [None]:
df_test['text_split'].str.len()

0     635
1     421
2    1056
3     711
4     363
Name: text_split, dtype: int64

In [None]:
n_tokens = len(tokenizer(df_test.iloc[2]['text'])['input_ids'])
n_tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (1304 > 512). Running this sequence through the model will result in indexing errors


1304

Verification that 4 chunks of 512 with a stride of 200 is the correct number of chunks to fit 1304 tokens in
# 512 + 2*(512-200) < n_tokens < 512 + 3*(512-200)

In [None]:
## Original text:
df_test.iloc[2]['text']

"80% of Americans believe seeking multiple opinions can help them make better choices, and for good reason. Studies have shown the average Americans faring far better in their lives compared to their counterparts because they are listening to other's advice. There are also many myths that have the moral of listening to other people's opinions. For example, Perseus got his achievement of slaying a gorgon because he listened to the Oracle. Another example I have is the fable of Osiris, in which Osiris listens to Sekhmet and becomes the king of he underworld. In all of these stories, the hero listens to other people, and benefited from the people around them being more knowledgeable, more experienced, and giving the hero more choices to consider. Therefore, I believe listening to other's advice can help someone make a better choice because they have more experience compared to you, know the pros and cons of your choice, and gives you multiple perspectives to deliberate over.\n\nFor exampl

In [None]:
## The four 512-token chunks generated by the tokenization procedure:
tokenizer.decode(tokenized_test['input_ids'][3])

"<s> 80% of Americans believe seeking multiple opinions can help them make better choices, and for good reason. Studies have shown the average Americans faring far better in their lives compared to their counterparts because they are listening to other's advice. There are also many myths that have the moral of listening to other people's opinions. For example, Perseus got his achievement of slaying a gorgon because he listened to the Oracle. Another example I have is the fable of Osiris, in which Osiris listens to Sekhmet and becomes the king of he underworld. In all of these stories, the hero listens to other people, and benefited from the people around them being more knowledgeable, more experienced, and giving the hero more choices to consider. Therefore, I believe listening to other's advice can help someone make a better choice because they have more experience compared to you, know the pros and cons of your choice, and gives you multiple perspectives to deliberate over. For examp

In [None]:
tokenizer.decode(tokenized_test['input_ids'][6])

"<s> he wouldn't have even considered it, and may have grown to be a nihilistic person who isn't able to change his fate of suffering. Therefore, I believe seeking multiple opinions can help someone make a better choice because it can give them multiple perspectives to consider. In conclusion, little Generic_Name, Generic_Name, Generic_Name, and many more changed their entire lives by listening to other people. If they didn't seek other's advice, they would have fallen into a path that would not allow their potential to bloom. But, thankfully, they chose to listen to other people. As a result, they grew to be figures that were respected around the world. Therefore, I in believe seeking multiple opinions.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa