# Application of models based on BERT (Delvin et al. 2019) on English Dataset


*   jjzha/jobbert-base-cased (Zhang et al. 2022) (https://huggingface.co/jjzha/jobbert-base-cased)
*   bert-large-uncased (https://huggingface.co/bert-large-uncased)
*   bert-large-uncased-whole-word-masking (https://huggingface.co/bert-large-uncased-whole-word-masking)
*   bert-base-multilingual-cased (https://huggingface.co/bert-base-multilingual-cased)


**The code is run on Google Colab, therefore, first step is installing necessary libraries for the script.**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pip install transformers



In [None]:
pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=d473faba156fa2db8df976a506761afb17edb582072cc29dd661941f9428aac0
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
pip install pytorch-crf


Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2


**Determine suitable max_length for training set**

In [None]:
import pandas as pd

# Read the training set and group tokens by sentence_id
train_file_path = 'processed_df_answers.csv'
train_set = pd.read_csv(train_file_path)

grouped_data = train_set.groupby('sentence_id').agg({
        'word': list,
        'tag': list
    }).reset_index()

# Count the number of words within each sentence
sentence_lengths = grouped_data['word'].apply(len)

# Calculating the 25th, 50th, and 75th percentiles for sentence lengths
percentiles = sentence_lengths.quantile([0.25, 0.5, 0.75]).to_dict()
percentiles

{0.25: 15.0, 0.5: 21.0, 0.75: 29.0}

This result means:


*   25th Percentile: 25% of the sentences have 15 words or fewer.

*   Median (50th Percentile): Half of the sentences have 21 words or fewer.

*   75th Percentile: 75% of the sentences have 29 words or fewer.

By considering also the impact of sub-tokenization steps on the training set, we decided to choose max_length = 64, therefore the model has enough context to make accurate predictions.


**Setting for Training, Fine-tuning and Evaluation.**
Code is similar to *8-bert-on-vi.ipynb*. See the explaination in *8-bert-on-vi.ipynb*

In [None]:
import torch
import pandas as pd
from transformers import BertForTokenClassification, AutoTokenizer, AdamW, get_linear_schedule_with_warmup, BertPreTrainedModel, BertModel
from torchcrf import CRF
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from seqeval.metrics import f1_score, classification_report
from tqdm import tqdm
import os
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split


class BertForTokenClassificationCRF(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.bert = BertModel(config)
        self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)
        self.classifier = torch.nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=self.num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask=None, labels=None, class_weights=None):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)

        if class_weights is not None and labels is not None:
            # Apply class weights
            weights = class_weights[labels]
            logits = logits * weights.unsqueeze(-1)

        if labels is not None:
            loss = -self.crf(logits, labels, mask=attention_mask.byte(), reduction='mean')
            return loss
        else:
            return self.crf.decode(logits, mask=attention_mask.byte())

label_map = {'O': 0, 'B-Skill': 1, 'I-Skill': 2}

def encode_labels(tokens, text_labels,tokenizer):
    if not all(isinstance(token, str) for token in tokens):
        tokens = [str(token) for token in tokens]

    encoded_inputs = tokenizer(tokens, is_split_into_words=True, add_special_tokens=True,
                               max_length=64, truncation=True, padding='max_length',
                               return_attention_mask=True, return_tensors='pt')

    labels = []
    attention_masks = []
    previous_word_idx = None
    is_first_token = True

    for i, word_id in enumerate(encoded_inputs.word_ids()):
        if word_id is None:
            labels.append(label_map['O'])
            if is_first_token:
                attention_masks.append(1)
            else:
                attention_masks.append(0)
        else:
            labels.append(text_labels[word_id])
            attention_masks.append(1)
            is_first_token = False

    labels = labels[:64] + [label_map['O']] * (64- len(labels))
    attention_masks = torch.tensor(attention_masks)

    return encoded_inputs['input_ids'][0], attention_masks, labels

def process_dataframe(df, tokenizer, label_map):
    grouped_data = df.groupby('sentence_id').agg({
        'word': list,
        'tag': list
    }).reset_index()
    grouped_data['label_ids'] = grouped_data['tag'].apply(lambda tags: [label_map[tag] for tag in tags])
    encoded_data = [encode_labels(sentence_tokens, sentence_labels, tokenizer)
                    for sentence_tokens, sentence_labels in zip(grouped_data['word'], grouped_data['label_ids'])]

    input_ids, attention_masks, labels = zip(*encoded_data)
    return torch.stack(input_ids), torch.stack(attention_masks), torch.tensor(labels)

def train_model(model, train_dataloader, dev_dataloader, optimizer, scheduler, device, num_epochs):
    train_losses = []

    for epoch_i in range(num_epochs):
        model.train()
        total_loss = 0
        for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc="Training"):
          batch = tuple(t.to(device) for t in batch)
          b_input_ids, b_input_mask, b_labels = batch

          model.zero_grad()
          loss = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
          total_loss += loss.item()
          optimizer.zero_grad()
          loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
          optimizer.step()
          scheduler.step()


        avg_train_loss = total_loss / len(train_dataloader)
        train_losses.append(avg_train_loss)
        print(f"Epoch {epoch_i + 1} - Average training loss: {avg_train_loss}")

        dev_f1 = calculate_f1_score(model, dev_dataloader, device)
        print(f"Epoch {epoch_i + 1} - Dev F1 Score: {dev_f1}")

    return model


def evaluate_model(model, dataloader, device):
    model.eval()
    model.to(device)
    predictions, true_labels = [], []

    for batch in tqdm(dataloader, desc="Evaluating"):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            outputs = model(b_input_ids, attention_mask=b_input_mask)

        batch_predictions = outputs
        label_ids = b_labels.to('cpu').numpy()

        for i in range(label_ids.shape[0]):
            input_len = sum(b_input_mask[i])
            prediction = [list(label_map.keys())[list(label_map.values()).index(p)] for p in batch_predictions[i][1:input_len-1]]
            true_label = [list(label_map.keys())[list(label_map.values()).index(l)] for l in label_ids[i][1:input_len-1]]
            predictions.append(prediction)
            true_labels.append(true_label)

    return predictions, true_labels

def calculate_f1_score(model, dataloader, device):
    predictions, true_labels = evaluate_model(model, dataloader, device)
    return f1_score(true_labels, predictions)

def save_model(model, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    torch.save(model.state_dict(), path)

def load_model(model_class, path, config):
    model = model_class(config)
    model.load_state_dict(torch.load(path))
    return model

def main():
    train_dev_file_path = 'processed_df_answers.csv'
    test_file_path = 'processed_df_testset.csv'

    train_dev_data = pd.read_csv(train_dev_file_path)
    test_set = pd.read_csv(test_file_path)

    unique_sentence_ids = train_dev_data['sentence_id'].unique()
    train_sentence_ids, dev_sentence_ids = train_test_split(unique_sentence_ids, test_size=len(test_set["sentence_id"].unique()), random_state=42)

    train_set = train_dev_data[train_dev_data['sentence_id'].isin(train_sentence_ids)]
    dev_set = train_dev_data[train_dev_data['sentence_id'].isin(dev_sentence_ids)]

    tokenizers = [AutoTokenizer.from_pretrained('jjzha/jobbert-base-cased'),
                  AutoTokenizer.from_pretrained('bert-large-uncased'),
                  AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking'),
                  AutoTokenizer.from_pretrained('bert-base-multilingual-cased')]

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    learning_rates = [1e-3, 1e-4, 1e-5]
    num_epochs = 5
    epsilon = 1e-8

    for tokenizer in tokenizers:
        tokenizer_name = tokenizer.name_or_path
        print(f"Tokenizer: {tokenizer_name}")
        train_inputs, train_masks, train_labels = process_dataframe(train_set, tokenizer, label_map)
        class_weights = compute_class_weight('balanced', classes=np.unique(train_labels.numpy()), y=train_labels.numpy().flatten())
        class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

        test_inputs, test_masks, test_labels = process_dataframe(test_set, tokenizer, label_map)
        dev_inputs, dev_masks, dev_labels = process_dataframe(dev_set, tokenizer, label_map)

        best_f1_score_for_tokenizer = 0
        best_model_info_for_tokenizer = None

        for lr in learning_rates:
            print(f"\nTraining with learning rate: {lr}")

            train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
            train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=16)

            test_dataset = TensorDataset(test_inputs, test_masks, test_labels)
            test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=16)

            dev_dataset = TensorDataset(dev_inputs, dev_masks, dev_labels)
            dev_dataloader = DataLoader(dev_dataset, sampler=SequentialSampler(dev_dataset), batch_size=16)

            model = BertForTokenClassificationCRF.from_pretrained(
                tokenizer_name,
                num_labels=len(label_map)
            )
            model.to(device)

            optimizer = AdamW(model.parameters(), lr=lr, eps=epsilon, weight_decay=0.01)
            total_steps = len(train_dataloader) * num_epochs
            scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

            model = train_model(model, train_dataloader, dev_dataloader, optimizer, scheduler, device, num_epochs)

            dev_predictions, dev_true_labels = evaluate_model(model, dev_dataloader, device)
            dev_f1 = f1_score(dev_true_labels, dev_predictions)

            if dev_f1 > best_f1_score_for_tokenizer:
                best_f1_score_for_tokenizer = dev_f1
                best_model_info_for_tokenizer = {
                    "model": model,
                    "learning_rate": lr,
                    "f1_score": dev_f1
                }

        if best_model_info_for_tokenizer:
            model_save_path = f'best_model_{tokenizer_name}.pth'
            save_model(best_model_info_for_tokenizer["model"], model_save_path)
            print(f"\nBest model for {tokenizer_name} saved: {model_save_path} with F1 score: {best_model_info_for_tokenizer['f1_score']}")

            best_model_loaded = load_model(BertForTokenClassificationCRF, model_save_path, best_model_info_for_tokenizer["model"].config)
            best_model_loaded.to(device)
            dev_predictions, dev_true_labels = evaluate_model(best_model_loaded, dev_dataloader, device)
            dev_report = classification_report(dev_true_labels, dev_predictions)
            print(f"\nDev Set Classification Report for {tokenizer_name}:\n{dev_report}")
            test_predictions, test_true_labels = evaluate_model(best_model_loaded, test_dataloader, device)
            test_report = classification_report(test_true_labels, test_predictions)
            print(f"\nTest Set Classification Report for {tokenizer_name}:\n{test_report}")

if __name__ == "__main__":
    main()



tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/603 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Tokenizer: jjzha/jobbert-base-cased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  score = torch.where(mask[i].unsqueeze(1), next_score, score)
Training: 100%|██████████| 582/582 [01:19<00:00,  7.34it/s]


Epoch 1 - Average training loss: 19.77084841679052


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.93it/s]


Epoch 1 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:16<00:00,  7.63it/s]


Epoch 2 - Average training loss: 13.621600712287877


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.81it/s]


Epoch 2 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:16<00:00,  7.59it/s]


Epoch 3 - Average training loss: 11.66387256969701


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.47it/s]


Epoch 3 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:16<00:00,  7.62it/s]


Epoch 4 - Average training loss: 10.906676457919616


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.45it/s]


Epoch 4 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:16<00:00,  7.61it/s]


Epoch 5 - Average training loss: 10.594653413877454


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.71it/s]


Epoch 5 - Dev F1 Score: 0.0


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 22.01it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [01:16<00:00,  7.60it/s]


Epoch 1 - Average training loss: 11.091320294694802


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.50it/s]


Epoch 1 - Dev F1 Score: 0.47783251231527096


Training: 100%|██████████| 582/582 [01:16<00:00,  7.64it/s]


Epoch 2 - Average training loss: 7.7234593058369825


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.27it/s]


Epoch 2 - Dev F1 Score: 0.5148683092608327


Training: 100%|██████████| 582/582 [01:16<00:00,  7.65it/s]


Epoch 3 - Average training loss: 5.056656461401084


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.86it/s]


Epoch 3 - Dev F1 Score: 0.5004240882103478


Training: 100%|██████████| 582/582 [01:16<00:00,  7.63it/s]


Epoch 4 - Average training loss: 3.0804700146835695


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.88it/s]


Epoch 4 - Dev F1 Score: 0.45781119465329995


Training: 100%|██████████| 582/582 [01:15<00:00,  7.66it/s]


Epoch 5 - Average training loss: 1.895519100513655


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.89it/s]


Epoch 5 - Dev F1 Score: 0.456910569105691


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.55it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [01:16<00:00,  7.60it/s]


Epoch 1 - Average training loss: 12.70598436712809


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.99it/s]


Epoch 1 - Dev F1 Score: 0.3358547655068079


Training: 100%|██████████| 582/582 [01:16<00:00,  7.62it/s]


Epoch 2 - Average training loss: 9.891203104015888


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.35it/s]


Epoch 2 - Dev F1 Score: 0.40195280716029286


Training: 100%|██████████| 582/582 [01:16<00:00,  7.63it/s]


Epoch 3 - Average training loss: 8.933021284870266


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.72it/s]


Epoch 3 - Dev F1 Score: 0.4118616144975288


Training: 100%|██████████| 582/582 [01:16<00:00,  7.61it/s]


Epoch 4 - Average training loss: 8.246785907810906


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.81it/s]


Epoch 4 - Dev F1 Score: 0.420032310177706


Training: 100%|██████████| 582/582 [01:16<00:00,  7.60it/s]


Epoch 5 - Average training loss: 7.801893204757848


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.45it/s]


Epoch 5 - Dev F1 Score: 0.421735604217356


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.23it/s]



Best model for jjzha/jobbert-base-cased saved: /content/drive/MyDrive/testBA/bert_en_models/best_model_jjzha/jobbert-base-cased.pth with F1 score: 0.456910569105691


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.47it/s]



Dev Set Classification Report for jjzha/jobbert-base-cased:
              precision    recall  f1-score   support

       Skill       0.43      0.49      0.46       575

   micro avg       0.43      0.49      0.46       575
   macro avg       0.43      0.49      0.46       575
weighted avg       0.43      0.49      0.46       575



Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.55it/s]



Test Set Classification Report for jjzha/jobbert-base-cased:
              precision    recall  f1-score   support

       Skill       0.56      0.44      0.49       904

   micro avg       0.56      0.44      0.49       904
   macro avg       0.56      0.44      0.49       904
weighted avg       0.56      0.44      0.49       904

Tokenizer: bert-large-uncased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 1 - Average training loss: 17.69840958274107


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.60it/s]


Epoch 1 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 2 - Average training loss: 12.60205442061539


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.66it/s]


Epoch 2 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.63it/s]


Epoch 3 - Average training loss: 11.003057338118143


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.62it/s]


Epoch 3 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 4 - Average training loss: 10.320801300691166


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.60it/s]


Epoch 4 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.61it/s]


Epoch 5 - Average training loss: 10.025995823116236


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.51it/s]


Epoch 5 - Dev F1 Score: 0.0


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.62it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:41<00:00,  3.60it/s]


Epoch 1 - Average training loss: 19.54640313931757


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.39it/s]


Epoch 1 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.60it/s]


Epoch 2 - Average training loss: 18.506129844901487


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.51it/s]


Epoch 2 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 3 - Average training loss: 17.95831512667469


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.38it/s]


Epoch 3 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 4 - Average training loss: 17.61113301339428


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.38it/s]


Epoch 4 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 5 - Average training loss: 17.500803062596272


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.49it/s]


Epoch 5 - Dev F1 Score: 0.0


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.45it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 1 - Average training loss: 12.835455861697902


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.30it/s]


Epoch 1 - Dev F1 Score: 0.35975066785396265


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 2 - Average training loss: 9.57307370135055


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.43it/s]


Epoch 2 - Dev F1 Score: 0.40316901408450706


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 3 - Average training loss: 8.272694834319177


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.46it/s]


Epoch 3 - Dev F1 Score: 0.4242957746478873


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 4 - Average training loss: 7.240819787036922


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.29it/s]


Epoch 4 - Dev F1 Score: 0.4350230414746544


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 5 - Average training loss: 6.442550402326682


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.29it/s]


Epoch 5 - Dev F1 Score: 0.42690582959641254


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.46it/s]



Best model for bert-large-uncased saved: /content/drive/MyDrive/testBA/bert_en_models/best_model_bert-large-uncased.pth with F1 score: 0.42690582959641254


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 10.98it/s]



Dev Set Classification Report for bert-large-uncased:
              precision    recall  f1-score   support

       Skill       0.38      0.49      0.43       490

   micro avg       0.38      0.49      0.43       490
   macro avg       0.38      0.49      0.43       490
weighted avg       0.38      0.49      0.43       490



Evaluating: 100%|██████████| 21/21 [00:01<00:00, 10.75it/s]



Test Set Classification Report for bert-large-uncased:
              precision    recall  f1-score   support

       Skill       0.50      0.43      0.46       801

   micro avg       0.50      0.43      0.46       801
   macro avg       0.50      0.43      0.46       801
weighted avg       0.50      0.43      0.46       801

Tokenizer: bert-large-uncased-whole-word-masking

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 1 - Average training loss: 18.364085774241445


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.51it/s]


Epoch 1 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 2 - Average training loss: 12.911842928719274


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.56it/s]


Epoch 2 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 3 - Average training loss: 11.118298070946919


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.53it/s]


Epoch 3 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 4 - Average training loss: 10.336887342413677


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.45it/s]


Epoch 4 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [02:41<00:00,  3.60it/s]


Epoch 5 - Average training loss: 10.019899473157535


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.59it/s]


Epoch 5 - Dev F1 Score: 0.0


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.55it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:41<00:00,  3.60it/s]


Epoch 1 - Average training loss: 11.849109766819224


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.18it/s]


Epoch 1 - Dev F1 Score: 0.41477272727272724


Training: 100%|██████████| 582/582 [02:41<00:00,  3.59it/s]


Epoch 2 - Average training loss: 8.913523948684182


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.42it/s]


Epoch 2 - Dev F1 Score: 0.37815975733063706


Training: 100%|██████████| 582/582 [02:41<00:00,  3.61it/s]


Epoch 3 - Average training loss: 6.719029015896656


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.44it/s]


Epoch 3 - Dev F1 Score: 0.4812734082397004


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 4 - Average training loss: 4.530593305518947


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.34it/s]


Epoch 4 - Dev F1 Score: 0.4833659491193738


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 5 - Average training loss: 2.764452708341002


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.32it/s]


Epoch 5 - Dev F1 Score: 0.476


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.40it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [02:40<00:00,  3.63it/s]


Epoch 1 - Average training loss: 11.214448454453773


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.43it/s]


Epoch 1 - Dev F1 Score: 0.37205081669691464


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 2 - Average training loss: 8.722560712971639


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.23it/s]


Epoch 2 - Dev F1 Score: 0.40558139534883725


Training: 100%|██████████| 582/582 [02:40<00:00,  3.63it/s]


Epoch 3 - Average training loss: 7.404626631654825


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.32it/s]


Epoch 3 - Dev F1 Score: 0.43611911623439


Training: 100%|██████████| 582/582 [02:40<00:00,  3.63it/s]


Epoch 4 - Average training loss: 6.118207544805258


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.41it/s]


Epoch 4 - Dev F1 Score: 0.4063079777365492


Training: 100%|██████████| 582/582 [02:40<00:00,  3.62it/s]


Epoch 5 - Average training loss: 5.265544702506967


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.47it/s]


Epoch 5 - Dev F1 Score: 0.4383301707779886


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.46it/s]



Best model for bert-large-uncased-whole-word-masking saved: /content/drive/MyDrive/testBA/bert_en_models/best_model_bert-large-uncased-whole-word-masking.pth with F1 score: 0.476


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 10.71it/s]



Dev Set Classification Report for bert-large-uncased-whole-word-masking:
              precision    recall  f1-score   support

       Skill       0.47      0.49      0.48       490

   micro avg       0.47      0.49      0.48       490
   macro avg       0.47      0.49      0.48       490
weighted avg       0.47      0.49      0.48       490



Evaluating: 100%|██████████| 21/21 [00:01<00:00, 11.01it/s]



Test Set Classification Report for bert-large-uncased-whole-word-masking:
              precision    recall  f1-score   support

       Skill       0.57      0.41      0.47       801

   micro avg       0.57      0.41      0.47       801
   macro avg       0.57      0.41      0.47       801
weighted avg       0.57      0.41      0.47       801

Tokenizer: bert-base-multilingual-cased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [01:22<00:00,  7.09it/s]


Epoch 1 - Average training loss: 18.371930026516473


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.90it/s]


Epoch 1 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:21<00:00,  7.17it/s]


Epoch 2 - Average training loss: 13.648922626914848


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.51it/s]


Epoch 2 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:21<00:00,  7.17it/s]


Epoch 3 - Average training loss: 11.80845211133924


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.73it/s]


Epoch 3 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:21<00:00,  7.13it/s]


Epoch 4 - Average training loss: 11.024561243778242


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.32it/s]


Epoch 4 - Dev F1 Score: 0.0


Training: 100%|██████████| 582/582 [01:21<00:00,  7.17it/s]


Epoch 5 - Average training loss: 10.73237258462152


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 22.02it/s]


Epoch 5 - Dev F1 Score: 0.0


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 22.05it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [01:21<00:00,  7.11it/s]


Epoch 1 - Average training loss: 12.43404071929119


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.64it/s]


Epoch 1 - Dev F1 Score: 0.4395973154362416


Training: 100%|██████████| 582/582 [01:21<00:00,  7.13it/s]


Epoch 2 - Average training loss: 9.277599620245576


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.53it/s]


Epoch 2 - Dev F1 Score: 0.5039494470774091


Training: 100%|██████████| 582/582 [01:21<00:00,  7.13it/s]


Epoch 3 - Average training loss: 7.067633427090661


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.39it/s]


Epoch 3 - Dev F1 Score: 0.5100887812752221


Training: 100%|██████████| 582/582 [01:21<00:00,  7.13it/s]


Epoch 4 - Average training loss: 4.761372963997097


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.50it/s]


Epoch 4 - Dev F1 Score: 0.5077186963979418


Training: 100%|██████████| 582/582 [01:21<00:00,  7.12it/s]


Epoch 5 - Average training loss: 3.146508716840515


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.46it/s]


Epoch 5 - Dev F1 Score: 0.4971287940935193


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.47it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.start_transitions', 'classifier.weight', 'classifier.bias', 'crf.transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training: 100%|██████████| 582/582 [01:20<00:00,  7.19it/s]


Epoch 1 - Average training loss: 12.551706657376895


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.52it/s]


Epoch 1 - Dev F1 Score: 0.3901060070671378


Training: 100%|██████████| 582/582 [01:21<00:00,  7.17it/s]


Epoch 2 - Average training loss: 9.910293041635624


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.42it/s]


Epoch 2 - Dev F1 Score: 0.4597156398104265


Training: 100%|██████████| 582/582 [01:22<00:00,  7.09it/s]


Epoch 3 - Average training loss: 8.84016670435155


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.45it/s]


Epoch 3 - Dev F1 Score: 0.4895429899302866


Training: 100%|██████████| 582/582 [01:22<00:00,  7.09it/s]


Epoch 4 - Average training loss: 7.922014176640723


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.25it/s]


Epoch 4 - Dev F1 Score: 0.47709320695102686


Training: 100%|██████████| 582/582 [01:22<00:00,  7.06it/s]


Epoch 5 - Average training loss: 7.315123693230226


Evaluating: 100%|██████████| 21/21 [00:01<00:00, 20.96it/s]


Epoch 5 - Dev F1 Score: 0.46901960784313723


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.39it/s]



Best model for bert-base-multilingual-cased saved: /content/drive/MyDrive/testBA/bert_en_models/best_model_bert-base-multilingual-cased.pth with F1 score: 0.4971287940935193


Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.20it/s]



Dev Set Classification Report for bert-base-multilingual-cased:
              precision    recall  f1-score   support

       Skill       0.50      0.50      0.50       609

   micro avg       0.50      0.50      0.50       609
   macro avg       0.50      0.50      0.50       609
weighted avg       0.50      0.50      0.50       609



Evaluating: 100%|██████████| 21/21 [00:00<00:00, 21.33it/s]



Test Set Classification Report for bert-base-multilingual-cased:
              precision    recall  f1-score   support

       Skill       0.64      0.44      0.52       973

   micro avg       0.64      0.44      0.52       973
   macro avg       0.64      0.44      0.52       973
weighted avg       0.64      0.44      0.52       973

