# Application of models based on BERT (Delvin et al. 2019) on Vietnamese Dataset


*   jjzha/jobbert-base-cased (Zhang et al. 2022) (https://huggingface.co/jjzha/jobbert-base-cased)
*   bert-large-uncased (https://huggingface.co/bert-large-uncased)
*   bert-large-uncased-whole-word-masking (https://huggingface.co/bert-large-uncased-whole-word-masking)
*   bert-base-multilingual-cased (https://huggingface.co/bert-base-multilingual-cased)



**The code is run on Google Colab, therefore, first step is installing necessary libraries for the script.**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
pip install transformers



In [3]:
pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=bc8ad0eb169a2202c1e7c898e4b3567d56c4d1421881784d06f3bf3d35b488a4
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [4]:
pip install pytorch-crf


Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2


**Determine suitable max_length for training set**

In [None]:
import pandas as pd

# Read the training set and group tokens by sentence_id
train_file_path = 'processed_kaggle_dataset.csv'
train_set = pd.read_csv(train_file_path)

grouped_data = train_set.groupby('sentence_id').agg({
        'word': list,
        'tag': list
    }).reset_index()

# Count the number of words within each sentence
sentence_lengths = grouped_data['word'].apply(len)

# Calculating the 25th, 50th, and 75th percentiles for sentence lengths
percentiles = sentence_lengths.quantile([0.25, 0.5, 0.75]).to_dict()
percentiles



{0.25: 5.0, 0.5: 9.0, 0.75: 14.0}

This result means:


*   25th Percentile: 25% of the sentences have 5 words or fewer.

*   Median (50th Percentile): Half of the sentences have 9 words or fewer.

*   75th Percentile: 75% of the sentences have 14 words or fewer.

By considering also the impact of sub-tokenization steps on the training set, we decided to choose max_length = 32, therefore the model has enough context to make accurate predictions.


**Setting for Training, Fine-tuning and Evaluation.**



In [5]:
import torch
import pandas as pd
from transformers import BertForTokenClassification, AutoTokenizer, AdamW, get_linear_schedule_with_warmup, BertPreTrainedModel, BertModel
from torchcrf import CRF
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from seqeval.metrics import f1_score, classification_report
from tqdm import tqdm
import os
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

# Initialize CRF layer as output for the model
class BertForTokenClassificationCRF(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.bert = BertModel(config)
        self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)
        self.classifier = torch.nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=self.num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask=None, labels=None, class_weights=None):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)

        if class_weights is not None and labels is not None:
            # Apply class weights for labels because of its imbalance
            weights = class_weights[labels]
            logits = logits * weights.unsqueeze(-1)

        if labels is not None:
            loss = -self.crf(logits, labels, mask=attention_mask.byte(), reduction='mean')
            return loss
        else:
            return self.crf.decode(logits, mask=attention_mask.byte())

# Define a label map
label_map = {'O': 0, 'B-Skill': 1, 'I-Skill': 2}

# Function to encode tokens and map labels to token ids
def encode_labels(tokens, text_labels,tokenizer):
    if not all(isinstance(token, str) for token in tokens):
        tokens = [str(token) for token in tokens]

    # max_length is determined from last step
    encoded_inputs = tokenizer(tokens, is_split_into_words=True, add_special_tokens=True,
                               max_length=32, truncation=True, padding='max_length',
                               return_attention_mask=True, return_tensors='pt')

    labels = []
    attention_masks = []
    previous_word_idx = None
    is_first_token = True

    for i, word_id in enumerate(encoded_inputs.word_ids()):
        if word_id is None:
            labels.append(label_map['O'])
            if is_first_token:
                attention_masks.append(1)
            else:
                attention_masks.append(0)
        else:
            labels.append(text_labels[word_id])
            attention_masks.append(1)
            is_first_token = False

    # Ensure labels are the correct length
    labels = labels[:32] + [label_map['O']] * (32- len(labels))

    # Convert attention masks to a tensor
    attention_masks = torch.tensor(attention_masks)

    return encoded_inputs['input_ids'][0], attention_masks, labels

# processes the data into a format suitable for training, including tokenization and label encoding.
def process_dataframe(df, tokenizer, label_map):
    grouped_data = df.groupby('sentence_id').agg({
        'word': list,
        'tag': list
    }).reset_index()
    grouped_data['label_ids'] = grouped_data['tag'].apply(lambda tags: [label_map[tag] for tag in tags])
    # Encode the data
    encoded_data = [encode_labels(sentence_tokens, sentence_labels, tokenizer)
                    for sentence_tokens, sentence_labels in zip(grouped_data['word'], grouped_data['label_ids'])]

    input_ids, attention_masks, labels = zip(*encoded_data)
    return torch.stack(input_ids), torch.stack(attention_masks), torch.tensor(labels)

# Train the model
def train_model(model, tokenizer, train_dataloader, dev_dataloader, optimizer, scheduler, device, num_epochs, class_weights):
    train_losses = []

    for epoch_i in range(num_epochs):
        model.train()
        total_loss = 0

        for step, batch in enumerate(train_dataloader):
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            model.zero_grad()
            outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels, class_weights=class_weights)
            loss = outputs[0] if isinstance(outputs, tuple) else outputs
            total_loss += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

        avg_train_loss = total_loss / len(train_dataloader)
        print(f"Epoch {epoch_i + 1}/{num_epochs} - Training loss: {avg_train_loss}")

        dev_predictions, dev_true_labels,_ = evaluate_model(model, dev_dataloader, tokenizer, device)
        dev_f1 = f1_score(dev_true_labels, dev_predictions)
        print(f"Epoch {epoch_i + 1}/{num_epochs} - Dev F1 Score: {dev_f1}")

    return model

# Evaluate the model
def evaluate_model(model, dataloader, tokenizer, device):
    model.eval()
    model.to(device)
    all_predictions, all_true_labels, all_words = [], [], []

    for batch in tqdm(dataloader, desc="Evaluating"):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            outputs = model(b_input_ids, attention_mask=b_input_mask)

        batch_predictions = outputs
        label_ids = b_labels.to('cpu').numpy()

        words = [tokenizer.convert_ids_to_tokens(input_id) for input_id in b_input_ids.to('cpu').numpy()]

        for i in range(label_ids.shape[0]):
            input_len = sum(b_input_mask[i])
            sentence_predictions = [list(label_map.keys())[list(label_map.values()).index(p)] for p in batch_predictions[i][1:input_len-1]]
            sentence_true_labels = [list(label_map.keys())[list(label_map.values()).index(l)] for l in label_ids[i][1:input_len-1]]
            sentence_words = words[i][1:input_len-1]

            all_predictions.append(sentence_predictions)
            all_true_labels.append(sentence_true_labels)
            all_words.extend(sentence_words)
    return all_predictions, all_true_labels, all_words

# Save data with predicted labels for comparison
def save_predictions_to_csv(sentence_id, words, true_labels, predictions, file_path):
    df = pd.DataFrame({
        'Sentence_id': sentence_id,
        'Word': words,
        'True_Label': true_labels,
        'Prediction': predictions
    })
    df.to_csv(file_path, index=False)

# Save best model
def save_model(model, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    torch.save(model.state_dict(), path)

# Load the saved model
def load_model(model_class, path, config):
    model = model_class(config)
    model.load_state_dict(torch.load(path))
    return model

def main():
    # Load data
    train_file_path = 'processed_kaggle_dataset.csv'
    validation_file_path = 'processed_website_dataset.csv'

    # Load data
    train_set = pd.read_csv(train_file_path)
    validation_data_df = pd.read_csv(validation_file_path)

    # Split the validation dataset into test and dev sets
    unique_sentence_ids = validation_data_df['sentence_id'].unique()
    split_index = len(unique_sentence_ids) // 2
    test_ids, dev_ids = unique_sentence_ids[:split_index], unique_sentence_ids[split_index:]

    test_set = validation_data_df[validation_data_df['sentence_id'].isin(test_ids)]
    dev_set = validation_data_df[validation_data_df['sentence_id'].isin(dev_ids)]

    # Initialize the tokenizers used for experiment
    tokenizers = [AutoTokenizer.from_pretrained('jjzha/jobbert-base-cased'),
                  AutoTokenizer.from_pretrained('bert-large-uncased'),
                  AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking'),
                  AutoTokenizer.from_pretrained('bert-base-multilingual-cased')]

    # Device setup
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Training setup
    learning_rates = [1e-3, 1e-4, 1e-5]
    num_epochs = 5
    epsilon = 1e-8

    # Iterate each tokenizer as well as learning rate
    for tokenizer in tokenizers:
        tokenizer_name = tokenizer.name_or_path
        print(f"Tokenizer: {tokenizer_name}")
        train_inputs, train_masks, train_labels = process_dataframe(train_set, tokenizer, label_map)
        class_weights = compute_class_weight('balanced', classes=np.unique(train_labels.numpy()), y=train_labels.numpy().flatten())
        class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

        test_inputs, test_masks, test_labels = process_dataframe(test_set, tokenizer, label_map)
        dev_inputs, dev_masks, dev_labels = process_dataframe(dev_set, tokenizer, label_map)

        best_f1_score_for_tokenizer = 0
        best_model_info_for_tokenizer = None

        for lr in learning_rates:
            print(f"\nTraining with learning rate: {lr}")

            # Create the DataLoader for the datasets
            train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
            train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=8)

            test_dataset = TensorDataset(test_inputs, test_masks, test_labels)
            test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=8)

            dev_dataset = TensorDataset(dev_inputs, dev_masks, dev_labels)
            dev_dataloader = DataLoader(dev_dataset, sampler=SequentialSampler(dev_dataset), batch_size=8)

            # Model initialization
            model = BertForTokenClassificationCRF.from_pretrained(
                tokenizer_name,
                num_labels=len(label_map)
            )
            model.to(device)

            # Optimizer and scheduler setup
            optimizer = AdamW(model.parameters(), lr=lr, eps=epsilon, weight_decay=0.01)
            total_steps = len(train_dataloader) * num_epochs
            scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

            # Train the model
            model = train_model(model, tokenizer, train_dataloader, dev_dataloader, optimizer, scheduler, device, num_epochs, class_weights)

            # Fine-tune the model by evaluating its perforance with current learning rate on the development set
            dev_predictions, dev_true_labels,_ = evaluate_model(model, dev_dataloader, tokenizer, device)
            dev_f1 = f1_score(dev_true_labels, dev_predictions)

            # Best model has best F1 score in comparison with other learning rate.
            if dev_f1 > best_f1_score_for_tokenizer:
                best_f1_score_for_tokenizer = dev_f1
                best_model_info_for_tokenizer = {
                    "model": model,
                    "learning_rate": lr,
                    "f1_score": dev_f1
                }

        if best_model_info_for_tokenizer:
            # Save the best model for further use
            model_save_path = f'vi_bert_models_{tokenizer_name}.pth'
            save_model(best_model_info_for_tokenizer["model"], model_save_path)
            print(f"\nBest model for {tokenizer_name} saved: {model_save_path} with F1 score: {best_model_info_for_tokenizer['f1_score']}")

            # Reload the best model
            best_model_loaded = load_model(BertForTokenClassificationCRF, model_save_path, best_model_info_for_tokenizer["model"].config)
            best_model_loaded.to(device)
            # Evaluate best model on the dev set
            dev_predictions, dev_true_labels,_ = evaluate_model(best_model_loaded, dev_dataloader, tokenizer, device)
            dev_report = classification_report(dev_true_labels, dev_predictions)
            print(f"\nDev Set Classification Report for {tokenizer_name}:\n{dev_report}")

            # Evaluate best model on the test set
            test_predictions, test_true_labels, test_words = evaluate_model(best_model_loaded, test_dataloader, tokenizer, device)
            test_report = classification_report(test_true_labels, test_predictions)
            print(f"\nTest Set Classification Report for {tokenizer_name}:\n{test_report}")

            # Save output of evaluation on testset
            gold_labels, predicted_labels, sentence_ids = [], [], []
            for sentence_id, sentence_labels in enumerate(test_true_labels):
                sentence_length = len(sentence_labels)
                gold_labels.extend(sentence_labels)
                predicted_labels.extend(test_predictions[sentence_id])
                sentence_ids.extend([sentence_id] * sentence_length)
            output_csv_path = f'vi_bert_models_{tokenizer_name}.csv'
            save_predictions_to_csv(sentence_ids, test_words, gold_labels, predicted_labels, output_csv_path)
            print(f"Test predictions saved to {output_csv_path}")
if __name__ == "__main__":
    main()



tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/603 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Tokenizer: jjzha/jobbert-base-cased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'bert.pooler.dense.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  score = torch.where(mask[i].unsqueeze(1), next_score, score)


Epoch 1/5 - Training loss: 29.638128128051758


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.09it/s]


Epoch 1/5 - Dev F1 Score: 0.0028275212064090478
Epoch 2/5 - Training loss: 25.180097534179687


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.82it/s]


Epoch 2/5 - Dev F1 Score: 0.25070446588677314
Epoch 3/5 - Training loss: 22.75505174255371


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.02it/s]


Epoch 3/5 - Dev F1 Score: 0.0028275212064090478
Epoch 4/5 - Training loss: 21.26473291015625


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.23it/s]


Epoch 4/5 - Dev F1 Score: 0.002828854314002829
Epoch 5/5 - Training loss: 20.553582038879394


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.00it/s]


Epoch 5/5 - Dev F1 Score: 0.002830188679245283


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.23it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'bert.pooler.dense.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 22.581214195251466


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.32it/s]


Epoch 1/5 - Dev F1 Score: 0.5971612903225807
Epoch 2/5 - Training loss: 16.56741064834595


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.10it/s]


Epoch 2/5 - Dev F1 Score: 0.5129310344827587
Epoch 3/5 - Training loss: 13.142621682167054


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.77it/s]


Epoch 3/5 - Dev F1 Score: 0.6426592797783934
Epoch 4/5 - Training loss: 8.102921355247497


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.93it/s]


Epoch 4/5 - Dev F1 Score: 0.710843373493976
Epoch 5/5 - Training loss: 4.763851860046387


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.38it/s]


Epoch 5/5 - Dev F1 Score: 0.6862425231103861


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.05it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at jjzha/jobbert-base-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'bert.pooler.dense.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 23.285528594970703


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.96it/s]


Epoch 1/5 - Dev F1 Score: 0.4483030781373322
Epoch 2/5 - Training loss: 18.355223770141603


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.58it/s]


Epoch 2/5 - Dev F1 Score: 0.5627470355731226
Epoch 3/5 - Training loss: 15.35090172958374


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.23it/s]


Epoch 3/5 - Dev F1 Score: 0.5881761006289308
Epoch 4/5 - Training loss: 13.31084612083435


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.90it/s]


Epoch 4/5 - Dev F1 Score: 0.6187929717341483
Epoch 5/5 - Training loss: 12.361799564361572


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.94it/s]


Epoch 5/5 - Dev F1 Score: 0.61692591616135


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 42.71it/s]



Best model for jjzha/jobbert-base-cased saved: /content/drive/MyDrive/testBA/bert_vi_models/bert_models_jjzha/jobbert-base-cased.pth with F1 score: 0.6862425231103861


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 39.80it/s]



Dev Set Classification Report for jjzha/jobbert-base-cased:
              precision    recall  f1-score   support

       Skill       0.65      0.73      0.69      1727

   micro avg       0.65      0.73      0.69      1727
   macro avg       0.65      0.73      0.69      1727
weighted avg       0.65      0.73      0.69      1727



Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.47it/s]



Test Set Classification Report for jjzha/jobbert-base-cased:
              precision    recall  f1-score   support

       Skill       0.64      0.71      0.67      2015

   micro avg       0.64      0.71      0.67      2015
   macro avg       0.64      0.71      0.67      2015
weighted avg       0.64      0.71      0.67      2015

Test predictions saved to /content/drive/MyDrive/testBA/bert_vi_models/bert_models_jjzha/jobbert-base-cased.csv
Tokenizer: bert-large-uncased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 29.918480026245117


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.08it/s]


Epoch 1/5 - Dev F1 Score: 0.0024420024420024416
Epoch 2/5 - Training loss: 19.506815826416016


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.59it/s]


Epoch 2/5 - Dev F1 Score: 0.0024420024420024416
Epoch 3/5 - Training loss: 17.69562532043457


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.21it/s]


Epoch 3/5 - Dev F1 Score: 0.0024434941967012825
Epoch 4/5 - Training loss: 16.343211196899414


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.39it/s]


Epoch 4/5 - Dev F1 Score: 0.0024420024420024416
Epoch 5/5 - Training loss: 15.640700828552246


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.27it/s]


Epoch 5/5 - Dev F1 Score: 0.0024420024420024416


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.54it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 23.697090087890626


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.49it/s]


Epoch 1/5 - Dev F1 Score: 0.026626664166510407
Epoch 2/5 - Training loss: 22.016629806518555


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.47it/s]


Epoch 2/5 - Dev F1 Score: 0.029183598079054306
Epoch 3/5 - Training loss: 21.698614250183105


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.51it/s]


Epoch 3/5 - Dev F1 Score: 0.02812675792237015
Epoch 4/5 - Training loss: 21.488119850158693


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.46it/s]


Epoch 4/5 - Dev F1 Score: 0.0024420024420024416
Epoch 5/5 - Training loss: 21.398471900939942


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.49it/s]


Epoch 5/5 - Dev F1 Score: 0.0024420024420024416


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.71it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 21.939636528015136


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.48it/s]


Epoch 1/5 - Dev F1 Score: 0.33728299807109396
Epoch 2/5 - Training loss: 18.447891227722167


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.39it/s]


Epoch 2/5 - Dev F1 Score: 0.3911269760326364
Epoch 3/5 - Training loss: 15.397011615753174


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.42it/s]


Epoch 3/5 - Dev F1 Score: 0.45801033591731266
Epoch 4/5 - Training loss: 13.965422510147095


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.43it/s]


Epoch 4/5 - Dev F1 Score: 0.48613376835236544
Epoch 5/5 - Training loss: 12.063983253479003


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.54it/s]


Epoch 5/5 - Dev F1 Score: 0.4778927563499529


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.70it/s]



Best model for bert-large-uncased saved: /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-large-uncased.pth with F1 score: 0.4778927563499529


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.31it/s]



Dev Set Classification Report for bert-large-uncased:
              precision    recall  f1-score   support

       Skill       0.39      0.61      0.48      1243

   micro avg       0.39      0.61      0.48      1243
   macro avg       0.39      0.61      0.48      1243
weighted avg       0.39      0.61      0.48      1243



Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.74it/s]



Test Set Classification Report for bert-large-uncased:
              precision    recall  f1-score   support

       Skill       0.43      0.60      0.50      1453

   micro avg       0.43      0.60      0.50      1453
   macro avg       0.43      0.60      0.50      1453
weighted avg       0.43      0.60      0.50      1453

Test predictions saved to /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-large-uncased.csv
Tokenizer: bert-large-uncased-whole-word-masking

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 29.58187399291992


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.73it/s]


Epoch 1/5 - Dev F1 Score: 0.20790565625682464
Epoch 2/5 - Training loss: 19.860742904663088


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.83it/s]


Epoch 2/5 - Dev F1 Score: 0.0024420024420024416
Epoch 3/5 - Training loss: 17.674963439941408


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.21it/s]


Epoch 3/5 - Dev F1 Score: 0.20790565625682464
Epoch 4/5 - Training loss: 16.6103698425293


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.05it/s]


Epoch 4/5 - Dev F1 Score: 0.0024434941967012825
Epoch 5/5 - Training loss: 15.778223793029785


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.16it/s]


Epoch 5/5 - Dev F1 Score: 0.0024434941967012825


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 17.31it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 23.60461707305908


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.50it/s]


Epoch 1/5 - Dev F1 Score: 0.0024420024420024416
Epoch 2/5 - Training loss: 22.521517791748046


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.57it/s]


Epoch 2/5 - Dev F1 Score: 0.20790565625682464
Epoch 3/5 - Training loss: 22.094135078430178


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.52it/s]


Epoch 3/5 - Dev F1 Score: 0.0024420024420024416
Epoch 4/5 - Training loss: 21.831853790283205


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.48it/s]


Epoch 4/5 - Dev F1 Score: 0.0024420024420024416
Epoch 5/5 - Training loss: 21.718242698669435


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.51it/s]


Epoch 5/5 - Dev F1 Score: 0.0024420024420024416


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.70it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 18.46255086517334


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.46it/s]


Epoch 1/5 - Dev F1 Score: 0.5378531073446328
Epoch 2/5 - Training loss: 13.879602458953858


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.52it/s]


Epoch 2/5 - Dev F1 Score: 0.5813077713111947
Epoch 3/5 - Training loss: 10.159353598594665


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.42it/s]


Epoch 3/5 - Dev F1 Score: 0.580949175361831
Epoch 4/5 - Training loss: 8.226711378097534


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.42it/s]


Epoch 4/5 - Dev F1 Score: 0.6363636363636362
Epoch 5/5 - Training loss: 6.652262309074402


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.45it/s]


Epoch 5/5 - Dev F1 Score: 0.6325214899713467


Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.70it/s]



Best model for bert-large-uncased-whole-word-masking saved: /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-large-uncased-whole-word-masking.pth with F1 score: 0.6325214899713467


Evaluating: 100%|██████████| 50/50 [00:03<00:00, 16.30it/s]



Dev Set Classification Report for bert-large-uncased-whole-word-masking:
              precision    recall  f1-score   support

       Skill       0.57      0.71      0.63      1243

   micro avg       0.57      0.71      0.63      1243
   macro avg       0.57      0.71      0.63      1243
weighted avg       0.57      0.71      0.63      1243



Evaluating: 100%|██████████| 50/50 [00:02<00:00, 16.70it/s]



Test Set Classification Report for bert-large-uncased-whole-word-masking:
              precision    recall  f1-score   support

       Skill       0.55      0.68      0.61      1453

   micro avg       0.55      0.68      0.61      1453
   macro avg       0.55      0.68      0.61      1453
weighted avg       0.55      0.68      0.61      1453

Test predictions saved to /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-large-uncased-whole-word-masking.csv
Tokenizer: bert-base-multilingual-cased

Training with learning rate: 0.001


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 18.05266269683838


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.75it/s]


Epoch 1/5 - Dev F1 Score: 0.003177124702144559
Epoch 2/5 - Training loss: 15.248228744506836


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.79it/s]


Epoch 2/5 - Dev F1 Score: 0.003177124702144559
Epoch 3/5 - Training loss: 13.988394603729247


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.59it/s]


Epoch 3/5 - Dev F1 Score: 0.003177124702144559
Epoch 4/5 - Training loss: 13.199251571655273


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.27it/s]


Epoch 4/5 - Dev F1 Score: 0.003177124702144559
Epoch 5/5 - Training loss: 12.828900588989258


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.88it/s]


Epoch 5/5 - Dev F1 Score: 0.003177124702144559


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.48it/s]



Training with learning rate: 0.0001


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 14.094678161621093


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.76it/s]


Epoch 1/5 - Dev F1 Score: 0.4613745338305807
Epoch 2/5 - Training loss: 10.176070594787598


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.76it/s]


Epoch 2/5 - Dev F1 Score: 0.5458823529411766
Epoch 3/5 - Training loss: 8.153024441719054


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.23it/s]


Epoch 3/5 - Dev F1 Score: 0.5626710454296661
Epoch 4/5 - Training loss: 5.434909428596496


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.24it/s]


Epoch 4/5 - Dev F1 Score: 0.6053215077605322
Epoch 5/5 - Training loss: 3.2367257113456724


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.45it/s]


Epoch 5/5 - Dev F1 Score: 0.6240730176839703


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.79it/s]



Training with learning rate: 1e-05


Some weights of BertForTokenClassificationCRF were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['crf.end_transitions', 'crf.transitions', 'crf.start_transitions', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Training loss: 13.667275535583496


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.20it/s]


Epoch 1/5 - Dev F1 Score: 0.5014749262536873
Epoch 2/5 - Training loss: 8.70182692527771


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.13it/s]


Epoch 2/5 - Dev F1 Score: 0.5661846496106785
Epoch 3/5 - Training loss: 6.9366288118362425


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.81it/s]


Epoch 3/5 - Dev F1 Score: 0.5774155995343422
Epoch 4/5 - Training loss: 6.030453784942627


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.56it/s]


Epoch 4/5 - Dev F1 Score: 0.5870069605568445
Epoch 5/5 - Training loss: 4.981920777320862


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.85it/s]


Epoch 5/5 - Dev F1 Score: 0.5811669555170421


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 43.42it/s]



Best model for bert-base-multilingual-cased saved: /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-base-multilingual-cased.pth with F1 score: 0.6240730176839703


Evaluating: 100%|██████████| 50/50 [00:01<00:00, 41.87it/s]



Dev Set Classification Report for bert-base-multilingual-cased:
              precision    recall  f1-score   support

       Skill       0.62      0.63      0.62       864

   micro avg       0.62      0.63      0.62       864
   macro avg       0.62      0.63      0.62       864
weighted avg       0.62      0.63      0.62       864



Evaluating: 100%|██████████| 50/50 [00:01<00:00, 44.18it/s]



Test Set Classification Report for bert-base-multilingual-cased:
              precision    recall  f1-score   support

       Skill       0.63      0.68      0.65      1070

   micro avg       0.63      0.68      0.65      1070
   macro avg       0.63      0.68      0.65      1070
weighted avg       0.63      0.68      0.65      1070

Test predictions saved to /content/drive/MyDrive/testBA/bert_vi_models/bert_models_bert-base-multilingual-cased.csv
