# Fine-tune BERT-based models from Hugging Face on POS-tagging for English and Norwegian

This notebook will guide you through Part 2 of [CS 2731 Homework 4](https://michaelmilleryoder.github.io/cs2731_fall2024/hw4).

Please copy this notebook and name it `{pitt email id}_hw4_bert_pos.ipynb`.

Code for loading and preprocessing the data is provided. You will provide code for training and evaluation using Hugging Face Trainer or PyTorch.

Run all the cells starting from the top, filling in any sections that need to be filled in. Spots you need to fill in are specified.

You will want to duplicate cells in each section for each language (English or Norwegian) or create separate sections in the notebook for separate languages.

**Note**: Please run this notebook on a GPU server on the CRCD.

The tutorials below from Hugging Face are informative. You can use code from them and adapt to this use case.
* [Token classification (sequence labeling) with Hugging Face](https://huggingface.co/docs/transformers/en/tasks/token_classification)
* [Hugging Face `Trainer` class tutorial](https://huggingface.co/docs/transformers/en/training#train)

# Load data

Here you will be loading the training, dev, and test datasets of English and Norwegian text annotated with POS tags. The data are from the [Universal Dependencies](https://universaldependencies.org/) project.

We will be using the universal part-of-speech tags in the `upos` column.

Note:  There are 2 written forms of Norwegian: Bokmål and Nynorsk: https://en.wikipedia.org/wiki/Norwegian_language. This data is in the Bokmål written form.

Here is a link to learn more about the data:
* [Universal Dependencies data format](https://universaldependencies.org/format.html)

In [None]:
# Load datasets
from datasets import load_from_disk

en_data = load_from_disk('data/en_ewt_hf')
print(en_data)
no_data = load_from_disk('data/no_bokmaal_hf')
print(no_data)

In [None]:
# Take a look at the part of speech tags

tags = en_data['train'].features['upos_labels'].feature.names
tags

# Count and explore POS tags in this dataset
For both English and Norwegian datasets, calculate the following and put in your report:

* The 5 most frequent POS tags and how many tokens are tagged with each
* For each of the 5 most frequent POS tags, find the 5 most frequent word types annotated with that tag in the training data

Be sure to report the actual POS tag names in your report, not the IDs.

# Tokenization
Fill in code in this section to prepare the input with subword tokenization for BERT. You can follow the process in the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

Here is also where you will decide on which BERT-based pre-trained model you will fine-tune, since you will need to match its tokenization.
Feel free to search Hugging Face for BERT variants or to use recommended ones in Hugging Face documentation. For Norwegian, you'll want a pretrained BERT model that can handle Norwegian (in Bokmål written form).

In [None]:
from transformers import AutoTokenizer

# FILL IN with the name of a BERT-based pretrained model from Hugging Face
pretrained_model =
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

Subword tokenization will add special tokens such as `[CLS]` which we want the classifier to ignore.

It also splits some words into multiple tokens. We'll have to re-align those to assign just one part-of-speech tag to each word.

Fill in code here to do this alignment, as well as prepare a tokenized version of the dataset. You may adapt code from the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

In [None]:
# FILL IN

# Prepare evaluation

Evaluation code is provided here.

Source: [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification)

In [None]:
import evaluate
seqeval = evaluate.load('seqeval')

In [None]:
import numpy as np

label_list = dataset['train'].features['upos_labels'].feature.names
labels = dataset['train'][0]['upos_labels']
labels = [label_list[i] for i in labels]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Train (fine-tune) the model

Fill in code here to load your pretrained model and do fine-tuning using the `Trainer` class or PyTorch.

In [None]:
# FILL IN

# Test performance

Fill in code here to evaluate your fine-tuned model's performance on the test set of the tokenized dataset.

You will be reporting accuracy in your report.

In [None]:
# FILL IN

# Run on an example sentence

Fill in code here to run your classifier on two example sentences choice for both English and Norwegian models:
1. The provided example sentence
2. An example sentence of your choice (feel free to use a translation service like Google Translate if you don't know Norwegian)

You will likely have to load these models from checkpoints created during training.

Provide the predicted tags for example sentences in your report.

In [None]:
# FILL IN
en_example = "Hello, I am a student at the University of Pittsburgh."
no_example = "Hallo, jeg er student ved University of Pittsburgh."