# Multilingual POS Tagging with Transformers

This notebook demonstrates how to perform Part-of-Speech (POS) tagging on the Universal Dependencies Urdu dataset using various transformer-based models. The workflow includes data loading, preprocessing, model training, and evaluation.

---

## Workflow Overview

1. **Imports**  
    Essential libraries for data handling, model training, and evaluation are imported, including HuggingFace Transformers, Datasets, and scikit-learn metrics.

2. **Dataset Preparation**  
    - The Universal Dependencies Urdu dataset is loaded.
    - Data splits for training, validation, and testing are created.
    - POS tag labels are extracted for use in model training.

3. **Metric Computation**  
    A custom function computes accuracy, precision, recall, and F1-score at the token level, ignoring padding tokens.

4. **Model Training and Evaluation**  
    - Three transformer models are evaluated: XLM-RoBERTa, Arabic BERT, and Urdu BERT.
    - For each model:
      - The tokenizer is loaded and used to tokenize and align labels with tokens.
      - Data is preprocessed and mapped for model input.
      - The model is trained using HuggingFace's Trainer API.
      - Evaluation metrics are computed and displayed for each model.

---

## Key Points

- **Tokenization and Label Alignment:**  
  Special care is taken to align original word-level labels with tokenized subwords, assigning `-100` to tokens that should be ignored during loss computation.

- **Evaluation:**  
  Macro-averaged metrics are used to provide a balanced view of model performance across all POS tags.

- **Reproducibility:**  
  Training arguments and model checkpoints are managed for consistent and repeatable experiments.

---

## References

- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [Universal Dependencies Project](https://universaldependencies.org/)
- [scikit-learn Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

---

*This notebook provides a template for multilingual sequence labeling tasks using state-of-the-art transformer models.*

In [3]:
# Documentation for Data Preparation and Metric Computation
"""
This section prepares the Universal Dependencies Urdu dataset for POS tagging.
- Loads the dataset and creates train/validation/test splits.
- Extracts the list of POS tag labels and their count.
- Defines a function `compute_metrics` to evaluate model predictions using accuracy, precision, recall, and F1-score at the token level, ignoring padding tokens (-100).
"""

from datasets import load_dataset, Sequence
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer,
)

from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score
)


In [4]:
# Documentation for Model Training and Evaluation
"""
This section iterates over three transformer models (XLM-RoBERTa, Arabic BERT, Urdu BERT) for POS tagging:
- Loads each model and its tokenizer.
- Defines a function to tokenize input sentences and align POS labels with subword tokens.
- Preprocesses the dataset splits for model input.
- Sets up the Trainer with appropriate arguments, data collator, and metric computation.
- Trains each model and evaluates its performance on the test set, printing out the results.
"""



GENERAL_BERT = "FacebookAI/xlm-roberta-base"
ARABIC_BERT = "asafaya/bert-base-arabic"
URDU_BERT = 'mirfan899/urdu-bert-ner'
UR_LANGUAGE = 'ur_udtb'

# 1) Load UD Urdu and get splits
ud = load_dataset("universal_dependencies", UR_LANGUAGE)

splits = {
    "train": ud["train"],
    "validation": ud.get("validation", ud.get("dev")),
    "test": ud["test"],
}

# 2) Extract labels
features = splits["train"].features
label_feature: Sequence = features["upos"]
label_list = label_feature.feature.names  # list of string labels
num_labels = len(label_list)


# 7) Compute metrics without seqeval (token-level classification)
def compute_metrics(p):
    preds, labels = p
    pred_ids = preds.argmax(-1)
    # Flatten
    true_labels = []
    true_preds = []
    for pred_seq, label_seq in zip(pred_ids, labels):
        for p_id, l_id in zip(pred_seq, label_seq):
            if l_id != -100:
                true_labels.append(l_id)
                true_preds.append(p_id)
    accuracy = accuracy_score(true_labels, true_preds)
    precision = precision_score(true_labels, true_preds, average='macro', zero_division=0)
    recall = recall_score(true_labels, true_preds, average='macro', zero_division=0)
    f1 = f1_score(true_labels, true_preds, average='macro', zero_division=0)
    return {
        "accuracy": accuracy,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1,
    }


In [5]:
# 8) Training arguments and Trainer

def finetune_bert_model(model_name):
    
    training_args = TrainingArguments(
        output_dir="./pos-urdu-xlmr",
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=3e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=128,
        num_train_epochs=5,
        logging_dir="./logs",
        save_total_limit=2,
        metric_for_best_model="f1_macro",
        load_best_model_at_end=True,
    )


    print(f"Training with model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    
    def tokenize_and_align(examples):
        tokenized = tokenizer(
            examples["tokens"],
            is_split_into_words=True,
            truncation=True,
            padding="max_length",
        )
        aligned_labels = []
        for i, word_ids in enumerate(tokenized.word_ids(batch_index=i) for i in range(len(examples["tokens"]))):
            orig_labels = examples["upos"][i]
            label_ids = []
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)
                else:
                    label_ids.append(orig_labels[word_idx])
            aligned_labels.append(label_ids)
        tokenized["labels"] = aligned_labels
        return tokenized
    
    tokenized_splits = {
        split: ds.map(
            tokenize_and_align,
            batched=True,
            remove_columns=ds.column_names,
        )
        for split, ds in splits.items()
    }

    # 6) Data collator and metrics

    data_collator = DataCollatorForTokenClassification(tokenizer)
    model = AutoModelForTokenClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        id2label={i: lbl for i, lbl in enumerate(label_list)},
        label2id={lbl: i for i, lbl in enumerate(label_list)},
        ignore_mismatched_sizes=True
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_splits["train"],
        eval_dataset=tokenized_splits["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    print(f"Evaluation results for {model_name}:")
    print(trainer.evaluate(tokenized_splits["test"]))
    print("\n" + "="*50 + "\n")

In [None]:
model_names = [
    GENERAL_BERT,
    ARABIC_BERT,
    URDU_BERT
]
for model_name in model_names:
    finetune_bert_model(model_name)
