## üìå Project Overview

This project focuses on building a Named Entity Recognition (NER) system for Persian (Farsi) text 

by fine-tuning a transformer-based model on the WikiAnn-fa dataset.

The goal is to automatically identify and extract named entities such as persons (PER), locations 

(LOC), and organizations (ORG) from Persian text.

The model is trained using Hugging Face Transformers and PyTorch, and the final system supports 

both offline inference and deployment as a RESTful API using FastAPI.

This cell installs the required Python libraries for training and evaluating the NER model:
- `datasets` for loading and processing the WikiAnn-fa dataset
- `transformers` for fine-tuning a pretrained Transformer model
- `evaluate` and `seqeval` for computing NER sequence-labeling metrics (e.g., precision, recall, F1)


In [1]:
! pip install evaluate seqeval datasets transformers




[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


This cell imports the core libraries required for the project, including PyTorch for model training, Hugging Face Transformers for token classification (NER), Datasets for loading the WikiAnn-fa dataset, and evaluation utilities for computing NER metrics. It also includes supporting libraries for numerical operations and configuration handling.


In [2]:
import numpy as np
import pandas as pd
import torch
from torch import nn
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
import evaluate
import json
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


This cell loads the Persian (Farsi) Named Entity Recognition dataset from WikiAnn using the Hugging Face Datasets library. The dataset provides token-level annotations for entities such as persons, locations, and organizations, and is used for training, validation, and evaluation of the NER model.


In [3]:
dataset = load_dataset('wikiann', 'fa')
print(dataset)
print(dataset['train'][0])

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 20000
    })
})
{'tokens': ['ÿ™ÿ∫€å€åÿ±ŸÖÿ≥€åÿ±', 'ŸÖŸáÿ™ÿ±', '(', 'ÿÆÿ±ŸÖ\u200cÿ¢ÿ®ÿßÿØ', ')'], 'ner_tags': [0, 5, 6, 6, 6], 'langs': ['fa', 'fa', 'fa', 'fa', 'fa'], 'spans': ['LOC: ŸÖŸáÿ™ÿ± ( ÿÆÿ±ŸÖ\u200cÿ¢ÿ®ÿßÿØ )']}


This cell initializes the tokenizer from the pretrained ParsBERT model. The tokenizer is responsible for converting Persian text into token IDs and subword representations that can be processed by the transformer-based NER model.


In [4]:
model_checkpoint = "HooshvareLab/bert-base-parsbert-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

This cell defines a preprocessing function that tokenizes the input tokens and aligns the original NER labels with the subword tokens produced by the tokenizer. Special tokens and non-initial subword pieces are assigned a label of `-100` so they are ignored during loss computation.


In [5]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


This cell applies the tokenization and label-alignment function to the entire dataset. It processes the data in batches and removes the original columns, producing tokenized datasets that are ready to be used for training and evaluation.


In [6]:
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 13397.07 examples/s]


This cell inspects the first sample from the tokenized training dataset to verify that the input tokens, attention masks, and aligned NER labels have been processed correctly.


In [7]:
tokenized_datasets['train'][0]

{'input_ids': [2, 2671, 85815, 61044, 9, 19530, 10, 4],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 0, -100, 5, 6, 6, 6, -100]}

This cell extracts the list of NER label names from the training dataset and computes the total number of unique labels. These labels are later used to configure the token classification model and to correctly interpret the model‚Äôs predictions.


In [8]:
label_list = dataset["train"].features["ner_tags"].feature.names
num_labels = len(label_list)

print("Labels:", label_list)
print("Num labels:", num_labels)

Labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
Num labels: 7


This cell initializes a transformer-based token classification model using the pretrained ParsBERT checkpoint. It configures the model with the correct number of NER labels and defines mappings between label IDs and label names, enabling proper training and human-readable predictions.


In [9]:
model = AutoModelForTokenClassification.from_pretrained(
    pretrained_model_name_or_path=model_checkpoint,
    num_labels = num_labels,
    id2label = {i:l for i, l in enumerate(label_list)},
    label2id = {l:i for i, l in enumerate(label_list)}
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(100000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-1

This cell creates a data collator specifically designed for token classification tasks. It dynamically pads input sequences and their corresponding labels to the same length within each batch, ensuring efficient and correct batching during training.


In [11]:
data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True)


In [12]:
device = 'cuda' if torch.cuda.is_available else 'mps' if torch.mps.is_available else 'cpu'
print(device)

cuda


This cell defines the training configuration using Hugging Face `TrainingArguments`. It specifies key hyperparameters such as learning rate, batch size, number of epochs, evaluation and checkpointing strategy, and enables mixed-precision (FP16) training to improve performance and reduce memory usage.


In [15]:
args = TrainingArguments(
    output_dir="./ner-wikiann-fa",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=50,
    save_total_limit=2,
    report_to="none",
    fp16 = True
)

In [16]:
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

This cell defines the evaluation metric for the NER task using `seqeval`. It converts model outputs into label predictions, filters out ignored tokens (`-100`), and computes standard NER metrics including precision, recall, F1-score, and accuracy.


In [17]:
metric = evaluate.load("seqeval")

def compute_metrics(p):

    predictions, labels = p
    predictions = predictions.argmax(axis=-1)

    true_predictions = [[label_list[p] for (p, l) in zip(pred, lab) if l != -100] for (pred , lab) in zip(predictions, labels)]

    true_labels = [[label_list[l] for (p, l) in zip(pred, lab) if l != -100] for (pred, lab) in zip(predictions, labels)]


    results = metric.compute(
        predictions = true_predictions,
        references = true_labels
    )

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

This cell initializes the Hugging Face `Trainer` with the model, training arguments, datasets, tokenizer, and evaluation function. It then starts the fine-tuning process on the Persian NER dataset, performing training and evaluation at the end of each epoch.


In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


This cell evaluates the fine-tuned NER model on the validation dataset and reports performance metrics such as precision, recall, F1-score, and accuracy.


In [None]:
trainer.evaluate()


This cell evaluates the trained NER model on different dataset splits. It reports performance metrics for the training set, validation set, and test set, allowing a comprehensive comparison of the model‚Äôs generalization performance.


In [None]:
train_evaluate = trainer.evaluate()
validation_evaluate = trainer.evaluate(eval_dataset=tokenized_datasets['validation'])
test_evaluate = trainer.evaluate(eval_dataset=tokenized_datasets['test'])

This cell aggregates the evaluation results from the training, validation, and test sets into a single dictionary and saves them as a CSV file. This allows the model‚Äôs performance metrics to be easily reviewed, compared, and reused for reporting or documentation.


In [None]:
all_evaluate = {
    'train': train_evaluate,
    'validation': validation_evaluate,
    'test': test_evaluate
}
pd.DataFrame(all_evaluate).to_csv('../data/results.csv')

In [20]:
pd.read_csv('../data/results.csv')

Unnamed: 0.1,Unnamed: 0,train,validation,test
0,eval_loss,0.164885,0.164885,0.17935
1,eval_precision,0.935195,0.935195,0.938316
2,eval_recall,0.942941,0.942941,0.94312
3,eval_f1,0.939052,0.939052,0.940712
4,eval_accuracy,0.972945,0.972945,0.972046
5,eval_runtime,12.0197,11.6282,11.3411
6,eval_samples_per_second,831.967,859.981,881.751
7,eval_steps_per_second,51.998,53.749,55.109
8,epoch,5.0,5.0,5.0


This cell saves the fine-tuned NER model, tokenizer, and label mappings to disk for later inference or deployment. It also packages the saved files into a ZIP archive and uploads them to Google Drive, ensuring the trained model is safely stored and easily transferable.


In [None]:
model_save_path = "ner_model"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

with open(f"{model_save_path}/labels.json", "w", encoding="utf-8") as f:
    json.dump(label_list, f, ensure_ascii=False)


notebook_name = "model_ner_wikiann-fa.ipynb"
!cp "{notebook_name}" ner_model/


!zip -r ner_model.zip ner_model


from google.colab import drive
drive.mount('/content/drive')


!cp ner_model.zip /content/drive/MyDrive/


In [None]:
# Save the current notebook to Google Drive (manually)
from google.colab import files
files.download('ner_model.zip')


In [None]:
files.download('results.csv')

This cell loads the saved NER model, tokenizer, and label mappings from disk and defines an inference function for Persian text. The function tokenizes the input sentence, runs the model in inference mode, aligns predictions with original words, and prints the recognized named entities for each token.


In [None]:
import os

model_dir = '../ner_model'
tokenizer = AutoTokenizer.from_pretrained(model_dir, local_files_only=True)
model = AutoModelForTokenClassification.from_pretrained(model_dir, local_files_only=True)


with open(f"{model_dir}/labels.json", "r", encoding="utf-8") as f:
    label_list = json.load(f)


def predict_ner(sentence):
    tokens = sentence.split()
    inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
    word_ids = inputs.word_ids()

    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)[0].numpy()


    final_tokens, final_tags = [], []
    prev_word_id = None
    for i, word_id in enumerate(word_ids):
        if word_id is None or word_id == prev_word_id:
            continue
        final_tokens.append(tokens[word_id])
        final_tags.append(label_list[predictions[i]])
        prev_word_id = word_id

    for token, tag in zip(final_tokens, final_tags):
        print(f"{token:10} ‚Üí {tag}")

predict_ner("Ÿæÿ≤ÿ¥⁄©€åÿßŸÜ ÿØ€åÿ±Ÿàÿ≤ ÿ®Ÿá ÿ≥ÿßÿ≤ŸÖÿßŸÜ ŸÖŸÑŸÑ ŸÖÿ™ÿ≠ÿØ ÿØÿ± ŸÜ€åŸà€åŸàÿ±⁄© ÿ±ŸÅÿ™")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Ÿæÿ≤ÿ¥⁄©€åÿßŸÜ    ‚Üí B-PER
ÿØ€åÿ±Ÿàÿ≤      ‚Üí O
ÿ®Ÿá         ‚Üí O
ÿ≥ÿßÿ≤ŸÖÿßŸÜ     ‚Üí B-ORG
ŸÖŸÑŸÑ        ‚Üí I-ORG
ŸÖÿ™ÿ≠ÿØ       ‚Üí I-ORG
ÿØÿ±         ‚Üí O
ŸÜ€åŸà€åŸàÿ±⁄©    ‚Üí B-LOC
ÿ±ŸÅÿ™        ‚Üí O
