**Named entity recognition (NER)**:

Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

In [None]:
!pip install transformers datasets tokenizers evaluate seqeval -q
!pip install farasapy
!pip install pyarabic
!git clone https://github.com/aub-mind/arabert

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
fatal: destination path 'arabert' already exists and is not an empty directory.


## Loading the dataset

you can find the data [here](https://huggingface.co/datasets/e-hossam96/conllpp-ner-ar)

In [None]:
from datasets import load_dataset

dataset = load_dataset("e-hossam96/conllpp-ner-ar")

train-00000-of-00001.parquet:   0%|          | 0.00/738k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/184k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/167k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2383 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2572 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 10250
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 2383
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 2572
    })
})

In [None]:
dataset["train"][0]

{'tokens': ['الاتحاد',
  'الأوروبي',
  'يرفض',
  'الدعوة',
  'الألمانية',
  'لمقاطعة',
  'لحم',
  'الضأن',
  'البريطاني',
  '.'],
 'ner_tags': [3, 4, 0, 0, 7, 0, 0, 0, 7, 0]}

In [None]:
dataset["train"][0]["tokens"]

['الاتحاد',
 'الأوروبي',
 'يرفض',
 'الدعوة',
 'الألمانية',
 'لمقاطعة',
 'لحم',
 'الضأن',
 'البريطاني',
 '.']

In [None]:
dataset["train"][0]["ner_tags"]

[3, 4, 0, 0, 7, 0, 0, 0, 7, 0]

In [None]:
ner_feature = dataset["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [None]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
print(f'The number of labels = {len(label_names)}')

The number of labels = 9


In [None]:
# show first example with its labels
words = dataset["train"][0]["tokens"]
labels = dataset["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

الاتحاد الأوروبي يرفض الدعوة الألمانية لمقاطعة لحم الضأن البريطاني . 
B-ORG   I-ORG    O    O      B-MISC    O       O   O     B-MISC    O 


## Tokenization

you can find the model [here](https://huggingface.co/aubmindlab/bert-base-arabertv02) and for more info check the [Arabert project](https://huggingface.co/aubmindlab/bert-base-arabert?)

In [None]:
# from transformers import AutoConfig, AutoModelForTokenClassification, AutoTokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer

model_checkpoint = "aubmindlab/bert-base-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/381 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/825k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



In [None]:
preprocessed_text = arabert_prep.preprocess(dataset["train"][0]["tokens"])
print(f'The preprocessed text : {preprocessed_text}')

The preprocessed text : [ ' الاتحاد ' , ' الأوروبي ' , ' يرفض ' , ' الدعوة ' , ' الألمانية ' , ' لمقاطعة ' , ' لحم ' , ' الضأن ' , ' البريطاني ' , ' . ' ]


In [None]:
inputs = tokenizer(dataset["train"][0]["tokens"], is_split_into_words=True)

In [None]:
inputs

{'input_ids': [2, 948, 2934, 5999, 4508, 4205, 37995, 12786, 792, 460, 4704, 20, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
inputs.tokens()

['[CLS]',
 'الاتحاد',
 'الأوروبي',
 'يرفض',
 'الدعوة',
 'الألمانية',
 'لمقاطعة',
 'لحم',
 'الض',
 '##أن',
 'البريطاني',
 '.',
 '[SEP]']

In [None]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, None]

**Problem of Sub-Token - The input ids returned by the tokenizer are longer than the lists of labels our dataset contain.**

In [None]:
len(inputs.tokens()), len(dataset["train"][0]['ner_tags'])

(13, 10)

In [None]:
import torch

This class is designed to take Arabic text, clean and tokenize it using the AraBERT tokenizer, and prepare it for training or inference in an NER task. It adds necessary padding, attention masks, and handles tokenization of words split into multiple subword tokens, while aligning them with the correct NER labels. The processed data is returned in a format that can directly be used for training a transformer model like BERT.

In [None]:
class NERDataset:
    def __init__(self, texts, tags, label_list, model_name, max_length):
        self.texts = texts
        self.tags = tags
        self.label_list = label_list
        self.preprocessor = ArabertPreprocessor(model_name.split("/")[-1])
        self.pad_token_label_id = torch.nn.CrossEntropyLoss().ignore_index
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        textlist = self.texts[item]
        tags = self.tags[item]

        tokens = []
        label_ids = []
        for word, label in zip(textlist, tags):
            clean_word = self.preprocessor.preprocess(word)
            word_tokens = self.tokenizer.tokenize(clean_word)

            if len(word_tokens) > 0:
                tokens.extend(word_tokens)
                # Use the real label id (numerical tag) for the first token of the word,
                # and padding ids for the remaining tokens
                label_ids.extend([label] + [self.pad_token_label_id] * (len(word_tokens) - 1))

        # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
        special_tokens_count = self.tokenizer.num_special_tokens_to_add()
        if len(tokens) > self.max_length - special_tokens_count:
            tokens = tokens[: (self.max_length - special_tokens_count)]
            label_ids = label_ids[: (self.max_length - special_tokens_count)]

        # Add the [SEP] token
        tokens += [self.tokenizer.sep_token]
        label_ids += [self.pad_token_label_id]
        token_type_ids = [0] * len(tokens)

        # Add the [CLS] token
        tokens = [self.tokenizer.cls_token] + tokens
        label_ids = [self.pad_token_label_id] + label_ids
        token_type_ids = [0] + token_type_ids

        input_ids = self.tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
        attention_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = self.max_length - len(input_ids)

        input_ids += [self.tokenizer.pad_token_id] * padding_length
        attention_mask += [0] * padding_length
        token_type_ids += [0] * padding_length
        label_ids += [self.pad_token_label_id] * padding_length

        assert len(input_ids) == self.max_length
        assert len(attention_mask) == self.max_length
        assert len(token_type_ids) == self.max_length
        assert len(label_ids) == self.max_length

        return {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),
            'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'labels': torch.tensor(label_ids, dtype=torch.long)
        }

This function automates the process of extracting the texts and tags from a specific split of a dataset (e.g., training, testing, or validation) and packaging them into the NERDataset class, which can then be used for training or evaluation in an NER task.

In [None]:
# Example of extracting data from DatasetDict
def create_ner_dataset_from_datasetdict(dataset, split_name, label_list, model_name, max_length):
    # Extract texts and tags from the DatasetDict for a specific split
    texts = dataset[split_name]['tokens']
    tags = dataset[split_name]['ner_tags']

    # Create the NERDataset
    return NERDataset(
        texts=texts,
        tags=tags,
        label_list=label_list,
        model_name=model_name,
        max_length=max_length
    )

In [None]:
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
# Create the dataset from the DatasetDict using the 'train' split
train_dataset = create_ner_dataset_from_datasetdict(dataset, 'train', label_names, model_checkpoint, 128)
validation_dataset = create_ner_dataset_from_datasetdict(dataset, 'validation', label_names, model_checkpoint, 128)
test_dataset = create_ner_dataset_from_datasetdict(dataset, 'test', label_names, model_checkpoint, 128)

- inv_label_map: Converts numerical label IDs back to human-readable NER labels.
- align_predictions: Takes the model's predicted outputs and the true labels, aligns them, and returns them in a human-readable form (skipping padding tokens). This is useful for evaluation and analysis of the model's performance on the NER task.

In [None]:
inv_label_map = {i: label for i, label in enumerate(label_names)}

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)

    batch_size, seq_len = preds.shape

    out_label_list = [[] for _ in range(batch_size)]
    preds_list = [[] for _ in range(batch_size)]

    for i in range(batch_size):
        for j in range(seq_len):
            if label_ids[i, j] != torch.nn.CrossEntropyLoss().ignore_index:
                out_label_list[i].append(inv_label_map[label_ids[i][j]])
                preds_list[i].append(inv_label_map[preds[i][j]])

    return preds_list, out_label_list

- compute_metrics is a function that evaluates the model's performance by comparing the predicted labels and the true labels.
- The function uses the seqeval library to calculate accuracy, precision, recall, and F1-score, which are commonly used metrics for sequence labeling tasks like NER.
- The align_predictions function ensures that predictions and labels are mapped back to their original label names before calculating these metrics.
- Optionally, you can also print a detailed classification report for each label using classification_report.

In [None]:
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def compute_metrics(p):
    preds_list, out_label_list = align_predictions(p.predictions,p.label_ids)
    #print(classification_report(out_label_list, preds_list,digits=4))
    return {
        "accuracy_score": accuracy_score(out_label_list, preds_list),
        "precision": precision_score(out_label_list, preds_list),
        "recall": recall_score(out_label_list, preds_list),
        "f1": f1_score(out_label_list, preds_list),
    }

## Defining the model

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=9)

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
model_path = "/content/drive/MyDrive/NER_Translation/ner_model"

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir= model_path + "/Fine-tuned-ar-ner",
    num_train_epochs=2,  # Start with a lower number and adjust as needed
    learning_rate=2e-5,  # Adjust based on initial results
    per_device_eval_batch_size=64,  # Adjust based on memory constraints
    per_device_train_batch_size=32,  # Adjust based on memory constraints
    weight_decay=0.01,
    eval_strategy="epoch",  # Use "epoch" or "steps"
    save_strategy="epoch",  # Matching the eval_strategy
    load_best_model_at_end=True,  # Ensures the best model is saved
    disable_tqdm= False,
    push_to_hub=False
    # warmup_steps=500  # Gradual increase in learning rate
  )

In [None]:
training_args.to_dict()

{'output_dir': '/content/drive/MyDrive/NER_Translation/Fine-tuned-ar-ner',
 'overwrite_output_dir': False,
 'do_train': False,
 'do_eval': True,
 'do_predict': False,
 'eval_strategy': 'epoch',
 'prediction_loss_only': False,
 'per_device_train_batch_size': 32,
 'per_device_eval_batch_size': 64,
 'per_gpu_train_batch_size': None,
 'per_gpu_eval_batch_size': None,
 'gradient_accumulation_steps': 1,
 'eval_accumulation_steps': None,
 'eval_delay': 0,
 'torch_empty_cache_steps': None,
 'learning_rate': 2e-05,
 'weight_decay': 0.01,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_epsilon': 1e-08,
 'max_grad_norm': 1.0,
 'num_train_epochs': 2,
 'max_steps': -1,
 'lr_scheduler_type': 'linear',
 'lr_scheduler_kwargs': {},
 'warmup_ratio': 0.0,
 'warmup_steps': 0,
 'log_level': 'passive',
 'log_on_each_node': True,
 'logging_dir': '/content/drive/MyDrive/NER_Translation/Fine-tuned-ar-ner/runs/Sep13_22-19-50_b7dbb2a13138',
 'logging_strategy': 'steps',
 'logging_first_step': False,
 'logging_s

look [here](https://github.com/chakki-works/seqeval)

## Training

In [None]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    tokenizer = tokenizer
)

In [None]:
import numpy as np

In [None]:
%%time
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy Score,Precision,Recall,F1
1,No log,0.198502,0.942274,0.817041,0.826223,0.821606
2,0.253200,0.189215,0.945172,0.825741,0.841207,0.833402


CPU times: user 7min 40s, sys: 11 s, total: 7min 51s
Wall time: 8min 47s


TrainOutput(global_step=642, training_loss=0.2287283835009994, metrics={'train_runtime': 526.0964, 'train_samples_per_second': 38.966, 'train_steps_per_second': 1.22, 'total_flos': 1339230628224000.0, 'train_loss': 0.2287283835009994, 'epoch': 2.0})

## Test the model

In [None]:
predictions, labels, _ = trainer.predict(test_dataset)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

In [None]:
import evaluate

metric = evaluate.load("seqeval")
results = metric.compute(predictions=true_predictions, references=true_labels)
results

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

{'LOC': {'precision': 0.8407148407148407,
  'recall': 0.8739903069466882,
  'f1': 0.8570297029702971,
  'number': 1238},
 'MISC': {'precision': 0.5555555555555556,
  'recall': 0.5748663101604278,
  'f1': 0.5650459921156373,
  'number': 374},
 'ORG': {'precision': 0.8064337215751525,
  'recall': 0.8002201430930105,
  'f1': 0.8033149171270718,
  'number': 1817},
 'PER': {'precision': 0.9104372355430184,
  'recall': 0.9382267441860465,
  'f1': 0.9241231209735148,
  'number': 1376},
 'overall_precision': 0.8257405515832482,
 'overall_recall': 0.8412070759625391,
 'overall_f1': 0.8334020618556701,
 'overall_accuracy': 0.9451717131341394}

## Saving

In [None]:
# Saving The Model
trainer.save_model(model_path +  "/AraBert_NER_model")

In [None]:
id2label = {
    str(i): label for i,label in enumerate(label_names)
}
label2id = {
    label: str(i) for i,label in enumerate(label_names)
}

In [None]:
import json

config = json.load(open(model_path + "/AraBert_NER_model" + "/config.json"))

In [None]:
config["id2label"] = id2label
config["label2id"] = label2id

In [None]:
json.dump(config, open(model_path + "/AraBert_NER_model" + "/config.json", "w"))

## loading the model and using it

In [3]:
from transformers import AutoModelForTokenClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path + "/AraBert_NER_model")
model_fine_tuned = AutoModelForTokenClassification.from_pretrained(model_path + "/AraBert_NER_model")

In [10]:
from transformers import pipeline

nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy="simple")


example = "عبد الرحمن يعيش في القاهرة لكنه من السودانيين"

ner_results = nlp(example)

ner_results

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/381 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/825k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'LABEL_1',
  'score': 0.6548905,
  'index': 1,
  'word': 'عبد',
  'start': 0,
  'end': 3},
 {'entity': 'LABEL_1',
  'score': 0.7616616,
  'index': 2,
  'word': 'الرحمن',
  'start': 4,
  'end': 10},
 {'entity': 'LABEL_1',
  'score': 0.5568705,
  'index': 3,
  'word': 'يعيش',
  'start': 11,
  'end': 15},
 {'entity': 'LABEL_1',
  'score': 0.51315325,
  'index': 4,
  'word': 'في',
  'start': 16,
  'end': 18},
 {'entity': 'LABEL_1',
  'score': 0.58255863,
  'index': 5,
  'word': 'القاهرة',
  'start': 19,
  'end': 26},
 {'entity': 'LABEL_1',
  'score': 0.6580443,
  'index': 6,
  'word': 'لكنه',
  'start': 27,
  'end': 31},
 {'entity': 'LABEL_0',
  'score': 0.6935499,
  'index': 7,
  'word': 'من',
  'start': 32,
  'end': 34},
 {'entity': 'LABEL_1',
  'score': 0.6107021,
  'index': 8,
  'word': 'السودانيين',
  'start': 35,
  'end': 45}]