# Named Entity Recognition in Nepali Language

Performs Named Entity Recognition Tasks using BERT models in Nepali Language.   
It extracts mainly abusive entities like Profanity, Violence and General from Nepali Social Media texts.   
Successfully runs in Kaggle.   

## Installation

In [1]:
%%capture
!python3 -m pip install -U transformers
!python3 -m pip install -U datasets evaluate
!python3 -m pip install -U accelerate
!python3 -m pip install -U seqeval
!python3 -m pip install -U wandb

# Data Preprocessing

## Load NepSA dataset


In [2]:
# Your github repo link here
!git clone https://github.com/oya163/bert-llm.git

Cloning into 'bert-llm'...
remote: Enumerating objects: 89, done.[K
remote: Counting objects: 100% (89/89), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 89 (delta 33), reused 52 (delta 12), pack-reused 0[K
Receiving objects: 100% (89/89), 5.66 MiB | 21.00 MiB/s, done.
Resolving deltas: 100% (33/33), done.


In [3]:
DATA_DIR = 'bert-llm/NepNER/dataset'

In [4]:
import os
from datasets import load_dataset

data_files = {
    "train": "train.txt",
    "validation": "valid.txt",
    "test": "test.txt",
}

load_script = os.path.join(DATA_DIR, "load_ner.py")

raw_datasets = load_dataset(load_script, data_files=data_files)



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Check the basic information on the loaded dataset

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2323
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 330
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 280
    })
})

Check sample of tokens from train dataset

In [6]:
print(raw_datasets["train"][0]["tokens"])

['भालुनी', 'सावित्री', 'कुकुरनी', 'मिले', 'को', 'रहेछ', 'आजा', 'प्रक्षया', 'थाहा', 'भयो', 'निर्माल', 'बहिनी', 'लाई', 'बलत्कार', 'गर्न', 'लगाउने', 'यनि', 'भलु', 'हरु', 'रहेछ', 'पहिला', 'जाती', 'आन्दोलान', 'गरे', 'को', 'थियो', 'त्यो', 'सबै', 'यनि', 'हरु', 'को', 'नाटक', 'रहेछ', 'आजा', 'बल', 'थाहा', 'भयो', '।']


Check the NER tags (its IDS) of the corresponding sample

In [7]:
print(raw_datasets["train"][0]["ner_tags"])

[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [8]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PROFANITY', 'I-PROFANITY', 'B-FEEDBACK', 'I-FEEDBACK', 'B-GENERAL', 'I-GENERAL', 'B-VIOLENCE', 'I-VIOLENCE'], id=None), length=-1, id=None)

Check the labels in the dataset

In [9]:
label_names = ner_feature.feature.names
label_names # GENERAL = slightly negative or positive connotation giving word

['O',
 'B-PROFANITY',
 'I-PROFANITY',
 'B-FEEDBACK',
 'I-FEEDBACK',
 'B-GENERAL',
 'I-GENERAL',
 'B-VIOLENCE',
 'I-VIOLENCE']

Display the token and labels

In [10]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

for x, y in zip(line1.split(), line2.split()):
    print(x, '\t', y)

भालुनी 	 B-PROFANITY
सावित्री 	 O
कुकुरनी 	 B-PROFANITY
मिले 	 O
को 	 O
रहेछ 	 O
आजा 	 O
प्रक्षया 	 O
थाहा 	 O
भयो 	 O
निर्माल 	 O
बहिनी 	 O
लाई 	 O
बलत्कार 	 B-VIOLENCE
गर्न 	 O
लगाउने 	 O
यनि 	 O
भलु 	 B-PROFANITY
हरु 	 O
रहेछ 	 O
पहिला 	 O
जाती 	 O
आन्दोलान 	 O
गरे 	 O
को 	 O
थियो 	 O
त्यो 	 O
सबै 	 O
यनि 	 O
हरु 	 O
को 	 O
नाटक 	 O
रहेछ 	 O
आजा 	 O
बल 	 O
थाहा 	 O
भयो 	 O
। 	 O


## Tokenization

In [11]:
from transformers import AutoTokenizer

# model_checkpoint = "NepBERTa/NepBERTa"
model_checkpoint = "Rajan/NepaliBERT"
# model_checkpoint = "Rajan/nepbertaTorch"
# model_checkpoint = "Sakonii/distilbert-base-nepali"
# model_checkpoint = "Sakonii/deberta-base-nepali"
# model_checkpoint = "mrm8488/bert-multi-cased-finetuned-xquadv1"
# model_checkpoint = "xlm-roberta-large"
# model_checkpoint = "nlptown/bert-base-multilingual-uncased-sentiment"
# model_checkpoint = "bert-base-multilingual-uncased"
# model_checkpoint = "cardiffnlp/twitter-xlm-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/987k [00:00<?, ?B/s]

In [12]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())

['[CLS]', 'भाल', '##नी', 'सावित', '##री', 'कक', '##रन', '##ी', 'मिल', 'को', 'रह', '##छ', 'आजा', 'पर', '##क', '##ष', '##या', 'थाहा', 'भयो', 'निरमा', '##ल', 'बहिनी', 'लाई', 'बल', '##तका', '##र', 'गर', '##न', 'लगाउन', 'य', '##नि', 'भल', 'हर', 'रह', '##छ', 'पहिला', 'जाती', 'आन', '##दो', '##लान', 'गर', 'को', 'थियो', 'तयो', 'सब', 'य', '##नि', 'हर', 'को', 'नाटक', 'रह', '##छ', 'आजा', 'बल', 'थाहा', 'भयो', '।', '[SEP]']


In [13]:
print(inputs.word_ids())

[None, 0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 5, 6, 7, 7, 7, 7, 8, 9, 10, 10, 11, 12, 13, 13, 13, 14, 14, 15, 16, 16, 17, 18, 19, 19, 20, 21, 22, 22, 22, 23, 24, 25, 26, 27, 28, 28, 29, 30, 31, 32, 32, 33, 34, 35, 36, 37, None]


## Data Preprocessing

In [14]:
# Align the number of labels and the tokens
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [15]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 1, 2, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [16]:
# Helper function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [17]:
# Tokenize all the examples from the datasets
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/2323 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/330 [00:00<?, ? examples/s]

Map:   0%|          | 0/280 [00:00<?, ? examples/s]

# Fine Tuning

## Data Collation

Prepare the dataloader for the training session

In [18]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    1,    2,    0,    0,    1,    2,    2,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    7,
            8,    8,    0,    0,    0,    0,    0,    1,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    5,    6,    6,    6,    6,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [19]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 1, 2, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


## Setup Evaluation

In [20]:
import evaluate
import numpy as np

metric = evaluate.load("seqeval")


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [21]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [22]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    # from_tf=True,
)


pytorch_model.bin:   0%|          | 0.00/328M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at Rajan/NepaliBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
model.config.num_labels

9

## Training

In [24]:
# from google.colab import userdata
# from huggingface_hub import login, notebook_login

# # notebook_login()
# login(token=userdata.get('hugging_face'))

import wandb
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wandb_secret = user_secrets.get_secret("wandb")
wandb.login(key=wandb_secret)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [25]:
from transformers import TrainingArguments, Trainer

model_name = "nepner"

args = TrainingArguments(
    model_name,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=6,
    weight_decay=0.01,
    push_to_hub=False,
    save_strategy="no"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33moyashi163[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231217_221848-7si4y9tt[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mkind-moon-33[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/oyashi163/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/oyashi163/huggingface/runs/7si4y9tt[0m


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.590977,0.117904,0.11588,0.116883,0.814536
2,0.615400,0.539762,0.180974,0.167382,0.173913,0.829344
3,0.615400,0.533893,0.186747,0.266094,0.219469,0.826876
4,0.440000,0.522532,0.207012,0.266094,0.232864,0.831811
5,0.440000,0.527699,0.223368,0.27897,0.248092,0.83465
6,0.358100,0.534489,0.218409,0.300429,0.252936,0.830207


TrainOutput(global_step=1746, training_loss=0.4497850992698713, metrics={'train_runtime': 111.3768, 'train_samples_per_second': 125.143, 'train_steps_per_second': 15.677, 'total_flos': 172421137645596.0, 'train_loss': 0.4497850992698713, 'epoch': 6.0})

In [26]:
trainer.evaluate()

{'eval_loss': 0.53448885679245,
 'eval_precision': 0.21840873634945399,
 'eval_recall': 0.30042918454935624,
 'eval_f1': 0.2529358626919603,
 'eval_accuracy': 0.8302073050345509,
 'eval_runtime': 0.7805,
 'eval_samples_per_second': 422.816,
 'eval_steps_per_second': 53.813,
 'epoch': 6.0}

## Save the model

In [27]:
saved_model_path='nepner'
trainer.save_model(saved_model_path)

## Evaluation

In [28]:
predictions = trainer.predict(tokenized_datasets["test"])

In [29]:
from tabulate import tabulate

metrics = ['precision', 'recall', 'f1', 'accuracy']
prediction_results = []

for key, val in predictions.metrics.items():
    if any(item in key for item in metrics):
        prediction_results.append([key, str(round(val,4)*100)+'%'])

print(tabulate(prediction_results, headers=['Metric', 'Score']))

Metric          Score
--------------  -------------------
test_precision  21.97%
test_recall     27.750000000000004%
test_f1         24.52%
test_accuracy   83.71%


## Inference

In [30]:
from transformers import pipeline

token_classifier = pipeline(
    "token-classification", model=saved_model_path, aggregation_strategy="simple"
)


In [31]:
results = token_classifier("यो गोबिन्दे लाइ कि देश निकाला गर्नुपर्छ कि मार्नु पर्छ ।")
#ओली दलाल मुर्दाबाद
#यो गोविन्दे लाई देश निकाला गर्नु पर्छ"
 #यो पुण्य गौतम जड्या हो जस्तो कस कस लाई लाग्छ ।")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [32]:
prediction_results = []
for each_entity in results:
    prediction_results.append([each_entity['word'], each_entity['entity_group']])

print(tabulate(prediction_results, headers=['Word', 'Predictions']))


Word      Predictions
--------  -------------
##नपरछ    VIOLENCE
मारन परछ  VIOLENCE


## Results

### Rajan/NepaliBERT

| Metric      | Score |
| ----------- | ----------- |
| test_precision | 21.84% |
| test_recall    | 30.04% |
| test_f1        | 25.92% |
| test_accuracy  | 83.02% |

### Sakonii/distilbert-base-nepali

| Metric      | Score |
| ----------- | ----------- |
| test_precision | 24.08% |
| test_recall    | 29.61% |
| test_f1        | 26.56% |
| test_accuracy  | 84.48% |

### bert-base-multilingual-uncased

| Metric      | Score |
| ----------- | ----------- |
| test_precision | 26.54% |
| test_recall    | 33.26% |
| test_f1        | 29.52% |
| test_accuracy  | 82.84% |

### xlm-roberta-large

| Metric      | Score |
| ----------- | ----------- |
| test_precision | 38.26% |
| test_recall    | 44.50% |
| test_f1        | 41.15% |
| test_accuracy  | 86.15% |


### 

