# Training Bangla Name Extractor Model

### Step 1: Clone git repository
Clone the [bangla_person_name_extractor](https://github.com/ibrahim-601/bangla_person_name_extractor) repository and go to the repository folder. We are cloning at first because we're training the model on colab. If you have already cloned the repository no need to clone again. You can skip this step

In [None]:
!git clone https://github.com/ibrahim-601/bangla_person_name_extractor.git

Cloning into 'bangla_person_name_extractor'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 73 (delta 27), reused 64 (delta 18), pack-reused 0[K
Unpacking objects: 100% (73/73), 28.92 KiB | 1.45 MiB/s, done.


Go to the cloned directory. If you opened this notebook after cloning then no need to run this cell. If your terminal is one folder above the cloned directory then you can run this cell.

In [9]:
%cd bangla_person_name_extractor

/content/bangla_person_name_extractor


### Step 2: Environment setup
Install required packages using pip by running below cell.

In [1]:
!pip install -r requirements.txt

Collecting spacy-transformers==1.2.5 (from -r requirements.txt (line 3))
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting gradio (from -r requirements.txt (line 5))
  Downloading gradio-3.36.1-py3-none-any.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.31.0,>=3.4.0 (from spacy-transformers==1.2.5->-r requirements.txt (line 3))
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers==1.2.5->-r requirements.txt (line 3))
  Downloading spacy_alignments-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux201

### Step 3: Download and process data
We need to call `download_data()` function from utils/downloder.py to download provided datasets. Data will be downloaded into two files inside `data_raw` directory.

In [11]:
from bangla_person_ner.utils import downloader

# download the dataset provided for the project
downloader.download_data()

Successfully downloaded data.


After downloading the data, we will clean and reformat the data. To do so, we call `process_text_data()` and `process_jsonl_data()` functions from `bangla_person_ner/preprocessing/raw_data_processing.py` for dataset_1 and dataset_2 respectively.

In [12]:
import os
import bangla_person_ner.config.config as cfg
from bangla_person_ner.preprocessing.raw_data_processing import process_text_data, process_jsonl_data

# process text data (data_1)
data_1 = process_text_data(data_path=cfg.RAW_DATA1_FILE_PATH, save_path=cfg.PROCESSESED_DATA1_PATH)

# process jsonl data (data_2)
data_2 = process_jsonl_data(data_path=cfg.RAW_DATA2_FILE_PATH, save_path=cfg.PROCESSESED_DATA2_PATH)


Data summary:  data_1
------------------------------
Total sentence : 6580
Sentence with person tag: 1776
Sentence without person tag: 4804

Data summary:  data_2
------------------------------
Total sentence : 3494
Sentence with person tag: 1189
Sentence without person tag: 2305


### Step 4: Split data and convert to Spacy format
Now we will split the dataset into train, validation, and test set. Then we will convert them into spacy binary data and store them. All of this can be done by calling `split_and_convert_data()` from `bangla_person_ner/preprocessing/train_data_processing.py` and passing processed data from previous step to this function. Path for the saved data can be obtained from `config/config.py` file. Path for train, validation, and test is defined as vairable `TAIN_DATA_PATH`, `VALID_DATA_PATH`, and `TEST_DATA_PATH` respectively in `config.py` file.

In [13]:
from bangla_person_ner.preprocessing.train_data_processing import split_and_convert_data

# this function accepts tuple of data_1 and data_2
tuple_data = (data_1, data_2)
split_and_convert_data(tuple_data)


Training data summary :  Data with PERSON tag
------------------------------
Number of train samples :  2372
Number of validation samples :  297
Number of test samples :  296
-----------------------------------
Number of total data :  2965

Training data summary :  Data without PERSON tag
------------------------------
Number of train samples :  5687
Number of validation samples :  711
Number of test samples :  711
-----------------------------------
Number of total data :  7109

Training data summary :  All data
------------------------------
Number of train samples :  8059
Number of validation samples :  1008
Number of test samples :  1007
-----------------------------------
Number of total data :  10074

Saving spacy binary format data...
Saved train data at :  /content/bangla_person_name_extractor/dataset/train.spacy
Saved train data at :  /content/bangla_person_name_extractor/dataset/valid.spacy
Saved train data at :  /content/bangla_person_name_extractor/dataset/test.spacy


### Step 5: Generate training config of Spacy
We need to generate config file for model training with spacy. We can do so by running following cell. It contains Spacy CLI command to generate training configuration file.

In [2]:
!python -m spacy init config bangla_person_ner/config/spacy_config.cfg --lang bn --pipeline ner --optimize accuracy --gpu

2023-07-16 07:36:57.923817: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-16 07:37:00.226467: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-16 07:37:00.226947: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


We would change some parameter in the config file.
1. By default spacy sets transformer model to `sagorsarker/bangla-bert-base`. We will change it to `csebuetnlp/banglabert`.
2. We will set `max_epochs` to 50.

### Step 6: Train the model
Now we will train the model by running following cell. It contains Spacy CLI command to for training.

In [3]:
!python -m spacy train bangla_person_ner/config/spacy_config.cfg --output bangla_person_ner/models --gpu-id 0 --paths.train bangla_person_ner/dataset/train.spacy --paths.dev bangla_person_ner/dataset/valid.spacy

2023-07-16 07:38:51.496357: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-16 07:38:53.857585: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-16 07:38:53.858041: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


### Step 7: Evaluating model
In this step we will evaluate the trained model using Spacy CLI. Spacy saves two models - `model-best`, and `model-last`. We will use model-best for evaluation and further usage.

In [6]:
!python -m spacy benchmark accuracy bangla_person_ner/models/model-best test.spacy --gpu-id 0

2023-07-16 08:29:35.776537: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-16 08:29:38.110313: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-16 08:29:38.110754: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


### Step 8: Make predictions
We will import `BanglaPersorNer` from `bangla_person_ner/bangla_person_ner.py` to make predictions. This class contains necessary codes to extract bangla person name.

In [14]:
from bangla_person_ner.bangla_person_ner import BanglaPersorNer

We defined `texts` variable with 4 bangla texts. Two of them contains person name and the other two does not. We will create an object of `BanglaPersonNer`. For each item of `texts` we call `extract_person_name()` function of `BanglaPersorNer` class, pass the item to the function, and print the returned value.

In [17]:
texts = [
    "মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।",
    "ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।",
    "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।",
    "এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।",
]
bpne = BanglaPersorNer()
for text in texts:
  res = bpne.extract_person_name(text)
  print(res)

{'sentence': 'মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।', 'extracted_names': [{'name': 'মো. আলমের', 'label': 'PERSON', 'start': 0, 'end': 2}]}
{'sentence': 'ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।', 'extracted_names': [{'name': 'ডা.', 'label': 'PERSON', 'start': 0, 'end': 1}, {'name': 'মো. শরিফুল ইসলাম', 'label': 'PERSON', 'start': 1, 'end': 4}]}
{'sentence': "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।", 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
{'sentence': 'এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।', 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
