# Training Bangla Name Extractor Model

### Step 1: Clone git repository
Clone the [bangla_person_name_extractor](https://github.com/ibrahim-601/bangla_person_name_extractor) repository and go to the repository folder. We are cloning at first because we're training the model on colab. If you have already cloned the repository no need to clone again. You can skip this step

In [4]:
!git clone https://github.com/ibrahim-601/bangla_person_name_extractor.git

Cloning into 'bangla_person_name_extractor'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 73 (delta 27), reused 64 (delta 18), pack-reused 0[K
Unpacking objects: 100% (73/73), 28.92 KiB | 1.45 MiB/s, done.


Go to the cloned directory. If you opened this notebook after cloning then no need to run this cell. If your terminal is one folder above the cloned directory then you can run this cell.

In [5]:
%cd bangla_person_name_extractor

/content/bangla_person_name_extractor/bangla_person_name_extractor


### Step 2: Environment setup
Install required packages using pip by running below cell.

In [2]:
!pip install -r requirements.txt

Collecting spacy-transformers (from -r requirements.txt (line 3))
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.31.0,>=3.4.0 (from spacy-transformers->-r requirements.txt (line 3))
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers->-r requirements.txt (line 3))
  Downloading spacy_alignments-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers<4.31.0,>=3.4.0->spacy-transformers->-r requirements

### Step 3: Download and process data
We need to call `download_data()` function from utils/downloder.py to download provided datasets. Data will be downloaded into two files inside `data_raw` directory.

In [3]:
from utils import downloader

# download the dataset provided for the project
downloader.download_data()

Successfully downloaded data.


After downloading the data, we will clean and reformat the data. To do so, we call `process_text_data()` and `process_jsonl_data()` functions from `preprocessing/raw_data_processing.py` for dataset_1 and dataset_2 respectively.

In [4]:
import os
import config.config as cfg
from preprocessing.raw_data_processing import process_text_data, process_jsonl_data

# process text data (data_1)
data_1_path = os.path.join(cfg.RAW_DATA_DOWNLOAD_DIR, cfg.RAW_DATA_1_FILE_NAME)
save_data_1_path = os.path.join(cfg.PROCESSESED_DATA_SAVE_DIR, cfg.PROCESSESED_DATA_1_NAME)
data_1 = process_text_data(data_path=data_1_path, save_path=save_data_1_path)

# process jsonl data (data_2)
data_2_path_ = os.path.join(cfg.RAW_DATA_DOWNLOAD_DIR, cfg.RAW_DATA_2_FILE_NAME)
save_data_2_path = os.path.join(cfg.PROCESSESED_DATA_SAVE_DIR, cfg.PROCESSESED_DATA_2_NAME)
data_2 = process_jsonl_data(data_path=data_2_path_, save_path=save_data_2_path)


Data summary:  data_1
------------------------------
Total sentence : 6580
Sentence with person tag: 1776
Sentence without person tag: 4804

Data summary:  data_2
------------------------------
Total sentence : 3494
Sentence with person tag: 1189
Sentence without person tag: 2305


### Step 4: Split data and convert to Spacy format
Now we will split the dataset into train, validation, and test set. Then we will convert them into spacy binary data and store them. All of this can be done by calling `split_and_convert_data()` from `preprocessing/train_data_processing.py` and passing processed data from previous step to this function. Path for the saved data can be obtained from `config/config.py` file. Path for train, validation, and test is defined as vairable `TAIN_DATA_PATH`, `VALID_DATA_PATH`, and `TEST_DATA_PATH` respectively in `config.py` file.

In [5]:
from preprocessing.train_data_processing import split_and_convert_data

# this function accepts tuple of data_1 and data_2
tuple_data = (data_1, data_2)
split_and_convert_data(tuple_data)


Training data summary :  Data with PERSON tag
------------------------------
Number of train samples :  2372
Number of validation samples :  297
Number of test samples :  296
-----------------------------------
Number of total data :  2965

Training data summary :  Data without PERSON tag
------------------------------
Number of train samples :  5687
Number of validation samples :  711
Number of test samples :  711
-----------------------------------
Number of total data :  7109

Training data summary :  All data
------------------------------
Number of train samples :  8059
Number of validation samples :  1008
Number of test samples :  1007
-----------------------------------
Number of total data :  10074
Saving spacy binary format data...
Saved train data at :  /content/bangla_person_name_extractor/dataset/train.spacy
Saved train data at :  /content/bangla_person_name_extractor/dataset/valid.spacy
Saved train data at :  /content/bangla_person_name_extractor/dataset/test.spacy


### Step 5: Generate training config of Spacy
We need to generate config file for model training with spacy. We can do so by running following cell. It contains Spacy CLI command to generate training configuration file.

In [None]:
!python -m spacy init config config/spacy_config.cfg --lang bn --pipeline ner --optimize accuracy --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: bn
- Pipeline: ner
- Optimize for: accuracy
- Hardware: GPU
- Transformer: sagorsarker/bangla-bert-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/spacy_config.cfg
You can now add your data and train your pipeline:
python -m spacy train spacy_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


We would change some parameter in the config file.
1. By default spacy sets transformer model to `sagorsarker/bangla-bert-base`. We will change it to `csebuetnlp/banglabert`.
2. We will set `max_epochs` to 50.

### Step 6: Train the model
Now we will train the model by running following cell. It contains Spacy CLI command to for training.

In [6]:
!python -m spacy train config/spacy_config.cfg --output models --gpu-id 0 --paths.train dataset/train.spacy --paths.dev dataset/valid.spacy

[38;5;2m✔ Created output directory: models[0m
[38;5;4mℹ Saving to output directory: models[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-15 09:59:32,121] [INFO] Set up nlp object from config
[2023-07-15 09:59:32,150] [INFO] Pipeline: ['transformer', 'ner']
[2023-07-15 09:59:32,156] [INFO] Created vocabulary
[2023-07-15 09:59:32,156] [INFO] Finished initializing nlp object
Downloading (…)okenizer_config.json: 100% 119/119 [00:00<00:00, 721kB/s]
Downloading (…)lve/main/config.json: 100% 586/586 [00:00<00:00, 3.82MB/s]
Downloading (…)solve/main/vocab.txt: 100% 528k/528k [00:00<00:00, 3.23MB/s]
Downloading (…)cial_tokens_map.json: 100% 112/112 [00:00<00:00, 748kB/s]
Downloading pytorch_model.bin: 100% 443M/443M [00:01<00:00, 288MB/s]
Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraModel: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_

### Step 7: Evaluating model
In this step we will evaluate the trained model using Spacy CLI. Spacy saves two models - `model-best`, and `model-last`. We will use model-best for evaluation and further usage.

In [7]:
!python -m spacy benchmark accuracy models/model-best dataset/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     -    
NER P   80.40
NER R   84.15
NER F   82.23
SPEED   3140 

[1m

          P       R       F
PER   80.40   84.15   82.23



### Step 8: Make predictions
We have `extract_person_name()` function in `bangla_person_name_extractor.py` to make predictions. We will import that in the next cell.

In [2]:
from bangla_person_name_extractor import extract_person_name

We defined `texts` variable with 4 bangla texts. Two of them contains person name and the rest does not. We will iterate over each item of `texts` and call `extract_person_name()` by passing the item and print the returned value.

In [3]:
texts = [
    "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।",
    "মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।",
    "এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।",
    "ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।"
]

for text in texts:
  res = extract_person_name(text)
  print(res)

{'sentence': "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।", 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
{'sentence': 'মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।', 'extracted_names': [{'name': 'মো. আলমের', 'label': 'PER', 'start': 0, 'end': 2}]}
{'sentence': 'এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।', 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
{'sentence': 'ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।', 'extracted_names': [{'name': 'ডা. মো.', 'label': 'PER', 'start': 0, 'end': 2}, {'name': 'শরিফুল ইসলাম', 'label': 'PER', 'start': 2, 'end': 4}]}
