# Training Bangla Name Extractor Model

### Step 1: Clone git repository
Clone the [bangla_person_name_extractor](https://github.com/ibrahim-601/bangla_person_name_extractor) repository and go to the repository folder. We are cloning at first because we're training the model on colab. If you have already cloned the repository no need to clone again. You can skip this step

In [None]:
!git clone https://github.com/ibrahim-601/bangla_person_name_extractor.git

Go to the cloned directory. If you opened this notebook after cloning then no need to run this cell. If your terminal is one folder above the cloned directory then you can run this cell.

In [1]:
%cd bangla_person_name_extractor

/content/bangla_person_name_extractor


### Step 2: Environment setup
Install required packages using pip by running below cell.

In [2]:
!pip install -r requirements.txt

### Step 3: Download and process data
We need to call `download_data()` function from utils/downloder.py to download provided datasets. Data will be downloaded into two files inside `data_raw` directory.

In [3]:
from utils import downloader

# download the dataset provided for the project
downloader.download_data()

Successfully downloaded data.


After downloading the data, we will clean and reformat the data. To do so, we call `process_text_data()` and `process_jsonl_data()` functions from `preprocessing/raw_data_processing.py` for dataset_1 and dataset_2 respectively.

In [9]:
import os
import config.config as cfg
from preprocessing.raw_data_processing import process_text_data, process_jsonl_data

# process text data (data_1)
data_1_path = os.path.join(cfg.RAW_DATA_DOWNLOAD_DIR, cfg.RAW_DATA_1_FILE_NAME)
save_data_1_path = os.path.join(cfg.PROCESSESED_DATA_SAVE_DIR, cfg.PROCESSESED_DATA_1_NAME)
data_1 = process_text_data(data_path=data_1_path, save_path=save_data_1_path)

# process jsonl data (data_2)
data_2_path_ = os.path.join(cfg.RAW_DATA_DOWNLOAD_DIR, cfg.RAW_DATA_2_FILE_NAME)
save_data_2_path = os.path.join(cfg.PROCESSESED_DATA_SAVE_DIR, cfg.PROCESSESED_DATA_2_NAME)
data_2 = process_jsonl_data(data_path=data_2_path_, save_path=save_data_2_path)


Data summary:  data_1
------------------------------
Total sentence : 6580
Sentence with person tag: 1776
Sentence without person tag: 4804

Data summary:  data_2
------------------------------
Total sentence : 3494
Sentence with person tag: 1189
Sentence without person tag: 2305


### Step 4: Split data and convert to Spacy format
Now we will split the dataset into train, validation, and test set. Then we will convert them into spacy binary data and store them. All of this can be done by calling `split_and_convert_data()` from `preprocessing/train_data_processing.py` and passing processed data from previous step to this function. Path for the saved data can be obtained from `config/config.py` file. Path for train, validation, and test is defined as vairable `TAIN_DATA_PATH`, `VALID_DATA_PATH`, and `TEST_DATA_PATH` respectively in `config.py` file.

In [5]:
from preprocessing.train_data_processing import split_and_convert_data

# this function accepts tuple of data_1 and data_2
tuple_data = (data_1, data_2)
split_and_convert_data(tuple_data)

Saving spacy binary format data...

Training data summary :  Data with PERSON tag
------------------------------
Number of train samples :  2372
Number of validation samples :  296
Number of test samples :  297
-----------------------------------
Number of total data :  2965

Training data summary :  Data without PERSON tag
------------------------------
Number of train samples :  5687
Number of validation samples :  711
Number of test samples :  711
-----------------------------------
Number of total data :  7109

Training data summary :  All data
------------------------------
Number of train samples :  8059
Number of validation samples :  1007
Number of test samples :  1008
-----------------------------------
Number of total data :  10074
Saved train data at :  dataset/train.spacy
Saved train data at :  dataset/valid.spacy
Saved train data at :  dataset/test.spacy


### Step 5: Generate training config of Spacy
We need to generate config file for model training with spacy. We can do so by running following cell. It contains Spacy CLI command to generate training configuration file.

In [None]:
!python -m spacy init config config/spacy_config.cfg --lang bn --pipeline ner --optimize accuracy --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: bn
- Pipeline: ner
- Optimize for: accuracy
- Hardware: GPU
- Transformer: sagorsarker/bangla-bert-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/spacy_config.cfg
You can now add your data and train your pipeline:
python -m spacy train spacy_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


We would change some parameter in the config file.
1. By default spacy sets transformer model to `sagorsarker/bangla-bert-base`. We will change it to `csebuetnlp/banglabert`.
2. We will set `max_epochs` to 50.

### Step 6: Train the model
Now we will train the model by running following cell. It contains Spacy CLI command to for training.

In [8]:
!python -m spacy train config/spacy_config.cfg --output models --gpu-id 0 --paths.train dataset/train.spacy --paths.dev dataset/valid.spacy

[38;5;2m✔ Created output directory: models[0m
[38;5;4mℹ Saving to output directory: models[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-14 19:52:14,334] [INFO] Set up nlp object from config
[2023-07-14 19:52:14,364] [INFO] Pipeline: ['transformer', 'ner']
[2023-07-14 19:52:14,370] [INFO] Created vocabulary
[2023-07-14 19:52:14,370] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraModel: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that 

### Step 7: Evaluating model
In this step we will evaluate the trained model using Spacy CLI. Spacy saves two models - `model-best`, and `model-last`. We will use model-best for evaluation and further usage.

In [10]:
!python -m spacy benchmark accuracy models/model-best dataset/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     -    
NER P   80.36
NER R   80.56
NER F   80.46
SPEED   3255 

[1m

          P       R       F
PER   80.36   80.56   80.46



Rename the `model-best` to `bangla_person_ner` as model path `config.py` is set to be like that. The piece of code which is used to make predictions also receives the same from `config.py` file.

In [11]:
!mv /content/bangla_person_name_extractor/models/model-best models/bn_person_ner

### Step 8: Make predictions
We have `extract_person_name()` function in `bangla_person_name_extractor.py` to make predictions. We will import that in the next cell.

In [12]:
import bangla_person_name_extractor as bpne

We defined `texts` variable with 4 bangla texts. Two of them contains person name and the rest does not. We will iterate over each item of `texts` and call `extract_person_name()` by passing the item and print the returned value.

In [18]:
texts = [
    "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।",
    "মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।",
    "এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।",
    "ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।"
]

for text in texts:
  res = bpne.extract_person_name(text)
  print(res)

{'sentence': "এ ট্যাবলেটটির নাম হতে পারে 'আইপ্যাড ম্যাক্সি'।", 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
{'sentence': 'মো. আলমের কাছ থেকে ১৫ লাখ টাকা আদায় করা হয়।', 'extracted_names': [{'name': 'মো. আলমের', 'label': 'PER', 'start': 0, 'end': 2}]}
{'sentence': 'এতিমখানার কর্মকর্তা-শিক্ষার্থীরা কমিটি ও চুক্তির বিরুদ্ধে আন্দোলন শুরু করে।', 'extracted_names': 'কোন নাম খুঁজে পাওয়া যায় নি/No name is found'}
{'sentence': 'ডা. মো. শরিফুল ইসলাম, শহীদ সোহরাওয়ার্দী মেডিকেল, কলেজ ও হাসপাতাল।', 'extracted_names': [{'name': 'ডা. মো. শরিফুল ইসলাম', 'label': 'PER', 'start': 0, 'end': 4}]}
