Skip to content

menon92/BanglaNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BanglaNER

Bangla Name Entity Recognition (NER) using SpaCy. NER from bangla input text sentences. Experiment is done only using one entity name (person) label as PER

Perform 5 different experiment this this data and foud that Transformer base model perform better compare to other model so far for this data. Please check the experimental detail and F1 score in experimental history. Where best F1 score ~.80

Bangla NER data is collected from,

Dependency

conda install spacy=3.1
pip install spacy-transformers # need if you want to use transformer 

NOTE: If you want to just test the ner model please,

Data prepration

  1. Clean IOB and remove data which is in wrong IOB format

Data conversion command (Optional)

  1. IOB to spacy .spacy data format in in SpaCy3.x
  2. Convert BLIOU json format to .spacy data format python -m spacy convert train.json ./data

Automate data prepration

  1. To automate data prepration just run,
python utils/convert_to_spacy_json_format.py

This scrip will generate data/train.json, data/val.json

  1. Convert json data to .spacy data
python -m spacy convert data/train.json ./data
python -m spacy convert data/val.json ./data

# Outputs
✔ Generated output file (8986 documents): data/train.spacy
✔ Generated output file (999 documents): data/val.spacy

Above two command will generate data/train.spacy, and data/val.spacy

Training & inferance of SpaCy Transition base model

Training

Create base config file

Go to the link and create a base config file and save it uinder ./configs/base_config.cfg

Required fils for ner task are already in,

configs/
├── base_config.cfg     # base ner file configuration download from spacy website
├── config.cfg          # use to train ner pipeline
└── config_pretrain.cfg # use to train only tok2vec seperately

Prepare configuration files

Now convert ./configs/base_config.cfg to config file ./configs/config.cfg

python -m spacy init fill-config configs/base_config.cfg configs/config.cfg

Start training

python -m spacy train configs/config.cfg \
    --output ./models \
    --paths.train ./data/train.spacy \
    --paths.dev ./data/val.spacy

You will get F1 score on val data around 0.66

Inferance

For inferance please run,

python test.py

You can already pretrain model in test.py. Please download the pretrain model from google drive (4.4MB) and set the model path in test.py file

Training and inferance SpaCy transformer pipeline

To training spacy transformer model please check need GPU,

Transformer training and inferance guide

You will get F1 score on val data around 0.80

Transformer based model sample output

if you want to use already trained model please download pretrain model from google drive (622.8MB) and set the model path in test.py file

import spacy

nlp = spacy.load("./models_multilingual_bert/model-best")

text_list = [
    "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম",
    "১০০ টাকা জমা দিয়েছেন কবির",
    "ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান",
    "অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।",
    "সে আজকে ঢাকা যাবে",
]
for text in text_list:
    doc = nlp(text)

    print(f"Input: {text}")
    for entity in doc.ents:
        print(f"Entity: {entity.text}, Label: {entity.label_}")
    print("---")

# Outputs
    Input: আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম
    Entity: আব্দুর রহিম, Label: PER
    ---
    Input: ১০০ টাকা জমা দিয়েছেন কবির
    Entity: কবির, Label: PER
    ---
    Input: ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান
    Entity: মুনীর চৌধুরী, Label: PER
    ---
    Input: অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে---
    Input: সে আজকে ঢাকা যাবে
    ---

NOTE: Why to use Transformer base model ?

Trainng Tok2Vec model

Data format

{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}

python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl

Init pretrain vector file

    python -m spacy init vectors bn pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec/bnwiki_word2vec.vector pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec_spacy --verbose

NER Data formats

BLIOU data format meaning
B = Begin
L = Last
I = Inside
O = Outside
U = Unique
IOB data format meaning
I = Inside
O = Outside
B = Begin

References