BanglaNER

Bangla Name Entity Recognition (NER) using SpaCy. NER from bangla input text sentences. Experiment is done only using one entity name (person) label as PER

Perform 5 different experiment this this data and foud that Transformer base model perform better compare to other model so far for this data. Please check the experimental detail and F1 score in experimental history. Where best F1 score ~.80

Bangla NER data is collected from,

Dependency

conda install spacy=3.1
pip install spacy-transformers # need if you want to use transformer

NOTE: If you want to just test the ner model please,

Data prepration

Clean IOB and remove data which is in wrong IOB format

Data conversion command (Optional)

IOB to spacy .spacy data format in in SpaCy3.x
- python -m spacy convert -c iob -s -n 1 ner-token-per-line.iob ./data
- Example SpaCy json data
Convert BLIOU json format to .spacy data format python -m spacy convert train.json ./data

Automate data prepration

To automate data prepration just run,

python utils/convert_to_spacy_json_format.py

This scrip will generate data/train.json, data/val.json

Convert json data to .spacy data

python -m spacy convert data/train.json ./data
python -m spacy convert data/val.json ./data

# Outputs
✔ Generated output file (8986 documents): data/train.spacy
✔ Generated output file (999 documents): data/val.spacy

Above two command will generate data/train.spacy, and data/val.spacy

Training & inferance of SpaCy Transition base model

Training

Create base config file

Go to the link and create a base config file and save it uinder ./configs/base_config.cfg

Required fils for ner task are already in,

configs/
├── base_config.cfg     # base ner file configuration download from spacy website
├── config.cfg          # use to train ner pipeline
└── config_pretrain.cfg # use to train only tok2vec seperately

Prepare configuration files

Now convert ./configs/base_config.cfg to config file ./configs/config.cfg

python -m spacy init fill-config configs/base_config.cfg configs/config.cfg

Start training

python -m spacy train configs/config.cfg \
    --output ./models \
    --paths.train ./data/train.spacy \
    --paths.dev ./data/val.spacy

You will get F1 score on val data around 0.66

Inferance

For inferance please run,

python test.py

You can already pretrain model in test.py. Please download the pretrain model from google drive (4.4MB) and set the model path in test.py file

Training and inferance SpaCy transformer pipeline

To training spacy transformer model please check need GPU,

Transformer training and inferance guide

You will get F1 score on val data around 0.80

Transformer based model sample output

if you want to use already trained model please download pretrain model from google drive (622.8MB) and set the model path in test.py file

import spacy

nlp = spacy.load("./models_multilingual_bert/model-best")

text_list = [
    "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম",
    "১০০ টাকা জমা দিয়েছেন কবির",
    "ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান",
    "অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।",
    "সে আজকে ঢাকা যাবে",
]
for text in text_list:
    doc = nlp(text)

    print(f"Input: {text}")
    for entity in doc.ents:
        print(f"Entity: {entity.text}, Label: {entity.label_}")
    print("---")

# Outputs
    Input: আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম
    Entity: আব্দুর রহিম, Label: PER
    ---
    Input: ১০০ টাকা জমা দিয়েছেন কবির
    Entity: কবির, Label: PER
    ---
    Input: ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান
    Entity: মুনীর চৌধুরী, Label: PER
    ---
    Input: অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।
    ---
    Input: সে আজকে ঢাকা যাবে
    ---

NOTE: Why to use Transformer base model ?

Trainng Tok2Vec model

Data format

{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}

python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl

Init pretrain vector file

    python -m spacy init vectors bn pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec/bnwiki_word2vec.vector pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec_spacy --verbose

NER Data formats

BLIOU data format meaning
B = Begin
L = Last
I = Inside
O = Outside
U = Unique

IOB data format meaning
I = Inside
O = Outside
B = Begin

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
data		data
docs		docs
transformers		transformers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test.py		test.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BanglaNER

Dependency

Data prepration

Data conversion command (Optional)

Automate data prepration

Training & inferance of SpaCy Transition base model

Training

Create base config file

Prepare configuration files

Start training

Inferance

Training and inferance SpaCy transformer pipeline

Transformer based model sample output

Trainng Tok2Vec model

Data format

Init pretrain vector file

NER Data formats

References

About

Releases

Packages

Languages

License

menon92/BanglaNER

Folders and files

Latest commit

History

Repository files navigation

BanglaNER

Dependency

Data prepration

Data conversion command (Optional)

Automate data prepration

Training & inferance of SpaCy Transition base model

Training

Create base config file

Prepare configuration files

Start training

Inferance

Training and inferance SpaCy transformer pipeline

Transformer based model sample output

Trainng Tok2Vec model

Data format

Init pretrain vector file

NER Data formats

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages