Skip to content
forked from sagorbrur/bnlp

BNLP is a natural language processing toolkit for Bengali Language.

License

Notifications You must be signed in to change notification settings

ibrahim-601/bnlp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bengali Natural Language Processing(BNLP)

PyPI version Downloads

Check out bnlp 4.0.0 dev version and its documentation. bnlp 4.0.0 was designed with the proper object-oriented method to initiate class with the trained model to re-use for getting results.

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Embedding Bengali Document, Bengali POS Tagging, Bengali Name Entity Recognition, Bangla Text Cleaning for Bengali NLP purposes.

Table of contents


Installation

PIP installer

pip install bnlp_toolkit

or Upgrade

pip install -U bnlp_toolkit
  • Python: 3.6, 3.7, 3.8, 3.9
  • OS: Linux, Windows, Mac

Pretrained Model

Download Links

Large model published in huggingface model hub.

Training Details

  • Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
  • SentencePiece Training Vocab Size=50000
  • Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
  • Word2Vec word embedding dimension = 100, min_count=5, window=5, epochs=10
  • To Know Bengali GloVe Wordvector and training process follow this repository
  • Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
  • Bengali CRF NER Tagging was train with this data with 90% accuracy.
  • Bengali news article doc2vec model train with 8 jsons of this corpus with epochs 40 vector size 100 min_count=2, total news article 400013
  • Bengali wikipedia doc2vec model trained with wikipedia dump datasets. Total articles 110448, epochs: 40, vector_size: 100, min_count: 2

Tokenization

Basic Tokenizer

from bnlp import BasicTokenizer

basic_tokenizer = BasicTokenizer()
raw_text = "āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžā§Ÿ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡āĨ¤"
tokens = basic_tokenizer.tokenize(raw_text)
print(tokens)

# output: ["āĻ†āĻŽāĻŋ", "āĻŦāĻžāĻ‚āĻ˛āĻžā§Ÿ", "āĻ—āĻžāĻ¨", "āĻ—āĻžāĻ‡", "āĨ¤"]

NLTK Tokenization

from bnlp import NLTKTokenizer

bnltk = NLTKTokenizer()
text = "āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡āĨ¤ āĻ¸ā§‡ āĻŦāĻžāĻœāĻžāĻ°ā§‡ āĻ¯āĻžā§ŸāĨ¤ āĻ¤āĻŋāĻ¨āĻŋ āĻ•āĻŋ āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ‡ āĻ­āĻžāĻ˛ā§‹ āĻŽāĻžāĻ¨ā§āĻˇ?"
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)

# output
# word_token: ["āĻ†āĻŽāĻŋ", "āĻ­āĻžāĻ¤", "āĻ–āĻžāĻ‡", "āĨ¤", "āĻ¸ā§‡", "āĻŦāĻžāĻœāĻžāĻ°ā§‡", "āĻ¯āĻžā§Ÿ", "āĨ¤", "āĻ¤āĻŋāĻ¨āĻŋ", "āĻ•āĻŋ", "āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ‡", "āĻ­āĻžāĻ˛ā§‹", "āĻŽāĻžāĻ¨ā§āĻˇ", "?"]
# sentence_token: ["āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡āĨ¤", "āĻ¸ā§‡ āĻŦāĻžāĻœāĻžāĻ°ā§‡ āĻ¯āĻžā§ŸāĨ¤", "āĻ¤āĻŋāĻ¨āĻŋ āĻ•āĻŋ āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ‡ āĻ­āĻžāĻ˛ā§‹ āĻŽāĻžāĻ¨ā§āĻˇ?"]

Bengali SentencePiece Tokenization

Tokenization using trained model

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
model_path = "./model/bn_spm.model"
input_text = "āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡āĨ¤ āĻ¸ā§‡ āĻŦāĻžāĻœāĻžāĻ°ā§‡ āĻ¯āĻžā§ŸāĨ¤"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)

Training SentencePiece

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
data = "raw_text.txt"
model_prefix = "test"
vocab_size = 5
bsp.train(data, model_prefix, vocab_size)

Word Embedding

Bengali Word2Vec

Generate Vector using pretrain model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'āĻ—ā§āĻ°āĻžāĻŽ'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)

Find Most Similar Word Using Pretrained Model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'āĻ—ā§āĻ°āĻžāĻŽ'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)

Train Bengali Word2Vec with your own data

Train Bengali word2vec with your custom raw data or tokenized sentences.

Custom tokenized sentence format example:

sentences = [['āĻ†āĻŽāĻŋ', 'āĻ­āĻžāĻ¤', 'āĻ–āĻžāĻ‡', 'āĨ¤'], ['āĻ¸ā§‡', 'āĻŦāĻžāĻœāĻžāĻ°ā§‡', 'āĻ¯āĻžā§Ÿ', 'āĨ¤']]

Check gensim word2vec api for details of training parameter

from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)

Pre-train or resume word2vec training with same or new corpus or tokenized sentences

Check gensim word2vec api for details of training parameter

from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()

trained_model_path = "mytrained_model.model"
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)

Bengali FastText

To use fasttext you need to install fasttext manually by pip install fasttext==0.9.2

NB: fasttext may not be worked in windows, it will only work in linux

Generate Vector Using Pretrained Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
word = "āĻ—ā§āĻ°āĻžāĻŽ"
model_path = "bengali_fasttext_wiki.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)

Train Bengali FastText Model

Check fasttext documentation for details of training parameter

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
data = "raw_text.txt"
model_name = "saved_model.bin"
epoch = 50
bft.train(data, model_name, epoch)

Generate Vector File from Fasttext Binary Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()

model_path = "mymodel.bin"
out_vector_name = "myvector.txt"
bft.bin2vec(model_path, out_vector_name)

Bengali GloVe Word Vectors

We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.

from bnlp import BengaliGlove
glove_path = "bn_glove.39M.100d.txt"
word = "āĻ—ā§āĻ°āĻžāĻŽ"
bng = BengaliGlove()
res = bng.closest_word(glove_path, word)
print(res)
vec = bng.word2vec(glove_path, word)
print(vec)

Document Embedding

Bengali Doc2Vec

Get document vector from input document

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
document = "āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻŦāĻŋāĻ°ā§‹āĻ§ā§€ āĻ“ āĻ‰āĻ¸āĻ•āĻžāĻ¨āĻŋāĻŽā§‚āĻ˛āĻ• āĻŦāĻ•ā§āĻ¤āĻŦā§āĻ¯ āĻĻā§‡āĻ“ā§ŸāĻžāĻ° āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ—ā§‡ āĻ—āĻžāĻœā§€āĻĒā§āĻ°ā§‡āĻ° āĻ—āĻžāĻ›āĻž āĻĨāĻžāĻ¨āĻžā§Ÿ āĻĄāĻŋāĻœāĻŋāĻŸāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻ†āĻ‡āĻ¨ā§‡ āĻ•āĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§Ÿ āĻ†āĻ˛ā§‹āĻšāĻŋāĻ¤ ‘āĻļāĻŋāĻļā§āĻŦāĻ•ā§āĻ¤āĻžâ€™ āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽā§‡āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ›ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§‡ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻ†āĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ• āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§‹āĨ¤ āĻ†āĻœ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻœāĻžāĻ¨ā§ā§ŸāĻžāĻ°āĻŋ) āĻĸāĻžāĻ•āĻžāĻ° āĻ¸āĻžāĻ‡āĻŦāĻžāĻ° āĻŸā§āĻ°āĻžāĻ‡āĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§‡āĻ° āĻŦāĻŋāĻšāĻžāĻ°āĻ• āĻ†āĻ¸āĻ¸āĻžāĻŽāĻ› āĻœāĻ—āĻ˛ā§āĻ˛ āĻšā§‹āĻ¸ā§‡āĻ¨ āĻ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻāĻ° āĻ†āĻ—ā§‡, āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽāĻ•ā§‡ āĻ•āĻžāĻ°āĻžāĻ—āĻžāĻ° āĻĨā§‡āĻ•ā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ā§‡ āĻšāĻžāĻœāĻŋāĻ° āĻ•āĻ°āĻž āĻšā§ŸāĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻ•ā§‡ āĻ¨āĻŋāĻ°ā§āĻĻā§‹āĻˇ āĻĻāĻžāĻŦāĻŋ āĻ•āĻ°ā§‡ āĻ¤āĻžāĻ° āĻ†āĻ‡āĻ¨āĻœā§€āĻŦā§€ āĻļā§‹āĻšā§‡āĻ˛ āĻŽā§‹. āĻĢāĻœāĻ˛ā§‡ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻšā§‡ā§Ÿā§‡ āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ…āĻ¨ā§āĻ¯āĻĻāĻŋāĻ•ā§‡, āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻĒāĻ•ā§āĻˇ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻĒāĻ•ā§āĻˇā§‡ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ‰āĻ­ā§Ÿ āĻĒāĻ•ā§āĻˇā§‡āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§‡āĻˇā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ–āĻžāĻ°āĻŋāĻœ āĻ•āĻ°ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§‡ āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻ†āĻĻā§‡āĻļ āĻĻā§‡āĻ¨āĨ¤ āĻāĻ•āĻ‡āĻ¸āĻ™ā§āĻ—ā§‡ āĻ¸āĻžāĻ•ā§āĻˇā§āĻ¯āĻ—ā§āĻ°āĻšāĻŖā§‡āĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻ—āĻžāĻŽā§€ ā§¨ā§¨ āĻĢā§‡āĻŦā§āĻ°ā§ā§ŸāĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻ•āĻ°ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤"

vector = bn_doc2vec.get_document_vector(model_path, text)
print(vector)

Find document similarity between two document

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
article_1 = "āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻŦāĻŋāĻ°ā§‹āĻ§ā§€ āĻ“ āĻ‰āĻ¸āĻ•āĻžāĻ¨āĻŋāĻŽā§‚āĻ˛āĻ• āĻŦāĻ•ā§āĻ¤āĻŦā§āĻ¯ āĻĻā§‡āĻ“ā§ŸāĻžāĻ° āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ—ā§‡ āĻ—āĻžāĻœā§€āĻĒā§āĻ°ā§‡āĻ° āĻ—āĻžāĻ›āĻž āĻĨāĻžāĻ¨āĻžā§Ÿ āĻĄāĻŋāĻœāĻŋāĻŸāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻ†āĻ‡āĻ¨ā§‡ āĻ•āĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§Ÿ āĻ†āĻ˛ā§‹āĻšāĻŋāĻ¤ ‘āĻļāĻŋāĻļā§āĻŦāĻ•ā§āĻ¤āĻžâ€™ āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽā§‡āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ›ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§‡ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻ†āĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ• āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§‹āĨ¤ āĻ†āĻœ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻœāĻžāĻ¨ā§ā§ŸāĻžāĻ°āĻŋ) āĻĸāĻžāĻ•āĻžāĻ° āĻ¸āĻžāĻ‡āĻŦāĻžāĻ° āĻŸā§āĻ°āĻžāĻ‡āĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§‡āĻ° āĻŦāĻŋāĻšāĻžāĻ°āĻ• āĻ†āĻ¸āĻ¸āĻžāĻŽāĻ› āĻœāĻ—āĻ˛ā§āĻ˛ āĻšā§‹āĻ¸ā§‡āĻ¨ āĻ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻāĻ° āĻ†āĻ—ā§‡, āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽāĻ•ā§‡ āĻ•āĻžāĻ°āĻžāĻ—āĻžāĻ° āĻĨā§‡āĻ•ā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ā§‡ āĻšāĻžāĻœāĻŋāĻ° āĻ•āĻ°āĻž āĻšā§ŸāĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻ•ā§‡ āĻ¨āĻŋāĻ°ā§āĻĻā§‹āĻˇ āĻĻāĻžāĻŦāĻŋ āĻ•āĻ°ā§‡ āĻ¤āĻžāĻ° āĻ†āĻ‡āĻ¨āĻœā§€āĻŦā§€ āĻļā§‹āĻšā§‡āĻ˛ āĻŽā§‹. āĻĢāĻœāĻ˛ā§‡ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻšā§‡ā§Ÿā§‡ āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ…āĻ¨ā§āĻ¯āĻĻāĻŋāĻ•ā§‡, āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻĒāĻ•ā§āĻˇ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻĒāĻ•ā§āĻˇā§‡ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ‰āĻ­ā§Ÿ āĻĒāĻ•ā§āĻˇā§‡āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§‡āĻˇā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ–āĻžāĻ°āĻŋāĻœ āĻ•āĻ°ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§‡ āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻ†āĻĻā§‡āĻļ āĻĻā§‡āĻ¨āĨ¤ āĻāĻ•āĻ‡āĻ¸āĻ™ā§āĻ—ā§‡ āĻ¸āĻžāĻ•ā§āĻˇā§āĻ¯āĻ—ā§āĻ°āĻšāĻŖā§‡āĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻ—āĻžāĻŽā§€ ā§¨ā§¨ āĻĢā§‡āĻŦā§āĻ°ā§ā§ŸāĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻ•āĻ°ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤"
article_2 = "āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻŦāĻŋāĻ°ā§‹āĻ§ā§€ āĻ“ āĻ‰āĻ¸āĻ•āĻžāĻ¨āĻŋāĻŽā§‚āĻ˛āĻ• āĻŦāĻ•ā§āĻ¤āĻŦā§āĻ¯ āĻĻā§‡āĻ“ā§ŸāĻžāĻ° āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ—ā§‡ āĻ—āĻžāĻœā§€āĻĒā§āĻ°ā§‡āĻ° āĻ—āĻžāĻ›āĻž āĻĨāĻžāĻ¨āĻžā§Ÿ āĻĄāĻŋāĻœāĻŋāĻŸāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻ†āĻ‡āĻ¨ā§‡ āĻ•āĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§Ÿ āĻ†āĻ˛ā§‹āĻšāĻŋāĻ¤ ‘āĻļāĻŋāĻļā§āĻŦāĻ•ā§āĻ¤āĻžâ€™ āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽā§‡āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ›ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§‡ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻ†āĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ• āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§‹āĨ¤ āĻ†āĻœ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻœāĻžāĻ¨ā§ā§ŸāĻžāĻ°āĻŋ) āĻĸāĻžāĻ•āĻžāĻ° āĻ¸āĻžāĻ‡āĻŦāĻžāĻ° āĻŸā§āĻ°āĻžāĻ‡āĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§‡āĻ° āĻŦāĻŋāĻšāĻžāĻ°āĻ• āĻ†āĻ¸āĻ¸āĻžāĻŽāĻ› āĻœāĻ—āĻ˛ā§āĻ˛ āĻšā§‹āĻ¸ā§‡āĻ¨ āĻ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻāĻ° āĻ†āĻ—ā§‡, āĻ°āĻĢāĻŋāĻ•ā§āĻ˛ āĻ‡āĻ¸āĻ˛āĻžāĻŽāĻ•ā§‡ āĻ•āĻžāĻ°āĻžāĻ—āĻžāĻ° āĻĨā§‡āĻ•ā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ā§‡ āĻšāĻžāĻœāĻŋāĻ° āĻ•āĻ°āĻž āĻšā§ŸāĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻ•ā§‡ āĻ¨āĻŋāĻ°ā§āĻĻā§‹āĻˇ āĻĻāĻžāĻŦāĻŋ āĻ•āĻ°ā§‡ āĻ¤āĻžāĻ° āĻ†āĻ‡āĻ¨āĻœā§€āĻŦā§€ āĻļā§‹āĻšā§‡āĻ˛ āĻŽā§‹. āĻĢāĻœāĻ˛ā§‡ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻšā§‡ā§Ÿā§‡ āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ…āĻ¨ā§āĻ¯āĻĻāĻŋāĻ•ā§‡, āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĻĒāĻ•ā§āĻˇ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻĒāĻ•ā§āĻˇā§‡ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻ•āĻ°ā§‡āĻ¨āĨ¤ āĻ‰āĻ­ā§Ÿ āĻĒāĻ•ā§āĻˇā§‡āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§‡āĻˇā§‡ āĻ†āĻĻāĻžāĻ˛āĻ¤ āĻ…āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻ†āĻŦā§‡āĻĻāĻ¨ āĻ–āĻžāĻ°āĻŋāĻœ āĻ•āĻ°ā§‡ āĻ…āĻ­āĻŋāĻ¯ā§‹āĻ— āĻ—āĻ āĻ¨ā§‡āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§‡ āĻŦāĻŋāĻšāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻ†āĻĻā§‡āĻļ āĻĻā§‡āĻ¨āĨ¤ āĻāĻ•āĻ‡āĻ¸āĻ™ā§āĻ—ā§‡ āĻ¸āĻžāĻ•ā§āĻˇā§āĻ¯āĻ—ā§āĻ°āĻšāĻŖā§‡āĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻ—āĻžāĻŽā§€ ā§¨ā§¨ āĻĢā§‡āĻŦā§āĻ°ā§ā§ŸāĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻ•āĻ°ā§‡āĻ¨ āĻ†āĻĻāĻžāĻ˛āĻ¤āĨ¤"

similarity = bn_doc2vec.get_document_similarity(
  model_path,
  article_1,
  article_2
)
print(similarity)

Train doc2vec vector with custom text files

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

text_files = "path/myfiles"
checkpoint_path = "msc/logs"

bn_doc2vec.train_doc2vec(
  text_files,
  checkpoint_path=checkpoint_path,
  vector_size=100,
  min_count=2,
  epochs=10
)

# it will train doc2vec with your text files and save the train model in checkpoint_path

Bengali POS Tagging

Bengali CRF POS Tagging

Find Pos Tag Using Pretrained Model

from bnlp import POS
bn_pos = POS()
model_path = "model/bn_pos.pkl"
text = "āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡āĨ¤" # or you can pass ['āĻ†āĻŽāĻŋ', 'āĻ­āĻžāĻ¤', 'āĻ–āĻžāĻ‡', 'āĨ¤']
res = bn_pos.tag(model_path, text)
print(res)
# [('āĻ†āĻŽāĻŋ', 'PPR'), ('āĻ­āĻžāĻ¤', 'NC'), ('āĻ–āĻžāĻ‡', 'VM'), ('āĨ¤', 'PU')]

Train POS Tag Model

from bnlp import POS
bn_pos = POS()
model_name = "pos_model.pkl"
train_data = [[('āĻ°āĻĒā§āĻ¤āĻžāĻ¨āĻŋ', 'JJ'), ('āĻĻā§āĻ°āĻŦā§āĻ¯', 'NC'), ('-', 'PU'), ('āĻ¤āĻžāĻœāĻž',  'JJ'), ('āĻ“', 'CCD'), ('āĻļā§āĻ•āĻ¨āĻž', 'JJ'), ('āĻĢāĻ˛', 'NC'), (',', 'PU'), ('āĻ†āĻĢāĻŋāĻŽ', 'NC'), (',', 'PU'), ('āĻĒāĻļā§āĻšāĻ°ā§āĻŽ', 'NC'), ('āĻ“', 'CCD'), ('āĻĒāĻļāĻŽ', 'NC'), ('āĻāĻŦāĻ‚', 'CCD'),('āĻ•āĻžāĻ°ā§āĻĒā§‡āĻŸ', 'NC'), ('ā§ˇ', 'PU')], [('āĻŽāĻžāĻŸāĻŋ', 'NC'), ('āĻĨā§‡āĻ•ā§‡', 'PP'), ('āĻŦā§œāĻœā§‹āĻ°', 'JQ'), ('āĻšāĻžāĻ°', 'JQ'), ('āĻĒāĻžāĻāĻš', 'JQ'), ('āĻĢā§āĻŸ', 'CCL'), ('āĻ‰āĻāĻšā§', 'JJ'), ('āĻšāĻŦā§‡', 'VM'), ('ā§ˇ', 'PU')]]

test_data = [[('āĻ°āĻĒā§āĻ¤āĻžāĻ¨āĻŋ', 'JJ'), ('āĻĻā§āĻ°āĻŦā§āĻ¯', 'NC'), ('-', 'PU'), ('āĻ¤āĻžāĻœāĻž', 'JJ'), ('āĻ“', 'CCD'), ('āĻļā§āĻ•āĻ¨āĻž', 'JJ'), ('āĻĢāĻ˛', 'NC'), (',', 'PU'), ('āĻ†āĻĢāĻŋāĻŽ', 'NC'), (',', 'PU'), ('āĻĒāĻļā§āĻšāĻ°ā§āĻŽ', 'NC'), ('āĻ“', 'CCD'), ('āĻĒāĻļāĻŽ', 'NC'), ('āĻāĻŦāĻ‚', 'CCD'),('āĻ•āĻžāĻ°ā§āĻĒā§‡āĻŸ', 'NC'), ('ā§ˇ', 'PU')], [('āĻŽāĻžāĻŸāĻŋ', 'NC'), ('āĻĨā§‡āĻ•ā§‡', 'PP'), ('āĻŦā§œāĻœā§‹āĻ°', 'JQ'), ('āĻšāĻžāĻ°', 'JQ'), ('āĻĒāĻžāĻāĻš', 'JQ'), ('āĻĢā§āĻŸ', 'CCL'), ('āĻ‰āĻāĻšā§', 'JJ'), ('āĻšāĻŦā§‡', 'VM'), ('ā§ˇ', 'PU')]]

bn_pos.train(model_name, train_data, test_data)

Bengali NER

Bengali CRF NER

Find NER Tag Using Pretrained Model

from bnlp import NER
bn_ner = NER()
model_path = "model/bn_ner.pkl"
text = "āĻ¸ā§‡ āĻĸāĻžāĻ•āĻžā§Ÿ āĻĨāĻžāĻ•ā§‡āĨ¤" # or you can pass ['āĻ¸ā§‡', 'āĻĸāĻžāĻ•āĻžā§Ÿ', 'āĻĨāĻžāĻ•ā§‡', 'āĨ¤']
result = bn_ner.tag(model_path, text)
print(result)
# [('āĻ¸ā§‡', 'O'), ('āĻĸāĻžāĻ•āĻžā§Ÿ', 'S-LOC'), ('āĻĨāĻžāĻ•ā§‡', 'O')]

Train NER Tag Model

from bnlp import NER
bn_ner = NER()
model_name = "ner_model.pkl"
train_data = [[('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')]]

test_data = [[('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ“', 'O'),('āĻ¸āĻŽāĻžāĻœāĻ•āĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ•', 'S-PER'),('āĻ¸ā§āĻœāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§Ÿ', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§€', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ–', 'O'),('āĻ¸āĻ‚āĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§‡āĻ˛āĻ¨ā§‡', 'O'),('āĻ‰āĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻ›āĻŋāĻ˛ā§‡āĻ¨', 'O')]]

bn_ner.train(model_name, train_data, test_data)

Bengali Corpus Class

Stopwords and Punctuations

from bnlp.corpus import stopwords, punctuations, letters, digits

print(stopwords)
print(punctuations)
print(letters)
print(digits)

Remove stopwords from Text

from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords

raw_text = 'āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡āĨ¤'
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['āĻ­āĻžāĻ¤', 'āĻ–āĻžāĻ‡', 'āĨ¤']

Text Cleaning

We adopted different text cleaning formula, codes from clean-text and modified for Bangla. Now you can normalize and clean your text using the following methods.

from bnlp import CleanText

clean_text = CleanText(
   fix_unicode=True,
   unicode_norm=True,
   unicode_norm_form="NFKC",
   remove_url=False,
   remove_email=False,
   remove_emoji=False,
   remove_number=False,
   remove_digits=False,
   remove_punct=False,
   replace_with_url="<URL>",
   replace_with_email="<EMAIL>",
   replace_with_number="<NUMBER>",
   replace_with_digit="<DIGIT>",
   replace_with_punct = "<PUNC>"
)

input_text = "āĻ†āĻŽāĻžāĻ° āĻ¸ā§‹āĻ¨āĻžāĻ° āĻŦāĻžāĻ‚āĻ˛āĻžāĨ¤"
clean_text = clean_text(input_text)
print(clean_text)

Contributor Guide

Check CONTRIBUTING.md page for details.

Thanks To

About

BNLP is a natural language processing toolkit for Bengali Language.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 58.3%
  • Python 41.7%