Check out bnlp 4.0.0 dev version and its documentation. bnlp 4.0.0 was designed with the proper object-oriented method to initiate class with the trained model to re-use for getting results.
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Embedding Bengali Document, Bengali POS Tagging, Bengali Name Entity Recognition, Bangla Text Cleaning for Bengali NLP purposes.
- Installation
- Pretrained Model
- Tokenization
- Word Embedding
- Document Embedding
- Bengali POS Tagging
- Bengali NER
- Bengali Corpus Class
- Bangla Text Cleaning
- Contributor Guide
pip install bnlp_toolkit
or Upgrade
pip install -U bnlp_toolkit
- Python: 3.6, 3.7, 3.8, 3.9
- OS: Linux, Windows, Mac
Large model published in huggingface model hub.
- Bengali SentencePiece
- Bengali Word2Vec
- Bengali FastText
- Bengali GloVe Wordvectors
- Bengali POS Tag model
- Bengali NER model
- Bengali News article Doc2Vec model
- Bangla Wikipedia Doc2Vec model
- Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
- SentencePiece Training Vocab Size=50000
- Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
- Word2Vec word embedding dimension = 100, min_count=5, window=5, epochs=10
- To Know Bengali GloVe Wordvector and training process follow this repository
- Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
- Bengali CRF NER Tagging was train with this data with 90% accuracy.
- Bengali news article doc2vec model train with 8 jsons of this corpus with epochs 40 vector size 100 min_count=2, total news article 400013
- Bengali wikipedia doc2vec model trained with wikipedia dump datasets. Total articles 110448, epochs: 40, vector_size: 100, min_count: 2
from bnlp import BasicTokenizer
basic_tokenizer = BasicTokenizer()
raw_text = "āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžā§ āĻāĻžāĻ¨ āĻāĻžāĻāĨ¤"
tokens = basic_tokenizer.tokenize(raw_text)
print(tokens)
# output: ["āĻāĻŽāĻŋ", "āĻŦāĻžāĻāĻ˛āĻžā§", "āĻāĻžāĻ¨", "āĻāĻžāĻ", "āĨ¤"]
from bnlp import NLTKTokenizer
bnltk = NLTKTokenizer()
text = "āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻāĨ¤ āĻ¸ā§ āĻŦāĻžāĻāĻžāĻ°ā§ āĻ¯āĻžā§āĨ¤ āĻ¤āĻŋāĻ¨āĻŋ āĻāĻŋ āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ āĻāĻžāĻ˛ā§ āĻŽāĻžāĻ¨ā§āĻˇ?"
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)
# output
# word_token: ["āĻāĻŽāĻŋ", "āĻāĻžāĻ¤", "āĻāĻžāĻ", "āĨ¤", "āĻ¸ā§", "āĻŦāĻžāĻāĻžāĻ°ā§", "āĻ¯āĻžā§", "āĨ¤", "āĻ¤āĻŋāĻ¨āĻŋ", "āĻāĻŋ", "āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ", "āĻāĻžāĻ˛ā§", "āĻŽāĻžāĻ¨ā§āĻˇ", "?"]
# sentence_token: ["āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻāĨ¤", "āĻ¸ā§ āĻŦāĻžāĻāĻžāĻ°ā§ āĻ¯āĻžā§āĨ¤", "āĻ¤āĻŋāĻ¨āĻŋ āĻāĻŋ āĻ¸āĻ¤ā§āĻ¯āĻŋāĻ āĻāĻžāĻ˛ā§ āĻŽāĻžāĻ¨ā§āĻˇ?"]
from bnlp import SentencepieceTokenizer
bsp = SentencepieceTokenizer()
model_path = "./model/bn_spm.model"
input_text = "āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻāĨ¤ āĻ¸ā§ āĻŦāĻžāĻāĻžāĻ°ā§ āĻ¯āĻžā§āĨ¤"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)
from bnlp import SentencepieceTokenizer
bsp = SentencepieceTokenizer()
data = "raw_text.txt"
model_prefix = "test"
vocab_size = 5
bsp.train(data, model_prefix, vocab_size)
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'āĻā§āĻ°āĻžāĻŽ'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'āĻā§āĻ°āĻžāĻŽ'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)
Train Bengali word2vec with your custom raw data or tokenized sentences.
Custom tokenized sentence format example:
sentences = [['āĻāĻŽāĻŋ', 'āĻāĻžāĻ¤', 'āĻāĻžāĻ', 'āĨ¤'], ['āĻ¸ā§', 'āĻŦāĻžāĻāĻžāĻ°ā§', 'āĻ¯āĻžā§', 'āĨ¤']]
Check gensim word2vec api for details of training parameter
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)
Check gensim word2vec api for details of training parameter
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
trained_model_path = "mytrained_model.model"
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)
To use fasttext
you need to install fasttext manually by pip install fasttext==0.9.2
NB: fasttext
may not be worked in windows
, it will only work in linux
from bnlp.embedding.fasttext import BengaliFasttext
bft = BengaliFasttext()
word = "āĻā§āĻ°āĻžāĻŽ"
model_path = "bengali_fasttext_wiki.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)
Check fasttext documentation for details of training parameter
from bnlp.embedding.fasttext import BengaliFasttext
bft = BengaliFasttext()
data = "raw_text.txt"
model_name = "saved_model.bin"
epoch = 50
bft.train(data, model_name, epoch)
from bnlp.embedding.fasttext import BengaliFasttext
bft = BengaliFasttext()
model_path = "mymodel.bin"
out_vector_name = "myvector.txt"
bft.bin2vec(model_path, out_vector_name)
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.
from bnlp import BengaliGlove
glove_path = "bn_glove.39M.100d.txt"
word = "āĻā§āĻ°āĻžāĻŽ"
bng = BengaliGlove()
res = bng.closest_word(glove_path, word)
print(res)
vec = bng.word2vec(glove_path, word)
print(vec)
from bnlp import BengaliDoc2vec
bn_doc2vec = BengaliDoc2vec()
model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
document = "āĻ°āĻžāĻˇā§āĻā§āĻ°āĻŦāĻŋāĻ°ā§āĻ§ā§ āĻ āĻāĻ¸āĻāĻžāĻ¨āĻŋāĻŽā§āĻ˛āĻ āĻŦāĻā§āĻ¤āĻŦā§āĻ¯ āĻĻā§āĻā§āĻžāĻ° āĻ
āĻāĻŋāĻ¯ā§āĻā§ āĻāĻžāĻā§āĻĒā§āĻ°ā§āĻ° āĻāĻžāĻāĻž āĻĨāĻžāĻ¨āĻžā§ āĻĄāĻŋāĻāĻŋāĻāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻāĻāĻ¨ā§ āĻāĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§ āĻāĻ˛ā§āĻāĻŋāĻ¤ âāĻļāĻŋāĻļā§āĻŦāĻā§āĻ¤āĻžâ āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽā§āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻāĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§āĨ¤ āĻāĻ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻāĻžāĻ¨ā§ā§āĻžāĻ°āĻŋ) āĻĸāĻžāĻāĻžāĻ° āĻ¸āĻžāĻāĻŦāĻžāĻ° āĻā§āĻ°āĻžāĻāĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§āĻ° āĻŦāĻŋāĻāĻžāĻ°āĻ āĻāĻ¸āĻ¸āĻžāĻŽāĻ āĻāĻāĻ˛ā§āĻ˛ āĻšā§āĻ¸ā§āĻ¨ āĻ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻ° āĻāĻā§, āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽāĻā§ āĻāĻžāĻ°āĻžāĻāĻžāĻ° āĻĨā§āĻā§ āĻāĻĻāĻžāĻ˛āĻ¤ā§ āĻšāĻžāĻāĻŋāĻ° āĻāĻ°āĻž āĻšā§āĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻā§ āĻ¨āĻŋāĻ°ā§āĻĻā§āĻˇ āĻĻāĻžāĻŦāĻŋ āĻāĻ°ā§ āĻ¤āĻžāĻ° āĻāĻāĻ¨āĻā§āĻŦā§ āĻļā§āĻšā§āĻ˛ āĻŽā§. āĻĢāĻāĻ˛ā§ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻā§ā§ā§ āĻāĻŦā§āĻĻāĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻ
āĻ¨ā§āĻ¯āĻĻāĻŋāĻā§, āĻ°āĻžāĻˇā§āĻā§āĻ°āĻĒāĻā§āĻˇ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻĒāĻā§āĻˇā§ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻā§ āĻĒāĻā§āĻˇā§āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§āĻˇā§ āĻāĻĻāĻžāĻ˛āĻ¤ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻāĻŦā§āĻĻāĻ¨ āĻāĻžāĻ°āĻŋāĻ āĻāĻ°ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻāĻĻā§āĻļ āĻĻā§āĻ¨āĨ¤ āĻāĻāĻāĻ¸āĻā§āĻā§ āĻ¸āĻžāĻā§āĻˇā§āĻ¯āĻā§āĻ°āĻšāĻŖā§āĻ° āĻāĻ¨ā§āĻ¯ āĻāĻāĻžāĻŽā§ ā§¨ā§¨ āĻĢā§āĻŦā§āĻ°ā§ā§āĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻāĻ°ā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤"
vector = bn_doc2vec.get_document_vector(model_path, text)
print(vector)
from bnlp import BengaliDoc2vec
bn_doc2vec = BengaliDoc2vec()
model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
article_1 = "āĻ°āĻžāĻˇā§āĻā§āĻ°āĻŦāĻŋāĻ°ā§āĻ§ā§ āĻ āĻāĻ¸āĻāĻžāĻ¨āĻŋāĻŽā§āĻ˛āĻ āĻŦāĻā§āĻ¤āĻŦā§āĻ¯ āĻĻā§āĻā§āĻžāĻ° āĻ
āĻāĻŋāĻ¯ā§āĻā§ āĻāĻžāĻā§āĻĒā§āĻ°ā§āĻ° āĻāĻžāĻāĻž āĻĨāĻžāĻ¨āĻžā§ āĻĄāĻŋāĻāĻŋāĻāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻāĻāĻ¨ā§ āĻāĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§ āĻāĻ˛ā§āĻāĻŋāĻ¤ âāĻļāĻŋāĻļā§āĻŦāĻā§āĻ¤āĻžâ āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽā§āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻāĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§āĨ¤ āĻāĻ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻāĻžāĻ¨ā§ā§āĻžāĻ°āĻŋ) āĻĸāĻžāĻāĻžāĻ° āĻ¸āĻžāĻāĻŦāĻžāĻ° āĻā§āĻ°āĻžāĻāĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§āĻ° āĻŦāĻŋāĻāĻžāĻ°āĻ āĻāĻ¸āĻ¸āĻžāĻŽāĻ āĻāĻāĻ˛ā§āĻ˛ āĻšā§āĻ¸ā§āĻ¨ āĻ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻ° āĻāĻā§, āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽāĻā§ āĻāĻžāĻ°āĻžāĻāĻžāĻ° āĻĨā§āĻā§ āĻāĻĻāĻžāĻ˛āĻ¤ā§ āĻšāĻžāĻāĻŋāĻ° āĻāĻ°āĻž āĻšā§āĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻā§ āĻ¨āĻŋāĻ°ā§āĻĻā§āĻˇ āĻĻāĻžāĻŦāĻŋ āĻāĻ°ā§ āĻ¤āĻžāĻ° āĻāĻāĻ¨āĻā§āĻŦā§ āĻļā§āĻšā§āĻ˛ āĻŽā§. āĻĢāĻāĻ˛ā§ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻā§ā§ā§ āĻāĻŦā§āĻĻāĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻ
āĻ¨ā§āĻ¯āĻĻāĻŋāĻā§, āĻ°āĻžāĻˇā§āĻā§āĻ°āĻĒāĻā§āĻˇ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻĒāĻā§āĻˇā§ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻā§ āĻĒāĻā§āĻˇā§āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§āĻˇā§ āĻāĻĻāĻžāĻ˛āĻ¤ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻāĻŦā§āĻĻāĻ¨ āĻāĻžāĻ°āĻŋāĻ āĻāĻ°ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻāĻĻā§āĻļ āĻĻā§āĻ¨āĨ¤ āĻāĻāĻāĻ¸āĻā§āĻā§ āĻ¸āĻžāĻā§āĻˇā§āĻ¯āĻā§āĻ°āĻšāĻŖā§āĻ° āĻāĻ¨ā§āĻ¯ āĻāĻāĻžāĻŽā§ ā§¨ā§¨ āĻĢā§āĻŦā§āĻ°ā§ā§āĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻāĻ°ā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤"
article_2 = "āĻ°āĻžāĻˇā§āĻā§āĻ°āĻŦāĻŋāĻ°ā§āĻ§ā§ āĻ āĻāĻ¸āĻāĻžāĻ¨āĻŋāĻŽā§āĻ˛āĻ āĻŦāĻā§āĻ¤āĻŦā§āĻ¯ āĻĻā§āĻā§āĻžāĻ° āĻ
āĻāĻŋāĻ¯ā§āĻā§ āĻāĻžāĻā§āĻĒā§āĻ°ā§āĻ° āĻāĻžāĻāĻž āĻĨāĻžāĻ¨āĻžā§ āĻĄāĻŋāĻāĻŋāĻāĻžāĻ˛ āĻ¨āĻŋāĻ°āĻžāĻĒāĻ¤ā§āĻ¤āĻž āĻāĻāĻ¨ā§ āĻāĻ°āĻž āĻŽāĻžāĻŽāĻ˛āĻžā§ āĻāĻ˛ā§āĻāĻŋāĻ¤ âāĻļāĻŋāĻļā§āĻŦāĻā§āĻ¤āĻžâ āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽā§āĻ° āĻŦāĻŋāĻ°ā§āĻĻā§āĻ§ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤ āĻĢāĻ˛ā§ āĻŽāĻžāĻŽāĻ˛āĻžāĻ° āĻāĻ¨ā§āĻˇā§āĻ āĻžāĻ¨āĻŋāĻ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§ āĻšāĻ˛ā§āĨ¤ āĻāĻ āĻŦā§āĻ§āĻŦāĻžāĻ° (ā§¨ā§Ŧ āĻāĻžāĻ¨ā§ā§āĻžāĻ°āĻŋ) āĻĸāĻžāĻāĻžāĻ° āĻ¸āĻžāĻāĻŦāĻžāĻ° āĻā§āĻ°āĻžāĻāĻŦā§āĻ¯ā§āĻ¨āĻžāĻ˛ā§āĻ° āĻŦāĻŋāĻāĻžāĻ°āĻ āĻāĻ¸āĻ¸āĻžāĻŽāĻ āĻāĻāĻ˛ā§āĻ˛ āĻšā§āĻ¸ā§āĻ¨ āĻ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻ° āĻāĻā§, āĻ°āĻĢāĻŋāĻā§āĻ˛ āĻāĻ¸āĻ˛āĻžāĻŽāĻā§ āĻāĻžāĻ°āĻžāĻāĻžāĻ° āĻĨā§āĻā§ āĻāĻĻāĻžāĻ˛āĻ¤ā§ āĻšāĻžāĻāĻŋāĻ° āĻāĻ°āĻž āĻšā§āĨ¤ āĻāĻ°āĻĒāĻ° āĻ¤āĻžāĻā§ āĻ¨āĻŋāĻ°ā§āĻĻā§āĻˇ āĻĻāĻžāĻŦāĻŋ āĻāĻ°ā§ āĻ¤āĻžāĻ° āĻāĻāĻ¨āĻā§āĻŦā§ āĻļā§āĻšā§āĻ˛ āĻŽā§. āĻĢāĻāĻ˛ā§ āĻ°āĻžāĻŦā§āĻŦāĻŋ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋ āĻā§ā§ā§ āĻāĻŦā§āĻĻāĻ¨ āĻāĻ°ā§āĻ¨āĨ¤ āĻ
āĻ¨ā§āĻ¯āĻĻāĻŋāĻā§, āĻ°āĻžāĻˇā§āĻā§āĻ°āĻĒāĻā§āĻˇ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻĒāĻā§āĻˇā§ āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻāĻ°ā§āĻ¨āĨ¤ āĻāĻā§ āĻĒāĻā§āĻˇā§āĻ° āĻļā§āĻ¨āĻžāĻ¨āĻŋ āĻļā§āĻˇā§ āĻāĻĻāĻžāĻ˛āĻ¤ āĻ
āĻŦā§āĻ¯āĻžāĻšāĻ¤āĻŋāĻ° āĻāĻŦā§āĻĻāĻ¨ āĻāĻžāĻ°āĻŋāĻ āĻāĻ°ā§ āĻ
āĻāĻŋāĻ¯ā§āĻ āĻāĻ āĻ¨ā§āĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§ āĻŦāĻŋāĻāĻžāĻ° āĻļā§āĻ°ā§āĻ° āĻāĻĻā§āĻļ āĻĻā§āĻ¨āĨ¤ āĻāĻāĻāĻ¸āĻā§āĻā§ āĻ¸āĻžāĻā§āĻˇā§āĻ¯āĻā§āĻ°āĻšāĻŖā§āĻ° āĻāĻ¨ā§āĻ¯ āĻāĻāĻžāĻŽā§ ā§¨ā§¨ āĻĢā§āĻŦā§āĻ°ā§ā§āĻžāĻ°āĻŋ āĻĻāĻŋāĻ¨ āĻ§āĻžāĻ°ā§āĻ¯ āĻāĻ°ā§āĻ¨ āĻāĻĻāĻžāĻ˛āĻ¤āĨ¤"
similarity = bn_doc2vec.get_document_similarity(
model_path,
article_1,
article_2
)
print(similarity)
from bnlp import BengaliDoc2vec
bn_doc2vec = BengaliDoc2vec()
text_files = "path/myfiles"
checkpoint_path = "msc/logs"
bn_doc2vec.train_doc2vec(
text_files,
checkpoint_path=checkpoint_path,
vector_size=100,
min_count=2,
epochs=10
)
# it will train doc2vec with your text files and save the train model in checkpoint_path
from bnlp import POS
bn_pos = POS()
model_path = "model/bn_pos.pkl"
text = "āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻāĨ¤" # or you can pass ['āĻāĻŽāĻŋ', 'āĻāĻžāĻ¤', 'āĻāĻžāĻ', 'āĨ¤']
res = bn_pos.tag(model_path, text)
print(res)
# [('āĻāĻŽāĻŋ', 'PPR'), ('āĻāĻžāĻ¤', 'NC'), ('āĻāĻžāĻ', 'VM'), ('āĨ¤', 'PU')]
from bnlp import POS
bn_pos = POS()
model_name = "pos_model.pkl"
train_data = [[('āĻ°āĻĒā§āĻ¤āĻžāĻ¨āĻŋ', 'JJ'), ('āĻĻā§āĻ°āĻŦā§āĻ¯', 'NC'), ('-', 'PU'), ('āĻ¤āĻžāĻāĻž', 'JJ'), ('āĻ', 'CCD'), ('āĻļā§āĻāĻ¨āĻž', 'JJ'), ('āĻĢāĻ˛', 'NC'), (',', 'PU'), ('āĻāĻĢāĻŋāĻŽ', 'NC'), (',', 'PU'), ('āĻĒāĻļā§āĻāĻ°ā§āĻŽ', 'NC'), ('āĻ', 'CCD'), ('āĻĒāĻļāĻŽ', 'NC'), ('āĻāĻŦāĻ', 'CCD'),('āĻāĻžāĻ°ā§āĻĒā§āĻ', 'NC'), ('ā§ˇ', 'PU')], [('āĻŽāĻžāĻāĻŋ', 'NC'), ('āĻĨā§āĻā§', 'PP'), ('āĻŦā§āĻā§āĻ°', 'JQ'), ('āĻāĻžāĻ°', 'JQ'), ('āĻĒāĻžāĻāĻ', 'JQ'), ('āĻĢā§āĻ', 'CCL'), ('āĻāĻāĻā§', 'JJ'), ('āĻšāĻŦā§', 'VM'), ('ā§ˇ', 'PU')]]
test_data = [[('āĻ°āĻĒā§āĻ¤āĻžāĻ¨āĻŋ', 'JJ'), ('āĻĻā§āĻ°āĻŦā§āĻ¯', 'NC'), ('-', 'PU'), ('āĻ¤āĻžāĻāĻž', 'JJ'), ('āĻ', 'CCD'), ('āĻļā§āĻāĻ¨āĻž', 'JJ'), ('āĻĢāĻ˛', 'NC'), (',', 'PU'), ('āĻāĻĢāĻŋāĻŽ', 'NC'), (',', 'PU'), ('āĻĒāĻļā§āĻāĻ°ā§āĻŽ', 'NC'), ('āĻ', 'CCD'), ('āĻĒāĻļāĻŽ', 'NC'), ('āĻāĻŦāĻ', 'CCD'),('āĻāĻžāĻ°ā§āĻĒā§āĻ', 'NC'), ('ā§ˇ', 'PU')], [('āĻŽāĻžāĻāĻŋ', 'NC'), ('āĻĨā§āĻā§', 'PP'), ('āĻŦā§āĻā§āĻ°', 'JQ'), ('āĻāĻžāĻ°', 'JQ'), ('āĻĒāĻžāĻāĻ', 'JQ'), ('āĻĢā§āĻ', 'CCL'), ('āĻāĻāĻā§', 'JJ'), ('āĻšāĻŦā§', 'VM'), ('ā§ˇ', 'PU')]]
bn_pos.train(model_name, train_data, test_data)
from bnlp import NER
bn_ner = NER()
model_path = "model/bn_ner.pkl"
text = "āĻ¸ā§ āĻĸāĻžāĻāĻžā§ āĻĨāĻžāĻā§āĨ¤" # or you can pass ['āĻ¸ā§', 'āĻĸāĻžāĻāĻžā§', 'āĻĨāĻžāĻā§', 'āĨ¤']
result = bn_ner.tag(model_path, text)
print(result)
# [('āĻ¸ā§', 'O'), ('āĻĸāĻžāĻāĻžā§', 'S-LOC'), ('āĻĨāĻžāĻā§', 'O')]
from bnlp import NER
bn_ner = NER()
model_name = "ner_model.pkl"
train_data = [[('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')]]
test_data = [[('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')], [('āĻ¤ā§āĻ°āĻžāĻŖ', 'O'),('āĻ', 'O'),('āĻ¸āĻŽāĻžāĻāĻāĻ˛ā§āĻ¯āĻžāĻŖ', 'O'),('āĻ¸āĻŽā§āĻĒāĻžāĻĻāĻ', 'S-PER'),('āĻ¸ā§āĻāĻŋāĻ¤', 'B-PER'),('āĻ°āĻžā§', 'I-PER'),('āĻ¨āĻ¨ā§āĻĻā§', 'E-PER'),('āĻĒā§āĻ°āĻŽā§āĻ', 'O'),('āĻ¸āĻāĻŦāĻžāĻĻ', 'O'),('āĻ¸āĻŽā§āĻŽā§āĻ˛āĻ¨ā§', 'O'),('āĻāĻĒāĻ¸ā§āĻĨāĻŋāĻ¤', 'O'),('āĻāĻŋāĻ˛ā§āĻ¨', 'O')]]
bn_ner.train(model_name, train_data, test_data)
from bnlp.corpus import stopwords, punctuations, letters, digits
print(stopwords)
print(punctuations)
print(letters)
print(digits)
from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords
raw_text = 'āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻāĨ¤'
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['āĻāĻžāĻ¤', 'āĻāĻžāĻ', 'āĨ¤']
We adopted different text cleaning formula, codes from clean-text and modified for Bangla. Now you can normalize and clean your text using the following methods.
from bnlp import CleanText
clean_text = CleanText(
fix_unicode=True,
unicode_norm=True,
unicode_norm_form="NFKC",
remove_url=False,
remove_email=False,
remove_emoji=False,
remove_number=False,
remove_digits=False,
remove_punct=False,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_number="<NUMBER>",
replace_with_digit="<DIGIT>",
replace_with_punct = "<PUNC>"
)
input_text = "āĻāĻŽāĻžāĻ° āĻ¸ā§āĻ¨āĻžāĻ° āĻŦāĻžāĻāĻ˛āĻžāĨ¤"
clean_text = clean_text(input_text)
print(clean_text)
Check CONTRIBUTING.md page for details.