Skip to content
Bengali NLP
Jupyter Notebook Python
Branch: master
Clone or download
Pull request Compare This branch is even with soham96:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
img
LICENSE
README.md
news_vector_training.ipynb
test_word2vec.ipynb
visualise.py
wikipedia_embeddings.ipynb

README.md

Bangla2Vec

Language Modelling and Classification in the Bengali Language

Announcement: I will be giving a talk at IEM, Kolkata this Saturday about this work. The event link is here. Hope to see you there!!!

Bangla2Vec is an open source project for modelling the Bengali Language. The models released here can be used for a variety of tasks like classification and translation. Furthermore, all the data and models are opensourced so you can train your own model or use the pretrained models for your own tasks.

Releases

  • Trained a skipgram model on a news dataset: Training Script | Results | Model
  • Trained a skipgram model on wikipedia dataset: Training Script | Results | Model
  • Visualise Word Embeddings: Script | Create a directory vis, run the script and then start Tensorboard using tensorboard --logdir=vis
  • Scripts to scrape data from Bengali news websites: Github Repo

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Data was scraped from multiple online Bengali news websites.

Data was also collected from a Wikipedia dump.

You can view the data in the data folder.

Examples

  • Classification: Using the trained Bangal2vec models, a news classifier was built. This classifier can classify news into 5 categories based on the news headlines. The best model achieved a testing f1 score of 0.76 after training on just 40k news headlines.

Similar Projects

This project is a sister project of other projects working on IndicNLP. They include:

To get resources to start working on IndicNLP or to learn more about it, you can see our Awesome List of resources

Future Work

  • Build a word2vec model
  • Visualise the trained embeddings
  • Build a UlmFit model
  • Get translation data
You can’t perform that action at this time.