L3Cube-MahaNLP

Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We have contributed un-supervised, supervised datasets, and Transformer models for Marathi. The supervised datasets include Marathi sentiment analysis, named entity recognition, and hate speech detection. With this, we at L3Cube-Pune aim to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!

[Update] The library is now available in a python package:

pip install mahaNLP

Usage examples are provided in this demo Colab .

[Update] We have released a new code-mixed Marathi-English unsupervised dataset MeCorpus and supervised datasets like MeSent, MeHate, and MeLID.
[Update] We have released a new multi-domain Sentiment analysis dataset MahaSent-MD with 60k samples across four diverse domains. A new sentiment analysis model is also released on HF.

L3Cube-MahaCorpus and Marathi BERT

L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper link

Dataset Statistics

L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)

Full Marathi Corpus incorporates all existing sources .

Dataset	#tokens(M)	#sentences(M)	Link
L3Cube-MahaCorpus (news)	212	17.6	link
L3Cube-MahaCorpus (non-news)	76.4	7.2	link
L3Cube-MahaCorpus (full)	289	24.8	link
Full Marathi Corpus (all sources)	752	57.2	link

L3Cube-MeCorpus and code-mixed MeBERT

L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper .

Dataset	#tokens(M)	#sentences(M)	Link
L3Cube-MeCorpus (Roman)	70.9	5	link
L3Cube-MeCorpus (Devanagari)	68.6	5	link
L3Cube-MeCorpus (Roman + Devanagari)	139.5	10	link

Marathi BERT models and Marathi Fast Text model

The full Marathi Corpus is used to train BERT language models and made available on Hugging Face model hub.

Model	Description	Link
MahaGemma-7B	Gemma-7B	v1
MahaGemma-2B	Gemma-2B	v1
MahaBERT	Base-BERT	v1 , v2 , paper
MahaRoBERTa	RoBERTa	link
MahaAlBERT	AlBERT	v1 , v2
MahaGPT	GPT2	link
MahaFT	Fast Text	bin , vec
MahaTweetBERT	MahaBERT + Tweets	model , paper
MahaSBERT	Sentence-BERT	MahaSBERT-STS , MahaSBERT , paper
IndicSBERT	Sentence-BERT (for cross-language)	IndicSBERT-STS , IndicSBERT , paper
MeBERT	Codemixed Marathi-English BERT (Roman)	me-bert , paper
MeRoBERTa	Codemixed Marathi-English RoBERTa (Roman)	me-roberta , paper
MeBERT-Mixed	Codemixed Marathi-English BERT (Roman + Devanagari)	me-bert-mixed , me-bert-mixed-v2 , paper
MeRoBERTa-Mixed	Codemixed Marathi-English RoBERTa (Roman + Devanagari)	me-roberta-mixed , paper

Supervised Datasets

Dataset	Description	Samples(train, valid, test)	link	model	paper
MahaSQuAD	Marathi Question Answering Dataset	142k (118516, 11873, 11803)	data	MahaSQuAD-BERT	link
MahaNews	Marathi long, medium, and short document classification dataset in Marathi dataset with 12 target classes	53k (42k, 5k, 5k)	data	MahaNews-All-BERT	link
MahaNER	Marathi Named Entity Recognition dataset with 8 entity classes	25k (21.5k, 1.5k, 2k)	data	MahaNER-BERT	link
MahaSocialNER	Social media based Marathi Named Entity Recognition dataset with 8 entity classes	18k (12k, 1.5k, 2.2k)	data	MahaSocialNER-BERT	link
MahaHate	Marathi Hate Speech Detection dataset with 4 class (hate, offensive, pofane, and not) and 2 class (hate and not) labels	4-class: 25k (21.5k, 1.5k, 2k), 2-class: 37500	data	4-class , 2-class	link
MahaSent	Marathi Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)	18,378 (12114, 1500, 2250); extra(2,514=2355(+1) + 159(-1))	data	MarathiSentiment	link
HateEval-Mr	Another dataset for evaluation of Hate Speech models with two classes - Hate(1) and None(0)	2k samples	data		link
MahaSent-MD	A Multi-domain Marathi Sentiment Analysis dataset (4 domains - Marathi Movie Reviews, TV Subtitles, Generic Tweets, and Political Tweets) with three classes - Positive(1), Negative(-1) and Neutral(0)	60k samples	data	MahaSent-MD	link
MeSent	A code-mixed Marathi-English Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)	12k samples	data	me-sent-roberta	link
MeHate	A code-mixed Marathi-English Hate speech identification dataset with two classes - Hate(1) and None(0)	2768 samples	data	me-hate-bert	link
MeLID	A code-mixed Marathi-English language identification (LID) dataset with three classes - Marathi, English, and Undefined	12k samples	data	me-lid-bert	link

License

L3Cube-MahaCorpus, L3Cube-MahaNER, L3Cube-MahaHate, L3Cube-HateEval-Mr, L3Cube-MahaSent-MD, L3CubeMahaSent, L3Cube-MeCorpus, L3Cube-MahaSent-MD, L3Cube-MeSent, L3Cube-MeHate, and L3Cube-MeLID are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The datasets are released to the community for research purposes only and the group is not responsible for any misuse of these datasets.

Citing

@article{joshi2022l3cube,
  title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2205.14728},
  year={2022}
}

@inproceedings{joshi-2022-l3cube,
    title = "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources",
    author = "Joshi, Raviraj",
    booktitle = "Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.wildre-1.17",
    pages = "97--101",
}

Publications

Joshi, Raviraj. "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources." LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022. 2022.

Mittal, Saloni, et al. "L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi." International Conference on Speech and Language Technologies for Low-resource Languages. Cham: Springer Nature Switzerland, 2023.

Chavan, Tanmay, et al. "My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks." arXiv preprint arXiv:2306.14030 (2023).

Pingle, Aabha, et al. "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." arXiv preprint arXiv:2306.13888 (2023).

Pingle, Aabha, et al. "Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi." arXiv preprint arXiv:2310.00734 (2023).

Deode, Samruddhi, et al. "L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT." arXiv preprint arXiv:2304.11434 (2023).

Joshi, Ananya, et al. "L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi." arXiv preprint arXiv:2211.11187 (2022).

Gokhale, Omkar Bhushan, et al. "Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection." I Can't Believe It's Not Better Workshop: Understanding Deep Learning Through Empirical Falsification.

Sabane, Maithili, et al. "Enhancing Low Resource NER using Assisting Language and Transfer Learning." 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 2023.

Litake, Onkar, et al. "L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models." Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 2022.

Litake, Onkar, et al. "Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition." arXiv preprint arXiv:2203.12907 (2022).

Velankar, Abhishek, Hrushikesh Patil, and Raviraj Joshi. "Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi." IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer, Cham, 2023.

Patil, Hrushikesh, Abhishek Velankar, and Raviraj Joshi. "L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models." Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022). 2022.

Velankar, Abhishek, et al. "Hate and offensive speech detection in Hindi and Marathi." arXiv preprint arXiv:2110.12200 (2021).

Kulkarni, Atharva, et al. "L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset." Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021.

Kulkarni, Atharva, et al. "Experimental Evaluation of Deep Learning Models for Marathi Text Classification." Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, Singapore, 2022.

This project is led by Raviraj Joshi under L3Cube Labs, Pune. For any queries contact ravirajoshi@gmail.com .

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
HateEval		HateEval
L3Cube-MahaHate		L3Cube-MahaHate
L3Cube-MahaNER		L3Cube-MahaNER
L3Cube-MahaNews		L3Cube-MahaNews
L3Cube-MahaSQuAD		L3Cube-MahaSQuAD
L3Cube-MahaSent-MD		L3Cube-MahaSent-MD
L3Cube-MahaSocialNER		L3Cube-MahaSocialNER
L3CubeMahaSent Dataset		L3CubeMahaSent Dataset
MeEval		MeEval
STSb-translated		STSb-translated
aux_resources		aux_resources
docs		docs
mahaNLP_examples		mahaNLP_examples
README.md		README.md

l3cube-pune/MarathiNLP

Folders and files

Latest commit

History

Repository files navigation

L3Cube-MahaNLP

L3Cube-MahaCorpus and Marathi BERT

Dataset Statistics

L3Cube-MeCorpus and code-mixed MeBERT

Marathi BERT models and Marathi Fast Text model

Supervised Datasets

License

Citing

Publications

About

Topics

Resources

Stars

Watchers

Forks

Languages