Skip to content

l3cube-pune/MarathiNLP

Repository files navigation

L3Cube-MahaNLP

Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We have contributed un-supervised, supervised datasets, and Transformer models for Marathi. The supervised datasets include Marathi sentiment analysis, named entity recognition, and hate speech detection. With this, we at L3Cube-Pune aim to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!

[Update] The library is now available in a python package:

pip install mahaNLP

Usage examples are provided in this demo Colab .

[Update] We have released a new code-mixed Marathi-English unsupervised dataset MeCorpus and supervised datasets like MeSent, MeHate, and MeLID.
[Update] We have released a new multi-domain Sentiment analysis dataset MahaSent-MD with 60k samples across four diverse domains. A new sentiment analysis model is also released on HF.

L3Cube-MahaCorpus and Marathi BERT

L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper link

Dataset Statistics

L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)

Full Marathi Corpus incorporates all existing sources .

Dataset #tokens(M) #sentences(M) Link
L3Cube-MahaCorpus (news) 212 17.6 link
L3Cube-MahaCorpus (non-news) 76.4 7.2 link
L3Cube-MahaCorpus (full) 289 24.8 link
Full Marathi Corpus (all sources) 752 57.2 link

L3Cube-MeCorpus and code-mixed MeBERT

L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper .

Dataset #tokens(M) #sentences(M) Link
L3Cube-MeCorpus (Roman) 70.9 5 link
L3Cube-MeCorpus (Devanagari) 68.6 5 link
L3Cube-MeCorpus (Roman + Devanagari) 139.5 10 link

Marathi BERT models and Marathi Fast Text model

The full Marathi Corpus is used to train BERT language models and made available on Hugging Face model hub.

Model Description Link
MahaGemma-7B Gemma-7B v1
MahaGemma-2B Gemma-2B v1
MahaBERT Base-BERT v1 , v2 , paper
MahaRoBERTa RoBERTa link
MahaAlBERT AlBERT v1 , v2
MahaGPT GPT2 link
MahaFT Fast Text bin , vec
MahaTweetBERT MahaBERT + Tweets model , paper
MahaSBERT Sentence-BERT MahaSBERT-STS , MahaSBERT , paper
IndicSBERT Sentence-BERT (for cross-language) IndicSBERT-STS , IndicSBERT , paper
MeBERT Codemixed Marathi-English BERT (Roman) me-bert , paper
MeRoBERTa Codemixed Marathi-English RoBERTa (Roman) me-roberta , paper
MeBERT-Mixed Codemixed Marathi-English BERT (Roman + Devanagari) me-bert-mixed , me-bert-mixed-v2 , paper
MeRoBERTa-Mixed Codemixed Marathi-English RoBERTa (Roman + Devanagari) me-roberta-mixed , paper

Supervised Datasets

Dataset Description Samples(train, valid, test) link model paper
MahaSQuAD Marathi Question Answering Dataset 142k (118516, 11873, 11803) data MahaSQuAD-BERT link
MahaNews Marathi long, medium, and short document classification dataset in Marathi dataset with 12 target classes 53k (42k, 5k, 5k) data MahaNews-All-BERT link
MahaNER Marathi Named Entity Recognition dataset with 8 entity classes 25k (21.5k, 1.5k, 2k) data MahaNER-BERT link
MahaSocialNER Social media based Marathi Named Entity Recognition dataset with 8 entity classes 18k (12k, 1.5k, 2.2k) data MahaSocialNER-BERT link
MahaHate Marathi Hate Speech Detection dataset with 4 class (hate, offensive, pofane, and not) and 2 class (hate and not) labels 4-class: 25k (21.5k, 1.5k, 2k), 2-class: 37500 data 4-class , 2-class link
MahaSent Marathi Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0) 18,378 (12114, 1500, 2250); extra(2,514=2355(+1) + 159(-1)) data MarathiSentiment link
HateEval-Mr Another dataset for evaluation of Hate Speech models with two classes - Hate(1) and None(0) 2k samples data link
MahaSent-MD A Multi-domain Marathi Sentiment Analysis dataset (4 domains - Marathi Movie Reviews, TV Subtitles, Generic Tweets, and Political Tweets) with three classes - Positive(1), Negative(-1) and Neutral(0) 60k samples data MahaSent-MD link
MeSent A code-mixed Marathi-English Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0) 12k samples data me-sent-roberta link
MeHate A code-mixed Marathi-English Hate speech identification dataset with two classes - Hate(1) and None(0) 2768 samples data me-hate-bert link
MeLID A code-mixed Marathi-English language identification (LID) dataset with three classes - Marathi, English, and Undefined 12k samples data me-lid-bert link

License

L3Cube-MahaCorpus, L3Cube-MahaNER, L3Cube-MahaHate, L3Cube-HateEval-Mr, L3Cube-MahaSent-MD, L3CubeMahaSent, L3Cube-MeCorpus, L3Cube-MahaSent-MD, L3Cube-MeSent, L3Cube-MeHate, and L3Cube-MeLID are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The datasets are released to the community for research purposes only and the group is not responsible for any misuse of these datasets.

Citing

@article{joshi2022l3cube,
  title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2205.14728},
  year={2022}
}
@inproceedings{joshi-2022-l3cube,
    title = "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources",
    author = "Joshi, Raviraj",
    booktitle = "Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.wildre-1.17",
    pages = "97--101",
}

Publications

Joshi, Raviraj. "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources." LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022. 2022.

Mittal, Saloni, et al. "L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi." International Conference on Speech and Language Technologies for Low-resource Languages. Cham: Springer Nature Switzerland, 2023.

Chavan, Tanmay, et al. "My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks." arXiv preprint arXiv:2306.14030 (2023).

Pingle, Aabha, et al. "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." arXiv preprint arXiv:2306.13888 (2023).

Pingle, Aabha, et al. "Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi." arXiv preprint arXiv:2310.00734 (2023).

Deode, Samruddhi, et al. "L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT." arXiv preprint arXiv:2304.11434 (2023).

Joshi, Ananya, et al. "L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi." arXiv preprint arXiv:2211.11187 (2022).

Gokhale, Omkar Bhushan, et al. "Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection." I Can't Believe It's Not Better Workshop: Understanding Deep Learning Through Empirical Falsification.

Sabane, Maithili, et al. "Enhancing Low Resource NER using Assisting Language and Transfer Learning." 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 2023.

Litake, Onkar, et al. "L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models." Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 2022.

Litake, Onkar, et al. "Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition." arXiv preprint arXiv:2203.12907 (2022).

Velankar, Abhishek, Hrushikesh Patil, and Raviraj Joshi. "Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi." IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer, Cham, 2023.

Patil, Hrushikesh, Abhishek Velankar, and Raviraj Joshi. "L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models." Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022). 2022.

Velankar, Abhishek, et al. "Hate and offensive speech detection in Hindi and Marathi." arXiv preprint arXiv:2110.12200 (2021).

Kulkarni, Atharva, et al. "L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset." Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021.

Kulkarni, Atharva, et al. "Experimental Evaluation of Deep Learning Models for Marathi Text Classification." Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, Singapore, 2022.

This project is led by Raviraj Joshi under L3Cube Labs, Pune. For any queries contact ravirajoshi@gmail.com .