Skip to content
State of the Art Tokenizer, Language model and Classifier for Kannada, which is spoken predominantly by Kannada people in India, mainly in the state of Karnataka
Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

NLP for Kannada

This repository contains State of the Art Tokenizer, Language model and Classifier for Kannada, which is spoken predominantly by Kannada people in India, mainly in the state of Karnataka.

Dataset

Results

Language Model

on 20% validation set

  • Perplexity of language model: ~70

Classifier

  • Accuracy of classification model: ~94%
  • Kappa score of classification model: ~90

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

You can’t perform that action at this time.