Skip to content

State of the Art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.

License

Notifications You must be signed in to change notification settings

goru001/nlp-for-tanglish

Repository files navigation

NLP for Tanglish (Code mixed Tamil+English)

This repository contains state of the art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.

Dataset

  1. Tamil Wikipedia Articles : Preprocessed and Transliterated versions of this dataset, used for language modeling in this repo, can be downloaded directly from here

  2. Dravidian Codemix HASOC @ FIRE 2020

  3. Dravidian Codemix Sentiment Analysis @ FIRE 2020

Results

Language Model Perplexity (on validation set)

Architecture/Dataset Tamil Wikipedia Articles Vocab size
ULMFiT 37.50 8000

Classification Metrics

ULMFiT
Dataset F1 Precision Recall Notebook to Reproduce results
Dravidian Codemix HASOC @ FIRE 2020 0.88 0.88 0.88 Link
Dravidian Codemix Sentiment Analysis @ FIRE 2020 0.62 0.65 0.69 Link

Visualizations

Word Embeddings
Architecture Vocab Size Visualization
ULMFiT 8k Embeddings projection

Pretrained Models

Language Models

Download pretrained ULMFiT LM with 8k vocab from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

About

State of the Art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published