# "Building தமிழ் language model"
> "In this notebook I try to build a language model for தமிழ்(Tamil) to help in some basic NLP tasks"
- toc: false
- branch: master
- badges: true
- comments: true
- categories: [nlp, language-model, தமிழ்]
- hide: false

# Introduction

In this post, I will try to model `தமிழ்` (Tamil), I have already prepared the data for the language model in the kaggle notebook [here](https://www.kaggle.com/manimaranp/tamil-wiki-data-extraction), A language model will be useful for many tasks such as text classification, information retrieval etc.

# Things we need for a language model

- A decent amount of raw text data, more about that [here](https://mani2106.github.io/Blog-Posts/data-cleaning/language-model/2020/04/14/wiki-data-extraction.html)
- A language tokenizer, more about that [here](https://mani2106.github.io/Blog-Posts/nlp/language-model/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D/2020/04/14/building-a-tokenizer-for-tamil-with-sentencepiece.html) and [here](https://mani2106.github.io/Blog-Posts/nlp/language-model/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D/2020/04/14/building-a-tokenizer-for-tamil-with-sentencepiece.html)

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #31708f; background-color: #d9edf7; border-color: #bce8f1;">
This notebook is executed on <b>kaggle</b>, so the paths mentioned here will be needed to change if you run in your own environment.
</div>

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# For language modelling
from fastai import *
from fastai.text import *
from fastai.callbacks import *
from fastai.metrics import *

# For unsupervised tokenization
import sentencepiece as spm
from pathlib import Path

DATA_PATH = Path('/kaggle/working/Tamil-Language-data')

## Set seed for reproducibility

In [None]:
#hide
seed = 42

# python RNG
import random
random.seed(seed)

# pytorch RNGs
import torch
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

# numpy RNG
import numpy as np
np.random.seed(seed)

## Load text data from csv

In [None]:
LANG_DATA = pd.read_csv(DATA_PATH/'filtered_data.csv.tar.gz', index_col=[0])
LANG_DATA.head()

We have the `url`, `article_id` and `title` as additional information about the text, Let's check the average length of the article text.

## Exploration

In [None]:
LANG_DATA.shape

In [None]:
LANG_DATA.info()

The total articles we have are `131162`

### Remove empty articles from the dataframe

In [None]:
LANG_DATA.dropna(axis='rows', inplace=True)
LANG_DATA.info()

### Length of articles

In [None]:
#hide_input
sum(LANG_DATA['text'].map(str).apply(len))/LANG_DATA.shape[0]

The average length of each article is `1370` words

In [None]:
#hide
LANG_DATA['article_length'] = LANG_DATA['text'].map(str).apply(len)

In [None]:
#hide
LANG_DATA.sort_values(by='article_length', axis='rows', inplace=True)

print('Article with lowest number of words is about :', LANG_DATA.iloc[0, 1], 'with ',LANG_DATA.iloc[0, -1], ' words')
print('Article with highest number of words is about :', LANG_DATA.iloc[-1, 1], 'with ',LANG_DATA.iloc[-1, -1], ' words')

`ஆண்` means `Male` and `புற்று நோய்` means cancer.

We had one empty article, I suppose.

# Prepare Text data for Language model

In [None]:
processor = SPProcessor(lang='ta', sp_model=DATA_PATH/'tamil_tok.model', sp_vocab=DATA_PATH/'tamil_tok.vocab')

Set batch size

In [None]:
bs = 16

In [None]:
data_lm = (TextList.from_df(LANG_DATA, path=DATA_PATH, cols='text', processor=processor)
            .split_by_rand_pct(0.1)
            .label_for_lm()
            # We want to do a language model so we label accordingly
            .databunch(bs=bs))

In [None]:
# Check if data is loadable
data_lm.sanity_check()

Let's save the language model data to skip the processing above next time.

In [None]:
data_lm.save(DATA_PATH/'data_lm.pkl')

Let's have a look at the tokenized data from the `sentencepiece` tokenizer.

In [None]:
data_lm.show_batch()

- `bos` means beginning of the sentence.
- `eos` means end of the sentence.
- `xx maj` used to indicate the next word begins with a capital in the original text. more about this can be found [here](https://forums.fast.ai/t/xxbos-is-it-marking-beginning-of-sentence-or-beginning-of-text/43688/4?u=mani) and [here](https://docs.fast.ai/text.transform.html#Tokenizer)

# Train the language model

Initialize model

In [None]:
# To use qrnn
config = awd_lstm_lm_config.copy()
config['qrnn'] = True
config['n_hid'] = 1550
config['n_layers'] = 4

perplexity = Perplexity()

learn = language_model_learner(data_lm, arch=AWD_LSTM, config=config,
                               pretrained=False,
                                metrics=[accuracy, perplexity],
                              ).to_fp16()
# gradient clipping
learn.clip = 0.1
learn.model_dir=DATA_PATH

## Find proper learning rate

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(suggestion=True)

Get suggested learning rate

In [None]:
min_grad_lr = learn.recorder.min_grad_lr
min_grad_lr

## Start training

## Stage - 1

In [None]:
learn.fit_one_cycle(10, min_grad_lr,
                    # Momentums, just a try!
                    div_factor=10.0, pct_start=0.8, moms=(0.75,0.65),
                    callbacks=[SaveModelCallback(learn, every='improvement', monitor='perplexity', name='best_st1'),
                               CSVLogger(learn, filename=DATA_PATH/'history', append=True)])

### Save the intermediate results

In [None]:
learn.load('best_st1');
learn.save('ta-wiki-stage1')
learn.save_encoder('ta-wiki-enc-stage1')

## Stage -2

In [None]:
learn.load('ta-wiki-stage1');

In [None]:
learn.lr_find(start_lr=1e-15)

In [None]:
learn.recorder.plot(suggestion=True)

In [None]:
min_grad_lr = learn.recorder.min_grad_lr
min_grad_lr

In [None]:
learn.fit_one_cycle(20, min_grad_lr, 
                    callbacks=[SaveModelCallback(learn, every='improvement', monitor='perplexity', name='best_st2'),
                               CSVLogger(learn, filename='/kaggle/working/history.csv', append=True)])

In [None]:
learn.load('best_st2');
learn.save('ta-wiki-stage2')
learn.save_encoder('ta-wiki-enc-stage2')
learn.export(DATA_PATH/'ta-lang_mod.pkl')

## Look at the logs

In [None]:
learn.csv_logger.read_logged_file()

We can test it out in a seperate notebook