# "Sanskrit Albert"
> "Training a Language model from scratch on Sanskrit using the HuggingFace library, and how to train your own model too!"

- toc: true 
- comments: true
- categories: [jupyter, NLP, HuggingFace]

In [0]:
!nvidia-smi

Sat May  2 06:43:02 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
import os
import gc
import glob
import torch
import pickle
import joblib
from tqdm.auto import tqdm

HuggingFace Recently updated their scripts, and the pip is yet to be released. So We'll build from source

In [0]:
!pip install tokenizers
#!pip install transformers

In [0]:
!git clone https://github.com/huggingface/transformers
!pip install transformers/.

# Collecting Corpus

I have used Sanskrit Corpus from Kaggle dataset. Feel free to skip and use your own ddataset. The trainig data needs to be in a .txt file. and I have also used Evaluation using the same dataset. 

I need Kaggle API to download the dataset. You can load your text corpus from anywhere.

You can download corpus for your language from https://traces1.inria.fr/oscar.

I have used data fro mthere too, and appended the data to corpus from Kaggle.

## Loading From kaggle

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/ 
!chmod 600 ~/.kaggle/kaggle.json 

[Kaggle dataset link](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles)

Thanks to [inltk](https://github.com/goru001/inltk) fro wikipedia dumps
and [CLTK](https://github.com/cltk/cltk) from which I am currently collecting Sanskrit scraps from open sources.

In [0]:
!mkdir corpus 
#directory for sac=ving all corpus in a single directory. You can save it anywhere

#From Kagle
!kaggle datasets download -d disisbig/sanskrit-wikipedia-articles
!unzip /content/sanskrit-wikipedia-articles.zip -d /content/corpus

#From OSCAR corpus
!wget https://traces1.inria.fr/oscar/files/compressed-orig/sa.txt.gz
!gunzip /content/sa.txt.gz

In [0]:
#Reading sample
with open("/content/sa.txt", "r") as fp:
    print(fp.read(1000))

In [0]:
import glob
train_list = glob.glob("/content/corpus/train/train/*.txt")
valid_list = glob.glob("/content/corpus/valid/valid/*.txt")

In [0]:
#readig and appending all small files to single Train and Valid files
with open("/content/corpus/train/full.txt", "wb") as outfile:
    for f in train_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")
    with open("/content/sa.txt", "rb") as infile:
            outfile.write(infile.read())

with open("/content/corpus/valid/full_val.txt", "wb") as outfile:
    for f in valid_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")

## Tokenizer Training

Directory to save trained tokenier and configuration files in a folder

In [0]:
!mkdir data_dir

In [0]:
import sentencepiece as spm
from tokenizers import SentencePieceBPETokenizer, BertWordPieceTokenizer

In [0]:
%%time

#Albert Tokenizer uses Sentence piece Tokenization, so I have used sentencepiece to to train tokenizer.
#This will take a while
spm.SentencePieceTrainer.Train('--input=/content/corpus/train/full.txt \
                                --model_prefix=m \
                                --vocab_size=32000 \
                                --control_symbols=[CLS],[SEP],[MASK]')

with open("m.vocab") as v:
    print(v.read(2000))
    v.close()


In [0]:
!mkdir /content/data_dir/
!cp /content/m.model -d /content/data_dir/spiece.model
!cp /content/m.vocab -d /content/data_dir/spiece.vocab

mkdir: cannot create directory ‘/content/data_dir/’: File exists


## Testing Tokenizer

Make sure to checkout the Fast Tokenizers from Huggingface, Tis is really Fast!
You can compare with sentencepiece.

In [0]:
%time
tokenizer = SentencePieceBPETokenizer()
tokenizer.train("/content/corpus/train/full.txt")

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 10 µs


This is a very beautiful Shlok ❤️, Let's just pray for this 🙏.
Do search the quotes used in this notebook, I am sure, you will love them!

In [0]:
txt = "ॐ सर्वेत्र सुखिनः सन्तु| सर्वे सन्तु निरामयाः| सर्वे भद्राणि पश्यन्तु| माँ कश्चिद् दुःख माप्नुयात॥ ॐ शांतिः शांतिः शांतिः ॥"
enc = tokenizer.encode(txt)
print(tokenizer.decode(enc.ids))

ॐ सर्वेत्र सुखिनः सन्तु| सर्वे सन्तु निरामयाः| सर्वे भद्राणि पश्यन्तु| माँ कश्चिद् दुःख माप्नुयात॥ ॐ शांतिः शांतिः शांतिः ॥


The tokenizer ssems to work, But since, The training script is configured to use Albert tokenizer. we need to use spiece.model and spiece.vocab, for training script

HuggingFace tokenizer creates `['/content/hft/vocab.json', '/content/hft/merges.txt']`

files, while the AlbertTokenizer requires spiece.model file. So we'll use sentencepiece saved vocab and tokenizer model


In [0]:
!mkdir hft
tokenizer.save("/content/hft")
#we won't be using this

['/content/hft/vocab.json', '/content/hft/merges.txt']

# Huggingface Training

In [0]:
from transformers import *

In [0]:
#Keep in mind, This is a tokenizer for Albert, unlike the previous one, which is a generic one.
#We'll load it in the form of Albert Tokenizer.
tokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")

In [62]:
op = tokenizer.encode("नैनं छिन्दन्ति शस्त्राणि नैनं दहति पावकः। न चैनं क्लेदयन्त्यापो न शोषयति मारुतः॥")
tokenizer.decode(op)

'[CLS] नैनं छिन्दन्ति शस्त्राणि नैनं दहति पावकः। न चैनं क्लेदयन्त्यापो न शोषयति मारुतः॥[SEP]'

Looks like, the tokenizer is working

## Model-Tokenizer Configurtion

This is important.
The training script needs a configuration for the model.

Architecture refers to what the model is going to be used for\
i.e., AlbertModelForLM, or for Sequence Classification.
Just take a look ar left panel for Model Architectures



In [0]:
#Checking vocabulary size
vocab_size=tokenizer.vocab_size ; vocab_size

32000

In [0]:
import json

config = {
    "architectures": [
        "AlbertModel"
    ],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "albert",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": vocab_size
}
with open("/content/data_dir/config.json", 'w') as fp:
    json.dump(config, fp)


#Configuration for tokenizer.
#Note I havve set do_lower_case: False, and keep_accents:True

tokenizer_config = {
	"max_len": 512,
	"model_type": "albert",
	"do_lower_case":False, 
	"keep_accents":True
}
with open("/content/data_dir/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

**Note: **While experimenting with tokenizer training, I found that encoding was done corectly, but when decoding with {do_lower_case: True, and keep_accents:False}, the decoded sentence was a bit changed.

So, by using above settings, I got the sentences decoded perfectly. 
a reason maybe that Sanskrit does not have 'Casing'. and the word has suffixes in the form of accents.

You should try with the settings ehich suits best for your langugae.

In [103]:
torch.cuda.empty_cache()
gc.collect()

157

Creating a small corpus for testing, You can skip this.

In [0]:
with open("/content/corpus/train/tmp.txt", "w") as fp:
    fp.write(open("/content/corpus/train/full.txt", "r").read(100000))      #250KB


In [0]:
with open("/content/corpus/valid/val_val.txt", "w") as fp:
    fp.write(open("/content/corpus/valid/full_val.txt", "r").read(10000000)) #

Checkpointing is very important. This is a directory where the intermediate model and tokenizer will be saved. 

**Note:** You should checkpoint to somewhere else, Maybe to your drive. and set 
`--save_total_limit 2`

This is the training script. you should experiment with arguments.
 
`!python /content/transformers/examples/run_language_modeling.py --help`

In [0]:
%load_ext tensorboard
%tensorboard --logdir logs

You see the magic here.

This script can be used to train most models with for Language modelling.

Another thing, Observe that you have to directly specify `--training_data_file` in `.txt` format. No need to generate any pretraining data! all thanks to the Fast toknizers in used for loading the text.

Features are created dynamically while starting trainng script.
However, This is limited to GPUs only. I would love to see a TPU version too.

Make sure to change batch_sizes according to the GPU you are having. I set to 16 because of 8 GB P4, 

In [0]:
#To train from scratch
!python /content/transformers/examples/run_language_modeling.py \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --evaluate_during_training \
        --num_train_epochs 5 \
        --per_gpu_eval_batch_size 16 \
        --per_gpu_train_batch_size 16 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --logging_dir logs \

In [0]:
torch.cuda.empty_cache()
gc.collect()

**Continuing Training**

```
--model_name_or_path      #Refers to the checkpoint directory
--overwrite_output_dir    #This is used to continue fro mlast checkpoint
```



After a checkpoint, You just need that directory and the corpus files, and toknizer. All configs, models, oprimizers are saved in `--output_dir` except tokenizer.

In [0]:
#To continue from checkpoint
#I have continued from 500 steps here, but you should use the latet saved models
!python /content/transformers/examples/run_language_modeling.py \
        --model_name_or_path /content/data_dir/checkpoint-500 \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --num_train_epochs 5 \
        --evaluate_during_training \
        --per_gpu_eval_batch_size 64 \
        --per_gpu_train_batch_size 64 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --overwrite_output_dir \

# Saving for Uploading

Since, training is complete, We can now upload models to Huffingface's 
[Models](https://huggingface.co/models)

In [0]:
!mkdir sanskrit_albert

In [48]:
atokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")
atokenizer.save_pretrained("/content/sanskrit_albert")

('/content/sanskrit_albert/spiece.model',
 '/content/sanskrit_albert/special_tokens_map.json',
 '/content/sanskrit_albert/added_tokens.json')

In [49]:
op = atokenizer.encode("क्षमया दयया प्रेम्णा सूनृतेनार्जवेन च। वशीकुर्याज्जगत्सर्वं विनयेन च सेवया॥")
print(atokenizer.decode(op))

[CLS] क्षमया दयया प्रेम्णा सूनृतेनार्जवेन च। वशीकुर्याज्जगत्सर्वं विनयेन च सेवया॥[SEP]


In [0]:
#I am using chackoint because os not much training
model = AlbertModel.from_pretrained("/content/data_dir/checkpoint-500")
model.save_pretrained("/content/sanskrit_albert")

Now All the files we want are in a separate folder, Which is all we need to upoad.

### **Tests**

In [0]:
tokenizer = AlbertTokenizer.from_pretrained("/content/sanskrit_albert")

In [0]:
txt = "चरन्मार्गान्विजानाति ।"
op = tokenizer.encode(txt)

In [55]:
op
#See howw it's tokenized!

[3, 15, 4280, 1345, 82, 177, 13866, 6, 4]

In [60]:
tokenizer.decode(op[:5]), tokenizer.decode(op[5:])

('[CLS] चरन्मार्गान्', 'विजानाति ।[SEP]')

This is the reason I set `do_lower_case:False`, and `keep_accents:True`

In [0]:
ps = model(torch.tensor(op).unsqueeze(1))

In [35]:
print(ps[0].shape)

torch.Size([30, 1, 768])


This way you can get the embeddings for a sentence. Check [ReSanskrit](https://resanskrit.com/sanskrit-shlok-popular-quotes-meaning-hindi-english/) for some beautiful shlok quotes. 

## Uploading to Models

[Instructions to upload a model](https://github.com/huggingface/transformers#Quick-tour-of-model-sharing)

In [0]:
!transformers-cli login

Make sure your model name is the folder name in which this will be uploaded.

Thus, my model would be `surajp/sanskrit_albert`,
but I won't upload this as I have alreasy uploaded one.


In [0]:
!transformers-cli upload /content/sanskrit_albert

And It's done! Since, I have alreadu uploaded a model, You can load using `surajp/sanskrit-base-albert`

In [0]:
#this way
tokenizer = AutoTokenizer.from_pretrained("surajp/albert-base-sanskrit")
model = AutoModel.from_pretrained("surajp/albert-base-sanskrit")

In [47]:
enc=tokenizer.encode("अपि स्वर्णमयी लङ्का न मे लक्ष्मण रोचते । जननी जन्मभूमिश्च स्वर्गादपि गरीयसी ॥")
print(tokenizer.decode(enc))

[CLS] अपि स्वर्णमयी लङ्का न मे लक्ष्मण रोचते । जननी जन्मभूमिश्च स्वर्गादपि गरीयसी ॥[SEP]


In [0]:
ps = model(torch.tensor(enc).unsqueeze(1))

In [46]:
 ps[0].shape

torch.Size([19, 1, 768])

In [0]:
model.save_pretrained("./")

I hope This notebook was helpful.🤗
    #StaySafe

This training contained only a little portion of Sanskrit literature. There is a vast amount of literature there which I am collecting. This was only a checkpoint for trainng, I will train more once I collect more data.

I am also trainig for other Indian Languages on different models (Gujarati, Hindi for now).

If you know any resources, Please write to me. I'd love to have your contribution.

**`parmarsuraj99@gmail.com`**

I am trying to find if the structure of language can have any effect on trainng, More structured language=>faster training and if this can be useful for cross-lingual learning?

What are you thoughts about this?