<a href="https://colab.research.google.com/github/parmarsuraj99/suraj-parmar/blob/master/_notebooks/2020-04-13-bert-on-hindi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "BERT training for Hindi"
> "I was exploring BERT for Indian regional languages. Starting with Hindi"

- toc:true- branch: master- badges: true- comments: true
- author: Suraj Parmar
- categories: [fastpages, colab, NLP, Hindi]

In [0]:
!pip install transformers

I am following this guide from Huggingface blog [how-to-train](https://huggingface.co/blog/how-to-train). They have eexplained how to train a model from scratch. The Language in the blog is 

I will be experimenting with different tokenizers available in HuggingFace's tokenizer library.

In [0]:
import os
from tokenizers import (ByteLevelBPETokenizer,
                            CharBPETokenizer,
                            SentencePieceBPETokenizer,
                            BertWordPieceTokenizer)

# **ByteLevelBPETokenizer**

In [0]:
tokenizer = ByteLevelBPETokenizer()
tokenizer.train("Geeta.txt", vocab_size=20000, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])
tokenizer_type = "ByteLevelBPETokenizer"

In [7]:
if not os.path.isdir(f"/content/{tokenizer_type}"):
    os.mkdir(tokenizer_type)
tokenizer.save(f"{tokenizer_type}/", f"hindi{tokenizer_type}")

['ByteLevelBPETokenizer/hindiByteLevelBPETokenizer-vocab.json',
 'ByteLevelBPETokenizer/hindiByteLevelBPETokenizer-merges.txt']

In [10]:
vocab_= open(f"{tokenizer_type}/hindi{tokenizer_type}-vocab.json","r")
merges_ = open(f"{tokenizer_type}/hindi{tokenizer_type}-merges.txt","r")

print("VOCAB:", vocab_.read()[2100:2200])
print("Merges:", merges_.read(100))

vocab_.close()
merges_.close()

VOCAB: ":271,"à¤¨":272,"Ġà¥":273,"à¤µ":274,"à¥ĩ":275,"à¤¸":276,"à¤Ĥ":277,"à¥ģ":278,"à¤¦":279,"à¥ĭ":280,"ĠĠ"
Merges: #version: 0.2 - Trained by `huggingface/tokenizers`
à ¤
à ¥
à¥ į
Ġ à¤
à¤ ¾
à¤ ¤
à¤ °
à¤ ¿
à¤ ¯
č Ċ
à


In [0]:
#Feel free to play with the outputs
outputs = tokenizer.encode("रामायण एक संस्कृत महाकाव्य है जिसकी रचना महर्षि वाल्मीकि ने की थी।")

In [0]:
outputs.ids

[212,
 146,
 246,
 174,
 36,
 73,
 49,
 194,
 118,
 436,
 163,
 40,
 346,
 388,
 98,
 144,
 220,
 366,
 70,
 36,
 69,
 164,
 114,
 258]

This doesn't seem good. Let's try other tokenizers

# **CharBPETokenizer**

In [0]:
tokenizer = CharBPETokenizer()
tokenizer.train("Geeta.txt", vocab_size=20000)
tokenizer_type = "CharBPETokenizer"

In [12]:
if not os.path.isdir(f"/content/{tokenizer_type}"):
    os.mkdir(tokenizer_type)
tokenizer.save(f"{tokenizer_type}/", f"hindi{tokenizer_type}")

['CharBPETokenizer/hindiCharBPETokenizer-vocab.json',
 'CharBPETokenizer/hindiCharBPETokenizer-merges.txt']

In [13]:
vocab_= open(f"{tokenizer_type}/hindi{tokenizer_type}-vocab.json","r")
merges_ = open(f"{tokenizer_type}/hindi{tokenizer_type}-merges.txt","r")

print("VOCAB:", vocab_.read()[2100:2200])
print("Merges:", merges_.read(100))

vocab_.close()
merges_.close()

VOCAB: w>":227,"ानि</w>":228,"सि</w>":229,"९॥</w>":230,"वाच</w>":231,"दु":232,"०॥</w>":233,"च्छ":234,"मि":2
Merges: #version: 0.2 - Trained by `huggingface/tokenizers`
् य
् त
् र
र ्
ा न
त ्
व ि
स ्
द ्
प ्र
म ा
ि त


In [0]:
#Feel free to play with the outputs
outputs = tokenizer.encode("रामायण एक संस्कृत महाकाव्य है जिसकी रचना महर्षि वाल्मीकि ने की थी।")

# **SentencePieceBPETokenizer**

In [0]:
tokenizer = SentencePieceBPETokenizer()
tokenizer.train("Geeta.txt", vocab_size=20000)
tokenizer_type = "SentencePieceBPETokenizer"

In [16]:
if not os.path.isdir(f"/content/{tokenizer_type}"):
    os.mkdir(tokenizer_type)
tokenizer.save(f"{tokenizer_type}/", f"hindi{tokenizer_type}")

['SentencePieceBPETokenizer/hindiSentencePieceBPETokenizer-vocab.json',
 'SentencePieceBPETokenizer/hindiSentencePieceBPETokenizer-merges.txt']

In [17]:
vocab_= open(f"{tokenizer_type}/hindi{tokenizer_type}-vocab.json","r")
merges_ = open(f"{tokenizer_type}/hindi{tokenizer_type}-merges.txt","r")

print("VOCAB:", vocab_.read()[2100:2200])
print("Merges:", merges_.read(100))

vocab_.close()
merges_.close()

VOCAB: ुः":244,"ुण":245,"श्य":246,"▁सं":247,"▁११":248,"तः":249,"▁त्व":250,"ेद":251,"कर्म":252,"गव":253,"▁मह
Merges: #version: 0.2 - Trained by `huggingface/tokenizers`
् य
् त
् र
र ्
▁ ।
▁ स
▁ ॥
ा न
▁ प
् व
▁ त
▁ म



In [0]:
#Feel free to play with the outputs
outputs = tokenizer.encode("रामायण एक संस्कृत महाकाव्य है जिसकी रचना महर्षि वाल्मीकि ने की थी।")

# **BertWordPieceTokenizer**

In [0]:
tokenizer = BertWordPieceTokenizer()
tokenizer.train("Geeta.txt", vocab_size=20000)
tokenizer_type = "BertWordPieceTokenizer"

In [21]:
if not os.path.isdir(f"/content/{tokenizer_type}"):
    os.mkdir(tokenizer_type)
tokenizer.save(f"{tokenizer_type}/", f"hindi{tokenizer_type}")

['BertWordPieceTokenizer/hindiBertWordPieceTokenizer-vocab.txt']

In [22]:
vocab_= open(f"{tokenizer_type}/hindi{tokenizer_type}-vocab.txt","r")
print("VOCAB:", vocab_.read()[150:250])
vocab_.close()

VOCAB: 
२
३
४
५
६
७
८
९
##ि
##न
##य
##म
##द
##भ
##क
##त
##ढ
##ा
##ः
##र
##थ
##ी
##ष
##व
##ो
##ण
##ह
##च
##प


Looks like this works better than others

I will be explaining this tokenizers in details in this blog! stay tuned. 
Stay Safe