<h1 style="text-align: center;">Tokenizers for Armenia Language: Training Process</h1>
<h2 style="text-align: center;" >Authors: Naira Maria Barseghyan and Anna Shaljyan</h2>

### Imports and data 

In [3]:
#Importing neccessary libraries
import pandas as pd
from BPE_tokenizer import BpeTokenizer
from WordPiece_tokenizer import WordPieceTokenizer
from pathlib import Path

In [39]:
#Change the Path according to your sytem and file location

df = pd.read_json('/Users/nairabarseghyan/Desktop/SPRING2024/GenerativeAI/Project/Data/wiki_arm_newest.json', orient ='columns', compression = 'infer')

In [40]:
#Creating corpus by joining all the texts under "text" column of the dataset

corpus = ' '.join(df['text'].tolist())

# Training Byte-Pair Encoding Tokenizer

In [41]:
#Initializing the BPE Tokenizer loaded from BPE_tokenizer.py

tokenizer = BpeTokenizer()

In [5]:
#Learning the BPE vocabulary 

tokenizer._learn_bpe_vocab(corpus)

100%|██████████| 2329/2329 [5:41:35<00:00,  8.80s/it]  


In [6]:
# Specifying the path where we want our learned BPE vocabulary to be saved

tokenizer_path = Path('./armenian_bpe_tokenizer.pkl')

In [7]:
tokenizer.save(tokenizer_path)

In [8]:
sample_text = "Հայերեն լեզվով հոդվածի օրինակ"

# Encode the sample text
encoded_ids, encoded_tokens = tokenizer.encode_text(sample_text)
print(f"Encoded IDs: {encoded_ids}")
print(f"Encoded Tokens: {encoded_tokens}")

# Decode the encoded IDs back to text
decoded_text = tokenizer.decode(encoded_ids)
print(f"Decoded Text: {decoded_text}")

Encoded IDs: [5, 6767, 7359, 7, 27975, 6731, 7, 8025, 6757, 12, 7, 8661, 6714]
Encoded Tokens: ['<maj>', 'հայ', 'երեն', ' ', 'լեզվ', 'ով', ' ', 'հոդ', 'ված', 'ի', ' ', 'օրին', 'ակ']
Decoded Text: Հայերեն լեզվով հոդվածի օրինակ


# Training the WordPiece Tokenizer

In [43]:
#Initializing the WordPiece Tokenizer loaded from WordPiece_tokenizer.py

wp_tokenizer = WordPieceTokenizer()

In [44]:
#Learning the WordPiece vocabulary 

wp_tokenizer._learn_wordpiece_vocab(corpus)

Learning WordPiece vocab:  20%|█▉        | 4645/23291 [1:31:47<5:16:27,  1.02s/it]

In [34]:
# Specifying the path where we want our learned WordPiece vocabulary to be saved

wp_tokenizer_path = Path('./armenian_wordpiece_tokenizer.pkl')

In [36]:
wp_tokenizer.save(wp_tokenizer_path)

In [35]:
sample_text = "Հայերեն լեզվով հոդվածի օրինակ"

# Encode the sample text
encoded_ids, encoded_tokens = wp_tokenizer.encode_text(sample_text)
print(f"Encoded IDs: {encoded_ids}")
print(f"Encoded Tokens: {encoded_tokens}")

# Decode the encoded IDs back to text
decoded_text = wp_tokenizer.decode(encoded_ids)
print(f"Decoded Text: {decoded_text}")

Encoded IDs: [0, 14037, 218, 21335, 13, 3862]
Encoded Tokens: ['<unk>', 'լեզվ', '##ով', 'հոդված', '##ի', 'օրինակ']
Decoded Text: <unk> լեզվ ով հոդված ի օրինակ


#### End of the training process
# End