# SentencePiece BPE Tokenizer for Armenian (hy)

This notebook implements:
- **Part 1:** Train a SentencePiece **BPE** tokenizer on `corpus.txt`
- **Part 2:** Encode/Decode 3 test sentences
- **Part 3:** Analyze the trained vocabulary + most frequent pieces on the corpus
- **Home Task:** Repeat on **CC-100 Armenian** with a larger vocabulary, and compare average tokens per sentence


In [None]:
# Install required libraries (Colab)
!pip -q install sentencepiece datasets


## Part 1 — Training Script


In [None]:
import sentencepiece as spm

# Paths (assumes corpus.txt is in the current working directory)
INPUT_PATH = "corpus.txt"
MODEL_PREFIX = "hy_bpe"

# Train SentencePiece BPE
spm.SentencePieceTrainer.train(
    input=INPUT_PATH,
    model_prefix=MODEL_PREFIX,
    vocab_size=300,
    model_type="bpe",
    character_coverage=1.0
)

# Load model to inspect vocab
sp = spm.SentencePieceProcessor()
sp.load(f"{MODEL_PREFIX}.model")

vocab_size = sp.get_piece_size()
print("Total vocabulary size:", vocab_size)

# First 30 pieces
print("\nFirst 30 vocabulary entries:")
for i in range(min(30, vocab_size)):
    print(f"{i:>4}: {sp.id_to_piece(i)}")

# Last 30 pieces
print("\nLast 30 vocabulary entries:")
start = max(0, vocab_size - 30)
for i in range(start, vocab_size):
    print(f"{i:>4}: {sp.id_to_piece(i)}")


Total vocabulary size: 300

First 30 vocabulary entries:
   0: <unk>
   1: <s>
   2: </s>
   3: ու
   4: ան
   5: այ
   6: եր
   7: ար
   8: ուն
   9: ▁հ
  10: ում
  11: ակ
  12: ութ
  13: ▁է
  14: ությ
  15: են
  16: ություն
  17: ▁Հ
  18: ներ
  19: աս
  20: ▁Հայ
  21: ▁կ
  22: որ
  23: ամ
  24: ական
  25: եւ
  26: ատ
  27: ▁են
  28: ▁մ
  29: ▁հայ

Last 30 vocabulary entries:
 270: Բ
 271: չ
 272: ջ
 273: փ
 274: Կ
 275: Ն
 276: Տ
 277: ձ
 278: Ե
 279: Մ
 280: Գ
 281: Դ
 282: Ծ
 283: Պ
 284: օ
 285: Թ
 286: Լ
 287: Խ
 288: Շ
 289: Ռ
 290: Ս
 291: Վ
 292: Ֆ
 293: ,
 294: Ը
 295: Ի
 296: Ձ
 297: Ղ
 298: Ո
 299: Ջ


## Part 2 — Encoding and Decoding Script


In [None]:
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("hy_bpe.model")

sentences = [
    "Հայաստանն ունի հարուստ պատմություն։",
    "Արհեստական բանականությունը արագ զարգանում է։",
    "Ծրագրավորումը կարևոր հմտություն է ապագայի համար։"
]

for idx, s in enumerate(sentences, start=1):
    pieces = sp.encode(s, out_type=str)
    ids = sp.encode(s, out_type=int)
    decoded = sp.decode(ids)

    print(f"\nS{idx}: {s}")
    print("Pieces:", pieces)
    print("IDs   :", ids)
    print("Decoded:", decoded)
    print("Exact match:", decoded == s)



S1: Հայաստանն ունի հարուստ պատմություն։
Pieces: ['▁Հայաստան', 'ն', '▁ունի', '▁հարուստ', '▁պ', 'ատ', 'մ', 'ություն', '։']
IDs   : [35, 236, 60, 221, 95, 26, 242, 16, 246]
Decoded: Հայաստանն ունի հարուստ պատմություն։
Exact match: True

S2: Արհեստական բանականությունը արագ զարգանում է։
Pieces: ['▁Ար', 'հ', 'եստ', 'ական', '▁բ', 'ան', 'ականությունը', '▁արագ', '▁զարգ', 'անում', '▁է', '։']
IDs   : [149, 247, 98, 24, 56, 4, 229, 161, 132, 158, 13, 246]
Decoded: Արհեստական բանականությունը արագ զարգանում է։
Exact match: True

S3: Ծրագրավորումը կարևոր հմտություն է ապագայի համար։
Pieces: ['▁Ծ', 'րագ', 'րա', 'վ', 'որ', 'ումը', '▁կարեւոր', '▁հ', 'մ', 'տ', 'ություն', '▁է', '▁ապագայի', '▁համար', '։']
IDs   : [188, 202, 73, 252, 22, 208, 40, 9, 242, 244, 16, 13, 220, 165, 246]
Decoded: Ծրագրավորումը կարեւոր հմտություն է ապագայի համար։
Exact match: False


## Part 3 — Vocabulary Analysis Script


In [None]:
import collections
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("hy_bpe.model")

def is_armenian_char(ch: str) -> bool:
    # Armenian block: U+0530..U+058F (letters, punctuation, etc.)
    return 0x0530 <= ord(ch) <= 0x058F

vocab = [sp.id_to_piece(i) for i in range(sp.get_piece_size())]

# SentencePiece uses ▁ to represent a space, and special tokens like <unk>, <s>, </s>
# We'll exclude special tokens from length-based stats.
special = {"<unk>", "<s>", "</s>"}
vocab_non_special = [p for p in vocab if p not in special]

# Remove leading ▁ for length analysis (it is a space marker, not a letter)
def strip_space_marker(piece: str) -> str:
    return piece[1:] if piece.startswith("▁") else piece

clean = [strip_space_marker(p) for p in vocab_non_special if strip_space_marker(p) != ""]

# Count buckets
single_chars = sum(1 for p in clean if len(p) == 1 and is_armenian_char(p))
subword_2_4 = sum(1 for p in clean if 2 <= len(p) <= 4)
full_5_plus = sum(1 for p in clean if len(p) >= 5)

print("Single Armenian characters (len=1):", single_chars)
print("Subword fragments (len 2-4):", subword_2_4)
print("Full words / long pieces (len 5+):", full_5_plus)

# Most frequent token pieces when encoding the entire corpus
with open("corpus.txt", "r", encoding="utf-8") as f:
    text = f.read()

all_pieces = sp.encode(text, out_type=str)
freq = collections.Counter(all_pieces)

print("\nTop 10 most frequent pieces in the corpus:")
for piece, c in freq.most_common(10):
    print(f"{piece!r}: {c}")


Single Armenian characters (len=1): 99
Subword fragments (len 2-4): 156
Full words / long pieces (len 5+): 40

Top 10 most frequent pieces in the corpus:
'։': 93
'▁է': 53
'▁': 41
'ն': 35
'ան': 33
'ի': 33
'▁են': 27
'ը': 25
'ր': 22
'ում': 20


## Home Task — Large-Scale Training on CC-100 Armenian


In [None]:
import sentencepiece as spm
import random, statistics
from datasets import load_dataset

# Fix: Downgrade datasets library to a compatible version
!pip install -q datasets==1.18.0

# 1) Load CC-100 Armenian subset
ds = load_dataset("cc100", lang="hy", split="train")

# 2) Extract at least 50,000 sentences and save to corpus_large.txt
N = 50_000
texts = ds.select(range(N))["text"]

with open("corpus_large.txt", "w", encoding="utf-8") as f:
    for t in texts:
        # Ensure one sample per line (SentencePiece is fine with raw text)
        t = t.replace("\n", " ").strip()
        if t:
            f.write(t + "\n")

print("Saved lines to corpus_large.txt:", sum(1 for _ in open("corpus_large.txt", encoding="utf-8")))

# 3) Train SentencePiece BPE with vocab_size=8000
spm.SentencePieceTrainer.train(
    input="corpus_large.txt",
    model_prefix="hy_bpe_large",
    vocab_size=8000,
    model_type="bpe",
    character_coverage=1.0
)

# 4) Encode and decode 5 sentences of your own choice
sp_large = spm.SentencePieceProcessor()
sp_large.load("hy_bpe_large.model")

my_sentences = [
    "Ես սիրում եմ սովորել նոր տեխնոլոգիաներ։",
    "Հայերեն տեքստերի վերլուծությունը հետաքրքիր է։",
    "Երևանում ձմեռը երբեմն շատ ցուրտ է լինում։",
    "Տվյալների գիտությունը կիրառվում է բազմաթիվ ոլորտներում։",
    "Արհեստական բանականությունը փոխում է աշխարհը։"
]

for s in my_sentences:
    ids = sp_large.encode(s, out_type=int)
    decoded = sp_large.decode(ids)
    print("\nSentence:", s)
    print("Num tokens:", len(ids))
    print("Decoded matches:", decoded == s)

# 5) Compare average number of tokens per sentence between small and large models
sp_small = spm.SentencePieceProcessor()
sp_small.load("hy_bpe.model")

def avg_tokens_per_sentence(sp, sentence_list):
    counts = [len(sp.encode(s, out_type=int)) for s in sentence_list]
    return statistics.mean(counts)

# Use a sample from CC-100 for a fair comparison
sample_size = 2000
sample_sents = [t.replace("\n", " ").strip() for t in texts[:sample_size] if t and t.strip()]

avg_small = avg_tokens_per_sentence(sp_small, sample_sents)
avg_large = avg_tokens_per_sentence(sp_large, sample_sents)

print("\nAverage tokens per sentence on the same sample:")
print("Small model (vocab=300):", avg_small)
print("Large model (vocab=8000):", avg_large)


Downloading:   0%|          | 0.00/2.96k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.57k [00:00<?, ?B/s]



Downloading and preparing dataset cc100/hy to /root/.cache/huggingface/datasets/cc100/hy-lang=hy/0.0.0/8159941b93eb06d0288bb80be26ddfe8213c0c5e33286619c85ad8e1ee0eb91c...


Downloading:   0%|          | 0.00/813M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset cc100 downloaded and prepared to /root/.cache/huggingface/datasets/cc100/hy-lang=hy/0.0.0/8159941b93eb06d0288bb80be26ddfe8213c0c5e33286619c85ad8e1ee0eb91c. Subsequent calls will reuse this data.
Saved lines to corpus_large.txt: 44322

Sentence: Ես սիրում եմ սովորել նոր տեխնոլոգիաներ։
Num tokens: 8
Decoded matches: True

Sentence: Հայերեն տեքստերի վերլուծությունը հետաքրքիր է։
Num tokens: 9
Decoded matches: True

Sentence: Երևանում ձմեռը երբեմն շատ ցուրտ է լինում։
Num tokens: 10
Decoded matches: False

Sentence: Տվյալների գիտությունը կիրառվում է բազմաթիվ ոլորտներում։
Num tokens: 11
Decoded matches: True

Sentence: Արհեստական բանականությունը փոխում է աշխարհը։
Num tokens: 12
Decoded matches: True

Average tokens per sentence on the same sample:
Small model (vocab=300): 94.15070093457943
Large model (vocab=8000): 41.68399532710281
