# Subword Tokenization

In this exercise, we will learn how to train our own subword tokenizers with different algorithms: BPE and Unigram. We will use `sentencepiece`, a library from Google to help create our tokenizers.

## Ref:
https://github.com/google/sentencepiece/blob/master/python

## Setup

In [1]:
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/kratoo-40000000-40002000.jsonl

--2025-01-17 22:39:55--  https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt [following]
--2025-01-17 22:39:56--  https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3231076 (3.1M) [application/octet-stream]
Saving to: ‘pra-apai-manee-ch1-50.txt.2’


2025-01-17 22:39:57 (17.3 MB/s) - ‘pra-apai-manee-ch1-50.txt.2’ saved [3231076/32310

## Code

In [2]:
%pip install -q sentencepiece
import sentencepiece as spm
import io
import json

Note: you may need to restart the kernel to use updated packages.


Load data

In [3]:
pantip_text = []
with open("kratoo-40000000-40002000.jsonl", "r") as json_file:
    json_list = list(json_file)
    for json_str in json_list:
        result = json.loads(json_str)
        pantip_text.append(f"{result['title']}\n{result['content']}\n")
sum([len(t) for t in pantip_text])

1060318

In [4]:
with open("pra-apai-manee-ch1-50.txt") as f:
    pra_apai_manee_data = f.readlines()

In [5]:
sum([len(t) for t in pra_apai_manee_data])

1100605

In [6]:
pantip_train_text = pantip_text[: int(len(pantip_text) * 0.8)]
pantip_test_text = pantip_text[int(len(pantip_text) * 0.8) :]

pam_train_text = pra_apai_manee_data[: int(len(pra_apai_manee_data) * 0.8)]  # pam = pra_apai_manee
pam_test_text = pra_apai_manee_data[int(len(pra_apai_manee_data) * 0.8) :]

## Run tokenizer training

The Python wrapper provides multiple APIs for training our tokenizers

1. `spm.SentencePieceTrainer.train(input='input.txt', model_prefix='m', vocab_size=vocab_size, model_type=model_type)`
  <br> This will output the tokenizer files `m.model` and `m.vocab` that can be later loaded into `SentencePieceProcessor`.
  <br><br>
2. `spm.SentencePieceTrainer.train(sentence_iterator=iterator, model_writer=obj_with_write_method, vocab_size=vocab_size, model_type=model_type)`
  <br> This method will require a file object e.g. `obj_with_write_method = io.BytesIO()`. The advantage of this method is you can run sentencepiece on environments that have limited access to the local file system. But you will still have to save the model file if you want to re-use the model else you will have to train it again.
<br><br>
3.  `spm.SentencePieceTrainer.train('--input=input.txt --model_prefix=m --vocab_size=vocab_size --model_type=model_type')`
<br> Same as no.1




### Unigram tokenizer

We are going to start with training a unigram tokenizer. You can use any method of training one. Make sure to set vocab_size to 1000.

In [7]:
both_train = pantip_train_text + pam_train_text
both_test = pantip_test_text + pam_test_text

both_train_corpus = "\n".join(both_train)
both_test_corpus = "\n".join(both_test)

with open("both_train_corpus.txt", "w") as f:
    f.write(both_train_corpus)
with open("both_test_corpus.txt", "w") as f:
    f.write(both_test_corpus)

In [8]:
pam_train_corpus = "\n".join(pam_train_text)
pam_test_corpus = "\n".join(pam_test_text)

with open("pam_train_corpus.txt", "w") as f:
    f.write(pam_train_corpus)
with open("pam_test_corpus.txt", "w") as f:
    f.write(pam_test_corpus)

In [9]:
spm.SentencePieceTrainer.train(
    input="pam_train_corpus.txt", model_prefix="pam_unigram", vocab_size=1000, model_type="unigram"
)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pam_train_corpus.txt
  input_format: 
  model_prefix: pam_unigram
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy

### Q1 MCV

How many tokens did you get when tokenizing the following sentence with your unigram tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [10]:
pam_unigram_tokenizer = spm.SentencePieceProcessor(model_file="pam_unigram.model")

In [11]:
len(pam_unigram_tokenizer.encode("อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม", out_type=str))

29

### BPE Tokenizer

Now try training a BPE tokenizer.

In [12]:
spm.SentencePieceTrainer.train(
    input="pam_train_corpus.txt",
    model_prefix="pam_bpe",
    vocab_size=1000,
    model_type="bpe",
)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pam_train_corpus.txt
  input_format: 
  model_prefix: pam_bpe
  model_type: BPE
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  di

### Q2 MCV

How many tokens did you get when tokenizing the following sentence with your BPE tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [13]:
pam_bpe_tokenizer = spm.SentencePieceProcessor(model_file="pam_bpe.model")

In [14]:
len(pam_bpe_tokenizer.encode("อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม", out_type=str))

28

These are some of your vocabs. Note that you will see "▁" (U+2581) in every type of tokenizer in SentencePiece since it makes it possible to perform detokenization \(unsplit your sentences\) without relying on language-specific resources.

In [15]:
unigram_vocabs = [pam_unigram_tokenizer.id_to_piece(id) for id in range(pam_unigram_tokenizer.get_piece_size())]
" | ".join(unigram_vocabs[:500])

'<unk> | <s> | </s> | ▁ | า | เ | น | ม | ย | ก | ร | ว | ด | ส | ง | บ | ค | มา | อ | ล | จะ | ท | ให้ | ห | ไป | ไม่ | แ | ว่า | พ | ุ | ี | ๏ | ฯ | ข | ช | เป็น | พระ | โ | ที่ | ใจ | ▁จะ | จ | ะ | ิ | ต | ก็ | อยู่ | ป | ได้ | ่ | ไ | เข้า | ู | ▁พระ | ้า | ตาม | ใน | ้ | ▁แล้ว | เหมือน | รา | ศ | เจ้า | เห็น | ลา | กัน | ั | หา | นาง | ทรง | ประ | ์ | ยา | ัก | ํา | ซ | าน | ัง | ฉ | องค์ | ัด | แล้ว | อน | ดู | ถ | ด้วย | มี | ▁จึง | นี้ | ่า | ผ | น้อง | แต่ | ทํา | ▁นาง | ▁ให้ | รัก | พี่ | คิด | ลูก | พา | รู้ | การ | กับ | ัน | หน้า | กระ | วน | ออก | ่อ | เขา | ถึง | ระ | ข้า | ับ | พล | นั่ง | ทั้ง | หน | รับ | ษ | กล | วง | ลง | ฝ | กร | พร | ความ | เสีย | ดี | ขึ้น | อง | ่ง | ธ | ▁แต่ | คน | กลับ | ▁ฝ่าย | ้น | อด | ภ | หรือ | ตร | ือ | ฟัง | แม่ | ▁ไม่ | ไว้ | ยัง | ▁เห็น | นา | ขอ | มิ | น้ํา | หล | ดัง | ▁พอ | ▁ทั้ง | ช่วย | สม | นั้น | ริ | ทัพ | ต้อง | วัน | อา | น้อย | รบ | ิน | อย่า | เอา | จน | เรา | สุด | เสียง | ข้าง | หลัง | ตี | ตัว | ละ | สุ | วัง | ทุก | ่น

In [16]:
bpe_vocabs = [pam_bpe_tokenizer.id_to_piece(id) for id in range(pam_bpe_tokenizer.get_piece_size())]
" | ".join(bpe_vocabs[:500])

'<unk> | <s> | </s> | ้า | ่า | อง | ระ | ํา | รา | อย | ่ง | มา | จะ | ัง | ัน | ▁เ | าย | ้ว | ับ | ี่ | ม่ | อน | ให | าม | ้น | ็น | พระ | ีย | าง | กล | ้ง | ัก | หน | ให้ | ไม่ | หล | ่น | ึง | ▁แ | ทั | ตร | าร | ้อง | ไป | ิด | ข้า | ว่า | หม | คร | ือ | ล้ว | เป | เส | ประ | าน | ั่ง | ▁๏ | ▁ฯ | ที่ | อก | เล | ิน | ได | พล | ทร | ัด | นาง | ึก | ได้ | ู่ | ▁จะ | ค์ | ี้ | พร | เป็น | สุ | ทั้ง | อม | ัย | เร | ห็น | ▁จ | ▁พระ | ก็ | ใจ | อา | ื่ | ่าง | ต่ | กร | ิง | วง | วน | ือน | เจ | ู้ | ียง | อยู่ | รร | ตาม | ▁พ | ้วย | าว | ถึง | คล | ั้น | รี | เข | ด้วย | สม | องค์ | สน | าก | ▁แล้ว | เช | ัว | ย์ | ใน | คว | น้ | หมือน | ▁ส | ูก | อบ | กระ | เจ้า | ทรง | ลา | กัน | มี | ่าย | พรา | ิ่ง | เข้า | เห็น | ิต | สง | อด | ณ์ | วย | ้ม | คิด | เม | เก | เด | ▁นาง | วา | ุก | ▁ให้ | ดู | หา | ▁อ | ▁จึง | ทํา | ลง | รัก | เค | แล้ว | ่าน | พี่ | เหมือน | ั่น | ความ | ยง | อย่า | หร | มิ | ืน | ช่ | การ | ัญ | ▁ไม่ | ฝ่าย | ศรี | ้าง | วก | ้อม | ือง | น้อง | ยว | พา | แก |

### User-defined symbols

Another important concept to know of is User-defined symbols. These special symbols are reserved for a special purpose \(e.g.\, the \<MASK\> token used in BERT) and will always be tokenized into one token.

Refer to the documentation for ways to add these special tokens to your tokenizer.

https://github.com/google/sentencepiece/blob/master/python

## Train another tokenizer on another domain

Now try training another unigram tokenizer on `pantip_text` and we will use it to compare with the unigram tokenizer we trained earlier.

In [17]:
pantip_train_corpus = "\n".join(pantip_train_text)
pantip_test_corpus = "\n".join(pantip_test_text)

with open("pantip_train_corpus.txt", "w") as f:
    f.write(pantip_train_corpus)
with open("pantip_test_corpus.txt", "w") as f:
    f.write(pantip_test_corpus)

In [18]:
## Train
spm.SentencePieceTrainer.train(
    input="pantip_train_corpus.txt", model_prefix="pantip_unigram", vocab_size=1000, model_type="unigram"
)
spm.SentencePieceTrainer.train(
    input="pantip_train_corpus.txt", model_prefix="pantip_bpe", vocab_size=1000, model_type="bpe"
)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pantip_train_corpus.txt
  input_format: 
  model_prefix: pantip_unigram
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_p

## Analyse top tokens on different datasets

Use your tokenizers to tokenize the datasets and analyse your most common vocabularies (try 300-400 vocabs with len>1). Hint: tokenize your data and count the tokens.

In [19]:
from collections import defaultdict
import pandas as pd

In [20]:
pantip_unigram_tokenizer = spm.SentencePieceProcessor(model_file="pantip_unigram.model")
pantip_bpe_tokenizer = spm.SentencePieceProcessor(model_file="pantip_bpe.model")

In [21]:
tokenization_stats_df = pd.DataFrame(
    columns=[
        "source",
        "tokenizer",
        "token",
        "count",
    ]
)

In [22]:
# tokenizer trained on pantip -> tokenized on pantip
unigram_token_counts = defaultdict(int)
bpe_token_counts = defaultdict(int)
for text in pantip_train_text + pantip_test_text:
    # unigram
    unigram_tokens = pantip_unigram_tokenizer.encode(text, out_type=str)
    for token in unigram_tokens:
        unigram_token_counts[token] += 1
    # bpe
    bpe_tokens = pantip_bpe_tokenizer.encode(text, out_type=str)
    for token in bpe_tokens:
        bpe_token_counts[token] += 1

rows = []

# unigram
for token, count in unigram_token_counts.items():
    rows.append(
        {
            "source": "pantip",
            "tokenizer": "pantip_unigram",
            "token": token,
            "count": count,
        }
    )

# bpe
for token, count in bpe_token_counts.items():
    rows.append(
        {
            "source": "pantip",
            "tokenizer": "pantip_bpe",
            "token": token,
            "count": count,
        }
    )

# append to the dataframe
tokenization_stats_df = pd.concat([tokenization_stats_df, pd.DataFrame(rows)])

print(f"Number of rows: {len(tokenization_stats_df)}")

Number of rows: 2371


In [23]:
# tokenizer trained on pantip -> tokenized on pam
unigram_token_counts = defaultdict(int)
bpe_token_counts = defaultdict(int)
for text in pam_train_text + pam_test_text:
    # unigram
    unigram_tokens = pantip_unigram_tokenizer.encode(text, out_type=str)
    for token in unigram_tokens:
        unigram_token_counts[token] += 1
    # bpe
    bpe_tokens = pantip_bpe_tokenizer.encode(text, out_type=str)
    for token in bpe_tokens:
        bpe_token_counts[token] += 1

rows = []

# unigram
for token, count in unigram_token_counts.items():
    rows.append(
        {
            "source": "pam",
            "tokenizer": "pantip_unigram",
            "token": token,
            "count": count,
        }
    )

# bpe
for token, count in bpe_token_counts.items():
    rows.append(
        {
            "source": "pam",
            "tokenizer": "pantip_bpe",
            "token": token,
            "count": count,
        }
    )

# append to the dataframe
tokenization_stats_df = pd.concat([tokenization_stats_df, pd.DataFrame(rows)])

print(f"Number of rows: {len(tokenization_stats_df)}")

Number of rows: 3750


In [24]:
# tokenizer trained on pam -> tokenized on pantip
unigram_token_counts = defaultdict(int)
bpe_token_counts = defaultdict(int)
for text in pantip_train_text + pantip_test_text:
    # unigram
    unigram_tokens = pam_unigram_tokenizer.encode(text, out_type=str)
    for token in unigram_tokens:
        unigram_token_counts[token] += 1
    # bpe
    bpe_tokens = pam_bpe_tokenizer.encode(text, out_type=str)
    for token in bpe_tokens:
        bpe_token_counts[token] += 1

rows = []

# unigram
for token, count in unigram_token_counts.items():
    rows.append(
        {
            "source": "pantip",
            "tokenizer": "pam_unigram",
            "token": token,
            "count": count,
        }
    )

# bpe
for token, count in bpe_token_counts.items():
    rows.append(
        {
            "source": "pantip",
            "tokenizer": "pam_bpe",
            "token": token,
            "count": count,
        }
    )

# append to the dataframe
tokenization_stats_df = pd.concat([tokenization_stats_df, pd.DataFrame(rows)])

print(f"Number of rows: {len(tokenization_stats_df)}")

Number of rows: 17810


In [25]:
# tokenizer trained on pam -> tokenized on pam
unigram_token_counts = defaultdict(int)
bpe_token_counts = defaultdict(int)
for text in pam_train_text + pam_test_text:
    # unigram
    unigram_tokens = pam_unigram_tokenizer.encode(text, out_type=str)
    for token in unigram_tokens:
        unigram_token_counts[token] += 1
    # bpe
    bpe_tokens = pam_bpe_tokenizer.encode(text, out_type=str)
    for token in bpe_tokens:
        bpe_token_counts[token] += 1

rows = []

# unigram
for token, count in unigram_token_counts.items():
    rows.append(
        {
            "source": "pam",
            "tokenizer": "pam_unigram",
            "token": token,
            "count": count,
        }
    )

# bpe
for token, count in bpe_token_counts.items():
    rows.append(
        {
            "source": "pam",
            "tokenizer": "pam_bpe",
            "token": token,
            "count": count,
        }
    )

# append to the dataframe
tokenization_stats_df = pd.concat([tokenization_stats_df, pd.DataFrame(rows)])

print(f"Number of rows: {len(tokenization_stats_df)}")

Number of rows: 19806


In [26]:
tokenization_stats_df.to_csv("tokenization_stats.csv", index=False)

In [27]:
%pip install -q duckdb
import duckdb

Note: you may need to restart the kernel to use updated packages.


In [28]:
duckdb.query(
    """
select source, tokenizer, sum(count) as total_count
from tokenization_stats_df
where source = 'pam'
group by source, tokenizer
order by source, total_count
"""
).to_df()

Unnamed: 0,source,tokenizer,total_count
0,pam,pam_bpe,445508.0
1,pam,pam_unigram,446143.0
2,pam,pantip_bpe,577579.0
3,pam,pantip_unigram,631336.0


In [29]:
duckdb.query(
    """
select source, tokenizer, sum(count) as total_count
from tokenization_stats_df
where source = 'pantip'
group by source, tokenizer
order by source, total_count
"""
).to_df()

Unnamed: 0,source,tokenizer,total_count
0,pantip,pantip_unigram,442659.0
1,pantip,pantip_bpe,448553.0
2,pantip,pam_bpe,474871.0
3,pantip,pam_unigram,511583.0


In [30]:
# pivot table for token and (source, tokenizer) the value is sum of count
pivot = (
    tokenization_stats_df[
        (tokenization_stats_df["token"] != "▁") & tokenization_stats_df["token"].apply(lambda x: len(x) > 2)
    ]
    .pivot_table(index=["token"], columns=["source", "tokenizer"], values="count", aggfunc="sum", fill_value=0)
    .sort_values(("pam", "pam_unigram"), ascending=False)
    .head(50)
)

pivot["diff_pam_bpe"] = pivot[("pam", "pam_bpe")] - pivot[("pam", "pantip_bpe")]
# Word that same train
pivot.sort_values("diff_pam_bpe", ascending=True).head(20)

  pivot = tokenization_stats_df[(tokenization_stats_df['token'] != '▁') & tokenization_stats_df['token'].apply(lambda x : len(x) > 2)].pivot_table(


source,pam,pam,pam,pam,pantip,pantip,pantip,pantip,diff_pam_bpe
tokenizer,pam_bpe,pam_unigram,pantip_bpe,pantip_unigram,pam_bpe,pam_unigram,pantip_bpe,pantip_unigram,Unnamed: 9_level_1
token,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
ทั้ง,1208,753,2025,2062,352,335,494,498,-817
พระ,2191,2023,2829,4565,144,144,174,174,-638
ถึง,903,751,1427,1430,687,648,774,775,-524
ด้วย,963,963,1396,1396,861,863,946,946,-433
ได้,1406,1535,1808,2116,3488,3769,2885,3098,-402
ประ,1997,1182,2347,2199,1270,895,972,503,-350
ข้า,737,769,1046,0,175,186,257,0,-309
นี้,774,945,1064,1064,2135,2551,1711,1629,-290
แต่,775,818,1061,1022,967,982,1083,928,-286
พี่,658,932,936,1191,282,367,304,369,-278


### To answer
What are some notable differences you see between the two vocabs?

Write your answer below.

In [31]:
"""
1. the number of token that from the tokenizer trained on the same corpus is more compact(gives less tokens) than the tokenizer trained on the different corpus.
2. in PAM data, with Pantip tokenizer, the performance look worse if the word is from PAM domain. -> (พระ ข้า รัก กระ ถึง ด้วย)
"""

'\n1. the number of token that from the tokenizer trained on the same corpus is more compact(gives less tokens) than the tokenizer trained on the different corpus.\n2. in PAM data, with Pantip tokenizer, the performance look worse if the word is from PAM domain. -> (พระ ข้า รัก กระ ถึง ด้วย)\n'

## Using tokenizer across domains

One problem you may face is your dataset is very specialized. In that case the tokenizer trained on a general domain may not perform as good as it should when used on your dataset.

Next you will try using tokenizers trained on one general domain (on Pantip) and use it on a specialized domain (พระอภัยมณี) and vice versa.

In [32]:
count_by_source_and_tokenizer_df = duckdb.query(
    """
select source, tokenizer, sum(count) as total_count
from tokenization_stats_df
group by source, tokenizer
order by source, total_count
"""
).to_df()
count_by_source_and_tokenizer_df["tokenizer_train_corpus"] = count_by_source_and_tokenizer_df["tokenizer"].apply(
    lambda x: x.split("_")[0]
)
count_by_source_and_tokenizer_df["tokenizer_algo"] = count_by_source_and_tokenizer_df["tokenizer"].apply(
    lambda x: x.split("_")[1]
)

count_by_source_and_tokenizer_df

Unnamed: 0,source,tokenizer,total_count,tokenizer_train_corpus,tokenizer_algo
0,pam,pam_bpe,445508.0,pam,bpe
1,pam,pam_unigram,446143.0,pam,unigram
2,pam,pantip_bpe,577579.0,pantip,bpe
3,pam,pantip_unigram,631336.0,pantip,unigram
4,pantip,pantip_unigram,442659.0,pantip,unigram
5,pantip,pantip_bpe,448553.0,pantip,bpe
6,pantip,pam_bpe,474871.0,pam,bpe
7,pantip,pam_unigram,511583.0,pam,unigram


### Q3 MCV

What percentage increase do you observe when tokenizing the whole พระอภัยมณี dataset with a tokenizer trained on Pantip compared to the one trained on พระอภัยมณี.

In [41]:
token_on_same_domain = (
    duckdb.query(
        """
select total_count
from count_by_source_and_tokenizer_df
where source = 'pam' and tokenizer = 'pam_bpe'
"""
    )
    .to_df()
    .values[0][0]
)
token_on_different_domain = (
    duckdb.query(
        """
select total_count
from count_by_source_and_tokenizer_df
where source = 'pam' and tokenizer = 'pantip_bpe'
"""
    )
    .to_df()
    .values[0][0]
)

print(f"Token from PAM")
print(f"Token on the same domain: {token_on_same_domain}")
print(f"Token on the different domain: {token_on_different_domain}")
print(f"Increase: {(token_on_different_domain - token_on_same_domain) / token_on_same_domain * 100:.2f}%")

Token from PAM
Token on the same domain: 445508.0
Token on the different domain: 577579.0
Increase: 29.65%


### Q4 MCV

What percentage increase do you observe when tokenizing the whole Pantip dataset with a tokenizer trained on พระอภัยมณี compared to the one trained on Pantip.

In [42]:
token_on_same_domain = (
    duckdb.query(
        """
select total_count
from count_by_source_and_tokenizer_df
where source = 'pantip' and tokenizer = 'pantip_bpe'
"""
    )
    .to_df()
    .values[0][0]
)

token_on_different_domain = (
    duckdb.query(
        """
select total_count
from count_by_source_and_tokenizer_df
where source = 'pantip' and tokenizer = 'pam_bpe'
"""
    )
    .to_df()
    .values[0][0]
)

print("Token from Pantip data")
print(f"Token on the same domain: {token_on_same_domain}")
print(f"Token on the different domain: {token_on_different_domain}")
print(f"Increase: {(token_on_different_domain - token_on_same_domain) / token_on_same_domain * 100:.2f}%")

Token from Pantip data
Token on the same domain: 448553.0
Token on the different domain: 474871.0
Increase: 5.87%


### To answer
Why do you think the number of tokens tokenized by the general tokenizer (the one trained on Pantip) has a higher percentage increase compared to the number of tokens tokenized by the specialized tokenizer? (Hint: we fixed vocab size.)

In [43]:
"""
Because we fix top 1000 vocab size and word that exist on PAM data(พระอภัยมณี) is old period word. 
Then, if we use PAM tokenizer to tokenize Pantip data, the performance is not growth as much as we use Pantip tokenizer to tokenize PAM data.
"""

'\nBecause we fix top 1000 vocab size and word that exist on PAM data(พระอภัยมณี) is old period word. \nThen, if we use PAM tokenizer to tokenize Pantip data, the performance is not growth as much as we use Pantip tokenizer to tokenize PAM data.\n'

## The effect on language models

Next, we will see the effect of using "cross-domain" tokenizers on Language models.

### Setup
We are going to reuse the code from the last assignment

In [46]:
!pip install -q lightning

In [47]:
import itertools
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import lightning as L
from tqdm import tqdm
import numpy as np

In [62]:
class TextDataset(Dataset):
    def __init__(self, data, tokenizer, seq_len=128):
        token_ids = [tokenizer.encode(d, add_bos=True, add_eos=True) for d in data]
        flatten_token_ids = list(itertools.chain(*token_ids))
        encoded = torch.LongTensor(flatten_token_ids)

        left_over = len(encoded) % seq_len
        encoded = encoded[: len(encoded) - left_over]
        self.encoded = encoded.view(-1, seq_len)

    def __getitem__(self, idx):
        return self.encoded[idx]

    def __len__(self):
        return len(self.encoded)

In [63]:
class LSTM(L.LightningModule):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, learning_rate, criterion):
        super().__init__()

        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.learning_rate = learning_rate
        self.criterion = criterion

    def forward(self, src):
        embedded = self.embedding(src)
        embedded = self.dropout(embedded)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = self.dropout(lstm_out)
        output = self.fc(lstm_out)

        return output

    def training_step(self, batch, batch_idx):
        src = batch[:, :-1]
        target = batch[:, 1:]
        prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("train_loss", loss)
        return loss

    def test_step(self, batch, batch_idx, dataloader_idx=0):
        src = batch[:, :-1]
        target = batch[:, 1:]
        with torch.no_grad():
            prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [64]:
vocab_size = pam_unigram_tokenizer.get_piece_size()
embedding_dim = 200
hidden_dim = 512
num_layers = 3
dropout_rate = 0.2
lr = 1e-3
criterion = nn.CrossEntropyLoss()
train_batch_size = 64
test_batch_size = 128

### Training

<a name="no1"></a>
#### 1. Training on Pantip data with Pantip tokenizer

In [65]:
trainer = L.Trainer(max_epochs=10, deterministic=True)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, pantip_unigram_tokenizer)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size=train_batch_size, shuffle=True)

pantip_test_dataset = TextDataset(pantip_test_text, pantip_unigram_tokenizer)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size=test_batch_size, shuffle=False)

pam_train_dataset = TextDataset(pam_train_text, pantip_unigram_tokenizer)
pam_train_loader = DataLoader(pam_train_dataset, batch_size=train_batch_size, shuffle=True)

pam_test_dataset = TextDataset(pam_test_text, pantip_unigram_tokenizer)
pam_test_loader = DataLoader(pam_test_dataset, batch_size=test_batch_size, shuffle=False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode


Epoch 9: 100%|██████████| 44/44 [00:08<00:00,  5.17it/s, v_num=4]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 44/44 [00:08<00:00,  5.05it/s, v_num=4]


In [66]:
test_result = trainer.test(
    model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader, pam_test_loader], verbose=False
)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Testing DataLoader 3: 100%|██████████| 9/9 [00:01<00:00,  8.02it/s]  
Perplexity on Pantip train set is:	39.45305689850137
Perplexity on Pra apai manee train set is:	107.59757736986049
Perplexity on Pantip test set is:	142.8758471391891
Perplexity on Pra apai manee test set is:	146.00030674317318


<a name="no2"></a>
#### 2. Training on Pantip data with Pra apai manee tokenizer

In [67]:
trainer = L.Trainer(max_epochs=10, deterministic=True)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, pam_unigram_tokenizer)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size=train_batch_size, shuffle=True)

pantip_test_dataset = TextDataset(pantip_test_text, pam_unigram_tokenizer)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size=test_batch_size, shuffle=False)

pam_train_dataset = TextDataset(pam_train_text, pam_unigram_tokenizer)
pam_train_loader = DataLoader(pam_train_dataset, batch_size=train_batch_size, shuffle=True)

pam_test_dataset = TextDataset(pam_test_text, pam_unigram_tokenizer)
pam_test_loader = DataLoader(pam_test_dataset, batch_size=test_batch_size, shuffle=False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Epoch 0:   0%|          | 0/44 [04:38<?, ?it/s]
Epoch 0:   0%|          | 0/44 [03:50<?, ?it/s]
Epoch 0:   0%|          | 0/44 [02:34<?, ?it/s]


  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.



Epoch 9: 100%|██████████| 51/51 [00:09<00:00,  5.20it/s, v_num=5]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 51/51 [00:10<00:00,  5.07it/s, v_num=5]


In [68]:
test_result = trainer.test(
    model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader, pam_test_loader], verbose=False
)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Testing DataLoader 3: 100%|██████████| 7/7 [00:00<00:00,  7.56it/s]  
Perplexity on Pantip train set is:	19.768709152369343
Perplexity on Pra apai manee train set is:	455.0351549376166
Perplexity on Pantip test set is:	50.90800060010585
Perplexity on Pra apai manee test set is:	377.87146572784667


#### To answer

The perplexity numbers should indicate that:
1. Training the LM with Pra apai manee tokenizer on Pantip (no. [2](#no2)) results in overfitting to Pantip and poor generalization to the Pra apai manee dataset.
2. However using the Pantip tokenizer (no. [1](#no1)) results in a much better generalization.

Try and come up with some reasons for the results above. <br>
Hint:
1. think about "general" vocabs and domain-specific vocabs.
2. what do you think happens to the model when the token ids become longer.

In [None]:
"""
PAM tokenizer’s vocabulary is less general and modern compared to Pantip’s. 
It relies on longer tokens, requiring more specific subwords (e.g., s## is more generalized than se##). 
This causes the model’s attention to be distributed across a wider variety of tokens, making it harder to generalize and learn dependencies between them, 
which is a limitation of the PAM tokenizer.
"""


<a name="no3"></a>
#### 3. Training on Pra apai manee data with Pantip tokenizer


In [69]:
trainer = L.Trainer(max_epochs=10, deterministic=True)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, pantip_unigram_tokenizer)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size=train_batch_size, shuffle=True)

pantip_test_dataset = TextDataset(pantip_test_text, pantip_unigram_tokenizer)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size=test_batch_size, shuffle=False)

pam_train_dataset = TextDataset(pam_train_text, pantip_unigram_tokenizer)
pam_train_loader = DataLoader(pam_train_dataset, batch_size=train_batch_size, shuffle=True)

pam_test_dataset = TextDataset(pam_test_text, pantip_unigram_tokenizer)
pam_test_loader = DataLoader(pam_test_dataset, batch_size=test_batch_size, shuffle=False)

trainer.fit(model, train_dataloaders=pam_train_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=

Epoch 9: 100%|██████████| 66/66 [00:13<00:00,  5.06it/s, v_num=6]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 66/66 [00:13<00:00,  4.99it/s, v_num=6]


In [70]:
test_result = trainer.test(
    model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader, pam_test_loader], verbose=False
)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Testing DataLoader 3: 100%|██████████| 9/9 [00:01<00:00,  8.20it/s]  
Perplexity on Pantip train set is:	2904.773227256836
Perplexity on Pra apai manee train set is:	24.55698071112975
Perplexity on Pantip test set is:	1152.3588205405729
Perplexity on Pra apai manee test set is:	36.7126200093089


<a name="no4"></a>
#### 4. Training on Pra apai manee data with Pra apai manee tokenizer




In [71]:
trainer = L.Trainer(max_epochs=10, deterministic=True)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, pam_unigram_tokenizer)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size=train_batch_size, shuffle=True)

pantip_test_dataset = TextDataset(pantip_test_text, pam_unigram_tokenizer)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size=test_batch_size, shuffle=False)

pam_train_dataset = TextDataset(pam_train_text, pam_unigram_tokenizer)
pam_train_loader = DataLoader(pam_train_dataset, batch_size=train_batch_size, shuffle=True)

pam_test_dataset = TextDataset(pam_test_text, pam_unigram_tokenizer)
pam_test_loader = DataLoader(pam_test_dataset, batch_size=test_batch_size, shuffle=False)

trainer.fit(model, train_dataloaders=pam_train_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
/Users/jirayuwat/anaconda3/envs/nlp/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=

Epoch 9: 100%|██████████| 48/48 [00:09<00:00,  4.82it/s, v_num=7]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 48/48 [00:10<00:00,  4.74it/s, v_num=7]


In [72]:
test_result = trainer.test(
    model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader, pam_test_loader], verbose=False
)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

Testing DataLoader 3: 100%|██████████| 7/7 [00:00<00:00,  8.06it/s]  
Perplexity on Pantip train set is:	369.715941589192
Perplexity on Pra apai manee train set is:	77.35313006676817
Perplexity on Pantip test set is:	286.4953979845209
Perplexity on Pra apai manee test set is:	103.6682372873689


#### To answer

The perplexity numbers should indicate that:
1. Both LM overfits on Pra apai manee data and performs really bad on Pantip data.
2. However using the Pra apai manee tokenizer (no. [4](#no4)) results in a  better generalization than the Pantip tokenizer(no. [3](#no3)).

Try and come up with some reasons for the results above. <br>

In [73]:
"""

Using “PAM data with the PAM tokenizer” performs better, possibly because:

The tokenizer’s vocabulary is better aligned with the training data, unlike in scenario 3. 
The PAM tokenizer is specifically designed for the linguistic patterns in PAM data (e.g., poems, which differ from the conversational style of Pantip data). 
This enables the model to learn more effective representations for these tokens, allowing it to generalize slightly better to unseen domains like Pantip data.
"""

'\n\nUsing “PAM data with the PAM tokenizer” performs better, possibly because:\n\nThe tokenizer’s vocabulary is better aligned with the training data, unlike in scenario 3. \nThe PAM tokenizer is specifically designed for the linguistic patterns in PAM data (e.g., poems, which differ from the conversational style of Pantip data). \nThis enables the model to learn more effective representations for these tokens, allowing it to generalize slightly better to unseen domains like Pantip data.\n'