# Language as a HMM

We try to learn an HMM description of English. 

In [1]:
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00004-2d5a1467fff108(…):   0%|          | 0.00/249M [00:00<?, ?B/s]

data/train-00001-of-00004-5852b56a2bd28f(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/train-00002-of-00004-a26307300439e9(…):   0%|          | 0.00/246M [00:00<?, ?B/s]

data/train-00003-of-00004-d243063613e5a0(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/validation-00000-of-00001-869c898b5(…):   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

In [None]:
test_text = ds["train"][0]['text']


One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.


## Training tokenizers

In [13]:
from tokenizers import ByteLevelBPETokenizer
from transformers import PreTrainedTokenizerFast
import os

ds = load_dataset("roneneldan/TinyStories")

vocab_size = 4096
tok_name = 'custom_tokenizer_{}'.format(vocab_size)

# Create directory if it doesn't exist
os.makedirs(os.path.join('tokenizers', tok_name), exist_ok=True)

def text_iterator():
    for split in ds.values():
        for sample_text in split["text"]:
            yield sample_text

tok = ByteLevelBPETokenizer()
tok.train_from_iterator(
    text_iterator(),
    vocab_size=vocab_size,
    min_frequency=2,
    special_tokens=["<s>", "</s>", "<pad>", "<unk>"]
)
# Save the full tokenizer (creates tokenizer.json)
tok.save(os.path.join('tokenizers', tok_name, "tokenizer.json"))

# Create directory for HuggingFace tokenizer
hf_tok_name = "HF_custom_tokenizer_{}".format(vocab_size)
os.makedirs(os.path.join('tokenizers', hf_tok_name), exist_ok=True)

# Load from the saved tokenizer.json
hf_tok = PreTrainedTokenizerFast(
    tokenizer_file=os.path.join('tokenizers', tok_name, "tokenizer.json"),
    unk_token="<unk>",
    pad_token="<pad>",
    bos_token="<s>",
    eos_token="</s>",
)
hf_tok.save_pretrained(os.path.join('tokenizers', hf_tok_name))






('tokenizers/HF_custom_tokenizer_4096/tokenizer_config.json',
 'tokenizers/HF_custom_tokenizer_4096/special_tokens_map.json',
 'tokenizers/HF_custom_tokenizer_4096/tokenizer.json')

## Application of Baum-Welch Algorithm