Pre process and tokenize all datasets, save the token ids.

In [6]:
import os

from datasets import load_dataset, load_from_disk
from datasets.formatting.formatting import LazyBatch
from huggingface_hub import login

from special_tokens import special_tokens

hf_token = os.getenv("HF_TOKEN")
login(hf_token)
batch_size = 10_000
processes = 4

In [2]:
from tokenizers.tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

In [3]:
def test_tokens(loaded):
    token_ids = next(iter(loaded))["tokens"]
    text = tokenizer.decode(token_ids)
    print(text)

Dataset: no robots

In [4]:
ds_test = load_dataset("HuggingFaceH4/no_robots", split="test").select_columns(["messages"])
ds_train = load_dataset("HuggingFaceH4/no_robots", split="train").select_columns(["messages"])

In [5]:
from chat_template import chat_template


def tokenize_robots(batch: LazyBatch):
    results = [
        tokenizer.encode(chat_template(row)).ids
        for row in batch["messages"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_robots,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/robots_test")

ds_train.map(
    tokenize_robots,
    batched=True,
    batch_size=batch_size,
    num_proc=processes,
).select_columns("tokens").save_to_disk("tokenized_data/robots_train")

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/9500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9500 [00:00<?, ? examples/s]

In [11]:
test_tokens(load_from_disk("tokenized_data/robots_test"))

aster is a chatbot who answers questions with rhymes.
where did chocolate originate?
chocolate is 4000 years old/mexico is where it was first sold
where was milk chocolate invented?
switzerland was the first to add milk/to make their chocolate smooth as silk
what are some good desserts that use chocolate?
pie, tart, cookies, and cake/chocolate is great to bake



Dataset: wikipedia summary

In [12]:
splits = load_dataset("jordiclive/wikipedia-summary-dataset", split="train").train_test_split(
    test_size=0.1,
    shuffle=True,
    seed=42
)
ds_test = splits["test"].select_columns(["summary"])
ds_train = splits["train"].select_columns(["summary"])

Repo card metadata block was not found. Setting CardData to empty.


In [13]:
def tokenize_wiki(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["summary"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_wiki,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/wiki_test")

ds_train.map(
    tokenize_wiki,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/wiki_train")

Map:   0%|          | 0/775001 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/775001 [00:00<?, ? examples/s]

Map:   0%|          | 0/6975006 [00:00<?, ? examples/s]

Saving the dataset (0/13 shards):   0%|          | 0/6975006 [00:00<?, ? examples/s]

In [14]:
test_tokens(load_from_disk("tokenized_data/wiki_test"))

category:populated places in mcpherson county, nebraska
mcpherson



Dataset: tiny stories

In [4]:
ds_test = load_dataset("roneneldan/TinyStories", split="validation").select_columns(["text"])
ds_train = load_dataset("roneneldan/TinyStories", split="train").select_columns(["text"])

In [7]:
def tokenize_stories(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["text"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_stories,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/stories_test")

ds_train.map(
    tokenize_stories,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/stories_train")

Saving the dataset (0/1 shards):   0%|          | 0/21990 [00:00<?, ? examples/s]

Saving the dataset (0/8 shards):   0%|          | 0/2119719 [00:00<?, ? examples/s]

In [8]:
test_tokens(load_from_disk("tokenized_data/stories_test"))

spot. spot saw the shiny car and said, "wow, kitty, your car is so bright and clean!" kitty smiled and replied, "thank you, spot. i polish it every day."

after playing with the car, kitty and spot felt thirsty. they found a small pond with clear water. they drank the water and felt very happy. they played together all day and became best friends.



Dataset: tiny textbooks

In [13]:
ds_test = load_dataset("nampdn-ai/tiny-textbooks", split="test").select_columns(["text"])
ds_train = load_dataset("nampdn-ai/tiny-textbooks", split="train").select_columns(["text"])

In [16]:
def tokenize_textbooks(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["text"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_textbooks,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/textbooks_test")

ds_train.map(
    tokenize_textbooks,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/textbooks_train")

Saving the dataset (0/1 shards):   0%|          | 0/21000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/399000 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/399000 [00:00<?, ? examples/s]

In [17]:
test_tokens(load_from_disk("tokenized_data/textbooks_test"))

watch this video and more exclusive full episodes of gma shows on. carmen (maria isabel lopez) meets jess (emilio garcia) in a ktv bar. they formed an instant bond which led to love, and so they decided to live together. because of their love, carmen had high hopes that jess will treat kat-kat (mara lopez), her daughter, like his own. jess did everything to win kat-kat's approval which led to the unthinkable - a sin that might change their lives forever.. watch full episodes of ‘karelasyon’ on gmanetwork.com/fullepisodes and youtube.com/gmanetwork. hosted by carla abellana, this episode stars maria isabel lopez, mara lopez, emilio garcia

