Pre process and tokenize all datasets, save the token ids.

In [10]:
import os

from datasets import load_dataset, load_from_disk
from datasets.formatting.formatting import LazyBatch
from huggingface_hub import login

from special_tokens import special_tokens

hf_token = os.getenv("HF_TOKEN")
login(hf_token)
batch_size = 10_000
processes = 8

In [2]:
from tokenizers.tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

In [3]:
def test_tokens(loaded):
    token_ids = next(iter(loaded))["tokens"]
    text = tokenizer.decode(token_ids)
    print(text)

Dataset: no robots

In [4]:
ds_test = load_dataset("HuggingFaceH4/no_robots", split="test").select_columns(["messages"])
ds_train = load_dataset("HuggingFaceH4/no_robots", split="train").select_columns(["messages"])

In [5]:
from chat_template import chat_template


def tokenize_robots(batch: LazyBatch):
    results = [
        tokenizer.encode(chat_template(row)).ids
        for row in batch["messages"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_robots,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/robots_test")

ds_train.map(
    tokenize_robots,
    batched=True,
    batch_size=batch_size,
    num_proc=processes,
).select_columns("tokens").save_to_disk("tokenized_data/robots_train")

Map (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9500 [00:00<?, ? examples/s]

In [6]:
test_tokens(load_from_disk("tokenized_data/robots_test"))

Aster is a chatbot who answers questions with rhymes.
Where did chocolate originate?
Chocolate is 4000 years old/Mexico is where it was first sold
Where was milk chocolate invented?
Switzerland was the first to add milk/To make their chocolate smooth as silk
What are some good desserts that use chocolate?
Pie, tart, cookies, and cake/Chocolate is great to bake



Dataset: wikipedia summary

In [7]:
splits = load_dataset("jordiclive/wikipedia-summary-dataset", split="train").train_test_split(
    test_size=0.1,
    shuffle=True,
    seed=42
)
ds_test = splits["test"].select_columns(["summary"])
ds_train = splits["train"].select_columns(["summary"])

Repo card metadata block was not found. Setting CardData to empty.


In [11]:
def tokenize_wiki(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["summary"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_wiki,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/wiki_test")

ds_train.map(
    tokenize_wiki,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/wiki_train")

Map (num_proc=8):   0%|          | 0/775001 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/775001 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/6975006 [00:00<?, ? examples/s]

Saving the dataset (0/13 shards):   0%|          | 0/6975006 [00:00<?, ? examples/s]

In [12]:
test_tokens(load_from_disk("tokenized_data/wiki_test"))

Category:Populated places in McPherson County, Nebraska
McPherson



Dataset: tiny stories

In [13]:
ds_test = load_dataset("roneneldan/TinyStories", split="validation").select_columns(["text"])
ds_train = load_dataset("roneneldan/TinyStories", split="train").select_columns(["text"])

In [14]:
def tokenize_stories(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["text"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_stories,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/stories_test")

ds_train.map(
    tokenize_stories,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/stories_train")

Map (num_proc=8):   0%|          | 0/21990 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/21990 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/2119719 [00:00<?, ? examples/s]

Saving the dataset (0/8 shards):   0%|          | 0/2119719 [00:00<?, ? examples/s]

In [15]:
test_tokens(load_from_disk("tokenized_data/stories_test"))

Spot. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot. I polish it every day."

After playing with the car, Kitty and Spot felt thirsty. They found a small pond with clear water. They drank the water and felt very happy. They played together all day and became best friends.



Dataset: tiny textbooks

In [16]:
ds_test = load_dataset("nampdn-ai/tiny-textbooks", split="test").select_columns(["textbook"])
ds_train = load_dataset("nampdn-ai/tiny-textbooks", split="train").select_columns(["textbook"])

In [17]:
def tokenize_textbooks(batch: LazyBatch):
    eot = special_tokens["end_of_text"]
    results = [
        tokenizer.encode(row + "\n" + eot).ids
        for row in batch["textbook"]
    ]
    return {"tokens": results}


ds_test.map(
    tokenize_textbooks,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/textbooks_test")

ds_train.map(
    tokenize_textbooks,
    batched=True,
    batch_size=batch_size,
    num_proc=processes
).select_columns("tokens").save_to_disk("tokenized_data/textbooks_train")

Map (num_proc=8):   0%|          | 0/21000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/21000 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/399000 [00:00<?, ? examples/s]

Saving the dataset (0/5 shards):   0%|          | 0/399000 [00:00<?, ? examples/s]

In [18]:
test_tokens(load_from_disk("tokenized_data/textbooks_test"))

Lesson: How to Analyze a Drama Series

Introduction:
In this lesson, we will learn how to analyze a drama series by breaking down its plot, characters, and themes. We will use "Karelasyon" as our example to demonstrate how to apply these analytical tools to a specific work.

Section 1: Plot Analysis

Plot refers to the events and conflicts that make up a story. In "Karelasyon," the plot revolves around Carmen and Jess's relationship and how it affects Carmen's daughter, Kat-kat. 

1. What is the main conflict of the story?
The main conflict is Carmen's hope that Jess will treat Kat-kat like his own daughter.

2. How does the story develop this conflict?
The story develops this conflict through Carmen and Jess's interactions with Kat-kat, showing how their relationship affects her.

3. What is the resolution of the story?
The resolution is not explicitly stated, but it can be inferred that the story ends with Carmen and Jess continuing to live together and care for Kat-kat.

Section 2: 