# Download and GPT-2-Tokenize the OpenWebText dataset into 256 token contexts

In [21]:
from datasets import load_dataset, load_from_disk, DatasetDict

ds = load_dataset("openwebtext")
ds

Downloading and preparing dataset openwebtext/plain_text to /home/kdbanman/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/6f68e85c16ccc770c0dd489f4008852ea9633604995addd0cd76e293aed9e521...


Downloading data files:   0%|          | 0/21 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/8013769 [00:00<?, ? examples/s]

Dataset openwebtext downloaded and prepared to /home/kdbanman/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/6f68e85c16ccc770c0dd489f4008852ea9633604995addd0cd76e293aed9e521. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8013769
    })
})

In [22]:
raw_datasets = ds["train"].train_test_split(test_size=0.1, seed=42069, shuffle=True)

In [23]:
for key in raw_datasets["train"][0]:
    print(f"{key}: {raw_datasets['train'][0][key][:200]}")

for key in raw_datasets["test"][0]:
    print(f"{key}: {raw_datasets['test'][0][key][:200]}")

text: Ennio Morricone's Soundtrack to John Carpenter's 'The Thing' Set for Vinyl Reissue

Published Dec 12, 2016

John Carpenter's The Thing. Coming soon from Waxwork Records. pic.twitter.com/MVhwVwMhsK — W
text: MI.Julz.Asus interviewed by GosuGamers

MI.ASUS: john, wootz, julz, harhar, owa

With one million dollar "The International" less than 10 days away, GosuGamers caught up with one of the participating 


In [25]:
from transformers import AutoTokenizer

context_length = 256
tokenizer = AutoTokenizer.from_pretrained("gpt2")

outputs = tokenizer(
    raw_datasets["train"][:3]["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 13
Input chunk lengths: [256, 142, 256, 208, 256, 256, 256, 256, 256, 256, 256, 256, 232]
Chunk mapping: [0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]


In [27]:
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/7212392 [00:00<?, ? examples/s]

Map:   0%|          | 0/801377 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 28115618
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 3126644
    })
})

In [28]:
tokenized_datasets.save_to_disk('tokenized-openwebtext')

Saving the dataset (0/58 shards):   0%|          | 0/28115618 [00:00<?, ? examples/s]

Saving the dataset (0/7 shards):   0%|          | 0/3126644 [00:00<?, ? examples/s]