## 1. Data

The purpose of this notebook is to provide an understanding of the data we are feeding into the model. The complete preprocessing code is contained in src/repgpt/process.py.

### 1.1 Accessing the Data

The original dataset (Web Text) from GPT2 is not public. However, there is an approximately similar data source (Open Web Text) created by (created by [Aaron Gokaslan](https://twitter.com/SkyLi0n)).

We can download the [openwebtext dataset](https://huggingface.co/datasets/Skylion007/openwebtext)  from HuggingFace's datasets library. Depending on your internet connection, it may take several minutes to download (on my machine, it takes ~8 minutes).

After the initial download, the data will be stored in local cache:

`~/.cache/huggingface/datasets/downloads/`

In [1]:
from datasets import load_dataset

dataset = load_dataset("openwebtext", num_proc=8)

  from .autonotebook import tqdm as notebook_tqdm


We observe the data contains 8M "rows" which in this case refers to documents extracted from reddit URLs.

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8013769
    })
})

We create a hold-out set for evaluation using the built-in huggingface functionality.

In [3]:
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
split_dataset["val"] = split_dataset.pop("test")

In [4]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8009762
    })
    val: Dataset({
        features: ['text'],
        num_rows: 4007
    })
})

Here is an example row from the dataset (first 1k characters).

In [5]:
print(split_dataset['train'][11]['text'][:1000])

This is NOT an April Fools joke!

Quite frankly I’m shocked that I’ve made it to day 60 of this project and only now selected a Dark Star. That seems almost criminal.

For many Dark Star is the quintessential Grateful Dead song. Well, song may be a somewhat liberal term here. To say it’s an anything-goes jam held together by a few key riffs might be more accurate, but regardless of the label you use to describe it one this is clear: Dark Star is awesome.

The only real issue I have with this recording is that Keith is virtually absent. I assume he was there that night, but I listened to this on good headphones and could detect nary a hint of piano, keys, anything. This is especially disappointing because we’ve already seen how well Keith integrated with the band (see yesterday’s Bertha post).

This one starts off in a very laid back manner. No noodling here, but the band is certainly not in a rush. Right around the 2 minute mark Jerry seems to be running through scales up and down the 

### 1.2 Tokenizing the data

We can use openai's GPT2 tokenizer (n_vocab=50257) to produce numeric encodings of the text.

In [6]:
import tiktoken

enc = tiktoken.get_encoding("gpt2")
print(f"There are {enc.n_vocab:,} tokens in the vocabulary")

enc.encode_ordinary("NOT an April Fools joke!")

There are 50,257 tokens in the vocabulary


[11929, 281, 3035, 376, 10141, 9707, 0]

In [7]:
enc.decode(enc.encode_ordinary("NOT an April Fools joke!"))

'NOT an April Fools joke!'

We can manually inspect what some of the tokens are:

In [8]:
print(f"{'| Index':<5} | {'Token': <15} |")
print("-"*27)
for i in [0, 1, 2, 3, 100, 1_000, 1_001, 1_003, 1_004, 2_000, 2_001, 10_000, 20_000, enc.n_vocab - 3, enc.n_vocab - 2, enc.n_vocab - 1]:
    print(f"| {i:>5} | {enc.decode([i]):<15} |")

| Index | Token           |
---------------------------
|     0 | !               |
|     1 | "               |
|     2 | #               |
|     3 | $               |
|   100 | �               |
|  1000 | ale             |
|  1001 |  Se             |
|  1003 | //              |
|  1004 |  Le             |
|  2000 |  mind           |
|  2001 | aff             |
| 10000 |  pocket         |
| 20000 |  Junior         |
| 50254 |  informants     |
| 50255 |  gazed          |
| 50256 | <|endoftext|>   |


### 3. Structuring data for training

The first step is to create a giant array, with each document separated by an `<|endoftext|>` token.

In [9]:
import torch

def process_document(example, enc) -> dict:
    ids = enc.encode_ordinary(example["text"])
    ids.append(enc.eot_token)
    return ids

train_tokens = []
for i in range(3):
    train_tokens += process_document(split_dataset['train'][i], enc)

train_tokens = torch.tensor(train_tokens)

print(train_tokens[:10])
print(len(train_tokens))

tensor([ 8585,   262,  1772,    25, 25334,  8120,    17, 43959, 44947,   318])
5627


Then we sample positions from this data, and create our X and y variables, each with `(batch_size, context_size)` dimension, with the elements from y being shifted by one token to the right.

In [10]:
def get_batch(data, batch_size, context_size):
    indices = torch.randint(low=0, high=data.shape[0] - context_size, size=(batch_size,))
    X = torch.stack([data[idx:idx+context_size]for idx in indices])
    y = torch.stack([data[idx+1:idx+context_size+1]for idx in indices])
    return X, y

X, y = get_batch(train_tokens, batch_size=2, context_size=4)
print(f"{X=}")
print(f"{y=}")

X=tensor([[ 257, 4048, 8066, 1560],
        [ 284, 2652, 1200, 1203]])
y=tensor([[4048, 8066, 1560,  345],
        [2652, 1200, 1203,  290]])


* You will observe the shift in this data. e.g., `X[0,2] == y[0, 1]`
* To simplify training, we preprocess all of our data into tokens and save to disk before even starting the training process. 
* The script preprocessing is here: `src/repgpt/preprocess.py`.
* Preview of what's to come: later on, we will perform an embedding lookup, which will be the third dimension in the data: `(batch_size, context_size, n_embd)`