<a href="https://colab.research.google.com/github/heispv/my-gpt/blob/master/experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download the data

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-02-23 14:28:13--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-02-23 14:28:13 (20.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open("/content/input.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [None]:
# Check the lenght of the dataset
print(f"The lenght of the dataset is {len(text)} characters.")

The lenght of the dataset is 1115394 characters.


In [None]:
# Check how the dataset looks like
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [None]:
# Check what are the unique characters that are used in the text
characters = sorted(list(set(text)))
num_unique_characters = len(characters)

In [None]:
print("Printing the unique characters in the text: ")
print("".join(characters))

Printing the unique characters in the text: 

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [None]:
print("The number of the unique characters:", num_unique_characters)

The number of the unique characters: 65


## Preprocess the data

In [None]:
# Lets create simple mapping
stoi = {s: i for i, s in enumerate(characters)}
itos = {i: s for i, s in enumerate(characters)}

In [None]:
# Function for encoding
def encode(string):
    return [stoi[s] for s in string]

In [None]:
# Function for decoding
def decode(index_list):
    return "".join(itos[i] for i in index_list)

In [None]:
# Check if they are working ok
decode(encode('hello!'))

'hello!'

In [None]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)

In [None]:
# Check the data
print(f"Datatype: {type(data)}\ndtype: {data.dtype}")

Datatype: <class 'torch.Tensor'>
dtype: torch.int64


In [None]:
# Print the first 500 character of the data
print(data[:500])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 

In [None]:
# Split the data into training and test
n = int(0.9 * len(data))
training_set = data[:n]
val_set = data[n:]

In [None]:
block_size = 8
training_set[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

* Note that we have 8 data, and not 9. why? because the relationships are important, for example from 18 we expect 47, and from 18, 47 we expect 56 and so on.

In [None]:
x = training_set[:block_size]
y = training_set[1:block_size+1]

In [None]:
for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f"Context: {context} --> Target: {target}")

Context: tensor([18]) --> Target: 47
Context: tensor([18, 47]) --> Target: 56
Context: tensor([18, 47, 56]) --> Target: 57
Context: tensor([18, 47, 56, 57]) --> Target: 58
Context: tensor([18, 47, 56, 57, 58]) --> Target: 1
Context: tensor([18, 47, 56, 57, 58,  1]) --> Target: 15
Context: tensor([18, 47, 56, 57, 58,  1, 15]) --> Target: 47
Context: tensor([18, 47, 56, 57, 58,  1, 15, 47]) --> Target: 58


In [None]:
batch_size = 4
block_size = 8

In [None]:
ax = torch.randint(len(data) - block_size, (batch_size,))

In [None]:
[data[a : a+block_size] for a in ax]

[tensor([ 1, 58, 53,  1, 44, 47, 50, 50]),
 tensor([ 1, 56, 53, 63, 39, 50,  1, 40]),
 tensor([47, 52, 39, 52, 41, 43,  1, 57]),
 tensor([ 1, 63, 53, 59, 56,  1, 45, 56])]

In [None]:
torch.stack([data[a : a+block_size] for a in ax])

tensor([[ 1, 58, 53,  1, 44, 47, 50, 50],
        [ 1, 56, 53, 63, 39, 50,  1, 40],
        [47, 52, 39, 52, 41, 43,  1, 57],
        [ 1, 63, 53, 59, 56,  1, 45, 56]])

In [None]:
batch_size = 4
block_size = 8

def get_batch(split):
    data = training_set if split == "train" else val_set
    index_x = torch.randint(len(data) - block_size, (batch_size,))

    x = torch.stack([data[i : i+block_size] for i in index_x])
    y = torch.stack([data[i+1 : i+block_size+1] for i in index_x])

    return x, y

In [None]:
# Lets see how are the inputs and the outputs
x_ex, y_ex = get_batch("train")
print(f"The input is:\n{x_ex}")
print("------------------------")
print(f"The target is:\n{y_ex}")

The input is:
tensor([[50,  1, 40, 43,  1, 57, 58, 53],
        [43, 41, 46,  1, 61, 47, 58, 46],
        [47, 41, 49,  6,  1, 57, 59, 41],
        [46,  1, 47, 58,  1, 40, 56, 47]])
------------------------
The target is:
tensor([[ 1, 40, 43,  1, 57, 58, 53, 52],
        [41, 46,  1, 61, 47, 58, 46,  1],
        [41, 49,  6,  1, 57, 59, 41, 46],
        [ 1, 47, 58,  1, 40, 56, 47, 52]])
