# Tiny GPT Implementation

Welcome to this walkthrough on how to implement GPT from scratch! 

Much of this notebook is taken from Andrej Karpathy's video ["Let's build GPT"](https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy) and its corresponding resources. However, this notebook is reworked to provide a more notebook-first experience, to aid hands-on learning.

This notebook will cover basic concepts such as attention and next-token prediction that are crucial to understanding how GPT works. This walkthrough will not cover much of the finer details of reproducing GPT performance. We'll be using a smaller dataset and single-GPU training.

## Dataset
Let's download our dataset that we will be training on. GPT-2 and later iterations of GPT were trained on closed-source, large, web-scale datasets. We'll instead be using a much smaller dataset for instructional and practical purposes.

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-12-24 01:49:08--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-12-24 01:49:08 (23.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



## EDA
Let's take a look at our dataset and what it looks like. First, we need to open it:

In [2]:
with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

Now, let's take a look at the length, some example text, and the alphabet we're dealing with.

In [9]:
print("======= Dataset Length =======")
print("Length of the dataset in characters:", len(text))

print("======= Sample Text =======")
print(text[:500])

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("======= Alphabet =======")
print("Alphabet:", "".join(chars))
print("Alphabet Size:", vocab_size)

Length of the dataset in characters: 1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor
Alphabet: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Alphabet Size: 65


It seems that our alphabet is consisted of uppercase and lowercase alphabetical characters, in addition to some punctuation and special characters. Typically, we want to tokenize our characters.