<img src="11.png" width="100%"/>

# Overview 

In The New York Times article "Watch an AI Learn to Write By Reading Nothing But ___" by **Aatish Bhatia**, he spends weeks reading AI research articles and training tiny language models on his computer. In his aticle, he uses "BabyGPT" to prompt AI to learn language by reading only the complete works of the readers' choosing; Jane Austin, Shakespeare, Federalist Papers, Moby-Dick, Star Trek: The Next Generation, and Harry Potter. This BabyAI only sees 900 thousand words and nothing else. 

Before training, there's gibberish. You can't read it or understand what the BabyGPT is trying to produce. When you click on "Generate another response" it ends up being another pile of gibberish. You can tell that BabyGPT is ant-size in comparison to using other models. This example shows how language models start off- there's clumps of randomly produced combinations of characters. These models need to train after many rounds to eventually generate responses that we can understand.

<img src="1.png" width="100%"/>

In the article, after **250** rounds, we get some English letters. They noticed that there's a lot of letter "e" because that's the most common letter in English. Also, there are some small words that form, such as I, to, you, etc. It's still not great, but it's something readable.

After **500** rounds, small words start forming. From the Shakespeare example, I see words such as "lover", "him", "they", etc. There's small and some basic grammar.

And after **5,000** rounds, there are even bigger words now. You see words like "charges", "concludes", and others. The sentences still don't cohesively make sense, but you can see the progress. The grammar is getting better too. We get this because BabyGPT is a neural network.

Finally, after **30,000** rounds, we get full sentences. The words still don't really make sense, but it's looking more like English.

**Let's explore this for ourselves** </br>
</br>
**nanoGPT**, developed by **Andrej Karpathy**, is the model used in this article. It's a generative pre-trained transformer. The difference in nanoGPT and chatGPT os the size because GPT-3 was training up to a million times.

# nanoGPT 

I wanted to look to see how they were able to do these training experiments, so I took a look at the original github. Here's just a summary of what I learn, hope it's useful!

**Here's the github link if you want to explore nanoGPT** : https://github.com/karpathy/nanoGPT

### README

karpathy gives an overview of what nanoGPT is and how it prioritizes "teeth over education." </br>
</br>
The file **train.py** reproduces GPT-2 (124M) on OpenWebText) and has about 4 days of training. The code is a 300-line boilerplate training loop. </br>
</br>
**Model.py** is a 300-line GPT model definition that loads the GPT-2 weights from OpenAI.

```
pip install torch numpy transformers datasets tiktoken wandb tqdm

python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char
```
</br>
This generates a few examples: </br>

```
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?

DUKE VINCENTIO:
If you have done evils of all disposition
To end his power, the day of thrust for a common men
That I leave, to fight with over-liking
Hasting in a roseman.
```
This is a character-level model after 3-minutes of training on a GPU.

### reproducing GPT-2

```
python data/openwebtext/prepare.py
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

```

### baselines 

OpenAI GPT-2 checkpoints allows for baselines in place for openweb text and here are the numbers: </br>
</br>
```
$ python train.py config/eval_gpt2.py
$ python train.py config/eval_gpt2_medium.py
$ python train.py config/eval_gpt2_large.py
$ python train.py config/eval_gpt2_xl.py
```
</br>
Here's the losses on train and val loss. If you don't know what that means: </br>
If your train loss is low but val loss is high, that usually means overfitting. If both are high, your model might be underfitting and needs improvement.

| Loss Type       | What it Measures                   | Used For                         |
|------------------|-------------------------------------|----------------------------------|
| **Train Loss**    | Error on training data              | Shows how well the model learns |
| **Validation Loss** | Error on unseen validation data   | Checks for generalization       |


and observe the following losses on train and val: 

| model      | params               | train loss          | val loss |
|------------|----------------------|---------------------|----------|
| **gpt2**    | 124M                | 3.11                | 3.12     |
| **gpt2-medium** | 350M            | 2.85                | 2.84     |
| **gpt2-large**  | 774M            | 2.66                | 2.67     |
| **gpt2-xl**     | 1558M           | 2.56                | 2.54     |

### finetuning 

Finetuning is very similar to training, it's just training with a smaller learning rate. You make sure to initialize from a pretrained model and train with a smaller learning rate. Here's an example of finetuning that they include: </br>
</br>
```
python train.py config/finetune_shakespeare.py
```
And when you run the code with: </br>
```
sample.py --out_dir=out-shakespeare
```
The output is: </br>
```
THEODORE:
Thou shalt sell me to the highest bidder: if I die,
I sell thee to the first; if I go mad,
I sell thee to the second; if I
lie, I sell thee to the third; if I slay,
I sell thee to the fourth: so buy or sell,
I tell thee again, thou shalt not sell my
possession.

JULIET:
And if thou steal, thou shalt not sell thyself.

THEODORE:
I do not steal; I sell the stolen goods.

THEODORE:
Thou know'st not what thou sell'st; thou, a woman,
Thou art ever a victim, a thing of no worth:
Thou hast no right, no right, but to be sold.
```

### sampling / inference

If you want a script to sample from a model you trained for yourself, or sample from a pre-trained GPT-2 model from OpenAI, you can use: </br>
```
sample.py
```

# Overall Notes

nanoGPT is minimalistic, but is important to help people learn how GPT models work under the hood. It makes it easier for me to understand and study the model's architecture, training, and inference. </br>
It's also a pretty modest hardware, like a single GPU. It's different from large-scale models that need massive compute. </br>
And unlike the commercial models, nanoGPT offers a transparent, open-source alternative that breaks down how LLMs work. </br>
It's a cute little starter kid for GPT models.