# DI 725: Transformers and Attention-Based Deep Networks

## An Assignment for Implementing Transformers in PyTorch

The purpose of this notebook is to guide you through the usage of sample code.

This notebook follows the baseline prepared by Andrej Karpathy, with a custom dataset (Don-Quixote by Cervantes). This version of the code, called [nanoGPT](https://github.com/karpathy/nanoGPT), is a revisit to his famous [minGPT](https://github.com/karpathy/minGPT).
### Author:
* Ümit Mert Çağlar

## Requirements
Install requirements for your environment, comment out for later uses.

Dependencies:

- [pytorch](https://pytorch.org)
- [numpy](https://numpy.org/install/)
-  `transformers` for huggingface transformers (to load GPT-2 checkpoints)
-  `datasets` for huggingface datasets (to download + preprocess datasets)
-  `tiktoken` for OpenAI's fast BPE code
-  `wandb` for optional logging
-  `tqdm` for progress bars

The fastest way to get started to transformers, apart from following the labs of DI725, is to use a small model and dataset. For this purpose, we will start with training a character-level GPT on the Don-Quixote by Cervantes. The code will download a single file (2MB) and apply some transformations. Examine the code [prepare.py](data/don_char/prepare.py).

## Quick Start

Use the following to prepare the don-quixote novel treated in character level:

In [4]:
!python data/don_char/prepare.py

length of dataset in characters: 2,361,834
all the unique characters: 
 !#$%&()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzÁÆÑÚàáæèéëíñóùŒœ—‘’“”•™
vocab size: 105
train has 2,125,650 tokens
val has 236,184 tokens


This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train our own GPT. The size of the GPT model depends on the computational resources. It is advised to have a GPU for heavy works, and to train lightweight and evaluate and infer models with a CPU.

Small scale GPT with the settings provided in the [config/train_don_char.py](config/train_don_char.py) config file will be trained with the following code:


In [3]:
!python train.py config/train_don_char.py --device=cuda --compile=False

^C


We are training a small scaled GPT with a context size of up to 256 characters, 384 feature channels, 6 layers of transformer with 6 attention heads. On one GTX 3070 GPU this training run takes about 10 minutes and the best validation loss is 1.1620. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-don-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:

In [4]:
!python sample.py --out_dir=out-don-char

^C


This generates a few samples, for example:

```
“I grant all that,” said the governor; “it’s not in a low voice

but not yet forget that there’s none of it the poor in the world; I’ll

like to take special to have been no one to write out the stone of

patience to the village.”

```

It is pretty nice to have a GPT in a few minutes of character level training! Better results can be achieved possibly by hyperparameter tuning and finetuning (transfer learning) from a pre-trained model.


## Quick start with less resources

If we are [low on resources](https://www.youtube.com/watch?v=rcXzn6xXdIc), we can use a simpler version of the training, first we need to set compile to false, this is also a must for Windows OS for now. We also set the device to CPU. The model that is trained in 10 minutes for a starter grade GPU, will be trained in a much longer time, so we can also decrease the dimensions of our model as follows:

In [5]:
!python train.py config/train_don_char.py --device=cuda --wandb_log=True --wandb_project="gpt2Train" --wandb_run_name="trial1" --out_dir="out-don-small-char" --compile=False --eval_iters=20 --log_interval=50 --block_size=64 --batch_size=12 --n_layer=6 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=1000 --dropout=0.2

Overriding config with config/train_don_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-don-char'
eval_interval = 250  # keep frequent because we'll overfit
eval_iters = 200
log_interval = 50  # don't print too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False  # override via command line if you like
wandb_project = 'don-char'
wandb_run_name = 'mini-gpt'

dataset = 'don_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256  # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3  # with baby networks can afford to go a bit higher
max_iters = 2000
lr_decay_iters = 2000  # make equal to max_iters usually
min_lr = 1e-4  # learning_rate / 10 usually
beta2 = 0.99  # make a bit bigger because number of tokens per iter is small

wa

wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: iedibrahimethemdeveci (iedibrahimethemdeveci-metu-middle-east-technical-university) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: creating run
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in c:\Users\iedev\Desktop\SPRING 2025\DI 725 - Transformers\Assignments\Assignment_1\Trial_Generation\DI725-main\assignment_1\wandb\run-20250403_211417-st80a6v1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trial1
wandb:  View project at https://wandb.ai/iedibrahimethemdeveci-metu-middle-east-technical-university/gpt2Train
wandb:  View run at https://wandb.ai/iedibrahimethemdeveci-metu-middle-east-technical-university/gpt2Train/runs/st80a6v1


*Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~5 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:*

In [6]:
!python sample.py --out_dir=out-don-small-char --device=cpu

Overriding: out_dir = out-don-small-char
Overriding: device = cpu
number of parameters: 1.19M
Loading meta from data\don_char\meta.pkl...

and of and ishe am lor ainter, if grutin the be tos the lis aprot,

ors this rert an dor had mee of in ghim othe as the baist, fif ash or rerckid merel

the ome a the ghe thers hides fo dore gresis lote, neasent of pencabes,

and thas the bantanennd to an thas I ma beeecces thald un or tho he gucave

me thid les of to or o matightad the cay by eend

d chis yof hey, bres iw win un hochor the he I thas sanim sacke to the cow

yow, and I o iSancht ant in not lase thase Go, and eerd and le wathir

b
---------------

ma, ind forsevemed hin the trameat dit wan the ouncke nelercene abat the of the ece

cof winth the would somangh thas a slaid nde heat the o a his dand

herghen und tho ward noce a a which me wich ithearl here be on seved the co thar

whe as wa him aredelllt at thime bute but wa

o an no f thist iy thare hee wharm the sa  of theld gooide tha

Generates samples like this:

```
Sancho nother with this then of everantan has for five he enver any

shal were than as in though they and I knight the sther his a jlage,

and mad priled and squiel a hist to in feet she took and and sersse to her of

Marest and good was pefor rubt some by than lave from his dintat all

pack that he remants to goost ever to him arestiance of it the were to who

which mom, worly gane for he sporen gort he was roosion, and be that

it thou, so so he kniders what the and him of him dest us on shart
```

*Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.*

*Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.*



## Finetuning

Finetuning or transfer learning is a precious method of achieving better models thanks to pre-trained models. Finetuning GPT models is just as simple as training from scratch! We will now download the Don-Quixote (again) but this time we will define it with tokens (using OpenAI's BPE tokenizer) instead of characters.



In [None]:
!python data/don/prepare.py

Run an example finetuning like:

In [None]:
!python train.py config/finetune_don.py --compile=False

This will load the config parameter overrides in `config/finetune_don.py`. Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. Model architecture is changable to `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) and can be decreased in size by the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-don` by default, per the config file. You can then run the code in `sample.py --out_dir=out-don`:
```
* * All creatures that enter the world below may so far as want to observe the rules of their own land, and may obey them under the hand of their lord, and may not follow others below.

* * *

THE PORT COLLIDATES,

- * *

ON the light, and the light to the dark, and the darkness to the light, and the darkness to the darkness, were the present-day laws of monarchy, whose lordship they approved in their faces and hearts. From this moment on, however, they had no other representation to give than that of their master, who, for all that was said or heard, had reached the height of his power.

The king's hand, though at times little more than a finger of his, required no more than a finger of his, and that power was, that of holding his eye, and the other of his, in his own, body.

When this was spoken of, it was a simple and noble quibble, and the subject of this was so as to admit of the few who had any forsemination, and the few who had the most to go on.

The time did not come for a thought of this, and for a moment the very thought of it seemed to fall to the ground.

But that thought did not come to pass; though the king was not speaking of the king, it came to pass that the king, with all his might, and all his cunning, and no other sense, and without any understanding, and without any desire for the utmost of his services, and without any desire to put an end to his own glory, and without any desire to hide his triumph, had found the time to say that this was what he thought on the subject of religion; that it was what he thought, and according as it seemed to him to be as good or better to him than to the other kings, and he was in no sense a king, for it seemed to him he could never have any more power than he had to be; that it was a matter of his will and power; and that it was all a matter of his will, for he was determined that this look and that to which he might have been given to hold it was the best in himself.

And so it was that the king, who was all around him, and all around him; and so
```

# Inference and Sampling
Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:

In [None]:
!python sample.py --out_dir=out-don --start="Explain the relationship between Don Quixote and Sancho Panza" --num_samples=5 --max_new_tokens=100

If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g.:

In [None]:
!python sample.py --start=FILE:"prompt/fictional.txt" --out_dir="out-don" --num_samples=1 --max_new_tokens=100

In [None]:
!python sample.py --start=FILE:"prompt/positive_review.txt" --out_dir="out-don" --num_samples=1 --max_new_tokens=500

I hope you will enjoy with the GPT as much as I did!

## Efficiency notes

*For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.*

*Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!*


## Troubleshooting

*Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.*

*For some context on this repository, GPT, and language modeling it might be helpful to watch [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.*

## Acknowledgements

This code is a fork from Andrej Karpathy's introductory [NanoGPT repository](https://github.com/karpathy/nanoGPT), which is an updated form of minGPT.

# Further Experiments

(Optional)

For further experiments, you can, for example, reproduce the GPT-2, which is still powerful, by following the link to the Andrej Karpathy's repository.