# `nanoGPT`: GPT-2 Small (125M Params)

## Install / Setup

### First Time Running

We need to install `ngpt` and setup the Shakespeare dataset

This will need to be ran the first time you are running this notebook.

Following the

```python
!python3 -m pip install nanoGPT
```

you will need to restart your runtime (Runtime -> Restart runtime)

After this, you should be able to

```python
>>> import ngpt
>>> ngpt.__file__
'/content/nanoGPT/src/ngpt/__init__.py'
```

In [1]:
%%bash

python3 -c 'import ngpt; print(ngpt.__file__)' 2> '/dev/null'

if [[ $? -eq 0 ]]; then
    echo "Has ngpt installed. Nothing to do."
else
    echo "Does not have ngpt installed. Installing..."
    git clone 'https://github.com/saforem2/nanoGPT'
    python3 nanoGPT/data/shakespeare_char/prepare.py
    python3 -m pip install -e nanoGPT -vvv
fi

/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/__init__.py
Has ngpt installed. Nothing to do.


## Post Install

If installed correctly, you should be able to:

```python
>>> import ngpt
>>> ngpt.__file__
'/path/to/nanoGPT/src/ngpt/__init__.py'
```

In [2]:
%load_ext autoreload
%autoreload 2

import ngpt
from enrich import get_logger
log = get_logger('jupyter')
log.info(ngpt.__file__)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[38;2;131;131;131m[2023-11-29 19:41:53][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119m3434626787.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m7[0m[38;2;119;119;119m][0m - [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/[0m[35m__init__.py[0m


## Build Trainer

Explicitly, we:

1. `setup_torch(...)`
2. Build `cfg: DictConfig = get_config(...)`
3. Instnatiate `config: ExperimentConfig = instantiate(cfg)`
4. Build `trainer = Trainer(config)`

In [3]:
import os
from ezpz import setup_torch
from hydra.utils import instantiate
from ngpt.configs import get_config
from ngpt.trainer import Trainer

os.environ['MASTER_PORT'] = '4235'
rank = setup_torch('DDP', seed=1234)
cfg = get_config(
    [
        'data=owt',              # open web text
        'model=gpt2',            # gpt2 arch.
        'optimizer=gpt2',
        'train=gpt2',
        'train.init_from=gpt2',  # init from GPT2
        'train.max_iters=1000',
        'train.dtype=bfloat16',
    ]
)
config = instantiate(cfg)
trainer = Trainer(config)

--------------------------------------------------------------------------

  Local host:   thetagpu24
  Local device: mlx5_0
--------------------------------------------------------------------------


Failed to download font: Source Sans Pro, skipping!
Failed to download font: Titillium WebRoboto Condensed, skipping!
[38;2;131;131;131m[2023-11-29 19:42:01][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mconfigs.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m263[0m[38;2;119;119;119m][0m - Rescaling GAS -> GAS [32m/[0m[32m/[0m WORLD_SIZE = [35m1[0m [32m/[0m[32m/[0m [35m1[0m
[38;2;131;131;131m[2023-11-29 19:42:01][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mconfigs.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m398[0m[38;2;119;119;119m][0m - Tokens per iteration: [35m12[0m,[35m288[0m
[38;2;131;131;131m[2023-11-29 19:42:01][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mconfigs.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m430[0m[38;2;119;119;119m][0m - Using [1m<[0m[1;95mtorch.amp.autocast_mode.autocast[0m[39m object at [0m[35m0x7f6e4c363550[0m[1m>[0m
[38;2;131;131;131m[2023-11-29 19:42:01][0m[3

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

[38;2;131;131;131m[2023-11-29 19:42:10][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mmodel.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m290[0m[38;2;119;119;119m][0m - num decayed parameter tensors: [35m50[0m, with [35m124[0m,[35m318[0m,[35m464[0m parameters
[38;2;131;131;131m[2023-11-29 19:42:10][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mmodel.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m291[0m[38;2;119;119;119m][0m - num non-decayed parameter tensors: [35m98[0m, with [35m121[0m,[35m344[0m parameters
[38;2;131;131;131m[2023-11-29 19:42:10][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mmodel.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m297[0m[38;2;119;119;119m][0m - using fused AdamW: [3;92mTrue[0m


In [4]:
query = "What is a supercomputer?"
outputs = trainer.evaluate(query, num_samples=1, display=False)
log.info("['prompt']: '{query}'")
log.info("['response']:\n\n" + fr"{outputs['0']['raw']}")

[38;2;131;131;131m[2023-11-29 19:42:25][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119m1657463709.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m3[0m[38;2;119;119;119m][0m - [1m[[0m[32m'prompt'[0m[1m][0m: [32m'[0m[32m{[0m[32mquery[0m[32m}[0m[32m'[0m
[38;2;131;131;131m[2023-11-29 19:42:25][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119m1657463709.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m4[0m[38;2;119;119;119m][0m - [1m[[0m[32m'response'[0m[1m][0m:

What is a supercomputer? When did you first learn it?

I used to work in an Apple Computer. It was called the [32m"Elgin"[0m Computer. It was the first computer that I had seen on TV. I went to college in [35m1983[0m. I was at Arizona State University and I studied computer science. I later joined a computer science program at MIT. But my first computer was the Intel Core. It was from [35m1986[0m. I went to MIT where I got my PhD and was at the lab. When I did graduate 

## Train Model

Legend:

<div style="text-align:left;">

|  **NAME**  |     **DESCRIPTION**          |
|:----------:|:----------------------------:|
|   `step`   | Current training step        |
|   `loss`   | Loss value                   |
|   `dt`     | Time per step (in **ms**)    |
|   `sps`    | Samples per second           |
|   `mtps`   | (million) Tokens per sec     |
|   `mfu`    | Model Flops Utilization*     |

*in units of A100 `bfloat16` peak FLOPS

</div>

In [5]:
trainer.train(train_iters=1000)

  0%|          | 0/1000 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-29 19:43:53][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mtrainer.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m100[0m [3;94mloss[0m=[35m3[0m[35m.119[0m [3;94mdt[0m=[35m254[0m[35m.990[0m [3;94msps[0m=[35m3[0m[35m.922[0m [3;94mmtps[0m=[35m0[0m[35m.048[0m [3;94mmfu[0m=[35m13[0m[35m.208[0m [3;94mtrain_loss[0m=[35m3[0m[35m.126[0m [3;94mval_loss[0m=[35m3[0m[35m.102[0m
[38;2;131;131;131m[2023-11-29 19:44:20][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119mtrainer.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m200[0m [3;94mloss[0m=[35m2[0m[35m.961[0m [3;94mdt[0m=[35m228[0m[35m.112[0m [3;94msps[0m=[35m4[0m[35m.384[0m [3;94mmtps[0m=[35m0[0m[35m.054[0m [3;94mmfu[0m=[35m13[0m[35m.364[0m [3;94mtrain_loss[0m=[35m3[0m[35m.126[0m [3;94mval_loss[0m=[

## Evaluate Model

In [7]:
query = "What is a supercomputer?"
outputs = trainer.evaluate(query, num_samples=1, display=False)
log.info("['prompt']: '{query}'")
log.info("['response']:\n\n" + fr"{outputs['0']['raw']}")

[38;2;131;131;131m[2023-11-29 19:49:09][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119m1657463709.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m3[0m[38;2;119;119;119m][0m - [1m[[0m[32m'prompt'[0m[1m][0m: [32m'[0m[32m{[0m[32mquery[0m[32m}[0m[32m'[0m
[38;2;131;131;131m[2023-11-29 19:49:09][0m[34m[INFO][0m[38;2;119;119;119m[[0m[38;2;119;119;119m1657463709.py[0m[38;2;119;119;119m:[0m[38;2;119;119;119m4[0m[38;2;119;119;119m][0m - [1m[[0m[32m'response'[0m[1m][0m:

What is a supercomputer?

Researchers at MIT and EPFL and the University of Southern California published a paper on July [35m17[0m in IEEE Translational Computer Graphics. The paper, titled [32m"Processes for Supercomputers,"[0m [32m"is particularly interesting because it shows how computer graphics can be used with supercomputers."[0m

Supercomputers solve many of the processing needs of large graphical full-screen displays, and there is always room to improve one's g