# `nanoGPT`

## Install / Setup

### First Time Running

We need to install `ngpt` and setup the Shakespeare dataset

This will need to be ran the first time you are running this notebook.

Following the

```python
!python3 -m pip install nanoGPT
```

you will need to restart your runtime (Runtime -> Restart runtime)

After this, you should be able to

```python
>>> import ngpt
>>> ngpt.__file__
'/content/nanoGPT/src/ngpt/__init__.py'
```


In [1]:
%%bash

[ $(python3 -c 'import ngpt; print(ngpt.__file__)') ] && export HAS_NGPT=1 || export HAS_NGPT=0
[ $HAS_NGPT ] || git clone https://github.com/saforem2/nanoGPT;
[ $HAS_NGPT ] || python3 nanoGPT/data/shakespeare_char/prepare.py
[ $HAS_NGPT ] || python3 -m pip install -e nanoGPT -vvv

## Post-Install

In [2]:
%load_ext autoreload
%autoreload 2

import ngpt
ngpt.__file__

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


'/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/__init__.py'

## Build Trainer

In [3]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

In [4]:
import os
from ezpz import setup_torch
from hydra.utils import instantiate
from ngpt.configs import get_config
from ngpt.trainer import Trainer

rank = setup_torch('DDP', seed=1234)
cfg = get_config(
    [
        'model=shakespeare',
        'train=shakespeare',
        'data=shakespeare',
        'train.log_interval=250',
        'train.eval_interval=2000',
        'optimizer=shakespeare',
        'train.dtype=bfloat16',
    ]
)
config = instantiate(cfg)
trainer = Trainer(config)

--------------------------------------------------------------------------

  Local host:   thetagpu24
  Local device: mlx5_0
--------------------------------------------------------------------------


2023-11-10 12:36:33.586444: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


RANK: 0 / 0


[38;2;131;131;131m[2023-11-10 12:36:37][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:256[0m[38;2;119;119;119m][0m - Rescaling GAS -> GAS [32m/[0m[32m/[0m WORLD_SIZE = [35m1[0m [32m/[0m[32m/[0m [35m1[0m


[38;2;131;131;131m[2023-11-10 12:36:37][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:391[0m[38;2;119;119;119m][0m - Tokens per iteration: [35m16[0m,[35m384[0m


[38;2;131;131;131m[2023-11-10 12:36:37][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:423[0m[38;2;119;119;119m][0m - Using [1m<[0m[1;95mtorch.amp.autocast_mode.autocast[0m[39m object at [0m[35m0x7f78e8d754b0[0m[1m>[0m


[38;2;131;131;131m[2023-11-10 12:36:37][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:429[0m[38;2;119;119;119m][0m - Initializing a new model from scratch


[38;2;131;131;131m[2023-11-10 12:36:37][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:144[0m[38;2;119;119;119m][0m - Initializing a new model from scratch


[38;2;131;131;131m[2023-11-10 12:36:38][0m[34m[INFO][0m[38;2;119;119;119m[model.py:160[0m[38;2;119;119;119m][0m - number of parameters: [35m10.[0m65M


[38;2;131;131;131m[2023-11-10 12:36:39][0m[34m[INFO][0m[38;2;119;119;119m[model.py:290[0m[38;2;119;119;119m][0m - num decayed parameter tensors: [35m26[0m, with [35m10[0m,[35m740[0m,[35m096[0m parameters


[38;2;131;131;131m[2023-11-10 12:36:39][0m[34m[INFO][0m[38;2;119;119;119m[model.py:291[0m[38;2;119;119;119m][0m - num non-decayed parameter tensors: [35m13[0m, with [35m4[0m,[35m992[0m parameters


[38;2;131;131;131m[2023-11-10 12:36:39][0m[34m[INFO][0m[38;2;119;119;119m[model.py:297[0m[38;2;119;119;119m][0m - using fused AdamW: [3;92mTrue[0m


In [5]:
trainer.train()

  0%|          | 0/5000 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-10 12:37:19][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m250[0m [3;94mloss[0m=[35m2[0m[35m.077[0m [3;94mdt[0m=[35m27[0m[35m.285[0m [3;94msps[0m=[35m36[0m[35m.650[0m [3;94mmtps[0m=[35m0[0m[35m.600[0m [3;94mmfu[0m=[35m13[0m[35m.657[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:37:26][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m500[0m [3;94mloss[0m=[35m1[0m[35m.596[0m [3;94mdt[0m=[35m28[0m[35m.565[0m [3;94msps[0m=[35m35[0m[35m.007[0m [3;94mmtps[0m=[35m0[0m[35m.574[0m [3;94mmfu[0m=[35m13[0m[35m.595[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:37:33][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m750[0m [3;94mloss[0m=[35m1[0m[35m.418[0m [3;94mdt[0m=[35m27[0m[35m.745[0m [3;94msps[0m=[35m36[0m[35m.042[0m [3;94mmtps[0m=[35m0[0m[35m.591[0m [3;94mmfu[0m=[35m13[0m[35m.579[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:37:40][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1000[0m [3;94mloss[0m=[35m1[0m[35m.332[0m [3;94mdt[0m=[35m27[0m[35m.101[0m [3;94msps[0m=[35m36[0m[35m.899[0m [3;94mmtps[0m=[35m0[0m[35m.605[0m [3;94mmfu[0m=[35m13[0m[35m.596[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:37:47][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1250[0m [3;94mloss[0m=[35m1[0m[35m.277[0m [3;94mdt[0m=[35m27[0m[35m.491[0m [3;94msps[0m=[35m36[0m[35m.376[0m [3;94mmtps[0m=[35m0[0m[35m.596[0m [3;94mmfu[0m=[35m13[0m[35m.592[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:37:54][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1500[0m [3;94mloss[0m=[35m1[0m[35m.234[0m [3;94mdt[0m=[35m27[0m[35m.454[0m [3;94msps[0m=[35m36[0m[35m.425[0m [3;94mmtps[0m=[35m0[0m[35m.597[0m [3;94mmfu[0m=[35m13[0m[35m.590[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:38:01][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1750[0m [3;94mloss[0m=[35m1[0m[35m.175[0m [3;94mdt[0m=[35m27[0m[35m.528[0m [3;94msps[0m=[35m36[0m[35m.327[0m [3;94mmtps[0m=[35m0[0m[35m.595[0m [3;94mmfu[0m=[35m13[0m[35m.584[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:38:08][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m2000[0m [3;94mloss[0m=[35m1[0m[35m.140[0m [3;94mdt[0m=[35m26[0m[35m.979[0m [3;94msps[0m=[35m37[0m[35m.066[0m [3;94mmtps[0m=[35m0[0m[35m.607[0m [3;94mmfu[0m=[35m13[0m[35m.607[0m [3;94mtrain_loss[0m=[35m4[0m[35m.299[0m [3;94mval_loss[0m=[35m4[0m[35m.291[0m


[38;2;131;131;131m[2023-11-10 12:38:12][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:394[0m[38;2;119;119;119m][0m - Saving checkpoint to: [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/[0m[35mnanoGPT[0m


[38;2;131;131;131m[2023-11-10 12:38:12][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:395[0m[38;2;119;119;119m][0m - Saving model to: [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/[0m[35mmodel.pth[0m


[38;2;131;131;131m[2023-11-10 12:38:12][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:122[0m[38;2;119;119;119m][0m - Appending [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/[0m[35mnanoGPT[0m to [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/ckpts/[0m[35mcheckpoints.log[0m


[38;2;131;131;131m[2023-11-10 12:38:19][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m2250[0m [3;94mloss[0m=[35m1[0m[35m.121[0m [3;94mdt[0m=[35m27[0m[35m.434[0m [3;94msps[0m=[35m36[0m[35m.451[0m [3;94mmtps[0m=[35m0[0m[35m.597[0m [3;94mmfu[0m=[35m13[0m[35m.605[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:38:26][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m2500[0m [3;94mloss[0m=[35m1[0m[35m.067[0m [3;94mdt[0m=[35m27[0m[35m.166[0m [3;94msps[0m=[35m36[0m[35m.811[0m [3;94mmtps[0m=[35m0[0m[35m.603[0m [3;94mmfu[0m=[35m13[0m[35m.616[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:38:32][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m2750[0m [3;94mloss[0m=[35m1[0m[35m.034[0m [3;94mdt[0m=[35m28[0m[35m.086[0m [3;94msps[0m=[35m35[0m[35m.605[0m [3;94mmtps[0m=[35m0[0m[35m.583[0m [3;94mmfu[0m=[35m13[0m[35m.581[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:38:40][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m3000[0m [3;94mloss[0m=[35m1[0m[35m.009[0m [3;94mdt[0m=[35m26[0m[35m.926[0m [3;94msps[0m=[35m37[0m[35m.139[0m [3;94mmtps[0m=[35m0[0m[35m.608[0m [3;94mmfu[0m=[35m13[0m[35m.607[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:38:46][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m3250[0m [3;94mloss[0m=[35m0[0m[35m.940[0m [3;94mdt[0m=[35m27[0m[35m.802[0m [3;94msps[0m=[35m35[0m[35m.968[0m [3;94mmtps[0m=[35m0[0m[35m.589[0m [3;94mmfu[0m=[35m13[0m[35m.586[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:38:53][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m3500[0m [3;94mloss[0m=[35m0[0m[35m.947[0m [3;94mdt[0m=[35m27[0m[35m.399[0m [3;94msps[0m=[35m36[0m[35m.497[0m [3;94mmtps[0m=[35m0[0m[35m.598[0m [3;94mmfu[0m=[35m13[0m[35m.588[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:39:00][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m3750[0m [3;94mloss[0m=[35m0[0m[35m.885[0m [3;94mdt[0m=[35m27[0m[35m.771[0m [3;94msps[0m=[35m36[0m[35m.009[0m [3;94mmtps[0m=[35m0[0m[35m.590[0m [3;94mmfu[0m=[35m13[0m[35m.571[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:39:07][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m4000[0m [3;94mloss[0m=[35m0[0m[35m.866[0m [3;94mdt[0m=[35m27[0m[35m.132[0m [3;94msps[0m=[35m36[0m[35m.856[0m [3;94mmtps[0m=[35m0[0m[35m.604[0m [3;94mmfu[0m=[35m13[0m[35m.587[0m [3;94mtrain_loss[0m=[35m1[0m[35m.050[0m [3;94mval_loss[0m=[35m1[0m[35m.474[0m


[38;2;131;131;131m[2023-11-10 12:39:11][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:394[0m[38;2;119;119;119m][0m - Saving checkpoint to: [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/[0m[35mnanoGPT[0m


[38;2;131;131;131m[2023-11-10 12:39:11][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:395[0m[38;2;119;119;119m][0m - Saving model to: [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/[0m[35mmodel.pth[0m


[38;2;131;131;131m[2023-11-10 12:39:11][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:122[0m[38;2;119;119;119m][0m - Appending [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/[0m[35mnanoGPT[0m to [32m/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/ckpts/[0m[35mcheckpoints.log[0m


[38;2;131;131;131m[2023-11-10 12:39:18][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m4250[0m [3;94mloss[0m=[35m0[0m[35m.847[0m [3;94mdt[0m=[35m27[0m[35m.197[0m [3;94msps[0m=[35m36[0m[35m.769[0m [3;94mmtps[0m=[35m0[0m[35m.602[0m [3;94mmfu[0m=[35m13[0m[35m.598[0m [3;94mtrain_loss[0m=[35m0[0m[35m.696[0m [3;94mval_loss[0m=[35m1[0m[35m.637[0m


[38;2;131;131;131m[2023-11-10 12:39:25][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m4500[0m [3;94mloss[0m=[35m0[0m[35m.835[0m [3;94mdt[0m=[35m27[0m[35m.515[0m [3;94msps[0m=[35m36[0m[35m.344[0m [3;94mmtps[0m=[35m0[0m[35m.595[0m [3;94mmfu[0m=[35m13[0m[35m.593[0m [3;94mtrain_loss[0m=[35m0[0m[35m.696[0m [3;94mval_loss[0m=[35m1[0m[35m.637[0m


[38;2;131;131;131m[2023-11-10 12:39:32][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m4750[0m [3;94mloss[0m=[35m0[0m[35m.822[0m [3;94mdt[0m=[35m26[0m[35m.982[0m [3;94msps[0m=[35m37[0m[35m.061[0m [3;94mmtps[0m=[35m0[0m[35m.607[0m [3;94mmfu[0m=[35m13[0m[35m.615[0m [3;94mtrain_loss[0m=[35m0[0m[35m.696[0m [3;94mval_loss[0m=[35m1[0m[35m.637[0m


[38;2;131;131;131m[2023-11-10 12:39:39][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:473[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m5000[0m [3;94mloss[0m=[35m0[0m[35m.808[0m [3;94mdt[0m=[35m27[0m[35m.527[0m [3;94msps[0m=[35m36[0m[35m.328[0m [3;94mmtps[0m=[35m0[0m[35m.595[0m [3;94mmfu[0m=[35m13[0m[35m.607[0m [3;94mtrain_loss[0m=[35m0[0m[35m.696[0m [3;94mval_loss[0m=[35m1[0m[35m.637[0m


In [6]:
outputs = trainer.evaluate("what is the meaning of life?", num_samples=1)

[38;2;131;131;131m[2023-11-10 12:41:38][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:513[0m[38;2;119;119;119m][0m - [1m[[0mprompt[1m][0m: [32m"what is the meaning of life?"[0m


[38;2;131;131;131m[2023-11-10 12:41:38][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:514[0m[38;2;119;119;119m][0m - > "what is the meaning of life?

JULIET:
If they have free true, come some one of them;
And there be thy consent to be thy friend,
Thy poor brother's life there have met thy death,
Or that thou shalt not be so bold as true.

ROMEO:
What lies thee? then, lords, peace! thou rag of shame!

Nurse:
Hold, O! wolves, hold me to the day!

JULIET:
A horse! and thou! and firmless lady-villain!
Forspeak and desperate gentlemen!
I wasted the traitor, and a prophetess
Comed to see a sail!
Alack, alas! that Bolingbroke! where
Shall Romeo b"


[38;2;131;131;131m[2023-11-10 12:41:38][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:515[0m[38;2;119;119;119m][0m - ----------------------------------------------------------------------------------------------------
