<a href="https://colab.research.google.com/github/gkielian/ReaLLMASIC_nanogpt/blob/master/NanoGPT_Quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div class="mrkdown-google-sans">

# **GPU Quickstart**

</div>

<div class="mrkdown-google-sans">

### **Install NanoGPT GPU Dependencies**

</div>


In [4]:
%cd
!rm -rf nanoGPT_gpu
!git clone https://github.com/ReaLLMASIC/nanoGPT.git nanoGPT_gpu
%cd nanoGPT_gpu

# check branch info
!echo "Cloned repository"
!git branch

!ls

!pip install --upgrade pip
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install numpy transformers datasets tiktoken wandb tqdm tensorboard torchinfo


/root
Cloning into 'nanoGPT_gpu'...
remote: Enumerating objects: 4568, done.[K
remote: Total 4568 (delta 0), reused 0 (delta 0), pack-reused 4568 (from 1)[K
Receiving objects: 100% (4568/4568), 38.85 MiB | 28.83 MiB/s, done.
Resolving deltas: 100% (2746/2746), done.
/root/nanoGPT_gpu
Cloned repository
* [32mmaster[m
bench.py		  gpt_conf.py	     publications		 statistics_util
colabs			  huggingface_model  quantization		 steering_vector_util
config			  HW		     README.md			 tests
Contributing_Features.md  images	     requirements_cpu.txt	 tf_np_golden_gen
curriculum		  inspect_ckpts.py   run_curriculum_learning.py  train.py
data			  LICENSE	     run_experiments.py		 variations
demos			  model_info_util    run_vizier.py		 visualization_util
documentation		  model.py	     sample.py			 visualize.py
explorations		  modules	     softmax_sweep.py
factorization_util	  monitoring_util    start_tensorboard.sh
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torchinfo
  Do

<div class="mrkdown-google-sans">

### **Run GPU Training**

</div>


In [6]:
!python3 data/shakespeare_char/prepare.py

length of dataset in characters: 1,115,394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens


In [7]:
!python3 train.py --max_iters=2000 --out_dir="out-shakespeare" --max_sample_tokens 100

2024-11-19 00:59:16.550499: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-19 00:59:16.569455: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-19 00:59:16.575230: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-19 00:59:16.589169: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
seed:  1337
seed offset:  0
File data/shakespeare_cha

<div class="mrkdown-google-sans">

### **Run GPU Inference**

Once training is complete, you can run inference to generate tokens based on any set of input tokens as a starting point.

Some parameters exist to prevent repeat loops, e.g. '**temperature**'.

'**temperature**' involves a bit of randomness (can change random_seed for different results), with lower values being more deterministic and higher values being more random.

Try out a few values from temperature from 0.2 to 2.0.
</div>


In [8]:
!python3 sample.py --out_dir="out-shakespeare" --num_samples=1 --temperature 0.8

  checkpoint = torch.load(ckpt_path, map_location=args.device)
sliding window size: [3;35mNone[0m
setting flash attn
sliding window size: [3;35mNone[0m
setting flash attn
sliding window size: [3;35mNone[0m
setting flash attn
sliding window size: [3;35mNone[0m
setting flash attn
sliding window size: [3;35mNone[0m
setting flash attn
sliding window size: [3;35mNone[0m
setting flash attn
number of parameters: [1;36m10.[0m65M
Loading meta from out-shakespeare/meta.pkl[33m...[0m
[1;4;35mHigh Level Parameters:[0m
Layer (type:depth-idx)                             Param #
GPT                                                --
├─ModuleDict: 1-1                                  --
│    └─Embedding: 2-1                              24,960
│    └─Dropout: 2-2                                --
│    └─ModuleList: 2-3                             --
│    │    └─Block: 3-1                             1,770,240
│    │    └─Block: 3-2                             1,770,240
│    │    └─Blo

<div class="mrkdown-google-sans">

# **Modular Arithmetic Exploration**

One of the biggest questions we have in Transformer research is how we can transfer knowledge between domains.

One of our hypothesis is whether the Transformer architecture can transfer prior knowledge over isomorphic systems.

If successful, this could suggest that LLMs collect the distinct types of groups in their journey of modeling natural language.

</div>

<div class="mrkdown-google-sans">

## Testing Transformer Knowledge Transfer

A convenient way to test this is with smaller models on things like modular arithmetic, which are much faster to train than even shakespeare_char

</div>

In [14]:
%cd
%cd nanoGPT_gpu/data/modular_addition
!bash create_examples.sh
!head data/*

/root
/root/nanoGPT_gpu/data/modular_addition
1
2
4
8
16
==> data/base_16.txt <==
dda
ffe
2f1
437
eca
4ae
4c0
880
8f7
c1d

==> data/base_1.txt <==
111111111111100011111111111110001111111111000000
111111111111111011111111111111101111111111111100
110000000000000011111111111111101000000000000000
111100000000000011100000000000001111111000000000
111111111111110011111111111100001111111111000000
111100000000000011111111110000001111111111111100
111100000000000011111111111100000000000000000000
111111110000000011111111000000000000000000000000
111111110000000011111111111111101111111000000000
111111111111000010000000000000001111111111111000

==> data/base_2.txt <==
101110110101
111111110111
010011111000
001011001110
011100110101
001001010111
001000110000
000100010000
000111111110
001110001011

==> data/base_4.txt <==
131322
333323
203310
013031
230322
012223
010300
020200
023331
031013

==> data/base_8.txt <==
515121
717161
207110
403070
614121
402161
404100
010100
017170
411051


In [26]:
%cd
%cd nanoGPT_gpu
%cd data/modular_addition
!python3 prepare.py -i data/base_1.txt

/root
/root/nanoGPT_gpu
/root/nanoGPT_gpu/data/modular_addition
Length of dataset: 12,544
Unique chars: 
01
Vocab size: 3
Train tokens: 11,289
Val tokens: 1,255


In [29]:
%cd
%cd nanoGPT_gpu
!python3 train.py --dataset modular_addition --n_layer 3 --n_head 3 --n_embd 192 --block_size 147 --max_sample_tokens 196

/root
/root/nanoGPT_gpu
2024-11-19 01:32:34.713665: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-19 01:32:34.747481: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-19 01:32:34.757204: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-19 01:32:34.783651: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
seed:  1337
seed offset:  0
F