# GPU Training Notebook

This notebook allows us to train our model using a GPU.

**Before running this notebook, follow these steps:**

1. In your Google Drive, go to MyDrive and create a folder `inf265_project_3`.
2. Put the required files in this folder: `tokenizer.py`, `train.py`, `config.py` and `utils.py`.
3. In the upper-right corner, click the down arrow and select `Change runtime type`.
4. Choose `Runtime: Python3` and `Hardware accelerator: T4 GPU`. Do not select the `High-RAM` option.
5. If required, click `Connect`. The bottom status bar should read something like `Connected to Python 3 Google Compute Engine backend (GPU)`.

**Warning:** You get some free compute time every 24 hours. As long as you are connected to a GPU runtime, this will count towards your quota. If you are not training your model, make sure to click `Runtime >> Disconnect and delete runtime` so you don't waste your free compute.

## Install Python Libraries

Start by running the following cell to install required libraries:

In [1]:
!pip install datasets tokenizers

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

## Imports and Mounting Google Drive

To save the tokenizer, model and optimizer checkpoints, we will mount Google Drive in the next code cell. Make sure you have created a directory `inf265_project_3` in your Google Drive under `MyDrive` and put your Python files there.

We also use the `autoreload` Jupyter extension allowing us to re-import external files without restarting the kernel. This is useful if you need to do small changes in some Python files. You can find the files in the file browser (the folder icon in the left sidebar). Note that you need to mount your Google Drive before you can access the files from Colab. It might also take a few seconds before the file is updated after saving.

Run the following cell to mount Google Drive and import the necessary files.

In [2]:
%load_ext autoreload
%autoreload 2

from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/inf265_project_3')

from pathlib import Path
from tokenizer import train_tokenizer
from train import train_model
from config import config
from utils import print_config

# Append paths to filenames for saving on Google Drive
gdrive_base_path = "/content/drive/MyDrive/inf265_project_3/"

if "MyDrive" not in config.tokenizer_filename: # Only append once
  config.tokenizer_filename = gdrive_base_path + config.tokenizer_filename
  config.model_filename = gdrive_base_path + config.model_filename
  config.optimizer_filename = gdrive_base_path + config.optimizer_filename

print_config(config)

Mounted at /content/drive
Using configuration:
	seed: 0
	dataset: odinhg/gooaq-subset
	split: train
	device: cuda
	vocab_size: 20000
	min_frequency: 5
	unk_token: [UNK]
	sep_token: [SEP]
	end_token: [END]
	pad_token: [PAD]
	tokenizer_filename: /content/drive/MyDrive/inf265_project_3/temp/tokenizer.json
	embed_size: 512
	num_heads: 8
	num_layers: 5
	dropout_p: 0.1
	max_len: 128
	model_train_fraction: 1.0
	batch_size: 128
	dataloader_num_workers: 2
	lr: 0.0001
	num_epochs: 5
	model_filename: /content/drive/MyDrive/inf265_project_3/temp/model.pth
	optimizer_filename: /content/drive/MyDrive/inf265_project_3/temp/optimizer.pth
****************************************************************************************************


## Training the Tokenizer

Train and save the tokenizer. This might take a few minutes to complete. But you only have to do this once as it will save the tokenizer for later use.

In [3]:
if not Path(config.tokenizer_filename).exists():
  tokenizer = train_tokenizer(config)
else:
  print(f"Tokenizer already exists at {config.tokenizer_filename}")

Tokenizer already exists at /content/drive/MyDrive/inf265_project_3/temp/tokenizer.json


## Training Your Model

We use the `train_model` function from `train.py`. This will save a model (and optimizer) checkpoint every 500 epochs. If you get disconnected or use all your daily compute, you can continue training again later.

When you have trained your model for around 3-5 epochs, download the model and tokenizer files from Google Drive and put them in your local `temp` folder. Then you can use these when doing inference (text generation).

A single epoch might take around 30 minutes to complete.

In [4]:
train_model(config)

Using configuration:
	seed: 0
	dataset: odinhg/gooaq-subset
	split: train
	device: cuda
	vocab_size: 20000
	min_frequency: 5
	unk_token: [UNK]
	sep_token: [SEP]
	end_token: [END]
	pad_token: [PAD]
	tokenizer_filename: /content/drive/MyDrive/inf265_project_3/temp/tokenizer.json
	embed_size: 512
	num_heads: 8
	num_layers: 5
	dropout_p: 0.1
	max_len: 128
	model_train_fraction: 1.0
	batch_size: 128
	dataloader_num_workers: 2
	lr: 0.0001
	num_epochs: 5
	model_filename: /content/drive/MyDrive/inf265_project_3/temp/model.pth
	optimizer_filename: /content/drive/MyDrive/inf265_project_3/temp/optimizer.pth
****************************************************************************************************
Number of parameters in the model: 36,261,920
Loading model and optimizer state dicts...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/321 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/138M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/859765 [00:00<?, ? examples/s]

Loaded dataset of size 859765 with columns ['question', 'answer']


  0%|          | 0/6717 [00:00<?, ?it/s]W0420 01:32:22.502000 690 torch/_inductor/utils.py:1137] [0/0] Not enough SMs to use max_autotune_gemm mode
[01 | 05] Loss: 1.2781: 100%|██████████| 6717/6717 [37:44<00:00,  2.97it/s]


Mean Epoch Cross-Entropy Loss: 1.2262



[02 | 05] Loss: 1.2165: 100%|██████████| 6717/6717 [36:27<00:00,  3.07it/s]


Mean Epoch Cross-Entropy Loss: 1.1994



[03 | 05] Loss: 1.2713: 100%|██████████| 6717/6717 [36:27<00:00,  3.07it/s]


Mean Epoch Cross-Entropy Loss: 1.1776



[04 | 05] Loss: 1.0990: 100%|██████████| 6717/6717 [36:28<00:00,  3.07it/s]


Mean Epoch Cross-Entropy Loss: 1.1596



[05 | 05] Loss: 1.1821: 100%|██████████| 6717/6717 [36:24<00:00,  3.08it/s]



Mean Epoch Cross-Entropy Loss: 1.1439


OptimizedModule(
  (_orig_mod): TransformerModel(
    (embedding): Embedding(20000, 512)
    (dropout): Dropout(p=0.1, inplace=False)
    (pos_encoder): PositionalEncoding()
    (layers): ModuleList(
      (0-4): 5 x DecoderBlock(
        (mha): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (ln1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (ff): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (ln2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (fc_out): Linear(in_features=512, out_features=20000, bias=True)
  )
)