# Training GPT2 with Colab and Google Drive
- badges: true
- comments: true
- categories: [gpt2,colab,drive]




We'll be using [aitextgen](https://github.com/minimaxir/aitextgen) to finetune the model.

In [None]:
#collapse-output
pip install aitextgen



Import modules and mount google drive

In [None]:
from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

mount_gdrive()

In [None]:
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt > input.txt
!head input.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1089k  100 1089k    0     0  9002k      0 --:--:-- --:--:-- --:--:-- 9002k
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:


Train tokenizer

In [None]:
file_name = "input.txt"
project_name = "project_name"

# copy_file_from_gdrive(file_name)
train_tokenizer(file_name);

INFO:aitextgen.tokenizers:Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


Training the model should take about 30 minutes

In [None]:
model = None
config = None

for _ in ["pytorch_model.bin", "config.json", "aitextgen_vocab.json", "aitextgen_merges.json"]:
    try:
        copy_file_from_gdrive(_, project_name)
        model = "pytorch_model.bin"
        config = "config.json"
    except FileNotFoundError:
        pass

config = config or build_gpt2_config(
    vocab_size=5000, max_length=200, dropout=0.0, n_embd=256, n_layer=8, n_head=8
)

ai = aitextgen(
    vocab_file="aitextgen-vocab.json",
    merges_file="aitextgen-merges.txt",
    config=config,
    model=model,
    to_gpu=True
)

INFO:aitextgen:Constructing GPT-2 model from provided config.
INFO:aitextgen:Using a custom tokenizer.


In [None]:
ai.train(
    file_name,
    line_by_line=False,
    num_steps=10000,
    generate_every=1000,
    save_every=500,
    learning_rate=1e-4,
    batch_size=128,
    save_gdrive=True,
    run_id=project_name
)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=40000.0), HTML(value='')), layout=Layout(…

INFO:aitextgen.TokenDataset:Encoding 40,000 sets of tokens from input.txt.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]





HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=10000.0), HTML(value='')), layout=Layout(…

Generating examples

In [None]:
ai.generate(
    n=5,
    batch_size=5,
    prompt="Speak:",
    temperature=1.0,
    top_p=0.9,    
)