# NanoGPT in Google Colab!

There are three programming "cells" in this notebook. Each appears in a grayish box. A cell can be run or re-run by hovering over the cell and pressing the play button that appears in the upper left margin.

## STEP 1. Select and upload your data
Run the first cell below, press the "Browse" button that appears below it, and select one or more plaintext files from you computer to upload to train the character-level NanoGPT model.

If you wish to train a model on new data later on, be sure to rerun this cell first.

In [7]:
from google.colab import files

print("Please select one or more plaintext files you want to train the LLM with.")
uploaded = files.upload()

# Read the data into one string.
data = '\n'.join([uploaded[filename].decode('utf-8') for filename in uploaded.keys()])

Please select one or more plaintext files you want to train the LLM with.


Saving bartleby.txt to bartleby.txt
Saving confidence-man.txt to confidence-man.txt
Saving mobydick.txt to mobydick.txt


## STEP 2. Get the environment setup and train your model

This next cell installs all of the code you need in order to train and run a model, then it trains a new model of your data. If you ever change your dataset by running the cell above, you need to rerun this cell afterwards to perform the training.

When training has completed, you will see a message telling you as much and to move on to the next cell.

In [11]:
# Install necessary packages.
!pip install torch numpy transformers datasets tiktoken wandb tqdm

# Clone the this repo; update if necessary (if re-running cell).
!git clone https://github.com/hafeild/nanoGPT-colab.git
!cd nanoGPT-colab && git pull

# Get and prepare the data for training.
import importlib
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, 'nanoGPT-colab/data/google_colab_char')
import prepare
importlib.reload(prepare) # In the event that we're re-running after a git repo update.
prepare.prepareGoogleColab(data)

#Train.
!cd nanoGPT-colab/ && python train.py config/train_google_colab_char.py \
    --device=cpu --compile=False --eval_iters=20 --log_interval=1 \
    --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 \
    --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0

print('\nTRAINING IS COMPLETE! Please run the next cell to generate some output!')

fatal: destination path 'nanoGPT-colab' already exists and is not an empty directory.
Already up to date.
all the unique characters: 
 !"#$%&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz|Æàáæèéëïö—‘’“”•™﻿
vocab size: 104
train has 1,701,332 tokens
val has 189,037 tokens
Overriding config with config/train_google_colab_char.py:
"""
File:   google_colab_char/prepare.py
Author: Adapted by Henry Feild from Andrej Karpathy's config/train_shakespeare_char.py
Date:   01-Oct-2023
Purpose: Trains a miniature character-level shakespeare model.
"""
out_dir = 'out-google-colab-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'google-colab-char'
wandb_run_name = 'mini-gpt'

dataset = 'google_colab_char'
gradie

## STEP 3. Generating text from the model

The cell below allows you to change some settings via a form displayed on the right. You can run this cell as many times as you want after you've trained the model above.

In [12]:
# @title Text generation settings
# @markdown Specify each of the following before running this cell.

# @markdown **Prompt**: This is what the model will start with and generate the remaining words.
prompt = 'If I had a million dollars, '  # @param {type: "string"}

# @markdown **Temperature**: This is the amount of randomness that is used to select the next word. 0 is least random, higher is more random. The default is 0.8.
temperature = 0.8 # @param {type: "slider", min: 0, max: 2, step: 0.05}

# @markdown **Number of passages**: Use this to select the number of samples you want to generate.
numberOfPassages = 4  # @param {type: "slider", max: 20, min: 1}

# Generate new text.
!cd nanoGPT-colab/ && python sample.py --out_dir='out-google-colab-char' --device='cpu' --start="$prompt" --num_samples=$numberOfPassages --temperature=$temperature



Overriding: out_dir = out-google-colab-char
Overriding: device = cpu
Overriding: start = If I had a million dollars, 
Overriding: num_samples = 4
Overriding: temperature = 0.8
number of parameters: 0.80M
Loading meta from data/google_colab_char/meta.pkl...
If I had a million dollars, the pinivaling
if the toest subers so sow at be this of the stlet of the
dontol in lofgedinht. But gond the onk en to him the nose hus onlaction.


Tight Chib, I That Alabsionseriat Speayly, I Fing, soing the Phot I was Quequecon-----hachan serng a terr Sack
in dending irpoic for witht at antuarss instain Aptered, Ind rownd ther
ways in on havng his of-spare of some the standice apps, in or as the winds on my
man frright what barke an bemptole in blowds at verys plarked as to man kiponse
so uth
---------------
If I had a million dollars, and the come Purojidical,
collower for a myard in that on of berfort abre of untotherstance upppore at time anductions, and
that momedned anything in rappion, me are of or