# NanoGPT in Google Colab!

There are three programming "cells" in this notebook. Each appears in a grayish box. A cell can be run or re-run by hovering over the cell and pressing the play button that appears in the upper left margin.

## STEP 1. Select and upload your data
Run the first cell below, press the "Browse" button that appears below it, and select one or more plaintext files from you computer to upload to train the character-level NanoGPT model.

If you wish to train a model on new data later on, be sure to rerun this cell first.

In [None]:
from google.colab import files

print("Please select one or more plaintext files you want to train the LLM with.")
uploaded = files.upload()

# Read the data into one string.
data = '\n'.join([uploaded[filename].decode('utf-8') for filename in uploaded.keys()])

## STEP 2. Get the environment setup and train your model

This next cell installs all of the code you need in order to train and run a model, then it trains a new model of your data. If you ever change your dataset by running the cell above, you need to rerun this cell afterwards to perform the training.

When training has completed, you will see a message telling you as much and to move on to the next cell.

In [None]:
# Install necessary packages.
!pip install torch numpy transformers datasets tiktoken wandb tqdm

# Clone the this repo; update if necessary (if re-running cell).
!git clone https://github.com/hafeild/nanoGPT-colab.git
!cd nanoGPT-colab && git pull

# Get and prepare the data for training.
import importlib
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, 'nanoGPT-colab/data/google_colab_char')
import prepare
importlib.reload(prepare) # In the event that we're re-running after a git repo update.
prepare.prepareGoogleColab(data)

#Train.
!cd nanoGPT-colab/ && python train.py config/train_google_colab_char.py \
    --device=cpu --compile=False --eval_iters=20 --log_interval=1 \
    --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 \
    --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0

print('\nTRAINING IS COMPLETE! Please run the next cell to generate some output!')

## STEP 3. Generating text from the model

The cell below allows you to change some settings via a form displayed on the right. You can run this cell as many times as you want after you've trained the model above.

In [None]:
# @title Text generation settings
# @markdown Specify each of the following before running this cell.

# @markdown **Prompt**: This is what the model will start with and generate the remaining words.
prompt = 'If I had a million dollars, '  # @param {type: "string"}

# @markdown **Temperature**: This is the amount of randomness that is used to select the next word. 0 is least random, higher is more random. The default is 0.8.
temperature = 0.8 # @param {type: "slider", min: 0, max: 2, step: 0.05}

# @markdown **Number of passages**: Use this to select the number of samples you want to generate.
numberOfPassages = 4  # @param {type: "slider", max: 20, min: 1}

# Generate new text.
!cd nanoGPT-colab/ && python sample.py --out_dir='out-google-colab-char' --device='cpu' --start="$prompt" --num_samples=$numberOfPassages --temperature=$temperature

