<a href="https://colab.research.google.com/github/jkraybill/gpt-2/blob/finetuning/GPT2-finetuning2-774M.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To try out GPT-2, do this:

- go to the "Runtime" menu and click "Change runtime type" and make sure this is a Python 3 notebook, running with GPU hardware acceleration.
- use the "Files" section to the left to upload a text file called "train.txt" which contains all the text you want to train on, and  a text file called "val.txt" which contains all the text you want to validate on.
- run the steps below in order.

In [0]:
import os
import json
import random
import re

In [0]:
!git clone https://github.com/jkraybill/gpt-2.git

In [0]:
cd gpt-2

In [0]:
!pip3 install -r requirements.txt

In [0]:
!sh download_model.sh 774M

The below step encodes your corpus into "NPZ" tokenized format for GPT-2.



In [0]:
!PYTHONPATH=src ./encode.py --in-text ../train.txt --out-npz train.txt.npz --model_name 774M
!PYTHONPATH=src ./encode.py --in-text ../val.txt --out-npz val.txt.npz --model_name 774M

Training is below. It will auto-stop after val loss stops going down for a while, but you may want to interrupt it early.

"sample_every" controls how often you get sample output from the trained model.

"save_every" controls how often the model is saved.

"learning_rate" is the AI learning rate. 0.00005 is the rate I've gotten the best results with, but I think most people are running with significantly higher rates, so you could try adjusting it.

"layers_to_train" controls the number of layers to fine-tune. Reduce this number if you are getting Out of Memory type errors. I hit them in Colab at values larger than 336 for the 774M model.


In [0]:
!PYTHONPATH=src ./trainval.py --dataset train.txt.npz --valset val.txt.npz --sample_every=1000 --save_every=25 --learning_rate=0.00005 --stop_after=60000 --model_name=774M --batch_length=512 --layers_to_train=336

The step below simply copies your trained model to the model directory, so the output will use your training. If you don't do this, you will be running against the trained GPT-2 model without your finetuning training.

In [0]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/774M/

Run the below step to generate unconditional samples (i.e. "dream mode").

"top_k" controls how many options to consider per word (the larger, the more "diverse" the output - anything from 1 to about 50 usually works, I think values around 10 are pretty good).

"temperature" controls the sampling of the words, from 0 to 1 where 1 is the most "random".

"length" controls the number of words in each sample output.

This command will run continuously until you turn it off.

In [0]:
!python3 src/generate_unconditional_samples.py --top_k 20 --temperature 0.8 --length=300 --model_name=774M

Run the command below to run in interactive / "completion" mode. You will get a prompt; just type in whatever prompt text you want, and the model will attempt to complete it "nsamples" times.

"top_k", "length", and "temperature" work as specified above.

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 1 --length=30 --temperature 0.1 --nsamples 3 --model_name=774M