<a href="https://colab.research.google.com/github/jkraybill/gpt-2/blob/finetuning/GPT2-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To try out GPT-2, do this:

- go to the "Runtime" menu and click "Change runtime type" and make sure this is a Python 3 notebook, running with GPU hardware acceleration.
- use the "Files" section to the left to upload a text file called "corpus.txt" which contains all the text you want to train on.
- run the steps below in order.

In [0]:
import os
import json
import random
import re

In [2]:
!git clone https://github.com/jkraybill/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 160 (delta 0), reused 0 (delta 0), pack-reused 157[K
Receiving objects: 100% (160/160), 4.35 MiB | 1.81 MiB/s, done.
Resolving deltas: 100% (81/81), done.


In [3]:
cd gpt-2

/content/gpt-2


In [4]:
!pip3 install -r requirements.txt

Collecting fire>=0.1.3 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Collecting regex==2017.4.5 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 28.5MB/s 
Building wheels for collected packages: fire, regex
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/2a/1a/4d/6b30377c3051e76559d1185c1dbbfff15aed31f87acdd14c22
  Building wheel for regex (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a09c132692c74450b01
Successfully built fire regex
[31mERROR: spacy 2.0.18 has requirement regex==2018.01.10, but you'll have regex 2017.4.5 which is incompat

In [5]:
!sh download_model.sh 117M

Fetching 117M/checkpoint
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    77  100    77    0     0   3850      0 --:--:-- --:--:-- --:--:--  3850
Fetching 117M/encoder.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1017k  100 1017k    0     0  33.1M      0 --:--:-- --:--:-- --:--:-- 33.1M
Fetching 117M/hparams.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    90  100    90    0     0   5625      0 --:

The below step encodes your corpus into "NPZ" tokenized format for GPT-2.



In [7]:
!PYTHONPATH=src ./encode.py --in-text ../corpus.txt --out-npz corpus.txt.npz

Reading files
100% 1/1 [00:00<00:00,  1.59it/s]
Writing corpus.txt.npz


Training is below. I usually get usable results with "stop_after" anywhere from 800 to 3000, but you can try going even higher. 800 steps takes only a few minutes.

"sample_every" controls how often you get sample output from the trained model.

"save_every" controls how often the model is saved.

"learning_rate" is the AI learning rate. 0.00005 is the rate I've gotten the best results with, but I think most people are running with significantly higher rates, so you could try adjusting it.

In [8]:
!PYTHONPATH=src ./train.py --dataset corpus.txt.npz --sample_every=200 --save_every=250 --learning_rate=0.00005 --stop_after=800


2019-05-04 08:10:31.678603: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-04 08:10:31.679072: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x207a520 executing computations on platform Host. Devices:
2019-05-04 08:10:31.679114: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-04 08:10:31.959328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-04 08:10:31.959938: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2079e40 executing computations on platform CUDA. Devices:
2019-05-04 08:10:31.959970: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-04 08:10:31.960381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de

The step below simply copies your trained model to the model directory, so the output will use your training. If you don't do this, you will be running against the trained GPT-2 model without your finetuning training.

In [0]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

Run the below step to generate unconditional samples (i.e. "dream mode").

"top_k" controls how many options to consider per word (the larger, the more "diverse" the output - anything from 1 to about 50 usually works, I think values around 10 are pretty good).

"temperature" controls the sampling of the words, from 0 to 1 where 1 is the most "random".

"length" controls the number of words in each sample output.

This command will run continuously until you turn it off.

In [0]:
!python3 src/generate_unconditional_samples.py --top_k 10 --temperature 1 --length=300

Traceback (most recent call last):
  File "src/generate_unconditional_samples.py", line 7, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/__init__.py", line 24, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/core/framework/graph_pb2.py", line 6, in <module>
    from google.protobuf import descriptor as _descriptor
  File "/usr/local/lib/python3.6/dist-packages/google/protobuf/__init__.py", line 37, in <module>
    __import__('pkg_resources').declare_namespace(__name__)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 3241, in <module>
    @_call_aside
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 3225, in _call_aside
    f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resource

Run the command below to run in interactive / "completion" mode. You will get a prompt; just type in whatever prompt text you want, and the model will attempt to complete it "nsamples" times.

"top_k", "length", and "temperature" work as specified above.

In [11]:
!python3 src/interactive_conditional_samples.py --top_k 10 --length=300 --temperature 1 --nsamples 10

2019-05-04 08:29:30.727211: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-04 08:29:30.727495: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x29f6260 executing computations on platform Host. Devices:
2019-05-04 08:29:30.727536: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-04 08:29:30.938981: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-04 08:29:30.939614: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x29f5e40 executing computations on platform CUDA. Devices:
2019-05-04 08:29:30.939669: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-04 08:29:30.940136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de