<a href="https://colab.research.google.com/github/mowillia/phantom_pen/blob/master/gpt2_training_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### GPT 2 Training -- Google Colab


This notebook contains code used to train the models for phantom pen.

[Modified from [ak9250's guide](https://github.com/ak9250/gpt-2-colab/blob/master/GPT_2.ipynb) to training GPT-2 on Nshepperd's gpt-2 fork]


**General Note:** This notebook will not automatically run on your (the reader's) computer. Instead, use it as a guide for writing a similar notebook that links to your appropriate text corpora directory and google drive.


#### Preaparing for Training

1. Ensure that GPU is enabled in Colab. Go to Edit->Notebook Settings-> Hardware Accelerator -> GPU


2. Since Colab resets after 12 hours, copy the current notebook to your Google Drive. File -> Save a copy in drive.

**Important:** The model saves its training parameters in "checkpoints". Due to the 12 hour reset time, you should make sure to save your model checkpoints before the 12 hour mark and, most importantly, copy those checkpoints to your personal drive. After Colab resets, you can copy the checkpoints back into Colab and start training again from the previous checkpoint. 


3. Clone and cd into the repository, mowillia's fork https://github.com/mowillia/gpt-2

In [0]:
!git clone https://github.com/mowillia/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 297, done.[K
remote: Total 297 (delta 0), reused 0 (delta 0), pack-reused 297[K
Receiving objects: 100% (297/297), 4.40 MiB | 14.71 MiB/s, done.
Resolving deltas: 100% (162/162), done.


4. Change directory to the gpt-2 folder

In [0]:
cd gpt-2

/content/gpt-2


5. Check the GPU status

In [0]:
#check GPU status
!nvidia-smi

Thu Jun 27 15:42:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    16W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

6. Second check to ensure GPU is being used 

In [0]:
import torch
# Checking if GPU is available
work_with_gpu = torch.cuda.is_available()
if(work_with_gpu):
    print('Using GPU!')
else: 
    print('No GPU available, using CPU; Consider using short texts.')

Using GPU!


7. Install the requirements for training

In [0]:
!pip3 install -r requirements.txt

Collecting fire>=0.1.3 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Collecting regex==2017.4.5 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 9.7MB/s 
Collecting tqdm==4.31.1 (from -r requirements.txt (line 4))
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 23.5MB/s 
[?25hCollecting toposort==1.5 (from -r requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels

8. Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


9. Download the model data. You have two choices the 117M model and the 345M model. The program phantom pen uses both models

In [0]:
!python3 download_model.py 117M

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 566kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 54.7Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 923kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:09, 52.1Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 5.53Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 53.5Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 52.4Mit/s]                                                       


In [0]:
#!python3 download_model.py 345M

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 814kit/s]                                                      
Fetching encoder.json:   0%|                                           | 0.00/1.04M [00:00<?, ?it/s]Fetching encoder.json: 1.04Mit [00:00, 54.1Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 1.02Mit/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:21, 65.4Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 7.13Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 52.4Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 46.1Mit/s]                                                       


10. Export particular Python encoding

In [0]:
!export PYTHONIOENCODING=UTF-8

11. Fetch checkpoints if you have them saved in google drive

In [0]:
# Fetch the checkpoints 
!cp -r /content/drive/My\ Drive/checkpoint/ /content/gpt-2/ 

12. Copy the corpora you will use from training from your google drive to the content folder of colab. [Below is the code used to copy from my own directory. Yours would be different.]

In [0]:
# Get full essays
!cp -r /content/drive/My\ Drive/writrly_proj_files/Full_Essays/* /content/


Let's get our train on! In this case the file is A Tale of Two Cities (Charles Dickens) from Project Gutenberg. To change the dataset GPT-2 models will fine-tune on, change this URL to another .txt file, and change corresponding part of the next cell. Note that you can use small datasets if you want but you will have to be sure not to run the fine-tuning for too long or you will overfit badly. Roughly, expect interesting results within minutes to hours in the 1-10s of megabyte ballpark, and below this you may want to stop the run early as fine-tuning can be very fast.

### Training the 117M Model

Now we will train the model on various corpora. Phantom Pen uses 11 different corpora and we train a 117M model on each one. We title the `run_name` according to the chosen corpora, we have set the model_name to be `117M` so the program knows which pretrained version of GPT-2 to use, and we end the training after 1000 steps.

**Training Tip:** We are fine-tuning the pretrained model parameters. Using small data sets (~2 MB in size) such as the ones used in Phantom Pen (excluding the gutenberg corpus which is ~20MB) is allowed but it is important not to let the training run too long or the program will overfit to the training text. I have found that the choice of 1000 steps is good for producing reasonable results. 

In [0]:
## business essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_business.txt --run_name 'atlantic_business' --model_name '117M' --counter_end 1000

In [0]:
## technology essays training  - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_technology.txt --run_name 'atlantic_technology' --model_name '117M' --counter_end 1000

In [0]:
## science essays training  - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_science.txt --run_name 'atlantic_science' --model_name '117M' --counter_end 1000

In [0]:
## education essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_education.txt --run_name 'atlantic_education' --model_name '117M' --counter_end 1000

In [0]:
## politics essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_politics.txt --run_name 'atlantic_politics' --model_name '117M' --counter_end 1000

In [0]:
## entertainment essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_entertainment.txt --run_name 'atlantic_entertainment' --model_name '117M' --counter_end 100

In [0]:
## ideas essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_ideas.txt --run_name 'atlantic_ideas' --model_name '117M' --counter_end 1000

In [0]:
## international essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_international.txt --run_name 'atlantic_international' --model_name '117M' --counter_end 1000

In [0]:
## health essays training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_health.txt --run_name 'atlantic_health' --model_name '117M' --counter_end 1000

In [0]:
## gutenberg training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_gutenberg.txt --run_name 'gutenberg' --model_name '117M' --counter_end 1000

In [0]:
## short story training - with 117M
!PYTHONPATH=src ./train.py --dataset /content/all_short_stories.txt --run_name 'all_short_stories' --model_name '117M' --counter_end 1000

### Training the 345M Model

[Repeat previous training with 345M model]


In [0]:
## business essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_business.txt --run_name 'atlantic_business_345' --model_name '345M' --counter_end 1000

In [0]:
## technology essays training  - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_technology.txt --run_name 'atlantic_technology_345' --model_name '345M' --counter_end 1000

In [0]:
## science essays training  - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_science.txt --run_name 'atlantic_science_345' --model_name '345M' --counter_end 1000

In [0]:
## education essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_education.txt --run_name 'atlantic_education_345' --model_name '345M' --counter_end 1000

In [0]:
## politics essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_politics.txt --run_name 'atlantic_politics_345' --model_name '345M' --counter_end 1000

In [0]:
## entertainment essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_entertainment.txt --run_name 'atlantic_entertainment_345' --model_name '345M' --counter_end 1000

In [0]:
## ideas essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_ideas.txt --run_name 'atlantic_ideas_345' --model_name '345M' --counter_end 1000

In [0]:
## international essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_international.txt --run_name 'atlantic_international_345' --model_name '345M' --counter_end 1000

In [0]:
## health essays training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_health.txt --run_name 'atlantic_health_345' --model_name '345M' --counter_end 1000

In [0]:
## gutenberg training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_gutenberg.txt --run_name 'gutenberg_345' --model_name '345M' --counter_end 1000

In [0]:
## short story training - with 345M
!PYTHONPATH=src ./train.py --dataset /content/all_short_stories.txt --run_name 'all_short_stories_345' --model_name '345M' --counter_end 1000

### Saving and Loading Checkpoints

After training the model, we need to save them to our google drive, after which we can load them for additional training or for sample generation.

In [0]:
## saves checpoints
## Note: Saving takes a long time (at least an hour) for the 345M model
!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/

Load one of the trained models from above for sampling

In [0]:
## copies checkpoint to model folder so we can use the model for generation
## We are using atlantic_business as an example
!cp -r /content/gpt-2/checkpoint/atlantic_business/* /content/gpt-2/models/atlantic_business/

cp: target '/content/gpt-2/models/117M_NR/' is not a directory


Generate conditional samples from the model given a prompt you provide -  change top-k hyperparameter if desired (default is 40)

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 40 --model_name "atlantic_business"

To check flag descriptions, use:

In [0]:
!python3 src/interactive_conditional_samples.py -- --help

W0620 16:30:15.744753 139747760654208 deprecation_wrapper.py:119] From /content/gpt-2/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

Type:        function
String form: <function interact_model at 0x7f198d753d08>
File:        /content/gpt-2/src/interactive_conditional_samples.py
Line:        11
Docstring:   Interactively run the model
:model_name=117M : String, which model to use
:seed=None : Integer seed for random number generators, fix seed to reproduce
 results
:nsamples=1 : Number of samples to return total
:batch_size=1 : Number of batches (only affects speed/memory).  Must divide nsamples.
:length=None : Number of tokens in generated text, if None (default), is
 determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
 distribution. Lower temperature results in less random completions. As the
 temperature approaches zero, the model will become deterministic and
 repetitive. Higher te

Generate unconditional samples from the model,  if you're using 345M, add "--model-name 345M"

In [0]:
!python3 src/generate_unconditional_samples.py --model_name "345M" | tee /tmp/samples

To check flag descriptions, use:

In [0]:
!python3 src/generate_unconditional_samples.py -- --help