#  04 GPT-2 Model Training With gpt-2-simple (Colab notebook)

This notebook is an edited version the original Google Colab notebook by [Max Woolf](http://minimaxir.com), creator of the `gpt-2-simple` wrapper for OpenAI's GPT-2. Since this notebook provided an easy workflow directly from the creator as well as Colab's access to a free GPU, I decided to maximize efficiency and usability by using it as the base rather than building a new one.

For more about `gpt-2-simple`, you can visit [Max's GitHub repository](https://github.com/minimaxir/gpt-2-simple). You can also read his [blog post](https://minimaxir.com/2019/09/howto-gpt2/) for a link to the original version and more information how to use this notebook.

As noted, this notebook is designed to run in Google Chrome and with Google Drive as access to required datafiles for uploading and save files for saving model information, so we will need room and access to the Drive.

Before we begin running the cells, we'll verify that we are using a GPU by going to Runtime--> Change runtime type --> and making sure that Hardware Excelerator is set to `GPU`.

We'll also upload our required text files to Google Drive. We'll need the `NLG_Poe/02_author_text_only/Prose/` directory (see the link in the repo) and the enclosed files installed on our Google Drive in the directory `/content/drive/MyDrive/`.

### Imports

Please note: this wrapper uses TensorFlow v1.x

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
import os
from google.colab import files

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## GPU
We'll verify which GPU is active by running the cell below.

In [None]:
!nvidia-smi

Mon Jan 25 01:53:16 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading the GPT-2 Pre-Trained Model

We'll be retraining the GPT-2 model utilizing the prose text of Edgar Allan Poe. To start this process, we'll first have to download the base model to our Colab instance.

For this initial test, we'll use the "medium model" with 355 million parameters. We'll specify the base model by entering "355M" as the `model_name` in the next cell below. When we run that cell, the model will be downloaded from Google Cloud Storage and saved in the Colaboratory VM at `/models/<model_name>`.

In [None]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 441Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 133Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 647Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:13, 105Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 454Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 147Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 189Mit/s]                                                       


## Mounting Google Drive

Next, we'll connect to our Google Drive by running the cell below. 

In [None]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Preparing Required Data Files for Training
Since we uploaded the `NLG_Poe/02_author_text_only/Prose/` directory (see the link in the repo) and its files to our Google Drive in the directory `/content/drive/MyDrive/` at the top of the this notebook, we'll run the following cell to create our combined data file for retraining GPT-2.


In [None]:
# getting a list of all the trimmed prose files in the directory
text_files = [file for file in os.listdir('/content/drive/MyDrive/NLG_Poe/02_Poe_author_text_only/Prose/')]

# setting variable to hold our combined data files
raw_text = ''

# reading and adding the data from each trimmed file to our combined data
for file in text_files:
    raw_text += open(f'/content/drive/MyDrive/NLG_Poe/02_Poe_author_text_only/Prose/{file}', encoding='utf-8').read()
    raw_text += ' '

# set the name for our combined data files to be saved as
file_name = "Poe_combined_trimmed_prose.txt"

# saving the new combined file to our Google Drive
with open(f'/content/drive/MyDrive/{file_name}', 'w') as f:
  f.write(raw_text)

# copying the new combined data file from Google Drive our Colab VM
gpt2.copy_file_from_gdrive(file_name)

## Finetune GPT-2

Now we'll actually finetun GPT-2 using the `gpt-s-simple` wrapper. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

There are several options that we should be aware off. Please reference the [repo](https://github.com/minimaxir/gpt-2-simple).

The parameters laid out in the following cell are for a continuation training of our model from steps 7000 to 10,000 and show the current loss measurement every 10 steps, saves our model checkpoints every 500 steps, and prints a sample generative text every 500 steps.

This particular wrapper makes it very easy to continue finetuning on the same data or folding in new data. 

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=3000,
              restore_from='latest', # change to 'fresh' if first training of model
              run_name='run1', 
              print_every=10,
              sample_every=500,
              save_every=500, 
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint checkpoint/run1/model-7000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-7000


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:03<00:00,  3.43s/it]


dataset has 763693 tokens
Training...
Saving checkpoint/run1/model-7000
 this man? He will be the saviour of us all! He is the
      reason why we are here—he is the one us hero—he is the
      reason we must all fight to the bitter death. Therefore let
      us all hail him with the ardor of our hatred, and
      strive to emulate him in everything we did not already
      consider a laudable art.”
     
    
      That some people took the excess of vulgarity in Jules and
      Favêt quite literally, is rendered difficult by the fact,
      that much of the speech attributed to him by his less
      intimate friends was positively
      disapproved of by his more intimate ones. It was not the
      idiosyncrasy of his speeches, but the idiosyncrasy of his
      character, which led him to incur the displeasure
      of his less intimate associates, and finally to the
      disfavour of his more intimate acquaintances. It was
      his prerogative to vary the sentences of his
      op

Since Google Colab has limited GPU time. We'll run it for no lo longer than 3000-4000 steps at a time.

Once we've reached a stopping point, we'll save the checkpoint file for the finetuned model from the Colab VM to our Google Drive. This will save a `.tar` file named `checkpoint_{the run name we specified}` (in our case, `checkpoint_run1.tar`). Saving the file to our personal Google Drive is important if we desire to access the model for further finetuning or generating text. 

In [None]:
# save finetuned model and support files to personal Google Drive for future access.
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

That completes the model training and finetuning. In the next notebook we'll do some text generation using our new model.

### LICENSE
Since I'm extensively utilizing the framework and processes created by Max Woolf, please be aware of this licensing information related to said software.

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.