Adapted from [minimaxir's gpt-2 simple library](https://github.com/minimaxir/gpt-2-simple) and [demonstration notebook](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce), themselves built on [nshepperd's original GPT-2 package](https://github.com/nshepperd/gpt-2).

Designed for implementation in Google Colab.


## Loading GPT-2

Here we'll load in the GPT-2 Simple package, download the 124M parameter "small" model, and mount Google Drive for data and model storage.

In [None]:
# install GPT-2 Simple wrapper and enable drive mounting
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
# download the 124M parameter GPT-2 small model
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 448Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 106Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 357Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:02, 187Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 356Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 188Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 202Mit/s]                                                       


In [None]:
# mount google drive for data and model storage
gpt2.mount_gdrive()

Mounted at /content/drive


In [None]:
# load dataset -- formatted as a single column CSV
# assumes the file is in the default directory of a mounted Google Drive
file_name = "gpt_training_data.csv"
gpt2.copy_file_from_gdrive(file_name)

## Model Finetuning

Next we'll activate a persistent TensorFlow session for our loaded model and begin retraining on our synopses.

The GPT-2 simple library provides us a customizable fine-tune function that has a lot to play with, but we'll focus on a few key parameters:

* steps: the number of training steps to perform. 1000 steps overfit when using our GPT-2 Medium implementation, but since we've cut down the model size it's a bit more robust.
* sample_every: we'll generate a randomly seeded sample every 200 steps to inspect how the model is learning
* learning_rate: we'll stick to the default learning rate of 1e-4 based on the size of our data

Finally we'll copy the trained model into Google Drive for generation or future training.

In [None]:
# persistent TensorFlow session
sess = gpt2.start_tf_sess()

# train on the specified dataset
gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


100%|██████████| 1/1 [00:00<00:00,  9.56it/s]

Loading dataset...





dataset has 1408260 tokens
Training...
[10 | 17.34] loss=3.11 avg=3.11
[20 | 29.80] loss=3.02 avg=3.06
[30 | 42.27] loss=3.02 avg=3.05
[40 | 54.75] loss=3.05 avg=3.05
[50 | 67.23] loss=2.93 avg=3.02
[60 | 79.70] loss=2.83 avg=2.99
[70 | 92.17] loss=2.97 avg=2.99
[80 | 104.63] loss=3.03 avg=2.99
[90 | 117.13] loss=3.00 avg=2.99
[100 | 129.59] loss=2.78 avg=2.97
[110 | 142.08] loss=3.07 avg=2.98
[120 | 154.56] loss=2.78 avg=2.96
[130 | 167.03] loss=2.99 avg=2.97
[140 | 179.50] loss=2.87 avg=2.96
[150 | 191.98] loss=2.81 avg=2.95
[160 | 204.45] loss=2.82 avg=2.94
[170 | 216.92] loss=2.89 avg=2.94
[180 | 229.40] loss=2.84 avg=2.93
[190 | 241.89] loss=2.91 avg=2.93
[200 | 254.36] loss=2.82 avg=2.92
 and have gone through many trials to finally make it through their trials. With time, they can learn to be confident in both themselves and others, and begin to realize why they live their very lives. Will they live up to their expectations, or will they not?<|endoftext|>
<|startoftext|>Special 

In [None]:
# save trained model checkpoint to Google Drive
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

## Text Generation

Next we'll load the model from Google Drive and check out some generations. 

We'll use the built-in generate function and play with some of the settings to get a feel for the model.

In [None]:
# restart TensorFlow session
sess = gpt2.start_tf_sess()

# identify the trained model checkpoint in Google Drive
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

# load model from saved checkpoint
gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


In [None]:
# generate with default parameters
gpt2.generate(sess, run_name='run1')

By: Yuuta Takami and Ayumu Watanabe

<|startoftext|>The story begins with the arrival of the last High School senior to enter the prestigious Genshiken High School. Honoka was quite shy when she first met her classmate, Morinomiya Morino, but after he accepted her to join her class, she began to learn and mature with him. As Morinomiya grew to like her, so too did his admiration of hers. Morinomiya was also made aware of the fact that Morinomiya was the only one who could change the world, and so he began to train with her to help him do so. Throughout the course of his training, Morinomiya learned to work alongside her classmates to improve their respective skills and abilities, and soon they also began to become friends. However, Morinomiya's love for Morinomiya grew and she began to question her own feelings. Morinomiya knew that there was no way that Morinomiya and her classmates could ever be the same person, but she also knew that she was not the only one who could change the wor

### Initial Observations

Note that GPT-2 can't generate to specific lengths, instead it continues predicting tokens until it predicts an <|endoftext|>, so we end up with several independent generations in the batch.

The default generations appear quite reasonable, the model has clearly honed in on Japan-oriented vocabulary and much of the specific vernacular of the dataset. We also noticed some of GPT-2's original training data leaking, especially when it produces named entities (Beastie Boys, Wizard of Oz, etc.). 

This level of fit may be ideal for the purpose of novel story generation, as the model seems to have captured the literary structure of synopses quite well while introducing new external information from the pre-training data. 

### Generator Tuning
 
Let's play around with the generator settings a bit and see how our output changes.

We'll use the nsamples and batch size settings here to generate multiple samples in parallel, and introduce the truncate parameter to exclude the end of text token. 

We also have access to temperature, which scales the logits of our final encoder layer to adjust the "randomness" of predicted tokens.

In [None]:
gpt2.generate(sess,
              length=250,
              temperature=0.7,
              nsamples=5,
              batch_size=5,
              truncate='<|endoftext|>',
              include_prefix=False
              )

Shirayuki Yagami is a young high schooler. One day, an evil organization is attacking the school. When Shirayuki is attacked, a girl named Naru is dragged into the fight. She is a classmate of the student council president and president of the "Naru Clan," a clan that deals with various issues with the human world. She is also the first transfer student at Shirayuki's school. As the students come to join the fight, they learn more about the people facing the human world.
Special thanks to the people who helped me with this project. Many thanks to the main character of the show, his name is Taku and he is one of the most popular characters in Kyoukai no Omoide (To You) and some other series. I hope that I have made some progress, but I don't know how to proceed. Thanks also go to the people who helped me with the music, animation and video. Thanks also go to the many people who helped me with the sound design, as well as the special effects. I hope that this will help others too. And fi

### Custom Generation
Let's try introducing a prefix here to serve as the 'seed' for our generation. We can use this parameter to force generations to begin with a particular sequence.

This will be the basis of our deployed generator where users can generate to their own custom seed.

In [None]:
gpt2.generate(sess,
              length=500,
              temperature=0.7,
              nsamples=5,
              batch_size=5,
              truncate='<|endoftext|>',
              prefix='In the year 20XX'
              )

In the year 20XX, a group of aliens attacks Earth with a "Valkyrie Spirit" attack. The war is a stalemate with no new incidents. The Guardian Angels inform humanity of the attack, but it is not long before the Guardian Angels appear again. The Guardian Angels are joined by Jean-Claude Van Damme, a young boy with superhuman powers. Jean-Claude and his friends have to try to track down the Valkyrie Spirit who attacked Earth. But they find out that the Valkyrie Spirit is still in the human world and they are being hunted by the army of the underworld. Jean-Claude and his friends must find a way to stop them before they become the next Guardian Angels.
In the year 20XX, a new race called the "Quilters" exist on Earth. They are a race of beings that has no memories and is unable to remember the past. They are slow, clumsy and clumsy, but at least they have knowledge of what happened to their ancestors when they were young. In order to defeat them, a team of humans named Spriggan and Sprigga

### Bulk Generation
Since we have a few different options here, let's generate a bunch of samples of varying length and temperature and save them to a file to pass to our scorer for curation. We'll rerun this cell a few times with different parameters and download the results to our local machine.

In [None]:
# generate a .txt file path using the current datetime
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

# generate directly to the gpt object's file store
gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.9,
                      nsamples=100,
                      batch_size=20
                      )

# download the current file to local machine
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>