<a href="https://colab.research.google.com/github/ktornetta/dad_jokes/blob/main/GPT_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Generating Dad Jokes with GPT-2**

**Aray Almen, Kelly Tornetta, Marissa Whitby**



We are utilizing Google Colab's free GPU to train a small GPT-2 model on a database of dad jokes, the Dadabase. After training, we will be able to generate original dad jokes based on different parameters. Note that this notebook should be run with Google Chrome and Google Colab because we will be using Google specific features. We hope you enjoy!

## Setup

Before fine-tuning the GPT-2 model, we need to setup the notebook to run all of our code.

### Download Libraries

First, download the necessary libraries. These are:


*   TensorFlow 1.5 
  * need previous version with tensorflow.contrib module that is removed in current update to run GPT-2 model
*   Files from Google Colab
  * allows us to import .csv dataframe and export .txt output jokes
*   GPT-2 from [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple)
  * free Python package for downloading and finetuning a GPT-2 model by [OpenAI](https://openai.com/blog/better-language-models/) and [Neil Shepperd](https://github.com/nshepperd/gpt-2) 



In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple

TensorFlow 1.x selected.


In [None]:
import gpt_2_simple as gpt2
from google.colab import files

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Active GPU

We can verify which GPU is active in Colab. Note that Colab either uses an Nvidia T4 GPU or an Nvidia K80 GPU. If you are training a larger GPT-2 model, it is recommended that the T4 is used since it is slightly faster than the older K80. Since we are using the smallest GPT-2 model, the GPU doesn't matter too much. 

If you see the error: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver." then you need to manually select the GPU by choosing Edit -> Notebook settings -> Hardware accelerator -> GPU.

In [None]:
!nvidia-smi

Wed Dec 16 17:25:58 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     8W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Download GPT-2

Download the version of GPT-2 that you wish to fine-tune. For the sake of run time and size, we are using the smallest 124M GPT-2 model. If you wish to train a larger model with a large amount of training data, gpt-2-simple can also support the fine-tuning of a 355M model of GPT-2. 

In [None]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 242Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 131Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 381Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:02, 174Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 377Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 176Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 186Mit/s]                                                       


### Mount Google Drive

Since we are using Google Colab, we also have access to Google Drive. Mounting our personal Google Drive in the VM allows easy transfer of data in/out of the VM. This means that we only have to train the GPT-2 model with the Dadabase once, and then can upload this trained model to generate jokes without having to re-run the fine-tuning. You will have to authorize this via a url to your google account.

In [None]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import Data to Fine-tune

Upload the Dadabase via the left sidebar (file logo) Files -> (paper with upwards arrow) Upload -> select the Dadabase.csv from your local device or drag and drop into the module. You could also upload via typing the command *uploaded = files.upload()* and then selecting the file. Once uploaded, we can call the file and import into the notebook.

In [None]:
file_name = "Dadabase.csv"

## Fine-tune GPT-2

Fine-tune the GPT-2 Model with the Dadabase. This can be done multiple times by saving different checkpoints with different parameters. We have included our final chosen model.

### Parameters

We have selected the following parameters for our final model:
* *model_name = '124M'* - the smallest model of GPT-2
* *steps = 500* - small number of steps to train since short form text is more likely to overfit
* *restore_from = 'fresh'* - training from the base GPT-2, choose *'latest'* to restore from a previously trained model
* *run_name = 'dadjokes1'* - saves the model in folder 'dadjokes1' within the folder 'checkpoint' 
* *print_every = 10* - prints every 10 steps during the training process
* *sample_every = 100* - prints 100 example outputs per step
* *learning_rate = 1e-4* - default learning rate

Using these parameters, the training process took about 20 minutes. Note that increasing the number of steps increases the amount of time to train, with 1000 steps taking about an hour to train.*italicized text*


In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=500,
              restore_from='fresh',
              run_name='dadjokes1',
              print_every=10,
              sample_every=100
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


100%|██████████| 1/1 [00:00<00:00, 219.36it/s]

Loading dataset...
dataset has 29571 tokens
Training...





[10 | 28.78] loss=1.74 avg=1.74
[20 | 50.41] loss=1.53 avg=1.63
[30 | 72.27] loss=1.26 avg=1.51
[40 | 94.32] loss=1.13 avg=1.41
[50 | 116.59] loss=0.88 avg=1.30
[60 | 139.06] loss=0.46 avg=1.16
[70 | 161.67] loss=0.38 avg=1.04
[80 | 184.49] loss=0.32 avg=0.95
[90 | 207.58] loss=0.24 avg=0.87
[100 | 230.64] loss=0.21 avg=0.80
text|>
<|endoftext|>
<|startoftext|>Why did the elf cross the road? With a cross<|endoftext|>
<|startoftext|>What’s Fifty Cent’s name in Zimbabwe? Two Hundred Dollars.<|endoftext|>
<|startoftext|>Why did the pig go to the doctor? It needed something to eat.<|endoftext|>
<|startoftext|>What’s Fifty Cent’s favorite song to swing is "Can I Have Three Chairs? There are three Chairs."<|endoftext|>
<|startoftext|>What do you call a bald porcupine? Kudzu.<|endoftext|>
<|startoftext|>What does a scientist do on a roller coaster? It's nobody else's business.<|endoftext|>
<|startoftext|>Why did the apple fall over? Because it was in the wrong place.<|endoftext|>
<|startoftex

### Save Model

Since this model has been checkpointed and saved, we can download it to our Google Drive using the command below and then reload the trained model using the command *gpt2.copy_checkpoint_from_gdrive(run_name = 'dadjokes1')*

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='dadjokes1')

## Generate Jokes

Now, it's time to finally generate jokes from our fine-tuned model! If you are using a previously trained model, uncomment and run the code below with the desired model checkpoint.

In [None]:
#gpt2.copy_checkpoint_from_gdrive(run_name = 'dadjokes1')
#sess = gpt2.start_tf_sess()
#gpt2.load_gpt2(sess, run_name = 'dadjokes1')

### Multiple Jokes to .txt

Our final parameters were chosen as:
* *length = 100* - maximum length of each joke (max is 1023)
* *temperature = 1.2* - higher temp means more original jokes and less copies (default is 0.7)
* *nsamples = 100* - number of jokes to be generated
* *batch_size = 20* - generates multiple samples in parallel to speed up runtime (max is 20)

We also added the below parameters that GPT-2 recognizes to handle the single-line output:
* *prefix = "<|startoftext|>"* - GPT-2's recognized indicator for starting text
* *truncate = "<|endoftext|>"* - truncates after each joke
* *include_prefix = False* - removes prefix from output

To download a .txt file of our output joke, we included the output file as :
* *gen_file = 'gpt2_jokes.txt'* - can change file name

Additional parameters for output file:
* *destination_path = gen_file* - output .txt path
* *sample_delim = ''* - removes "====" from in between each joke

These parameters were chosen to output jokes with the most originality and highest scores. Details on how we optimized these parameters and scored our jokes are included in our final paper. 



In [None]:
gen_file = 'gpt2_jokes.txt'

gpt2.generate_to_file(sess, run_name = 'dadjokes1',
                      destination_path=gen_file,
                      length=100,
                      temperature=1.2,
                      nsamples=100,
                      batch_size=20,
                      prefix="<|startoftext|>",
                      truncate="<|endoftext|>",
                      include_prefix=False,
                      sample_delim=''
                      )

The following command will prompt the file to be downloaded by Google Chrome.

In [None]:
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Single Joke to Notebook

To output a single joke or group of jokes to the notebook, run the following with the desired number of jokes (nsamples). If outputting >10 jokes, increase the batch_size to improve runtime. The parameters *length = 100* and *temperature = 1.2* can be changed for different joke lengths and joke originality. Note that a higher temperature will result in more original jokes but also more nonsensical output. A lower temperature will result in less original jokes. The given parameters are optimized for most originality with the highest score. Details on how we scored our jokes are included in our final paper.

In [None]:
gpt2.generate(sess, run_name='dadjokes1',
              length=100,
              temperature=1.2,
              nsamples=10,
              batch_size=1,
              prefix="<|startoftext|>",
              truncate="<|endoftext|>",
              include_prefix=False
              )

Did the orange win the prize? It got the trophy.
What do traditional sausages contain that can hold their shape? Yogscorns.
How did high school introduce me? I was a purely sedentary mouse.
Do you understand the number 15? He takes three seconds to type himself.
Some flowers fight? Bison.
Have you considered dropping by a pig and meeting another pig? Because there's not a lot of money.
I made a playlist for hiking. It has music from band the aphas, and Muse. I call it my Trail Mix.
The word queue is ironic. It's just q with a bunch of silent letters waiting in line.
How did Greek anger crop up in the news? I don't know but it ain't nothing special.
Where does Arnold Schwarzenegger hate the most? Cincinnati.


## Thanks

Thanks to [Max Woolf](https://minimaxir.com/) for his blogs and interactive notebooks on using GPT-2!