# Introduction

The goal of this notebook is to demonstrate the usage of [OpenAI's GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) to generate standup comedy jokes. Specifically, I was interested in its ability to generate longer-form, coherent jokes, as opposed to one-liners.

# Imports

This project uses my fork of [minimaxir's gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), which allows us to use Python lists as inputs into the model. We also install the necessary dependencies for web scraping, data cleaning, and data saving and loading.

In [1]:
import os
import sys
path = os.path.abspath('src/gpt_2_simple')
sys.path.append(path)

In [2]:
%pip install tqdm
%pip install bs4
%pip install regex
!git clone 'https://github.com/keatonconrad/gpt-2-simple' './src/gpt_2_simple'
import re
import requests
from bs4 import BeautifulSoup
import pickle
from tqdm import tqdm
import string
from src.gpt_2_simple import gpt_2_simple as gpt2

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
fatal: destination path './src/gpt_2_simple' already exists and is not an empty directory.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



# Model Preparation

This project uses the 345M parameter model of GPT-2. While other sizes are available, this size was chosen as a balance between speed and quality of results. We download the model here.

In [3]:
model_size = '345M'
if not os.path.isdir(os.path.join('models', model_size)):
    gpt2.download_gpt2(model_name=model_size)

Fetching checkpoint: 1.05Mit [00:00, 866Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 4.97Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 1.61Git/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 1.42Git [01:14, 19.0Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 1.19Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:00, 5.47Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 8.27Mit/s]                                                      


# Data Collection

To collect the data for training, we scrape standup comedy transcripts from [scrapsfromtheloft.com](scrapsfromtheloft.com). We load a list of links from a .txt file. These links are sourced based on my own personal comedy tastes, favoring comedians with longer-form jokes.

In [4]:
def url_to_transcript(url):
    # Scrapes transcript data from scrapsfromtheloft.com
    page = requests.get(url).text # Get all data from URL
    soup = BeautifulSoup(page, 'html.parser') # Read as an HTML document
    text = [p.text for p in soup.find(class_='elementor-widget-theme-post-content').find_all('p')] # Pull out all text from post-content
    return text

with open('./urls.txt') as f:
    urls = f.read().splitlines()

In [5]:
transcripts = [url_to_transcript(u) for u in tqdm(urls)]

100%|██████████| 40/40 [01:47<00:00,  2.69s/it]


## Data Storage and Loading

The transcripts are saved using the pickle module. These transcripts can then be loaded later from the pickled files for speed and to avoid unnecessary web scraping.

In [6]:
# Save transcripts
# !mkdir transcripts

for i in range(len(urls)):
    with open("transcripts/" + str(i) + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [7]:
# Load pickled files
transcripts = []
for i in range(len(urls)):
    with open("transcripts/" + str(i) + ".txt", "rb") as file:
        transcripts.append(pickle.load(file))
transcripts = [item for sublist in transcripts for item in sublist]

# Data Cleaning

Since the data is raw text scraped from the web, it should be cleaned for best results before fine-tuning the model.

First, we remove the text between brackets, and the brackets themselves. This is commonly seen in this dataset as "[audience laughing]" or similar. These instances of bracketed text do not usually add to the jokes themselves, so we don't want to train the model with them present.

Then, we remove words with numbers in them, new lines, and all non-printable characters to ensure we are working with only alpha characters. 

Only jokes that are 50 characters or longer at this point are used. This is to ensure we are working with jokes of substantial length, to remove unwanted lines we didn't catch before (stage directions, etc.), and again because I am primarily interested in the model's ability to work with long-form comedic material.

In [8]:
# Data cleaning
printable = set(string.printable)

jokes = []
for paragraph in transcripts:
    cleaned = re.sub(r'\[[^)]*\]', '', paragraph)  # Removes text within []
    cleaned = re.sub('\w*\d\w*', '', cleaned)  # Removes words with numbers in them
    cleaned = re.sub('\n', ' ', cleaned)  # Removes new lines
    cleaned = ''.join(filter(lambda x: x in printable, cleaned))  # Removes all non-printable characters
    if len(cleaned) > 50:  # Removes short paragraphs in favor of longer jokes
        # cleaned = '<|startoftext|> ' + cleaned + ' <|endoftext|>'
        jokes.append(cleaned)

# Dataset Creation and Preparation

First, we shuffle the list of jokes to mitigate any potential bias from the order.

Then, we split the jokes into two datasets: one for training and one for validation. This is to ensure we are evaluating the results on an unseen or "hold-out" set of jokes. I used a split of 0.2 here, meaning 80% of the data will be used for training, and the remaining 20% will be used for validation.

In [9]:
import random
random.shuffle(jokes)

In [10]:
val_split = 0.2

num_train_samples = int(len(jokes) * (1 - val_split))
num_val_samples = int(len(jokes) * val_split)
print('Training samples:', num_train_samples)
print('Validation samples:', num_val_samples)

dataset = jokes[:num_train_samples]
val_dataset = jokes[num_val_samples:]

Training samples: 1701
Validation samples: 425


# Model Fine-Tuning

Now we are ready to fine-tune the model. This Python package makes it easy.

We begin by starting a Tensorflow session. We input our data and iterate over it 501 times (the extra 1 is to ensure we get a final validation reading).

In [11]:
sess = gpt2.start_tf_sess()

In [12]:
gpt2.finetune(
	sess,
	dataset=dataset,
    val_dataset=val_dataset,
    val_every=100,
	model_name=model_size,
	steps=501,  # number of iterations
	restore_from='fresh',  # start from scratch
	run_name='run1_standup',  # directory where trained model will be located
	overwrite=True,
	print_every=10,
	sample_every=200,  # output results every 200 steps
	learning_rate=0.001,
	save_every=800,
)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/345M/model.ckpt
INFO:tensorflow:Restoring parameters from models/345M/model.ckpt


INFO:tensorflow:Restoring parameters from models/345M/model.ckpt
100%|██████████| 1/1 [00:00<00:00, 216.07it/s]

Loading dataset...



100%|██████████| 1/1 [00:00<00:00, 240.94it/s]


Dataset has 335938 tokens
Training...
[10 | 22.62] loss=3.69 avg=3.69
[20 | 36.60] loss=3.59 avg=3.64
[30 | 50.60] loss=3.41 avg=3.56
[40 | 64.60] loss=3.14 avg=3.46
[50 | 78.60] loss=3.20 avg=3.40
[60 | 92.61] loss=2.59 avg=3.27
[70 | 106.70] loss=3.20 avg=3.26
[80 | 120.80] loss=3.35 avg=3.27
[90 | 134.93] loss=3.11 avg=3.25


  0%|          | 0/40 [00:00<?, ?it/s]

[100 | 149.08] loss=3.36 avg=3.26
Calculating validation loss...


100%|██████████| 40/40 [00:29<00:00,  1.34it/s]


[101 | 179.16] validation loss = 3.30
[110 | 193.42] loss=2.70 avg=3.21
[120 | 207.75] loss=3.18 avg=3.21
[130 | 222.15] loss=3.40 avg=3.22
[140 | 236.62] loss=3.23 avg=3.22
[150 | 251.20] loss=3.67 avg=3.25
[160 | 265.85] loss=3.22 avg=3.25
[170 | 280.22] loss=3.62 avg=3.28
[180 | 294.45] loss=2.56 avg=3.23
[190 | 308.58] loss=3.00 avg=3.22
[200 | 322.68] loss=2.78 avg=3.19


  0%|          | 0/40 [00:00<?, ?it/s]

 about? I know that much. Yeah, shes a hero, baby. And Im okay, but you still want me to be with me? Okay. Youre a bitch. <|endoftext|>
<|startoftext|> All this really does is piss everybody off, gives me a bunch of time to come in here and tell you a joke. Yeah, but if you ever tell me a joke, I will be fucking mad at you. Youre like, yeah, thats what you said. Okay. Yeah, but I did it. I did it. Its fine. I did it. I gave it a shot. I did, we did it. It was, wasnt it? I gave it a shot. I dont want you, you got nothing against the show youre doing. Its just, you and I dont want you to tell it that in your show, if you tell me a joke, I will be like, Yeah, you gotta do it and Im not standing around, You know? Yeah, if you told me a joke, you have nothing against the show you say. Im your friend. And youre like, yeah, you read that joke and you still give it a shot, you still give the joke? Youre like, I read it. <|endoftext|>
<|startoftext|> I didnt want to tell you a story right now. 

100%|██████████| 40/40 [00:27<00:00,  1.45it/s]


[201 | 372.06] validation loss = 3.23
[210 | 386.26] loss=3.19 avg=3.19
[220 | 400.47] loss=2.13 avg=3.14
[230 | 414.71] loss=1.69 avg=3.07
[240 | 428.97] loss=2.52 avg=3.04
[250 | 443.27] loss=1.66 avg=2.98
[260 | 457.58] loss=1.83 avg=2.93
[270 | 471.93] loss=2.63 avg=2.92
[280 | 486.33] loss=3.42 avg=2.94
[290 | 500.75] loss=2.90 avg=2.94


  0%|          | 0/40 [00:00<?, ?it/s]

[300 | 515.25] loss=2.13 avg=2.91
Calculating validation loss...


100%|██████████| 40/40 [00:28<00:00,  1.40it/s]


[301 | 543.89] validation loss = 3.10
[310 | 558.70] loss=1.36 avg=2.85
[320 | 573.34] loss=3.47 avg=2.87
[330 | 587.73] loss=2.23 avg=2.85
[340 | 602.03] loss=2.15 avg=2.83
[350 | 616.27] loss=3.37 avg=2.84
[360 | 630.49] loss=1.48 avg=2.80
[370 | 644.69] loss=1.98 avg=2.77
[380 | 658.90] loss=0.90 avg=2.71
[390 | 673.10] loss=3.54 avg=2.74
[400 | 687.31] loss=1.75 avg=2.71


  0%|          | 0/40 [00:00<?, ?it/s]

 And I would say, what do you want me to do? Im like, you know, I want this guy to do one-on-one with me to really please you, but he did all these things for you, you know,  to his house. So <|endoftext|>
<|startoftext|> Thank you for coming out so really, you know, your favorite thing that I ever did is I went to the gym the other day. I thought it was a little thing that I would do in order to have to get rid of a little guy named Matt Stamci. Like, I was hoping for him to get rid of this little guy named Rory, which is a little old, you know, a little old man. I just thought that was a little bit my chance to tell me that Rory was, you know, his little boy. I was hoping for Rory to get rid of this little boy and get some respect from the people of England, where they are so fond of the English. I was hoping that they would just do that, you know, to keep him in the fucking basement so that my little boy could talk to my friend Rory. So I was hoping that Rory would just, like, go ou

100%|██████████| 40/40 [00:27<00:00,  1.44it/s]


[401 | 734.23] validation loss = 2.96
[410 | 748.54] loss=2.88 avg=2.71
[420 | 762.85] loss=0.83 avg=2.66
[430 | 777.21] loss=1.75 avg=2.63
[440 | 791.61] loss=3.24 avg=2.65
[450 | 806.04] loss=2.68 avg=2.65
[460 | 820.55] loss=0.97 avg=2.61
[470 | 835.17] loss=2.32 avg=2.60
[480 | 849.88] loss=2.44 avg=2.59
[490 | 864.68] loss=2.80 avg=2.60


  0%|          | 0/40 [00:00<?, ?it/s]

[500 | 879.17] loss=2.78 avg=2.60
Calculating validation loss...


100%|██████████| 40/40 [00:27<00:00,  1.43it/s]


[501 | 907.14] validation loss = 2.85
Saving checkpoint/run1_standup/model-501


# Final Results

We can now use the fine-tuned model to generate standup comedy jokes! After some testing, I found that a temperature of 0.8 produced results that were more sensical and sometimes had an element of humor to them, as well. 

In [16]:
gpt2.generate(
	sess,
    run_name='run1_standup',
	nsamples=10,
	temperature=0.8,
	prefix='<|startoftext|>',
	truncate='<|endoftext|>'
)

<|startoftext|> All right. I know, I cant wait to get back to the old days of just not doing good inside a building. Then get a good one. And this is cool for people everywhere. Its not a new problem. You dont even have to be a saint. Youre a saint. But God gives you a prayer and you can do everything you want the rest of your lives. You go find one. Its a miracle. Thatsnt the God, Im like a saint. And the other saints turned and said, I wasnt introduced to Him, its a gift. So I was in a hotel room with the rest of the other saints and we went and we walked through Church. Everybody was there. We went and we sat and were like, Uh, I know, Jesus! And then Jesus Christ! That Jesus! Really? Im like, Jesus! What the fuck are you trying to say a Godfather didnt want? Somebody is like, Um, Jesus! And I was like, Jesus? And Jesus! Jesus Christ! Jesus! Jesus!  Hrs. Yeah! Jesus! Jesus! You know, Jesus! I know, you never hear a Jesus Christ! Jesus! Jesus! Jesus! Jesus! Jesus! Jesus! Jesus! Jesus