<a href="https://colab.research.google.com/github/kcambrek/Norman/blob/master/Reddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Norman story generator

Reddit has a plethora of surprising niche communities. One of my favourites is r/lifeofnorman. The subreddit is full of small stories written by community members on the mundane life of the protagonist Norman.

The goal of this notebook is to retrieve posts from the r/lifeofnorman, train a language generator model on these post and generate new stories. The language generator is already trained on a extremely large corpus and only needs to be fine-tuned. For more information on the language model, see https://openai.com/blog/better-language-models/.

It is not necessary to collect the data and train the language generator everytime you use this notebook. If you mount your google drive to this notebook, you can save a model checkpoint which can be loaded in later sessions. Therefore, you only need to collect data and train the model once.

The syntax of the results seem to be sound, but the content can be a quite off.
Some interesting results:



```
Norman was hungry.
He sat down in his favorite chair, and opened his mouth to say something, but couldn't say anything.
Then, Norman's mouth started to move.
This was it.
Norman's mouth was moving.
Norman stood up, and started to eat.
```


```
Norman woke up.
He had a fever and was worried he might get sick, but he got up and got out of bed.
"I should be careful, I can't sleep tonight." Norman went to the kitchen and opened his fridge.
Norman's stomach was full.
He had to have breakfast.
Norman went to bed and got up to go to the bathroom.
On his way out he tried to get Norman's medicine.
He didn't want to risk getting sick.
"That's okay" he thought.
Norman was in heaven.
He had to take his medicine.
```


```
Norman went to the cinema.
It was a good one, he thought.
He enjoyed the marketing; it made him feel special.
He went to the cinema and sat down with a good date.
When the date came, the two sat down and enjoyed their popcorn.
The date was nice, but not nice enough for Norman.
Norman thought it was a little too much, and thought he'd be late.
He'd already been through all the events of the day, and he'd go straight to the theatre.
Norman decided that he didn't want to be late; he didn't want to be a pest.
He also didn't want to be a bother to the hostess.
Norman sat down to watch CSI.
The date ended and Norman got up to go to bed.
He went to sleep, but not before finishing up his popcorn.
```


And the last one, inspired by https://www.reddit.com/r/lifeofnorman/comments/3ykvsw/norman_considers_suicide/:

```
Norman considers suicide.
Norman recently found himself in situations where he couldn't think of anything to say to the woman at the checkout.
Norman couldn't get himself to say anything.
Norman decides that he'd rather die than live.
```

#Prerequisites

In [1]:
#install and import gpt2 simple
!pip install gpt-2-simple
import tensorflow as tf
import gpt_2_simple as gpt2

Collecting gpt-2-simple
  Downloading https://files.pythonhosted.org/packages/6f/e4/a90add0c3328eed38a46c3ed137f2363b5d6a07bf13ee5d5d4d1e480b8c3/gpt_2_simple-0.7.1.tar.gz
Collecting toposort
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: gpt-2-simple
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
  Created wheel for gpt-2-simple: filename=gpt_2_simple-0.7.1-cp36-none-any.whl size=23581 sha256=71adf8d47502d2b8af0742df28aa9971be9df65e3f57394f8d4e7f8fdea8eda5
  Stored in directory: /root/.cache/pip/wheels/0c/f8/23/b53ce437504597edff76bf9c3b8de08ad716f74f6c6baaa91a
Successfully built gpt-2-simple
Installing collected packages: toposort, gpt-2-simple
Successfully installed gpt-2-simple-0.7.1 toposort-1.5


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [2]:
#connect personal google drive to session
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#Data Gathering

We can get reddit data from the Pushshift API. However, there is a limit of how much data one can get in one request. To solve this, we can get the required data in multiple smaller requests. Therefore, we make a list with every month since r/lifeofnorman is founded till now and set the maximum submissions to 500 per month (which seems reasonable for a relatively small subreddit).



In [0]:
from dateutil.relativedelta import relativedelta
from datetime import datetime, timezone

months = []

today = datetime.today()
current = datetime(2013, 4, 18)  
#for some reason the pushshift api only accepts data arguments in unix time stamp format.
while current <= today:
    months.append(int(current.replace(tzinfo=timezone.utc).timestamp()))
    current += relativedelta(months=1)


In [0]:
import requests

results = []
for i in range(len(months)-1):
  r = requests.get('https://api.pushshift.io/reddit/search/submission/?subreddit=lifeofnorman&after={}&before={}&sort_type=score&sort=desc&size=500'.format(months[i], months[i+1]))
  if r.status_code != 200:
    print("Something went wrong:", r.status_code)
    break
  else:
    results.append(r)

In [0]:
#we are only interested in the submissions that include a story about Norman and not in meta posts.
#luckily the title of most of these posts start with 'Norman'

submissions = []

for response in results:
  for submission in response.json()["data"]:
    if submission["title"].startswith("Norman"):
      submissions.append(submission["selftext"].rstrip())

In [0]:
#gpt-2 needs to know when a new piece of text starts and ends. 
formatted_submissions = ["<|startoftext|> " + submission.strip() + " <|endoftext|>" for submission in submissions]
text = " ".join(formatted_submissions).replace("\n", " ").replace("\\", "")

In [0]:
#save the data to the current session.
f = open("norman.txt", "w")
f.write(text)
f.close()

#GPT-2 fine-tuning

In [0]:
import tensorflow as tf

In [0]:
#specify the model size; check https://github.com/minimaxir/gpt-2-simple for other options.
model_name = "124M"
gpt2.download_gpt2(model_name=model_name)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Fetching checkpoint: 1.05Mit [00:00, 187Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 79.7Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 194Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:03, 163Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 212Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 118Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 120Mit/s]                                                       


In [0]:
tf.reset_default_graph()
sess = gpt2.start_tf_sess()

gpt2.finetune(sess, "norman.txt", steps=1000)   # steps is max number of training steps
#every 100 epochs a new sample will be generated to display process.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:09<00:00,  9.41s/it]


dataset has 1558548 tokens
Training...
[1 | 13.80] loss=3.15 avg=3.15
[2 | 18.12] loss=3.56 avg=3.36
[3 | 22.46] loss=3.27 avg=3.33
[4 | 26.79] loss=3.37 avg=3.34
[5 | 31.13] loss=3.24 avg=3.32
[6 | 35.48] loss=3.29 avg=3.32
[7 | 39.80] loss=3.30 avg=3.31
[8 | 44.12] loss=3.09 avg=3.28
[9 | 48.49] loss=3.21 avg=3.28
[10 | 52.83] loss=2.99 avg=3.25
[11 | 57.19] loss=3.04 avg=3.23
[12 | 61.56] loss=3.13 avg=3.22
[13 | 65.92] loss=3.17 avg=3.21
[14 | 70.31] loss=3.16 avg=3.21
[15 | 74.69] loss=3.10 avg=3.20
[16 | 79.09] loss=2.91 avg=3.18
[17 | 83.44] loss=3.02 avg=3.17
[18 | 87.83] loss=2.98 avg=3.16
[19 | 92.25] loss=2.94 avg=3.15
[20 | 96.64] loss=3.16 avg=3.15
[21 | 101.04] loss=3.12 avg=3.15
[22 | 105.42] loss=3.27 avg=3.15
[23 | 109.80] loss=2.93 avg=3.14
[24 | 114.20] loss=3.10 avg=3.14
[25 | 118.59] loss=3.00 avg=3.13
[26 | 123.01] loss=2.88 avg=3.12
[27 | 127.41] loss=3.07 avg=3.12
[28 | 131.81] loss=2.94 avg=3.11
[29 | 136.20] loss=3.15 avg=3.12
[30 | 140.59] loss=2.84 avg=3.10


In [0]:
#save a checkpoint to your drive which can be loaded later.
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

#Generate text

In [3]:
#reset tensorflow graph and load model from drive with new session.
tf.reset_default_graph()
gpt2.copy_checkpoint_from_gdrive(run_name='run1')
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


In [11]:
#provide a prompt on which the model will continue to generate text.
prompt = "Norman buys breakfast cereal." 
prompt = "<|startoftext|> " + prompt
#temperature is a value between 0 and 1 which controls how conservative vs random the output is.
#a value around 0.7 yields interesting results
temp = 0.7

response = gpt2.generate(sess, temperature = temp, prefix = prompt, return_as_list = True)

text_as_list = response[0].split()
if "<|endoftext|>" in text_as_list:
  print(" ".join(text_as_list[1:text_as_list.index("<|endoftext|>")]).replace(". ", ".\n"))
else:
  print(" ".join(text_as_list[1:]).replace(". ", ".\n"))

Norman buys breakfast cereal.
The price of the cereal is $1.99 a piece.
Norman laughs at himself for buying a whole bowl of cereal because he forgot to buy milk.
