# Mounting drive

Create shortcut to project folder in your main drive.

In [None]:
from google.colab import drive

drive.mount("/content/drive/")

Mounted at /content/drive/


In [None]:
!ls drive/MyDrive/nlp-project

 char_level_model.ipynb   models		    'NLP project plan.gdoc'
 data			  model_with_end.h5	     proovitud-mudelid.gdoc
 model.ipynb		  model_with_endings.ipynb   test.ipynb


In [None]:
data_path = "drive/MyDrive/nlp-project/data/reddit_jokes.json"

# Text generation with an RNN

Tutorial: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/ 

## Setup

### Import TensorFlow and other libraries

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow import keras

import numpy as np
import os
import time
import json
import pandas as pd
import random

from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import GRU

from functools import reduce


## Process the text

#### Reddit dataset: cleaning

Cleaning the dataset:
- Remove anything from the post following \"edit: \"
- Remove duplicate posts
- Create a \"joke\" column to the df by either combining titles with bodies or just returning bodies(if it contains the title)

In [None]:
number_of_jokes = 10000

In [None]:
# Cleaning for jokes dataset
def clean_df(df):
    # Remove content after edit
    df["title"] = df["title"].str.replace(r'edit:.*', '')
    df["body"] = df["body"].str.replace(r'edit:.*', '')
    
    # Creating "joke" column
    df["joke"] = np.where(df["title"].str[:10] != df["body"].str[:10], df["title"] + " " + df["body"], df["body"])
    df['joke'] = df['joke'] + "❌"
    
    return df

# Read json, that has reddit submissions as "title" and "body", combine them into column "joke" and 
# remove duplicates.
def read(json_filename):
    df = pd.read_json(path_or_buf=json_filename,orient='records',compression="infer")
    print("All jokes len", len(df))
    df = clean_df(df.iloc[:number_of_jokes])
    print("Loaded", number_of_jokes)
    
    return df

In [None]:
jokes_df = read(data_path)

All jokes len 194553
Loaded 10000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 

In [None]:
jokes_df.head()

Unnamed: 0,body,id,score,title,joke
0,"Now I have to say ""Leroy can you please paint ...",5tz52q,1,I hate how you cant even say black paint anymore,I hate how you cant even say black paint anymo...
1,Pizza doesn't scream when you put it in the ov...,5tz4dd,0,What's the difference between a Jew in Nazi Ge...,What's the difference between a Jew in Nazi Ge...
2,...and being there really helped me learn abou...,5tz319,0,I recently went to America....,I recently went to America.... ...and being th...
3,A Sunday school teacher is concerned that his ...,5tz2wj,1,"Brian raises his hand and says, “He’s in Heaven.”","Brian raises his hand and says, “He’s in Heave..."
4,He got caught trying to sell the two books to ...,5tz1pc,0,You hear about the University book store worke...,You hear about the University book store worke...


### Vectorize the text

In [None]:
jokes_list = jokes_df['joke'].to_numpy()
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(jokes_list)

In [None]:
list(tokenizer.index_word.values())

[' ',
 'e',
 't',
 'a',
 'o',
 'i',
 's',
 'n',
 'h',
 'r',
 'd',
 'l',
 'u',
 'm',
 'w',
 'y',
 'c',
 'g',
 '.',
 'f',
 'p',
 'b',
 'k',
 '\n',
 ',',
 '"',
 'v',
 "'",
 '❌',
 '?',
 'j',
 '!',
 'x',
 ':',
 '-',
 '0',
 'z',
 '1',
 '*',
 'q',
 '2',
 '’',
 '”',
 '“',
 '5',
 '3',
 '4',
 ')',
 '9',
 '$',
 '6',
 '(',
 '7',
 '8',
 '=',
 '/',
 ';',
 '^',
 '…',
 '[',
 ']',
 '&',
 '‘',
 '\xa0',
 '#',
 '%',
 '_',
 '+',
 '>',
 'é',
 '\t',
 '~',
 '–',
 '\ufeff',
 '\r',
 '£',
 '—',
 '´',
 'ñ',
 '<',
 '\\',
 '€',
 '°',
 '@',
 '😂',
 '`',
 '\u2028',
 '😨',
 '\u200b',
 '\u200f',
 '\x9d',
 '͡',
 '‽',
 'π',
 '🤣',
 'è',
 '😇',
 '√',
 '͜',
 'ʖ',
 '×',
 '«',
 '»',
 '🇩',
 '🇰',
 '•',
 '∫',
 '¢',
 'ó',
 'μ',
 '}',
 '¡',
 'ì',
 '笑',
 '林',
 '浮',
 '白',
 '斋',
 '主',
 '人',
 '|',
 '😎',
 '😁',
 '😈',
 '☝',
 'ä',
 '😋',
 '\u2009',
 'à',
 '😊']

In [None]:
tokenizer.word_docs['ü']

0

In [None]:
tokenizer.texts_to_sequences(['❌'])

[[29]]

In [None]:
print(jokes_list[0])
print(len(jokes_list[0].split()))

I hate how you cant even say black paint anymore Now I have to say "Leroy can you please paint the fence?"❌
22


In [None]:
vector = tokenizer.texts_to_sequences(jokes_list[0:1])
print(vector)
print(len(vector[0]))

[[6, 1, 9, 4, 3, 2, 1, 9, 5, 15, 1, 16, 5, 13, 1, 17, 4, 8, 3, 1, 2, 27, 2, 8, 1, 7, 4, 16, 1, 22, 12, 4, 17, 23, 1, 21, 4, 6, 8, 3, 1, 4, 8, 16, 14, 5, 10, 2, 1, 8, 5, 15, 1, 6, 1, 9, 4, 27, 2, 1, 3, 5, 1, 7, 4, 16, 1, 26, 12, 2, 10, 5, 16, 1, 17, 4, 8, 1, 16, 5, 13, 1, 21, 12, 2, 4, 7, 2, 1, 21, 4, 6, 8, 3, 1, 3, 9, 2, 1, 20, 2, 8, 17, 2, 30, 26, 29]]
107


### The prediction task

In [None]:
print(len(jokes_list))

10000


In [None]:
jokes_without_word = []
word_without_joke = []
sequence_length = 100
for j, joke in enumerate(jokes_list[:number_of_jokes]):
  joke_words = tokenizer.texts_to_sequences([joke])
  joke_words = list(reduce(lambda a, b: a + b, joke_words))
  if j % 1000 == 0:
    print(j)
  """for i in range(len(joke_words)):
    if i >= sequence_length:
      break
    sequence = joke_words[:sequence_length] # (joke_words[:i] + joke_words[i+1:])[:sequence_length]
    if len(sequence) < sequence_length:
      sequence += [0] * (sequence_length - len(sequence))
    jokes_without_word.append(np.array(sequence))
    word_without_joke.append(joke_words[i])"""
  if len(joke_words) < 2:
    continue
  if len(joke_words) <= sequence_length:
    word = joke_words[-1]
    seq = joke_words[:len(joke_words) - 1]
    seq = [0] * (sequence_length - len(seq)) + seq
    jokes_without_word.append(np.array(seq))
    word_without_joke.append(word)
  else:
    for i in range(len(joke_words)):
      if len(joke_words) - i < sequence_length + 1:
        break
      window = joke_words[i:i + sequence_length + 1]
      word = window[-1]
      seq = window[:len(window) - 1]
      jokes_without_word.append(np.array(seq))
      word_without_joke.append(word)


0
1000
2000
3000
4000
5000
6000
7000
8000
9000


In [None]:
print(len(jokes_without_word))

1516374


In [None]:
word_without_joke[-1]

29

In [None]:
for joke in jokes_without_word:
  if len(joke) != sequence_length:
    print(len(joke), joke)

In [None]:
# print(word_without_joke[:5])
# print(jokes_without_word[:5])
X = np.array(jokes_without_word)
y = word_without_joke
# vocabulary size
print(len(X))
print(len(y))
vocab_size = len(tokenizer.word_index) + 1

y = to_categorical(y, num_classes=vocab_size)

# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=sequence_length))
model.add(GRU(64))
model.add(Dense(64, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
# print(len(X), len(y))
model.fit(X, y, batch_size=128, epochs=15)
 
# save the model to file
# model.save('model.h5')
# save the tokenizer
# dump(tokenizer, open('tokenizer.pkl', 'wb'))

1516374
1516374
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 50)           6550      
_________________________________________________________________
gru (GRU)                    (None, 64)                22272     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 131)               8515      
Total params: 41,497
Trainable params: 41,497
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
  598/11847 [>.............................] - ETA: 4:54 - loss: 1.5163 - accuracy

In [None]:
model.save('model.h5')

In [1]:
# Printing a picture of the architecture of the model

keras.utils.plot_model(model, "initial_joke_generator.png")

NameError: ignored

In [None]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
	result = list()
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# truncate sequences to a fixed length
		encoded = tf.keras.preprocessing.sequence.pad_sequences([encoded], maxlen=seq_length, truncating='pre')
		# predict probabilities for each word
		# yhat = model.predict_classes(encoded, verbose=0)
		choices = model.predict(encoded)
		yhat = np.random.choice(len(choices[0]), p=choices[0])
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += out_wordscrees
		if out_word == '❌':
			result.append('.')
			break
		result.append(out_word)
	return ''.join(result)

In [None]:
seed_text = " ".join(jokes_list[random.randint(0,len(jokes_list))].split()[:5])
print(seed_text + '\n')

generate_seq(model, tokenizer, sequence_length, seed_text, 10)

You hear about the University



'tsnicticst'

In [None]:
for i in range(10):
  seed_text = " ".join(jokes_list[i].split()[:-3])
  print(seed_text + '\n')
  print(generate_seq(model, tokenizer, sequence_length, seed_text, 30))
  print("------------------------")

NameError: ignored

In [None]:
for i in range(1, 11):
  seed_text = " ".join(jokes_list[-i].split()[:-3])
  print(seed_text + '\n')
  print(generate_seq(model, tokenizer, sequence_length, seed_text, 30))
  print("------------------------")



My wife asked why I never eat at museum cafes I told her it was because the food





 the man is the man in the man
------------------------
If Reddit were around during the 2000 elections, who would they have voted? Neither. They'd just take the opportunity to make

 you a business and i was a bu
------------------------
Who is Bush's favorite NFL Team? The

 man is the man in the man in 
------------------------
Why were the workers of the twin towers sad? They ordered pepperoni but

 i was a bull you a business i
------------------------
"What did two years of Spanish classes teach you in

 the man is the man is the man
------------------------
We'll give him gold and frankincense But wait, there's myrrh. I'm

 go to the man in the man in t
------------------------
What did the bad rapper get for

 you a business is the man is 
------------------------
Why did the kitchen renovator go to

 the man is the man is the man
------------------------
It takes 1,437 bolts to assemble a car. It takes one nut to scatter them all

 the man is the man is the man
---------------

In [None]:
for i in range(1, 11):
  words = jokes_list[-i].split()
  seed_text = " ".join(words[:-3])[:20]
  print(seed_text)
  print(words[-3:])

['Edit:', 'A', 'name']
I hate housework. You do the dishes and you do the washing. Then six months later you have
['to', 'start', 'again.']
American healthcare be like Ooh no insurance? Bad news you owe $500 I am so sorry! Oh you have insurance! Great news, in that case bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit but fortunately all you would owe is only $700 for
['deductible', 'and', 'copay!']
What does a neckbeard get when he's
['sick', 'A', 'malady.']
Sometimes when I turn off the lights and masturbate, it feels like Jesus is watching me. Mexican
['prison', 'is', 'shit.']
Trump is a real asset to the country! Fucking Siri! I said *Ass
['Hat*', 'not', '*Asset*!!!']
Why did the cow go to the gym? To work
['on', 'his', 'calves.']
A piece of string walks into a bar... "Hi, I'd like to be served please." Says the string. "We cannot serve you, as you are only a string." Said the bartender. Upset, the string comes out of the bar, tears

In [None]:

jokes_list = jokes_df['joke'].to_numpy()

In [None]:
print(len(jokes_list))

3000


In [None]:
for i in range(1, 11):
  seed_text = " ".join(jokes_list[-i].split()[:-3])
  print(seed_text + '\n')
  print(generate_seq(model, tokenizer, sequence_length, seed_text, 3))
  print("------------------------")

A guy loses an eye on a fishing trip with his friends As he is laying in the hospital bed surrounded by all his family and friends after the surgery, his best friend rushes in the room and says: -I have great news!! I just ran into the doctor and he said you're not going to lose your eye! Everybody in the room turns around and the wounded man asks -Are you serious?! -Yeah! The doctor said he's going to put it in a jar with Formaldehyde and you get

the condoms and
------------------------
A turtle is walking across the yard . . . Three snails come up and mug him. Later the cops are asking questions about the mugging: "Can you describe your attackers?" The turtle responds, "I don't know, it all happened so fast





that right back
------------------------
How do you make a magician cry? You make

topic topic questions
------------------------
Why did the chicken cross the road? To get to the

side eater am
------------------------
Why was the Buddhist sad when he was asked to send his resume to the company as a word document via email? Attachment

to get the
------------------------
When A Teacher Asks You If You Did Your Homework Teacher: Did you do your homework? Student: Did you grade my test? Teacher:I have other students' tests to grade. Student: I have other teachers'

enough of course
------------------------
How to lose weight while still eating fast food? Buy food from England, you tend to lose

redial though that
------------------------
An unhappy teenage boy decides to ditch school for the first time, in order to get away from the people who only know he exists is when they are bullying him. He decides to wander the woods near his childhood home. After wandering for a while he comes u

In [None]:
for i in range(1, 11):
  words = jokes_list[-i].split()
  seed_text = " ".join(words[:-3])[:20]
  print(seed_text)
  print(" ".join(words[-3:]))
  print()

A guy loses an eye o
to keep it!

A turtle is walking 
. . ."

How do you make a ma
his family disappear.

Why did the chicken 
New York Times!

Why was the Buddhist
leads to suffering.

When A Teacher Asks 
homework to do.

How to lose weight w
a few pounds.

An unhappy teenage b
bears don't talk.

I found a kind of to
was a rad-ish

I told myself I woul
talks to himself.



In [None]:
seed_text = " ".join(jokes_list[-25].split()[:-3])
print(seed_text + '\n')
print(generate_seq(model, tokenizer, sequence_length, seed_text, 3))
print("------------------------")
print(" ".join(jokes_list[-25p].split()[-3:]))

Two chemists go into a bar. The first one says "I think I'll have an H2O." The second one says "I think I'll have an H2O too" He

replied me the
------------------------
died shortly after.


