# Mounting drive

Create shortcut to project folder in your main drive.

In [None]:
from google.colab import drive

drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
!ls drive/MyDrive/nlp-project

'char_level_model (1).ipynb'	   data		'NLP project plan.gdoc'
 char_level_model.ipynb		   model.ipynb	 proovitud-mudelid.gdoc
'Copy of char_level_model.ipynb'   models	 test.ipynb


In [None]:
data_path = "drive/MyDrive/nlp-project/data/reddit_jokes.json"

# Text generation with an RNN

Tutorial: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/ 

## Setup

### Import TensorFlow and other libraries

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow import keras

import numpy as np
import os
import time
import json
import pandas as pd
import random

from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, GRU, SimpleRNN
from keras.layers import Embedding

from functools import reduce


## Process the text

#### Reddit dataset: cleaning

Cleaning the dataset:
- Remove anything from the post following \"edit: \"
- Remove duplicate posts
- Create a \"joke\" column to the df by either combining titles with bodies or just returning bodies(if it contains the title)

In [None]:
number_of_jokes = 1000

In [None]:
# Cleaning for jokes dataset
def clean_df(df):
    # Remove content after edit
    df["title"] = df["title"].str.replace(r'edit:.*', '')
    df["body"] = df["body"].str.replace(r'edit:.*', '')
    
    # Creating "joke" column
    df["joke"] = np.where(df["title"].str[:10] != df["body"].str[:10], df["title"] + " " + df["body"], df["body"])
    df["joke"] = df["joke"] + " xyz"
    
    return df

# Read json, that has reddit submissions as "title" and "body", combine them into column "joke" and 
# remove duplicates.
def read(json_filename):
    df = pd.read_json(path_or_buf=json_filename,orient='records',compression="infer")
    df = clean_df(df.iloc[:number_of_jokes])
    
    return df

In [None]:
jokes_df = read(data_path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 

In [None]:

jokes_df.head()

Unnamed: 0,body,id,score,title,joke
0,"Now I have to say ""Leroy can you please paint ...",5tz52q,1,I hate how you cant even say black paint anymore,I hate how you cant even say black paint anymo...
1,Pizza doesn't scream when you put it in the ov...,5tz4dd,0,What's the difference between a Jew in Nazi Ge...,What's the difference between a Jew in Nazi Ge...
2,...and being there really helped me learn abou...,5tz319,0,I recently went to America....,I recently went to America.... ...and being th...
3,A Sunday school teacher is concerned that his ...,5tz2wj,1,"Brian raises his hand and says, “He’s in Heaven.”","Brian raises his hand and says, “He’s in Heave..."
4,He got caught trying to sell the two books to ...,5tz1pc,0,You hear about the University book store worke...,You hear about the University book store worke...


In [None]:
jokes_df.shape

(10000, 5)

### Vectorize the text

In [None]:
jokes_list = jokes_df['joke'].to_numpy()
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(jokes_list)

In [None]:
list(tokenizer.index_word.values())[:5]

['the', 'a', 'and', 'xyz', 'to']

In [None]:
print(jokes_list[0])
print(len(jokes_list[0].split()))

I hate how you cant even say black paint anymore Now I have to say "Leroy can you please paint the fence?" xyz
23


In [None]:
vector = tokenizer.texts_to_sequences(jokes_list[0:1])
print(vector)
print(len(vector[0]))

[[6, 329, 45, 7, 1857, 210, 98, 150, 913, 784, 95, 6, 32, 5, 98, 785, 60, 7, 364, 913, 1, 672, 4]]
23


### The prediction task

In [None]:
print(len(jokes_list))

1000


In [None]:
jokes_without_word = []
word_without_joke = []
sequence_length = 20
n_of_jokes = 2000
for j, joke in enumerate(jokes_list[:number_of_jokes]):
  joke_words = tokenizer.texts_to_sequences([joke])
  joke_words = list(reduce(lambda a, b: a + b, joke_words))
  if j % 1000 == 0:
    print(j)
  if len(joke_words) < 2:
    continue
  if len(joke_words) <= sequence_length:
    word = joke_words[-1]
    seq = joke_words[:len(joke_words) - 1]
    seq = [0] * (sequence_length - len(seq)) + seq
    jokes_without_word.append(np.array(seq))
    word_without_joke.append(word)
  else:
    for i in range(len(joke_words)):
      if len(joke_words) - i < sequence_length + 1:
        break
      window = joke_words[i:i + sequence_length + 1]
      word = window[-1]
      seq = window[:len(window) - 1]
      jokes_without_word.append(np.array(seq))
      word_without_joke.append(word)


0


In [None]:
print(len(jokes_without_word))

26215


In [None]:
jokes_without_word[:5]

[array([   6,  329,   45,    7, 1857,  210,   98,  150,  913,  784,   95,
           6,   32,    5,   98,  785,   60,    7,  364,  913]),
 array([ 329,   45,    7, 1857,  210,   98,  150,  913,  784,   95,    6,
          32,    5,   98,  785,   60,    7,  364,  913,    1]),
 array([  45,    7, 1857,  210,   98,  150,  913,  784,   95,    6,   32,
           5,   98,  785,   60,    7,  364,  913,    1,  672]),
 array([  96,    1,  234,  195,    2,  786,   10, 1858, 1859,    3,  787,
         787,  211, 1094,   41,    7,  196,   11,   10,    1]),
 array([   1,  234,  195,    2,  786,   10, 1858, 1859,    3,  787,  787,
         211, 1094,   41,    7,  196,   11,   10,    1, 1362])]

In [None]:
for joke in jokes_without_word:
  if len(joke) != sequence_length:
    print(len(joke), joke)

In [None]:
# print(word_without_joke[:5])
# print(jokes_without_word[:5])
X = np.array(jokes_without_word)
y = word_without_joke
# vocabulary size
print(len(X))
print(len(y))
vocab_size = len(tokenizer.word_index) + 1

y = to_categorical(y, num_classes=vocab_size)

# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=sequence_length))
model.add(GRU(64))
model.add(Dense(64, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
# print(len(X), len(y))
model.fit(X, y, batch_size=64, epochs=100, validation_split=0.05)
 
# save the model to file
# model.save('model.h5')
# save the tokenizer
# dump(tokenizer, open('tokenizer.pkl', 'wb'))

26215
26215
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 50)            297150    
_________________________________________________________________
gru (GRU)                    (None, 64)                22272     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 5943)              386295    
Total params: 709,877
Trainable params: 709,877
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 

<tensorflow.python.keras.callbacks.History at 0x7fcddcd7ecd0>

In [None]:
model.save('drive/MyDrive/nlp-project/model_with_end.h5')

In [None]:
# Printing a picture of the architecture of the model

# keras.utils.plot_model(model, "initial_joke_generator.png")

In [None]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text):
	result = list()
	in_text = seed_text
	# generate a fixed number of words
	while True:
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# truncate sequences to a fixed length
		encoded = tf.keras.preprocessing.sequence.pad_sequences([encoded], maxlen=seq_length, truncating='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
		if out_word == 'xyz':
			result.append('.')
			break
		result.append(out_word)
	return ' '.join(result)

In [None]:
seed_text = " ".join(jokes_list[random.randint(0,len(jokes_list))].split()[:5])
print(seed_text + '\n')

generate_seq(model, tokenizer, sequence_length, seed_text)

Why did the boy leave





'the man .'

In [None]:
for i in range(10):
	seed_text = " ".join(jokes_list[i].split()[:-3])
	print(seed_text + '\n')
	print(generate_seq(model, tokenizer, sequence_length, seed_text))
	print("\nReal joke:")
	print(" ".join(jokes_list[i].split()))
	print("------------------------")

I hate how you cant even say black paint anymore Now I have to say "Leroy can you please paint

the fence .

Real joke:
I hate how you cant even say black paint anymore Now I have to say "Leroy can you please paint the fence?" xyz
------------------------
What's the difference between a Jew in Nazi Germany and pizza ? Pizza doesn't scream when you put it in the oven . I'm





so sorry .

Real joke:
What's the difference between a Jew in Nazi Germany and pizza ? Pizza doesn't scream when you put it in the oven . I'm so sorry. xyz
------------------------
I recently went to America.... ...and being there really helped me learn about American culture. So I visited a shop and as I was leaving, the Shopkeeper said "Have a nice day!" But I didn't so I

sued him .

Real joke:
I recently went to America.... ...and being there really helped me learn about American culture. So I visited a shop and as I was leaving, the Shopkeeper said "Have a nice day!" But I didn't so I sued him. xyz
------------------------
Brian raises his hand and says, “He’s in Heaven.” A Sunday school teacher is concerned that his students might be a little confused about Jesus, so he asks his class, “Where is Jesus today?” Brian raises his hand and says, “He’s in Heaven.” Susan answers, “He’s in my heart.” Little Johnny waves his hand furiously and blurts out, “He’s in our bathroom!” The teach

In [None]:
for i in range(1, 11):
	seed_text = " ".join(jokes_list[-i].split()[:-3])
	print(seed_text + '\n')
	print(generate_seq(model, tokenizer, sequence_length, seed_text))
	print("\nReal joke:")
	print(" ".join(jokes_list[-i].split()))
	print("------------------------")

Rule #1 for learning english Their our

.

Real joke:
Rule #1 for learning english Their our know rules! xyz
------------------------
I came up with a science joke... Why are people with diamond shoes so bad for the environment? They have a big





black rang and said are having this joke the priest yells how many kinds of whiskey then the next morning she sees a lady after something and does and leroy here is too million help chocolate be he bang after that ” replies well we look magnificent the man has quite he'll go '' the us .

Real joke:
I came up with a science joke... Why are people with diamond shoes so bad for the environment? They have a big carbon footprint... xyz
------------------------
A kindergarten student told his teacher he'd found a cat, but it was dead. "How do you know that the cat was dead?" she asked her student. "Because I pissed in its ear and it didn't move," answered the child innocently. "You did WHAT?!?!?!" the teacher yelled in shock. "You know," explained the boy, "I leaned over and went 'Pssst!' and it

is possible that he could have to do .

Real joke:
A kindergarten student told his teacher he'd found a cat, but it was dead. "How do you know that the cat was dead?" she asked her student. "Because

In [None]:
for i in range(1, 11):
  words = jokes_list[-i].split()
  seed_text = " ".join(words[:-3])[:20]
  print(seed_text)
  print(words[-3:])

['Edit:', 'A', 'name']
I hate housework. You do the dishes and you do the washing. Then six months later you have
['to', 'start', 'again.']
American healthcare be like Ooh no insurance? Bad news you owe $500 I am so sorry! Oh you have insurance! Great news, in that case bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit bullshit but fortunately all you would owe is only $700 for
['deductible', 'and', 'copay!']
What does a neckbeard get when he's
['sick', 'A', 'malady.']
Sometimes when I turn off the lights and masturbate, it feels like Jesus is watching me. Mexican
['prison', 'is', 'shit.']
Trump is a real asset to the country! Fucking Siri! I said *Ass
['Hat*', 'not', '*Asset*!!!']
Why did the cow go to the gym? To work
['on', 'his', 'calves.']
A piece of string walks into a bar... "Hi, I'd like to be served please." Says the string. "We cannot serve you, as you are only a string." Said the bartender. Upset, the string comes out of the bar, tears

In [None]:

jokes_list = jokes_df['joke'].to_numpy()

In [None]:
print(len(jokes_list))

3000


In [None]:
for i in range(1, 11):
  seed_text = " ".join(jokes_list[-i].split()[:-3])
  print(seed_text + '\n')
  print(generate_seq(model, tokenizer, sequence_length, seed_text, 3))
  print("------------------------")

A guy loses an eye on a fishing trip with his friends As he is laying in the hospital bed surrounded by all his family and friends after the surgery, his best friend rushes in the room and says: -I have great news!! I just ran into the doctor and he said you're not going to lose your eye! Everybody in the room turns around and the wounded man asks -Are you serious?! -Yeah! The doctor said he's going to put it in a jar with Formaldehyde and you get

the condoms and
------------------------
A turtle is walking across the yard . . . Three snails come up and mug him. Later the cops are asking questions about the mugging: "Can you describe your attackers?" The turtle responds, "I don't know, it all happened so fast





that right back
------------------------
How do you make a magician cry? You make

topic topic questions
------------------------
Why did the chicken cross the road? To get to the

side eater am
------------------------
Why was the Buddhist sad when he was asked to send his resume to the company as a word document via email? Attachment

to get the
------------------------
When A Teacher Asks You If You Did Your Homework Teacher: Did you do your homework? Student: Did you grade my test? Teacher:I have other students' tests to grade. Student: I have other teachers'

enough of course
------------------------
How to lose weight while still eating fast food? Buy food from England, you tend to lose

redial though that
------------------------
An unhappy teenage boy decides to ditch school for the first time, in order to get away from the people who only know he exists is when they are bullying him. He decides to wander the woods near his childhood home. After wandering for a while he comes u

In [None]:
for i in range(1, 11):
  words = jokes_list[-i].split()
  seed_text = " ".join(words[:-3])[:20]
  print(seed_text)
  print(" ".join(words[-3:]))
  print()

A guy loses an eye o
to keep it!

A turtle is walking 
. . ."

How do you make a ma
his family disappear.

Why did the chicken 
New York Times!

Why was the Buddhist
leads to suffering.

When A Teacher Asks 
homework to do.

How to lose weight w
a few pounds.

An unhappy teenage b
bears don't talk.

I found a kind of to
was a rad-ish

I told myself I woul
talks to himself.



In [None]:
seed_text = " ".join(jokes_list[-25].split()[:-3])
print(seed_text + '\n')
print(generate_seq(model, tokenizer, sequence_length, seed_text, 3))
print("------------------------")
print(" ".join(jokes_list[-25p].split()[-3:]))

Two chemists go into a bar. The first one says "I think I'll have an H2O." The second one says "I think I'll have an H2O too" He

replied me the
------------------------
died shortly after.


