<table align = 'center'><td><a href = "https://colab.research.google.com/drive/1lA-rrmdr36jImFrukuoZd5nFfToGAKEz#scrollTo=vpfuzzvOmauq">Open in Colab</a></td>

<td><a href = "https://github.com/joshchen984/Inspirational-Quotes-Generator/tree/master">View on Github</a></td>
</table>

#Quotes Generation Notebook

In this notebook I will use LSTMs to try and generate inspirational quotes

Quotes were taken from https://github.com/ShivaliGoel/Quotes-500K

Sections:


1.   Loading Data
2.   Cleaning Data
  *   Mapping Characters to Unique Numbers
  *   Creating Sequences
  *   One Hot Encoding Sequences
3.   Creating Model
4.   Training Model
5.   Using Model



In [27]:
import pandas as pd
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Input, Dense, LSTM, Dropout, Flatten, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import string
import re

##Loading Data

Run the cell below to upload quotes.zip

In [2]:
#upload quotes.zip here
#comment this cell out if you aren't running this in google colab
from google.colab import files
uploaded = files.upload()
!unzip quotes.zip

Saving quotes.zip to quotes.zip
Archive:  quotes.zip
  inflating: quotes.csv              


In [3]:
#if running this on your local machine then change data_path to the path of your quotes.csv
data_path = "/content/quotes.csv"
df = pd.read_csv(data_path, header = None)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df = df.loc[:,:2]
df = df.astype(str)
df.columns = ["quote", "author", "tags"]
#getting rid of non-ascii characters
df['quote'] = df['quote'].apply(lambda text:re.sub(r'[^\x00-\x7F]',' ', text))

In [8]:
df.head()

Unnamed: 0,quote,author,tags
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."


In [16]:
num_quotes = df.shape[0]
num_quotes

499709

In [7]:
#showing a quote
df['quote'][0]

"I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best."

##Cleaning Data

In [18]:
#turning quotes into numpy array so we can do more things with it
quotes_array = df['quote'].values

#only using a quarter of quotes because we don't need all 500k quotes
quotes_array = quotes_array[:num_quotes//4]
num_quotes = len(quotes_array)

#How long each sequence is (How much characters in the input)
T = 30

# how much characters to jump for each value
step = 3
X_sequences = []
Y_sequences = []


###Mapping Characters to Unique Numbers
First we have to get a list of all the unique characters and we store this in chars

Then we create a dictionary that maps each character with a unique number


In [19]:
chars = sorted(list(set(np.array2string(quotes_array[:5_000], threshold = 5_001)[1:-1])))
chars_index = dict( (c, i) for i, c in enumerate(chars))
len_chars = len(chars)

In [20]:
len_chars

87

###Creating Sequences

In [21]:
for quote in quotes_array:
  for c in range(0,len(quote) - T, step):
    X_sequences.append(quote[c:c+T])
    Y_sequences.append(quote[c+T])

In [22]:
#number of rows
N = len(X_sequences)

In [23]:
N

7356430

In [24]:
#example of row
X_sequences[134]

'Darkness cannot drive out dark'

###One Hot Encoding Sequences
We can't just turn each character into a number because characters aren't
<a href = "https://cyfar.org/types-variables#:~:text=The%20three%20types%20of%20categorical,ordinal%E2%80%94are%20explained%20further%20below.">ordinal variables</a>

We have to one hot encode each character

In [25]:
def one_hot_encode(batch_size, x_sequences, y_sequences):
  #one hot encoding characters
  batch = 0
  x = np.zeros((batch_size, T, len_chars))
  y = np.zeros((batch_size, len_chars))
  while True:
    for i, sentence in enumerate(x_sequences):
      for j, char in enumerate(sentence):
        #if character isn't in char index then replace it with space
        try:
          x[batch, j, chars_index[char]] = 1
        except KeyError:
          #if the function encounters a character not in chars_index just set that character to a space
          x[batch, j, chars_index[' ']] = 1
          
      #if character isn't in char index then replace it with space
      try:
        y[batch, chars_index[y_sequences[i]]] = 1
      except:
        y[batch, chars_index[' ']] = 1

      batch+=1

      if batch >=batch_size:
        yield (x, y)
        x = np.zeros((batch_size, T, len_chars))
        y = np.zeros((batch_size, len_chars))
        batch = 0


##Creating Model

In [None]:
i = Input(shape = (T, len_chars))
x = LSTM(128)(i)
x = Dropout(0.2)(x)
x = Dense(len_chars, activation = 'softmax')(x)

model = Model(i, x)
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

In [None]:
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 30, 87)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               110592    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 87)                11223     
Total params: 121,815
Trainable params: 121,815
Non-trainable params: 0
_________________________________________________________________


In [None]:
#after every epoch save the model weights to my google drive
path = "/content/gdrive/My Drive/models/quote{epoch:02d}-loss{loss:.3f}-val_loss{val_loss:.3f}.hdf5"
checkpoint = ModelCheckpoint(path, monitor = 'loss', save_best_only = False, save_weights_only = True, mode = "min")

##Training Model

In [None]:
batch_size = 128

In [None]:
r = model.fit(one_hot_encode(batch_size, X_sequences[:N//2], Y_sequences[:N//2]),
              validation_data = one_hot_encode(batch_size, X_sequences[N//2:], Y_sequences[N//2:]),
              validation_batch_size = batch_size,
              validation_steps = N//batch_size * 2,
              batch_size = batch_size,
              epochs = 50,
              steps_per_epoch = N//batch_size * 2, #sampling only half of x_sequences
              verbose = 1,
              callbacks = [checkpoint])
#On a gpu 25 epochs takes around 7 hours

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50

KeyboardInterrupt: ignored

##Using Model
###Picks character from predictions.

Higher temperature means model takes more chances

Lower temperature means model is more conservative

(If temperature is low it has a high chance of getting stuck in an infinite loop)

Temperature has to be between 0-1 (0 not included)

In [9]:
def sample(preds, temperature = 1.0):
  #chooses a character from predictions
  preds = np.asarray(preds, dtype = np.float64)
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds/np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

In [12]:
def generate(length, model,temperature = 1.0):
  #choosing random quote for the start seed
  start_quote_index = np.random.randint(0, num_quotes)
  start_quote = quotes_array[start_quote_index]

  while len(start_quote) < T:
    #making sure start quote is at least T characters long
    start_quote_index = np.random.randint(0, num_quotes)
    start_quote = quotes_array[start_quote_index]

  #getting random index in quote
  start_index = np.random.randint(0, len(start_quote)-T+1)
  seed = start_quote[start_index:start_index+T]

  print(f"Start seed: {seed}")
  print(f"Temperature: {temperature}")
  print()

  generated = ""
  for i in range(length):
    x_pred = np.zeros((1,T,len_chars))

    #one hot encoding x_pred
    for j, char in enumerate(seed):
      x_pred[0,j,chars_index[char]] = 1

    preds = model.predict(x_pred)[0]
    
    #choosing next character
    next_index = sample(preds, temperature)

    #adding char to generated text
    next_char = chars[next_index]
    generated +=next_char

    #printing generated character to screen
    print(next_char, end='')

    #getting rid of left character and adding next_char to end of seed
    seed = seed[1:] + next_char

Run the cell below to upload the quote-model.h5 file

In [13]:
#comment this cell out if you aren't running this on google colab
uploaded = files.upload()

Saving quote-model.h5 to quote-model.h5


In [14]:
#if running this on your local machine then change model_path to the path of your quote-model.h5
model_path = "/content/quote-model.h5"
saved_model = tf.keras.models.load_model(model_path)

### Generating New Quotes

Run the cell below to generate new quotes.

The first argument specifies the length of the new quote.

The third argument specifies the temperature.

In [26]:
generate(50, saved_model, 0.5)

Start seed: s is a path of setbacks, passi
Temperature: 0.5

on and even the world and feeling and works. The p