##Generating Text with Neural Networks

##Imports
First, import the required libraries

In [23]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

##Building the Vocabulary
The dataset is the lyrics of Lanigan's Ball, a traditional Irish song. You will split it per line then use the `Tokenizer` class to build the word index dicitonary

In [24]:
#Define the lyrics of the song
data = "America God bless you if its good to ya America please take my hand Can you help me underst New Kung Fu Kenny Throw a steak off the ark to a pool full of sharks He’ll take it Leave him in the wilderness with a sworn nemesis He'll make it He'll make it Take the gratitude from him I bet he'll show you somethin' Woah Woah I chip a nigga lil' bit of nothin' I chip a nigga lil' bit of nothin' I chip a nigga lil' bit of nothin' I chip a nigga then throw the blower in his lap Walk myself to the court like Bitch I did that X rated Johnny don't wanna go to school no mo' no mo' Johnny said books ain't cool no mo' No mo' Johnny wanna be a rapper like his big cousin Johnny caught a body yesterday out hustlin' God bless America you know we all love him Yesterday I got a call like from my dog  like 101Said they killed his only son because of insufficient funds He was sobbin' he was mobbin' way belligerent and drunk Talkin' out his head philosophin' on what the Lord had done He said KDot can you pray for me It been a fucked up day for me I know that you anointed show me how to overcome He was lookin' for some closure hopin' I could bring him closer To the spiritual my spirit do know better but I told him I can't sugarcoat the answer for you this is how I feel If somebody kill my son that mean somebody gettin' killed Tell me what you do for love loyalty and passion of All the memories collected moments you could never touch I wait in front a nigga's spot and watch him hit his block I catch a nigga leavin' service if that's all I got I chip a nigga then throw the blower in his lap Walk myself to the court like Bitch I did that Ain't no Black Power when your baby killed by a coward I can't even keep the peace don't you fuck with one of ours It be murder in the street it be bodies in the hour Ghetto bird be on the street paramedics on the dial Let somebody touch my momma touch my sister touch my woman Touch my daddy touch my niece touch my nephew touch my brother You should chip a nigga then throw the blower in his lap Matter fact I'm 'bout to speak at this convention call you back  Alright kids we're gonna talk about gun control Pray for me Damn It's not a place This country is to me a sound Of drum and bass You close your eyes to look around Hail Mary Jesus and Joseph The great American flag is wrapped in drag with explosives Compulsive disorder sons and daughters Barricaded blocks and borders look what you taught us It's murder on my street your street back streets Wall Street Corporate offices banks employees and bosses with Homicidal thoughts Donald Trump's in office We lost Barack and promised to never doubt him again But is America honest or do we bask in sin Pass the gin I mix it with American blood Then bash him in you Crippin' or you married to Blood I'll aks again oops accident It's nasty when you set us up then roll the dice then bet us up You overnight the big rifles then tell Fox to be scared of us Gang members or terrorists et cetera et cetera America's reflections of me that's what a mirror does It's not a place This country is to me a sound Of drum and bass You close your eyes to look"
#Split the long string per line and put in a list 
corpus = data.lower().split('\n')

#Preview the result
print(corpus)

["america god bless you if its good to ya america please take my hand can you help me underst new kung fu kenny throw a steak off the ark to a pool full of sharks he’ll take it leave him in the wilderness with a sworn nemesis he'll make it he'll make it take the gratitude from him i bet he'll show you somethin' woah woah i chip a nigga lil' bit of nothin' i chip a nigga lil' bit of nothin' i chip a nigga lil' bit of nothin' i chip a nigga then throw the blower in his lap walk myself to the court like bitch i did that x rated johnny don't wanna go to school no mo' no mo' johnny said books ain't cool no mo' no mo' johnny wanna be a rapper like his big cousin johnny caught a body yesterday out hustlin' god bless america you know we all love him yesterday i got a call like from my dog  like 101said they killed his only son because of insufficient funds he was sobbin' he was mobbin' way belligerent and drunk talkin' out his head philosophin' on what the lord had done he said kdot can you pr

In [25]:
#Initialize the Tokenizer Class
tokenizer = Tokenizer()

#Generate the word index dictionary
tokenizer.fit_on_texts(corpus)

#Define the total words: Add 1 for the index '0' which is just the padding token
total_words = len(tokenizer.word_index) + 1

print(f'word index dictionary: {tokenizer.word_index}')
print(f'total words: {total_words}')

word index dictionary: {'a': 1, 'the': 2, 'you': 3, 'i': 4, 'to': 5, 'my': 6, 'of': 7, 'in': 8, 'and': 9, 'me': 10, 'him': 11, 'touch': 12, 'it': 13, 'nigga': 14, 'then': 15, 'his': 16, 'chip': 17, 'for': 18, 'with': 19, 'like': 20, 'no': 21, 'be': 22, 'is': 23, 'street': 24, 'america': 25, 'throw': 26, 'that': 27, 'johnny': 28, "mo'": 29, 'he': 30, 'on': 31, 'what': 32, 'this': 33, 'your': 34, "it's": 35, 'us': 36, 'if': 37, 'take': 38, "he'll": 39, "lil'": 40, 'bit': 41, "nothin'": 42, 'blower': 43, 'lap': 44, 'know': 45, 'we': 46, 'all': 47, 'killed': 48, 'was': 49, 'up': 50, 'do': 51, 'somebody': 52, 'look': 53, 'or': 54, 'god': 55, 'bless': 56, 'can': 57, 'make': 58, 'from': 59, 'bet': 60, 'show': 61, 'woah': 62, 'walk': 63, 'myself': 64, 'court': 65, 'bitch': 66, 'did': 67, "don't": 68, 'wanna': 69, 'said': 70, "ain't": 71, 'big': 72, 'yesterday': 73, 'out': 74, 'love': 75, 'got': 76, 'call': 77, 'son': 78, 'pray': 79, 'how': 80, 'could': 81, 'but': 82, "can't": 83, 'tell': 84, '

##Preprocessing the data

Next, you will be generating the training sequences and their labels. Take each line of the song and generate inputs and labels from it. 

For example, if you only have one sentence: 'I am using TensorFlwo', then you want the model to learn the next word given anyany subphrase of this sentence

The result would be inputs as padded sequences, and labels as one-hot encoded arrays

In [26]:
#Initialize the sequences list
input_sequences = []

#Loop over every line
for line in corpus:

  #Tokenize the current line 
  token_list = tokenizer.texts_to_sequences([line])[0]

  #Loop over the line several times to generate the subphrases
  for i in range(1, len(token_list)):

    #Generate the subphrase 
    n_gram_sequence = token_list[:i+1]

    #Append the subphrase to the sequences list 
    input_sequences.append(n_gram_sequence)

#Get the length of the longest line
max_sequence_len = max([len(x) for x in input_sequences])

#Pad all sequences 
input_sequences = np.array(pad_sequences(input_sequences, maxlen = max_sequence_len, padding='pre'))

#Create inputs and label them by splitting the last token in the subphrases
xs, labels = input_sequences[:,:-1], input_sequences[:,-1]

#Convert the label into one-hot arrays
ys = tf.keras.utils.to_categorical(labels, num_classes = total_words)

Let's see the reslt for the first line of the song. The particular line and the expected token sequence is shown in the cell below 

In [27]:
#Get the sample sentence 
sentence = corpus[0].split()
print(f'sample sentence: {sentence}')

#Initialize token list
token_list = []

#Look up the indices of each word and append to the list
for word in sentence:
  token_list.append(tokenizer.word_index[word])

print(token_list)

sample sentence: ['america', 'god', 'bless', 'you', 'if', 'its', 'good', 'to', 'ya', 'america', 'please', 'take', 'my', 'hand', 'can', 'you', 'help', 'me', 'underst', 'new', 'kung', 'fu', 'kenny', 'throw', 'a', 'steak', 'off', 'the', 'ark', 'to', 'a', 'pool', 'full', 'of', 'sharks', 'he’ll', 'take', 'it', 'leave', 'him', 'in', 'the', 'wilderness', 'with', 'a', 'sworn', 'nemesis', "he'll", 'make', 'it', "he'll", 'make', 'it', 'take', 'the', 'gratitude', 'from', 'him', 'i', 'bet', "he'll", 'show', 'you', "somethin'", 'woah', 'woah', 'i', 'chip', 'a', 'nigga', "lil'", 'bit', 'of', "nothin'", 'i', 'chip', 'a', 'nigga', "lil'", 'bit', 'of', "nothin'", 'i', 'chip', 'a', 'nigga', "lil'", 'bit', 'of', "nothin'", 'i', 'chip', 'a', 'nigga', 'then', 'throw', 'the', 'blower', 'in', 'his', 'lap', 'walk', 'myself', 'to', 'the', 'court', 'like', 'bitch', 'i', 'did', 'that', 'x', 'rated', 'johnny', "don't", 'wanna', 'go', 'to', 'school', 'no', "mo'", 'no', "mo'", 'johnny', 'said', 'books', "ain't", 'c

Since there are 8 tokens here, you can expect to find this particular line in the first 7 elements of  `xs` that you generated earlier. If we get the longest subphrase generated, that should be found in xs[6] 

See the padded token sequence below

In [28]:
#Pick element 
elem_number = 6

#Print token list and phrase
print(f'token list: {xs[elem_number]}')
print(f'decoded to text: {tokenizer.sequences_to_texts([xs[elem_number]])}')

token list: [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   

If you print out the label, it shoudl show `70` because that is the next word in the phrase (`lanigan`)

See the one-hot encoded form below. YOu can use the np.argmax() method to get the index of the 'hot' label

In [29]:
#Print label 
print(f'one-hot label: {ys[elem_number]}')
print(f'index of label: {np.argmax(ys[elem_number])}')

one-hot label: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.]
index of label: 5


If you pick the element before that, you'll see the same subphrase as above minus one word

##Build the Model

Next build the model with basically the same layers as before. The main difference is that the sigmoid output is removed and a softmax activated `Dense` layer is used instead. This output layer will have one neuron for each word in the vocabulary. So given an input token list, the output array of the final layer will have a probability for each word.

In [30]:
#Build the model 
model = Sequential([
      Embedding(total_words, 64, input_length=max_sequence_len - 1),
      Bidirectional(LSTM(20)),
      Dense(total_words, activation='softmax')
])

#Use categorical crossentropy becasue this is a multi-class problem 
model.compile(loss='categorical_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 625, 64)           19008     
                                                                 
 bidirectional (Bidirectiona  (None, 40)               13600     
 l)                                                              
                                                                 
 dense (Dense)               (None, 297)               12177     
                                                                 
Total params: 44,785
Trainable params: 44,785
Non-trainable params: 0
_________________________________________________________________


##Train the model
You can now train the model. We have a relatively small vocabulary so it will only take a couple of minutes to complete 500 epochs 


In [None]:
history = model.fit(xs, ys, epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

You can visualize the results with the utility below

In [None]:
import matplotlib.pyplot as plt

#Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel('Epochs')
  plt.ylabel(string)
  plt.show()

plot_graphs(history, 'accuracy')

from matplotlib import text
##Generating Text
With the model trained, you can now use it to make its own song.

1. Feed a seed text to initiate the process.
2. Model predicts the index of the most probable next word.
3. Look up the index in the reverse word index dictionary.
4. Append the next word to the seed text.
5. Feed the result to the model again. 

Steps 2 to 5 will repeat until the desired length of the song is reached. See how it is implemented in the code below:

In [None]:
#Define the seed text
seed_text = "Bitch don't kill my vibe"

#Define total words to predict 
next_words = 15

#Loop until desired length is reached 
for _ in range(next_words):
  
  #Convert the seed text to a token sequence
  token_list = tokenizer.texts_to_sequences([seed_text])[0]

  #Pad the sequence 
  token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding = 'pre')

  #Feed the model and get the probabilites for each index
  probabilities = model.predict(token_list)

  #Get the index with the highest probability
  predicted = np.argmax(probabilities, axis = -1)[0]

  #Ignore if index with the highest probability
  predicted = np.argmax(probabilities, axis = -1)[0]

  #Ignore if index is 0 becasue that is just the padding 
  if predicted != 0: 

    #Look up the word associated with the index
    output_word = tokenizer.index_word[predicted]

    #Combine with the seed text 
    seed_text += " " + output_word

print(seed_text)

In the output above, you might notice frequent repetition of words the longer the sentence gets. There are ways to get around it and the next cell shows one. Basically, instead of getting the index with max probability, you will get the top three indices and choose one at random. See if the output text makes more sense with this approach. This is not the most time efficient solution because it is always sorting the entire array even if you only need the top three. Feel free to improve it and of course, you can also develop your own method of picking the next word.

In [None]:
# Define seed text
seed_text = "Bitch don't kill my vibe"

# Define total words to predict
next_words = 20

# Loop until desired length is reached
for _ in range(next_words):

	# Convert the seed text to a token sequence
  token_list = tokenizer.texts_to_sequences([seed_text])[0]

	# Pad the sequence
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	
	# Feed to the model and get the probabilities for each index
  probabilities = model.predict(token_list)

  # Pick a random number from [1,2,3]
  choice = np.random.choice([1,2,3])
	
  # Sort the probabilities in ascending order 
  # and get the random choice from the end of the array
  predicted = np.argsort(probabilities)[0][-choice]

	# Ignore if index is 0 because that is just the padding.
  if predicted != 0:
		
		# Look up the word associated with the index. 
	  output_word = tokenizer.index_word[predicted]

		# Combine with the seed text
	  seed_text += " " + output_word

# Print the result	
print(seed_text)