<a href="https://colab.research.google.com/github/plaban1981/DEEP-LEARNING-/blob/master/LSTM_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How to create a generative model for text using LSTM recurrent neural networks in Python with Keras**

*Text Generation is a type of Language Modelling problem. *

**Language Modelling** is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization.

A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text.


 Language models can be operated at character level, n-gram level, sentence level or even paragraph level.


**The objective of this model is to generate new text, given that some input text is present. **

*Import Libraries*

In [0]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding,LSTM,Dense,Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import numpy as np

*Lets use a popular nursery rhyme — “Cat and Her Kittens” as our corpus. A corpus is defined as the collection of text documents.*

In [11]:
from google.colab import files
files.upload()

Saving data.txt to data.txt


{'data.txt': b'The cat and her kittens\r\nThey put on their mittens\r\nTo eat a christmas pie\r\nThe poor little kittens\r\nThey lost their mittens\r\nAnd then they began to cry.\r\n\r\nO mother dear, we sadly fear\r\nWe cannot go to-day,\r\nFor we have lost our mittens\r\nIf it be so, ye shall not go\r\nFor ye are naughty kittens'}

There will be three main parts of the code: 

* dataset preparation, 

* model training, and 

* generating prediction. 

**dataset preparation step:**

1. Tokenization -  Tokenization is a process of extracting tokens (terms / words) from a corpus.

2.  Convert the corpus into a flat dataset of sentence sequences.

3. As  different sequences have different lengths, so we need to pad  the sequences and make their lengths equal using pad_sequence function of Kears.

4. To input this data into a learning model, we need to create predictors and label.

We will create N-grams sequence as predictors and the next word of the N-gram as label. For example:

"""

Sentence: "they are learning data science"

PREDICTORS                | LABEL

they                                | are

they are                         | learning

they are learning          | data

they are learning data | science


"""

In [0]:
data = open('data.txt').read()

In [13]:
data

'The cat and her kittens\nThey put on their mittens\nTo eat a christmas pie\nThe poor little kittens\nThey lost their mittens\nAnd then they began to cry.\n\nO mother dear, we sadly fear\nWe cannot go to-day,\nFor we have lost our mittens\nIf it be so, ye shall not go\nFor ye are naughty kittens'

In [18]:
tokenizer = Tokenizer()
corpus = data.lower().split("\n") 
print(corpus)
for line in corpus:
  print(line)
  token_list = tokenizer.texts_to_sequences([line])[0]
  print(token_list)

['the cat and her kittens', 'they put on their mittens', 'to eat a christmas pie', 'the poor little kittens', 'they lost their mittens', 'and then they began to cry.', '', 'o mother dear, we sadly fear', 'we cannot go to-day,', 'for we have lost our mittens', 'if it be so, ye shall not go', 'for ye are naughty kittens']
the cat and her kittens
[]
they put on their mittens
[]
to eat a christmas pie
[]
the poor little kittens
[]
they lost their mittens
[]
and then they began to cry.
[]

[]
o mother dear, we sadly fear
[]
we cannot go to-day,
[]
for we have lost our mittens
[]
if it be so, ye shall not go
[]
for ye are naughty kittens
[]


In [0]:
tokenizer = Tokenizer()
def dataset_preparation(data):

	# basic cleanup
	corpus = data.lower().split("\n")

	# tokenization	
	tokenizer.fit_on_texts(corpus)
	total_words = len(tokenizer.word_index) + 1

	# create input sequences using list of tokens
	input_sequences = []
	for line in corpus:
		token_list = tokenizer.texts_to_sequences([line])[0]
		for i in range(1, len(token_list)):
			n_gram_sequence = token_list[:i+1]
			input_sequences.append(n_gram_sequence)

	# pad sequences 
	max_sequence_len = max([len(x) for x in input_sequences])
	input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
  

	# create predictors and label
	predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
	label = ku.to_categorical(label, num_classes=total_words)
	#print(input_sequences)
	return predictors, label, max_sequence_len, total_words

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(data)

In [37]:
predictors 

array([[ 0,  0,  0,  0,  0,  0,  6],
       [ 0,  0,  0,  0,  0,  6, 13],
       [ 0,  0,  0,  0,  6, 13,  7],
       [ 0,  0,  0,  6, 13,  7, 14],
       [ 0,  0,  0,  0,  0,  0,  2],
       [ 0,  0,  0,  0,  0,  2, 15],
       [ 0,  0,  0,  0,  2, 15, 16],
       [ 0,  0,  0,  2, 15, 16,  8],
       [ 0,  0,  0,  0,  0,  0,  4],
       [ 0,  0,  0,  0,  0,  4, 17],
       [ 0,  0,  0,  0,  4, 17, 18],
       [ 0,  0,  0,  4, 17, 18, 19],
       [ 0,  0,  0,  0,  0,  0,  6],
       [ 0,  0,  0,  0,  0,  6, 21],
       [ 0,  0,  0,  0,  6, 21, 22],
       [ 0,  0,  0,  0,  0,  0,  2],
       [ 0,  0,  0,  0,  0,  2,  9],
       [ 0,  0,  0,  0,  2,  9,  8],
       [ 0,  0,  0,  0,  0,  0,  7],
       [ 0,  0,  0,  0,  0,  7, 23],
       [ 0,  0,  0,  0,  7, 23,  2],
       [ 0,  0,  0,  7, 23,  2, 24],
       [ 0,  0,  7, 23,  2, 24,  4],
       [ 0,  0,  0,  0,  0,  0, 26],
       [ 0,  0,  0,  0,  0, 26, 27],
       [ 0,  0,  0,  0, 26, 27, 28],
       [ 0,  0,  0, 26, 27, 28,  5],
 

In [33]:
label.shape,np.argmax(label[0])

((48, 43), 13)

In [24]:
max_sequence_len

8

In [25]:
total_words 

43

**Recurrent Neural Networks**

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks .

**Disadvantage **
  A problem called Vanishing Gradient is associated with them.
  
 In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. 
 
 To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

**LSTM**

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. 

The advantage of this state is that the model can remember or forget the leanings more selectively.


Lets architecture a LSTM model in our code. I have added total three layers in the model.

* Input Layer : Takes the sequence of words as input

* LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.

* Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting.

* Output Layer : Computes the probability of the best possible next word as output

In [0]:
def create_model(predictors, label, max_sequence_len, total_words):
	
	model = Sequential()
	model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
	model.add(LSTM(256, return_sequences = True))
	#model.add(Dropout(0.1))
	model.add(LSTM(124, return_sequences = True))
	#model.add(Dropout(0.1))
	model.add(LSTM(100))
	model.add(Dense(total_words, activation='softmax'))

	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
	model.fit(predictors, label, epochs=200, verbose=1, callbacks=[earlystop])
	print(model.summary())
	return model 

**Function to predict the next word based on the input words **

* first tokenize the seed text

* pad the sequences and pass into the trained model to get predicted word.

The multiple predicted words can be appended together to get predicted sequence.

In [0]:
def generate_text(seed_text, next_words, max_sequence_len,model):
	for _ in range(next_words):
		token_list = tokenizer.texts_to_sequences([seed_text])[0]
		token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
		predicted = model.predict_classes(token_list, verbose=0)
		
		output_word = ""
		for word, index in tokenizer.word_index.items():
			if index == predicted:
				output_word = word
				break
		seed_text += " " + output_word
	return seed_text


**Lets train our model using the Cat and Her Kitten rhyme.**

In [0]:
X, Y, max_len, total_words = dataset_preparation(data)

In [64]:
model = create_model(X, Y, max_len, total_words)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200




Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoc

**Model’s Output when the the above model was trained on 100 epochs.**

In [65]:
text = generate_text("cat and", 3,  max_sequence_len,model)
print(text)

cat and have lost on


In [66]:
text = generate_text("you and", 3,  max_sequence_len,model)
print(text)

you and then they began


In [67]:
text = generate_text("i we", 3,  max_sequence_len,model)
print(text)

i we then they began
