<a href="https://colab.research.google.com/github/laxmangautam/nlp-in-python-tutorial/blob/master/beginners_guide_to_text_generation_using_lstms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Beginners Guide to Text Generation using LSTMs

Text Generation is a type of Language Modelling problem. Language Modelling is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization. A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. Language models can be operated at character level, n-gram level, sentence level or even paragraph level. In this notebook, I will explain how to create a language model for generating natural language text by implement and training state-of-the-art Recurrent Neural Network. 

### Generating News headlines 

In this kernel, I will be using the dataset of [New York Times Comments and Headlines](https://www.kaggle.com/aashita/nyt-comments) to train a text generation language model which can be used to generate News Headlines


## 1. Import the libraries

As the first step, we need to import the required libraries:

In [None]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout , Activation
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
from keras.optimizers import RMSprop
 
import pandas as pd
import numpy as np
import string, os 
 
import random
import heapq
import sys

In [None]:
from google.colab import drive
 
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [None]:
data = open('/content/drive/My Drive/Colab Notebooks/Dataset_RNN_shakespeare.txt', "r").read()
 
text = data.lower()
print(text)
 
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 
 
text  = clean_text(text);
print(text)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
 
characters = sorted(list(set(text)))
 
print('corpus length:', len(text))
 
print('total chars:', len(characters))

corpus length: 5219164
total chars: 38


In [None]:
char2indices = dict((c, i) for i, c in enumerate(characters))
 
indices2char = dict((i, c) for i, c in enumerate(characters))
print(char2indices)
 
print(indices2char)

{'\n': 0, ' ': 1, '0': 2, '1': 3, '2': 4, '3': 5, '4': 6, '5': 7, '6': 8, '7': 9, '8': 10, '9': 11, 'a': 12, 'b': 13, 'c': 14, 'd': 15, 'e': 16, 'f': 17, 'g': 18, 'h': 19, 'i': 20, 'j': 21, 'k': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'q': 28, 'r': 29, 's': 30, 't': 31, 'u': 32, 'v': 33, 'w': 34, 'x': 35, 'y': 36, 'z': 37}
{0: '\n', 1: ' ', 2: '0', 3: '1', 4: '2', 5: '3', 6: '4', 7: '5', 8: '6', 9: '7', 10: '8', 11: '9', 12: 'a', 13: 'b', 14: 'c', 15: 'd', 16: 'e', 17: 'f', 18: 'g', 19: 'h', 20: 'i', 21: 'j', 22: 'k', 23: 'l', 24: 'm', 25: 'n', 26: 'o', 27: 'p', 28: 'q', 29: 'r', 30: 's', 31: 't', 32: 'u', 33: 'v', 34: 'w', 35: 'x', 36: 'y', 37: 'z'}


In [None]:
# cut the text in semi-redundant sequences of maxlen characters
 
maxlen = 40
 
step = 3
sentences = []
 
next_chars = []
 
for i in range(0, len(text) - maxlen, step):
  sentences.append(text[i: i + maxlen])
  next_chars.append(text[i + maxlen])
 
print('nb sequences:', len(sentences))

nb sequences: 1739708


In [None]:
# Converting indices into vectorized format
 
X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool)
 
y = np.zeros((len(sentences), len(characters)), dtype=np.bool)
 
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    X[i, t, char2indices[char]] = 1
    y[i, char2indices[next_chars[i]]] = 1

In [None]:
#Model Building
 
model = Sequential()
 
model.add(LSTM(64, input_shape=(maxlen, len(characters))))
 
model.add(Dense(len(characters)))
 
model.add(Activation('softmax'))
 
 
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01))
 
print (model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 64)                26368     
_________________________________________________________________
dense (Dense)                (None, 38)                2470      
_________________________________________________________________
activation (Activation)      (None, 38)                0         
Total params: 28,838
Trainable params: 28,838
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
# Function to convert prediction into index
 
def pred_indices(preds, metric=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / metric
  exp_preds = np.exp(preds)
  preds = exp_preds/np.sum(exp_preds)
  probs = np.random.multinomial(1, preds, 1)

  return np.argmax(probs)

In [None]:
import pickle


weigh= model.get_weights();    
pklfile= "modelweights.pkl"
fpkl= open(pklfile, 'wb')    #Python 3     
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()


sys.stdout.write("Test")

Test

In [None]:
# Train and Evaluate the Model
 
for iteration in range(1, 15):
  print('-' * 40) 
  print('Iteration', iteration)
  model.fit(X, y,batch_size=128,epochs=1)
  start_index = random.randint(0, len(text) - maxlen - 1)
  for diversity in [0.2, 0.7,1.2]:
    print('n----- diversity:', diversity)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"\n')
    print(str(iteration)  + " Round completed  model training is completed ")
    #sys.stdout.write(generated)
    for i in range(400):
      x = np.zeros((1, maxlen, len(characters)))
      for t, char in enumerate(sentence):
        x[0, t, char2indices[char]] = 1.
        preds = model.predict(x, verbose=0)[0]
        next_index = pred_indices(preds, diversity)
        pred_char = indices2char[next_index]
        generated += pred_char
        sentence = sentence[1:] + pred_char
        #sys.stdout.write(pred_char)
        #sys.stdout.flush()
        
    weigh= model.get_weights();    
    pklfile= "modelweights.pkl"
    fpkl= open(pklfile, 'wb')    #Python 3     
    pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
    fpkl.close()

----------------------------------------
Iteration 1
n----- diversity: 0.2
----- Generating with seed: "m
    so much i love his heart but i per"

1 Round completed  model training is completed 
n----- diversity: 0.7
----- Generating with seed: "m
    so much i love his heart but i per"

1 Round completed  model training is completed 
n----- diversity: 1.2
----- Generating with seed: "m
    so much i love his heart but i per"

1 Round completed  model training is completed 
----------------------------------------
Iteration 2
n----- diversity: 0.2
----- Generating with seed: "uld be so rich for when rich villains ha"

2 Round completed  model training is completed 
n----- diversity: 0.7
----- Generating with seed: "uld be so rich for when rich villains ha"

2 Round completed  model training is completed 
n----- diversity: 1.2
----- Generating with seed: "uld be so rich for when rich villains ha"

2 Round completed  model training is completed 
----------------------------------------
I

  """


----------------------------------------
Iteration 6
n----- diversity: 0.2
----- Generating with seed: "    exeunt




scene ii
before the duke "

6 Round completed  model training is completed 
n----- diversity: 0.7
----- Generating with seed: "    exeunt




scene ii
before the duke "

6 Round completed  model training is completed 
n----- diversity: 1.2
----- Generating with seed: "    exeunt




scene ii
before the duke "

6 Round completed  model training is completed 
----------------------------------------
Iteration 7
n----- diversity: 0.2
----- Generating with seed: "se are the tribunes of the people
    th"

7 Round completed  model training is completed 
n----- diversity: 0.7
----- Generating with seed: "se are the tribunes of the people
    th"

7 Round completed  model training is completed 
n----- diversity: 1.2
----- Generating with seed: "se are the tribunes of the people
    th"

7 Round completed  model training is completed 
----------------------------------------
I

## 2. Load the dataset

Load the dataset of William Shakespeare’s dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training:

Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved:
 

*   **Preprocessing:**  Prepare X and Y data from the given entire story text file and converting them into indices vectorized format.
*   **Deep learning model training and validation:** Train and validate the deep learning model.
*   **Text generation:** Generate the text with the trained model.

## 3. Dataset preparation

### 3.1 Dataset cleaning 

In dataset preparation step, we will first perform text cleaning of the data which includes removal of punctuations and lower casing all the words. 

### 3.2 Generating Sequence of N-gram Tokens

Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.  

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens. 


In the above output [30, 507], [30, 507, 11], [30, 507, 11, 1] and so on represents the ngram phrases generated from the input data. where every integer corresponds to the index of a particular word in the complete vocabulary of words present in the text. For example

**Headline:** i stand  with the shedevils  
**Ngrams:** | **Sequence of Tokens**

<table>
<tr><td>Ngram </td><td> Sequence of Tokens</td></tr>
<tr> <td>i stand </td><td> [30, 507] </td></tr>
<tr> <td>i stand with </td><td> [30, 507, 11] </td></tr>
<tr> <td>i stand with the </td><td> [30, 507, 11, 1] </td></tr>
<tr> <td>i stand with the shedevils </td><td> [30, 507, 11, 1, 975] </td></tr>
</table>



### 3.3 Padding the Sequences and obtain Variables : Predictors and Target

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Kears for this purpose. To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label. For example:


Headline:  they are learning data science

<table>
<tr><td>PREDICTORS </td> <td>           LABEL </td></tr>
<tr><td>they                   </td> <td>  are</td></tr>
<tr><td>they are               </td> <td>  learning</td></tr>
<tr><td>they are learning      </td> <td>  data</td></tr>
<tr><td>they are learning data </td> <td>  science</td></tr>
</table>

Perfect, now we can obtain the input vector X and the label vector Y which can be used for the training purposes. Recent experiments have shown that recurrent neural networks have shown a good performance in sequence to sequence learning and text data applications. Lets look at them in brief.

## 4. LSTMs for Text Generation

![](http://www.shivambansal.com/blog/text-lstm/2.png)

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

1. Input Layer : Takes the sequence of words as input
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
4. Output Layer : Computes the probability of the best possible next word as output

We will run this model for total 100 epoochs but it can be experimented further.

Lets train our model now

## 5. Generating the text 

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.


In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

## 6. Some Results

In [None]:
print (generate_text("united states", 50
                     , model, max_sequence_len))
print (generate_text("preident trump", 4, model, max_sequence_len))
print (generate_text("donald trump", 4, model, max_sequence_len))
print (generate_text("india and china", 40, model, max_sequence_len))
print (generate_text("new york", 4, model, max_sequence_len))
print (generate_text("science and technology", 30, model, max_sequence_len))
print (generate_text("Russia", 30, model, max_sequence_len))

United States Ai Is Cheaper Thats Also Bad News A New Authoritarian Era Gun Ones Has Dog A Border Of Gun Violence Is Sex Gerrymandering In In He It Where Ride Beauty Live To Confinement Or Men Says His Blood Blue Food Corruption Age Dynasty Men Raise Become Is Better Young Young
Preident Trump Wants A Military Parade
Donald Trump Is Hiding To Twist
India And China Better Than Food A Real Impact At The New High Map That Witness To Jail Kurds To Help Was You Him Sex De Moon Men Sex First Age Why Balance Sex Russia Sex Vote Sex Fresh Sex European Million Plastic
New York Is Finding A Premier
Science And Technology In Babies Slogs To Polls Three Spanish Words That Will Help It Become A Better Crossword Solver Student Comments Of The Week In The United Landing Men Young Young Young
India The Supreme Court Become The Run For Dreamers On The Ground Patient The Problem Of The Year Plan In Help Away The Barely Poland Corner Of Sex Raises New Athletes


## Improvement Ideas 

As we can see, the model has produced the output which looks fairly fine. The results can be improved further with following points:
- Adding more data
- Fine Tuning the network architecture
- Fine Tuning the network parameters

Thanks for going through the notebook, please upvote if you liked. 