# Generating Text A Many To Many Architecture

I will create and train a text generating model which will generate the next possible 50 words for a given sequence of 10 words.

### Steps for  text generating model. 
* Step-1:  Downloaded Gutenberg's Alice's Adventures in Wonderland dataset from https://www.gutenberg.org/files/11/11-0.txt.
* Step-2:  Preprocess the text data to convert in lower case, and remove punctuation.
* Step-3:  Assign an ID to each unique word and then convert the dataset into a sequence of word IDs.
* Step-4:  Loop through the total dataset, 10 words at a time. Consider the 10 words as input and the subsequent 11th word as output. 
* Step-5:  Build and train a model, by performing embedding on top of the input word IDs and then connecting the embeddings to an LSTM, which is connected to the output layer through a hidden layer. The value in the output layer is the one-hot-encoded version of the output.
* Step-6:  Make a prediction for the subsequent word by taking a random location of word and consider the historical words prior to the location of the random word chosen.
* Step-7:  Move the window of the input words by one from the seed word's location that we chose earlier and the tenth time step word shall be the word that we predicted in the previous step.
* Step-8  Continue this process to keep generating text.

Import the modules.

In [1]:
import re
import numpy as np
from collections import Counter
from keras.layers import LSTM
from keras.models import Sequential
from keras.layers import Dense,Activation

Using TensorFlow backend.


In [2]:
fin=open('./data/11-0.txt',encoding='utf-8-sig')
lines=[]
for line in fin:
    line = line.strip()
    #line = line.strip().lower()
    #line = line.decode("ascii","ignore")
    if(len(line)==0):
        continue
    lines.append(line)
fin.close()
text = " ".join(lines)

A sample of the dataset looks as follows.

In [3]:
text[0:200]

'Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it awa'

Normalize the text to remove punctuations and convert it to lowercase.

In [4]:
text = text.lower()
text = re.sub('[^0-9a-zA-Z]+',' ',text)

A sample of the dataset looks as follows after Normalize the text.

In [5]:
text[0:200]

'project gutenberg s alice s adventures in wonderland by lewis carroll this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever you may copy it give it away or'

Assign the unique words to an index so that they can be referenced when constructing the training and test datasets

In [6]:
counts = Counter()
counts.update(text.split())
words = sorted(counts, key=counts.get, reverse=True)
nb_words = len(text.split())
word2index = {word: i for i, word in enumerate(words)}
index2word = {i: word for i, word in enumerate(words)}

Construct the input set of words that leads to an output word. 
* Note that I am considering a sequence of 10 words and trying to predict the 11th word.

In [7]:
SEQLEN = 10
STEP = 1
input_words = []
label_words = []
text2=text.split()
for i in range(0,nb_words-SEQLEN,STEP):
    x=text2[i:(i+SEQLEN)]
    y=text2[i+SEQLEN]
    input_words.append(x)
    label_words.append(y)
print('input words list: ','\n',input_words[0])
print('label(output) words list: ','\n',label_words[0])

input words list:  
 ['project', 'gutenberg', 's', 'alice', 's', 'adventures', 'in', 'wonderland', 'by', 'lewis']
label(output) words list:  
 carroll


* Note that input_words is a list of lists and the output_words list is not.

Construct the vectors of the input and the output datasets.

In [8]:
total_words = len(set(words))
X = np.zeros((len(input_words), SEQLEN, total_words), dtype=np.bool)
y = np.zeros((len(input_words), total_words), dtype=np.bool)

I am creating empty arrays in the preceding step, which will be populated in the following code.

In [9]:
for i, input_word in enumerate(input_words):
    for j, word in enumerate(input_word):
        X[i, j, word2index[word]] = 1
    y[i,word2index[label_words[i]]]=1

In the preceding code 
* The first for loop is used to loop through all the words in the input sequence of words (10 words in input), and the second for loop is used to loop through an individual word in the chosen sequence of input words. 
* Additionally, given that the output is a list, I do not need to update it using the second for loop (as there is no sequence of IDs). 

The output shapes of X and y are as follows.

In [10]:
print('Shape of X: ',X.shape)
print('Shape of y: ',y.shape)

Shape of X:  (30527, 10, 3043)
Shape of y:  (30527, 3043)


Define the architecture of the model

In [11]:
HIDDEN_SIZE = 128
BATCH_SIZE = 32
NUM_ITERATIONS = 100
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100
model = Sequential()
model.add(LSTM(HIDDEN_SIZE,return_sequences=False,input_shape=(SEQLEN,total_words)))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy')

Instructions for updating:
Colocations handled automatically by placer.


A summary of the model is as follows.

In [12]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               1624064   
_________________________________________________________________
dense_1 (Dense)              (None, 3043)              392547    
Total params: 2,016,611
Trainable params: 2,016,611
Non-trainable params: 0
_________________________________________________________________


In [13]:
int(len(input_words)*0.1)

3052

In [14]:
len(input_words)

30527

### Fit the model. 
Look at how the output varies over an increasing number of epochs. Generate a random set of sequences of 10 words and try to predict the next possible word. We are in a position to observe how our predictions are getting better over an increasing number of epochs.

* In the following code, I am fitting my model on input and output arrays for one epoch. 
* I am choosing a random seed word (test_idx – which is a random number that is among the last 10% of the input array (as validation_split is 0.1) and I am collecting the input words at a random location. 
* I am converting the input sequence of IDs into a one-hot-encoded version (thus obtaining an array that is 1 x 10 x total_words in shape).
* Finally, making a prediction on the array I just created and obtain the word that has the highest probability.

In [15]:
for iteration in range(50):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION, validation_split = 0.1)
    test_idx = np.random.randint(int(len(input_words)*0.1)) * (-1)
    test_words = input_words[test_idx]
    print("Generating from seed: %s" % (test_words))
    for i in range(NUM_PREDS_PER_EPOCH):        
        Xtest = np.zeros((1, SEQLEN, total_words))
        for i, ch in enumerate(test_words):
            Xtest[0, i, word2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2word[np.argmax(pred)]
        print(ypred,end=' ')
        test_words = test_words[1:] + [ypred]

Iteration #: 0
Instructions for updating:
Use tf.cast instead.
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['right', 'to', 'prevent', 'you', 'from', 'copying', 'distributing', 'performing', 'displaying', 'or']
Iteration #: 1
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['e', '8', 'or', '1', 'e', '9', '1', 'e', '8', 'you']
Iteration #: 2
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['copy', 'display', 'perform', 'distribute', 'or', 'redistribute', 'this', 'electronic', 'work', 'or']
Iteration #: 3
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['information', 'about', 'the', 'project', 'gutenberg', 'literary', 'archive', 'foundation', 'the', 'project']
Iteration #: 4
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['redistributing', 'project', 'gutenberg', 'tm', 'electronic', 'works', '1', 'a', 'by', 'reading']
It

Iteration #: 11
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['or', 'other', 'form', 'any', 'alternate', 'format', 'must', 'include', 'the', 'full']
Iteration #: 12
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['or', 'other', 'form', 'any', 'alternate', 'format', 'must', 'include', 'the', 'full']
Iteration #: 13
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['paragraph', '1', 'e', '8', 'or', '1', 'e', '9', '1', 'e']
Iteration #: 14
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['help', 'preserve', 'free', 'future', 'access', 'to', 'project', 'gutenberg', 'tm', 'electronic']
Iteration #: 15
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['defective', 'work', 'may', 'elect', 'to', 'provide', 'a', 'replacement', 'copy', 'in']
Iteration #: 16
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from 

Generating from seed: ['owns', 'a', 'united', 'states', 'copyright', 'in', 'these', 'works', 'so', 'the']
Iteration #: 22
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['newby', 'chief', 'executive', 'and', 'director', 'gbnewby', 'pglaf', 'org', 'section', '4']
Iteration #: 23
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['provided', 'to', 'you', 'as', 'is', 'with', 'no', 'other', 'warranties', 'of']
Iteration #: 24
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['work', 'is', 'discovered', 'and', 'reported', 'to', 'you', 'within', '90', 'days']
Iteration #: 25
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['paid', 'for', 'it', 'by', 'sending', 'a', 'written', 'explanation', 'to', 'the']
Iteration #: 26
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['collection', 'of', 'project', 'gutenberg', 'tm', 'electronic', '

Iteration #: 32
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['of', 'this', 'agreement', 'for', 'keeping', 'the', 'project', 'gutenberg', 'tm', 'name']
Iteration #: 33
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['with', 'others', '1', 'd', 'the', 'copyright', 'laws', 'of', 'the', 'place']
Iteration #: 34
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['full', 'project', 'gutenberg', 'tm', 'license', 'you', 'must', 'require', 'such', 'a']
Iteration #: 35
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['of', 'public', 'domain', 'and', 'licensed', 'works', 'that', 'can', 'be', 'freely']
Iteration #: 36
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['loose', 'network', 'of', 'volunteer', 'support', 'project', 'gutenberg', 'tm', 'ebooks', 'are']
Iteration #: 37
Train on 27474 samples, validate on 3053 samples
Epoch 1

Iteration #: 44
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['federal', 'laws', 'and', 'your', 'state', 's', 'laws', 'the', 'foundation', 's']
Iteration #: 45
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['at', 'our', 'web', 'site', 'which', 'has', 'the', 'main', 'pg', 'search']
Iteration #: 46
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['produce', 'our', 'new', 'ebooks', 'and', 'how', 'to', 'subscribe', 'to', 'our']
Iteration #: 47
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['to', 'the', 'project', 'gutenberg', 'literary', 'archive', 'foundation', 'royalty', 'payments', 'must']
Iteration #: 48
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
Generating from seed: ['full', 'project', 'gutenberg', 'tm', 'license', 'available', 'with', 'this', 'file', 'or']
Iteration #: 49
Train on 27474 samples, validate on 3053 samples
Epoch 1/1
