# Text Prediction

In this project, we will predict the next character based on previous characters in context. We will be using a LSTM model to incorporate the contextual dependencies in text.

The reference for this project is taken from https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

The dataset we are using is the book Alice in Wonderland. It is a fairly small corpus and hence, doesn't take much time in training and evaluation.

The first step is to import all the dependencies.
1. numpy- for mathematical calculations
2. Sequential- create a Sequential model using keras
3. layers- Various layers that we will be using to create our model(Dense, Dropout, LSTM)
4. utils- to convert integer labels into categorical data(one-hot encoding)

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils

Using TensorFlow backend.


The next step is to load our data. We will use the built in open function to open the file and read function to read it.

In [2]:
data = open('alice.txt').read()

The first 100 characters in the file is shown below:

In [4]:
data[:100]

'\n\nALICE’S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3.0\n\n\n\n\nCHAPTER I.'

It is better to convert all the text to lowercase. We will use lower function for that

In [5]:
data = data.lower()
data[:100]

'\n\nalice’s adventures in wonderland\n\nlewis carroll\n\nthe millennium fulcrum edition 3.0\n\n\n\n\nchapter i.'

char_list is a list of all the unique characters present in the corpus. set function is use to eradicate duplicate entries and list is used to create a list of unique chars

In [11]:
char_list = list(set(data))
char_list

['(',
 ';',
 ' ',
 'm',
 '“',
 '[',
 'x',
 'a',
 'i',
 'j',
 'u',
 'f',
 '.',
 ':',
 '”',
 '_',
 'n',
 's',
 'r',
 'l',
 '\n',
 '!',
 'k',
 'h',
 'v',
 'p',
 't',
 '*',
 '’',
 '0',
 ']',
 'o',
 '3',
 'g',
 ')',
 'q',
 'y',
 'w',
 'b',
 'c',
 ',',
 'z',
 '?',
 '-',
 'd',
 'e',
 '‘']

char is the list of characters in char_list arranged according to their precedence

In [13]:
char = sorted(char_list)
char

['\n',
 ' ',
 '!',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '0',
 '3',
 ':',
 ';',
 '?',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '‘',
 '’',
 '“',
 '”']

Now we will use the built-in function enumerate() and dict() to make a dictionary of unique characters and their integer equivalents. Numeric data handling is much more useful compared to characters

In [15]:
char_to_int = dict((c, i) for i, c in enumerate(char))
char_to_int

{'\n': 0,
 ' ': 1,
 '!': 2,
 '(': 3,
 ')': 4,
 '*': 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '0': 9,
 '3': 10,
 ':': 11,
 ';': 12,
 '?': 13,
 '[': 14,
 ']': 15,
 '_': 16,
 'a': 17,
 'b': 18,
 'c': 19,
 'd': 20,
 'e': 21,
 'f': 22,
 'g': 23,
 'h': 24,
 'i': 25,
 'j': 26,
 'k': 27,
 'l': 28,
 'm': 29,
 'n': 30,
 'o': 31,
 'p': 32,
 'q': 33,
 'r': 34,
 's': 35,
 't': 36,
 'u': 37,
 'v': 38,
 'w': 39,
 'x': 40,
 'y': 41,
 'z': 42,
 '‘': 43,
 '’': 44,
 '“': 45,
 '”': 46}

As we can see, we got 47(0 - 46) different integer equivalents corresponding to each character in the list char. The new dictionary is named char_to_int. It holds characters as key and integers as values

In [17]:
n_chars = len(data)
n_vocab = len(char)
print('Total characters: ', n_chars)
print('Total Vocab', n_vocab)

Total characters:  144414
Total Vocab 47


The basic processing on the dataset is done. We got a dictionary(char_to_int) of 47 unique characters and their integer equivalents. The total number of characters in the corpous is 144414.

Now, we need to structure our dataset so that we can input it in our model. We will be keeping the sequence length of 200. That means all the characters(integer equivalent) from 0 to 199 will be used as the input and the 200th character will be out target value. Again, all the characters from 1 to 200 will be used as input and 201th character will be the target. This process will be repeated till we reach the end of file and this should give us (144414-200)=144214 size training input and corresponding number of labels. We will use an lstm model to retain the contextual meaning of the text

In [20]:
seq_length = 200
dataX = []
dataY = []
for i in range(0, n_chars-seq_length, 1):
    seq_in = data[i:i+seq_length]
    seq_out = data[i+seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print('Total Patterns: ', n_patterns)

Total Patterns:  144214


In [31]:
print('The first 10 values of the 0th entry in dataX: ', dataX[0][:10])
print('The corresponding value of the 0th entry in dataY: ', dataY[0])

The first 10 values of the 0th entry in dataX:  [0, 0, 17, 28, 25, 19, 21, 44, 35, 1]
The corresponding value of the 0th entry in dataY:  30


As we can see above, dataX contains 144214 rows and 200 columns. dataY contains 144214 rows and the corresponding target value for each row

The last step before feeding the data is to reshape

In [34]:
X = np.reshape(dataX, (n_patterns, seq_length, 1))
X.shape

(144214, 200, 1)

In [35]:
X = X/float(n_vocab) #Normalizing the value of X

In [37]:
X[0][:10]

array([[ 0.        ],
       [ 0.        ],
       [ 0.36170213],
       [ 0.59574468],
       [ 0.53191489],
       [ 0.40425532],
       [ 0.44680851],
       [ 0.93617021],
       [ 0.74468085],
       [ 0.0212766 ]])

X data is normalized to remove data redundancy and a few obtained values are shown above

As explained above, dataY is converted into its categorical form

In [38]:
Y = np_utils.to_categorical(dataY)

In [40]:
print(Y.shape)
print(Y[0])

(144214, 47)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


The last step is to create our sequential model using Keras and then fit it with data. We will use 200 lstm cells, dropout=0.3, activation=softmax and categorical_crossentropy loss with adam optimizer. The parameters chosen here are all arbitrary 

In [42]:
model = Sequential()
model.add(LSTM(200, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

input_shape=(X.shape[1], X.shape[2]) because we have 144214 inputs with 200 features and 1 channel. This input is fed to the network and a dense layer with Y.shape[1] (=47) output nodes is obtained. Each node has its own softmax probability

Since the dimensions are all correct, the model is created. Now we need to feed the data.

We will be using epoch size of 20 with batch_size=128. Again, these hyper-parameters are chosen at random

In [None]:
model.fit(X, Y, epochs=20, batch_size=128)

Epoch 1/20

The training step continued as shown above, for 20 epochs and loss is minimized

Thus, a lstm model is successfully created which predicts the next character with substantial accuracy for a given input