[View in Colaboratory](https://colab.research.google.com/github/pskshyam/NLP/blob/master/Char_RNN.ipynb)

In [0]:
import numpy as np
np.random.seed(42)

Load the data

In [2]:
!wget -O Pride_and_Prejudice.txt http://www.gutenberg.org/files/1342/1342-0.txt

--2018-06-01 08:04:54--  http://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 724725 (708K) [text/plain]
Saving to: ‘Pride_and_Prejudice.txt’


2018-06-01 08:04:55 (907 KB/s) - ‘Pride_and_Prejudice.txt’ saved [724725/724725]



In [3]:
book_text = open('Pride_and_Prejudice.txt', encoding='utf8').read()
print(len(book_text)) #total characters in the book, not words

704190


Build Tokenizer

In [0]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
t = Tokenizer(char_level=True)
t.fit_on_texts(book_text) #Each unique character is assigned with an index number. There are total of 86 unique characters.

Number of unique characters

In [5]:
vocab_size = len(t.word_index)
vocab_size

86

Convert Characters to Numbers

In [6]:
book_num = t.texts_to_sequences(book_text)
number_chars = len(book_num)
number_chars

704190

Build Input and Output

In [0]:
sequence_length = 100
input_data = []
output_data = []

Input and output container
> Input data will have sequences with 100 characters

> Output data will have one character which comes after 100 characters in the input data

In [0]:
for i in range(0, number_chars - sequence_length): #0 to (704190-100)
    input_seq = book_num[i : i + sequence_length] #0:100, 1:101, 2:102, ...
    output_seq = book_num[i + sequence_length] #100, 101, 102, ...
    input_data.append(input_seq)
    output_data.append(output_seq)

In [9]:
output_data[0]

[17]

Reshape and Normalize the input

In [10]:
#Input Reshape is required to convert data into 3-dimensional comprising of batch_size, number of characters in one sequence and 
#how many numbers should represent each character
input_data = np.reshape(input_data, (len(input_data),sequence_length,1))
input_data.shape

(704090, 100, 1)

We have 704090 sequences each with 100 chars and each represented by 1 number.

In [0]:
input_data = input_data / vocab_size #Dividing input_data by vocab_size 86 to normalize the data

One hot encode the output

In [12]:
from tensorflow.python.keras.utils import to_categorical
output_data = to_categorical(output_data,num_classes=vocab_size+1)
output_data[0:1]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.]])

Build the model

In [0]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import LSTM, Dense, Dropout
model = Sequential()
model.add(LSTM(128, input_shape=(input_data.shape[1],input_data.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(vocab_size+1, activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy')

Execute the model - Goal of the model is to minimize the loss

In [14]:
model.fit(input_data, output_data, batch_size=1000, epochs=10, verbose=2)

Epoch 1/10
 - 187s - loss: 3.1469
Epoch 2/10
 - 183s - loss: 3.0603
Epoch 3/10
 - 183s - loss: 2.9948
Epoch 4/10
 - 184s - loss: 2.9513
Epoch 5/10
 - 185s - loss: 2.9183
Epoch 6/10
 - 185s - loss: 2.8786
Epoch 7/10
 - 184s - loss: 2.8516
Epoch 8/10
 - 185s - loss: 2.8274
Epoch 9/10
 - 184s - loss: 2.8054
Epoch 10/10
 - 185s - loss: 2.7841


<tensorflow.python.keras._impl.keras.callbacks.History at 0x7fec8789fb00>

Build random Starting point for predicting

In [19]:
start = np.random.randint(0, input_data.shape[0]-1)
start

72847

In [0]:
data = book_num[start: start+sequence_length]
data = [item for sublist in data for item in sublist]

Build Int to Char routine

In [0]:
int_to_char = dict((i,c) for c, i in t.word_index.items())

Start Predicting String

In [18]:
print ('STARTING DATA: ')
print(''.join(int_to_char[char_val] for char_val in data))
print ('\nPREDICTED: ')

for i in range(100):
    #Predict for initial data
    prediction = model.predict(np.reshape(data,(1, len(data), 1))/vocab_size)
    
    #Get char with max probability
    char_index_predicted = np.argmax(prediction)
    
    #convert index to char
    char_predicted = int_to_char[char_index_predicted]
    
    print (char_predicted, end='')
    
    #Change data - append new char index and remove the first index
    data.append(char_index_predicted)
    data = data[1:len(data)]   

STARTING DATA: 
all them.

But the attention of every lady was soon caught by a young man, whom
they had never seen 

PREDICTED: 
 he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he  he 