# Keras RNN
This notebook follows the reference below.

http://ethen8181.github.io/machine-learning/keras/rnn_language_model_basic_keras.html#Keras-RNN-(Recurrent-Neural-Network)---Language-Model

The goal of this project is given n word, predict n+1th word.

In [1]:
import plaidml.keras #GPU packages
plaidml.keras.install_backend()

In [2]:
import os
import warnings
warnings.filterwarnings('ignore')


path = os.getcwd() # Get the current working directory

import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
from collections import Counter
from keras.utils import to_categorical
from keras.utils.data_utils import get_file
from keras.models import Sequential, load_model
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
nltk.download('wordnet')
from nltk.stem.porter import *


# 1. magic so that the notebook will reload external python modules
# 2. magic for inline plot
# 3. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
# 4. magic to print version
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
%watermark -a 'Ulas' -d -t -v -p keras,numpy,matplotlib,tensorflow,nltk

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/uozdemir/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/uozdemir/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Ulas 2020-07-03 07:38:52 

CPython 3.7.6
IPython 7.12.0

keras 2.2.4
numpy 1.18.1
matplotlib 3.1.3
tensorflow 1.13.1
nltk 3.4.5


# Implementation of word-level language processing.

## Data preperation

In [3]:
pathFile = get_file('nietzsche.txt', origin = 'https://s3.amazonaws.com/text-datasets/nietzsche.txt') # Download the text file

with open(pathFile, encoding = 'utf-8') as f:
    raw_text = f.read() # Save the text file raw_text variable

print('corpus length:', len(raw_text)) # Print the char length of the corpus
print('example text:', raw_text[:150]) # Print the first 150 char of the corpus

corpus length: 600893
example text: PREFACE


SUPPOSING that Truth is a woman--what then? Is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists


Need to remove punctuation marks, and '--' values.

In [4]:
# ideally, we would save the cleaned text, to prevent
# doing this step every single time
porterStemmer = PorterStemmer() # Initialize stemmer
tokens = raw_text.replace('--', ' ').split()
cleanedTokens = []
table = str.maketrans('', '', string.punctuation)
for word in tokens:
    word = word.translate(table)
#     word = porterStemmer.stem(word)
    if word.isalpha():
        cleanedTokens.append(word.lower())
        
print('sampled original text: ', tokens[:10])
print('sampled cleaned text: ', cleanedTokens[:10])

sampled original text:  ['PREFACE', 'SUPPOSING', 'that', 'Truth', 'is', 'a', 'woman', 'what', 'then?', 'Is']
sampled cleaned text:  ['preface', 'supposing', 'that', 'truth', 'is', 'a', 'woman', 'what', 'then', 'is']


Instead of storing each word separetly, it is a good practice to assign each word an index. I believe this is because hashtable lookups are faster than matrix lookups.

In [5]:
min_count = 2 # The word should appear more than once. This is necessary to get rid of noise.
unknown_token = '<unk>' # If a word is not in the vocabulary, it will be classified as <unk>
word2index = {unknown_token: 0}
index2word = [unknown_token]


stopWords = set(stopwords.words('english')) # Generate a list of English stopwords.
filteredWords = 0
counter = Counter(cleanedTokens) # this dictionary will store how many times each word showed up in the corpus.
for word, count in counter.items():
    if count >= min_count:# and word not in stopWords: # Check if the word shows up more than twice
        index2word.append(word) # append the word to index2word vector
        word2index[word] = len(word2index) #the location of the word in index2word is the index on the dict.
    else:
        filteredWords += 1 # If the word only showed up once, 
vocabSize = len(word2index)

print('Number of Words: ', vocabSize,', Filtered Words: ', filteredWords)


Number of Words:  5090 , Filtered Words:  5097


In [6]:
step = 3
maxlen = 40
X = []
y = []

# This part will create "sentences" with fixed lengths of 40 words. Since we are trying to predict (n+1)th word
# labels are generated using the next word after each 40 words.
# Step parameter controls the overlapping between sentences. 

for i in range(0, len(cleanedTokens) - maxlen, step):
    sentence = cleanedTokens[i:i + maxlen] # A sentence is a group of 40 words starting from index i.
    next_word = cleanedTokens[i + maxlen] # Labels are 41th word of each group
    X.append([word2index.get(word, 0) for word in sentence]) # Find the index of each word in a sentence, then append it to feature vector
    y.append(word2index.get(next_word, 0)) # Find the index of the label in the dictionary append it to vector y.

X = np.array(X) # Convert list into a numpy array
Y = to_categorical(y, vocabSize) # Keras expects label vector to be one hot encoded.

print('sequence dimension: ', X.shape) # There are 33342 sentences. Each sentence contains 40 words.
print('target dimension: ', Y.shape) # Note that 5090 is the vocabulary size.
print('example sequence:\n', X[0]) # 

sequence dimension:  (33342, 40)
target dimension:  (33342, 5090)
example sequence:
 [ 1  2  3  4  5  6  7  8  9  5 10 11 12 13  0  3 14 15 16 17 18 19 20 21
 22 23 21 24 25 26 27  3 28 29 30 31 32  0 33 34]


In [7]:
embeddingSize = 50 # Embedding vector size
lstmSize = 256 # LSTM layer size
model = Sequential()
model.add(Embedding(vocabSize, embeddingSize, input_length = maxlen)) # Embedding layer will generate vectors
model.add(LSTM(lstmSize))
# model.add(Dropout(0.2))
model.add(Dense(vocabSize, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam') # Adam is the safest bet.
print(model.summary())

INFO:plaidml:Opening device "metal_amd_radeon_pro_5300m.0"


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 40, 50)            254500    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_1 (Dense)              (None, 5090)              1308130   
Total params: 1,876,998
Trainable params: 1,876,998
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
# Converts seconds into human-readable format
def elapsed(sec):
    if sec < 60:
        return str(sec) + ' seconds'
    elif sec < (60 * 60):
        return str(sec / 60) + ' minutes'
    else:
        return str(sec / (60 * 60)) + ' hours'

In [None]:
def build_model(model, address = None):
    if address is not None or not os.path.isfile(address): # Check is a checkpoint exists. If none exists fit the model
        stop = EarlyStopping(monitor = 'val_loss', min_delta = 0, 
                             patience = 5, verbose = 1, mode = 'auto') # use validation loss as a monitoring tool for early stopping.
        save = ModelCheckpoint(address, monitor = 'val_loss', 
                               verbose = 0, save_best_only = True) # Save the model
        callbacks = [stop, save]

        start = time()
        history = model.fit(X, Y, batch_size = batch_size, 
                            epochs = epochs, verbose = 1,
                            validation_split = validation_split,
                            callbacks = callbacks)
        elapse = time() - start
        print('elapsed time: ', elapsed(elapse))
        model_info = {'history': history, 'elapse': elapse, 'model': model}
    else: # If a checkpoints exists load it.
        model = load_model(address)
        model_info = {'model': model}

    return model_info
  

epochs = 40
batch_size = 32
validation_split = 0.2
address1 = 'lstm_weights1.hdf5'
print('model checkpoint address: ', address1)
model_info = build_model(model, address1)

model checkpoint address:  lstm_weights1.hdf5
Train on 26673 samples, validate on 6669 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40

In [10]:
def check_prediction(model, num_predict):
    true_print_out = 'Actual words: '
    pred_print_out = 'Predicted words: '
    for i in range(num_predict):
        x = X[i]
        prediction = model.predict(x[np.newaxis, :], verbose = 0)
        index = np.argmax(prediction)
        true_print_out += index2word[y[i]] + ' '
        pred_print_out += index2word[index] + ' '

    print(true_print_out)
    print(pred_print_out)


num_predict = 10
model = model_info['model']
check_prediction(model, num_predict)

Actual words: they paid to been unseemly <unk> certainly never to and 
Predicted words: the down and to <unk> the is been to and 


Removing stopwords does not seem to be a good idea