A language model predicts the next word in a sequence based off the words that have come before it in the sequence. It's possible to develop language models at the character level using neural networks. 


The benefit of character-based language models is their small vocabulary and adaptability with words, punctuation, and other documents. The trade off is that, larger models are slower to train.​

However, with neural language models, character-based models offer promise for a general, flexible and powerful approach to language modeling.

Here, this project is meant to be an excerise in implementing a Neural Language Models with a simple nusery rhyme and song lyrics. 

What resulted were two models that could predict, with a given input of strings, predict the next sequence of strings thereafter associated with the input.  

In [None]:
from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Activation, Dense
from pickle import load
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding

In [None]:
# generate a sequence of characters with a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		 # encode the characters as integers
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict character
		yhat = model.predict_classes(encoded, verbose=0)
		# reverse map integer to character
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text

# Source Text Creation



In [None]:
!pip install tensorflow
!pip install keras
!pip install h5py



In [None]:
s='Sing a song of sixpence,\
A pocket full of rye.\
Four and twenty blackbirds,\
Baked in a pie.\
When the pie was opened\
The birds began to sing;\
Wasn’t that a dainty dish,\
To set before the king.\
The king was in his counting house,\
Counting out his money;\
The queen was in the parlour,\
Eating bread and honey.\
The maid was in the garden,\
Hanging out the clothes,\
When down came a blackbird\
And pecked off her nose.'

with open('rhymes.txt','w') as f:
  f.write(s)

# Sequence Generation

Instead of taking the corpus and splitting into individual words then feeding into a model with file imports and whatnot. I instead, used the tokenizer function introduced in the previous labs and tokenized the text to numerical values.

This reduced processing time as the model was not fitting to strings but numerical designations.

I also used kera's prepossing text_to_sequencies which with toeknizer, represents the document as a sequence of interger values, where each word in the document is represented as a unique intiger.

In other words, the function

Splits words by space (split=” “).
Filters out punctuation.
Converts text to lowercase.
I then converted to sequences of integers with the texts_to_sequences() function.

I tried to understand the file outputs originally and the tokenization process that was in the orignal lab outline. However, I saw that the tokenizer funciton from last lab would work on this instead of file inputs and outputs when not necessary.

In [None]:
#using tokenizer then assigning text to numerical values instead of raw words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([s])
encoded = tokenizer.texts_to_sequences([s])[0]

The size of the vocabulary function is modified to retrieve the vocab sie from the trained Tokenizer by accessing the word_index attribute.

+1 is added because the integer for the largest encoded word needs to be specified as an array index.

Ex: 1 to 10 with array indcies 0 to 11 or 12 positions.

I orignally tried to vary the size of the sequences manually, but I realized that this was inefficient. Differnt lines had different lengths, such that the sequence size had to be adjusted appropriately for each line individually.

In [None]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 51


In [None]:
# encode n words -> 1 words
sequences = list()
for i in range(2, len(encoded)):
	sequence = encoded[i-2:i+1]
	sequences.append(sequence)

In [None]:
sequences

[[5, 2, 15],
 [2, 15, 6],
 [15, 6, 16],
 [6, 16, 2],
 [16, 2, 17],
 [2, 17, 18],
 [17, 18, 6],
 [18, 6, 19],
 [6, 19, 20],
 [19, 20, 7],
 [20, 7, 21],
 [7, 21, 22],
 [21, 22, 23],
 [22, 23, 3],
 [23, 3, 2],
 [3, 2, 8],
 [2, 8, 9],
 [8, 9, 1],
 [9, 1, 8],
 [1, 8, 4],
 [8, 4, 24],
 [4, 24, 25],
 [24, 25, 26],
 [25, 26, 10],
 [26, 10, 5],
 [10, 5, 27],
 [5, 27, 28],
 [27, 28, 2],
 [28, 2, 29],
 [2, 29, 30],
 [29, 30, 10],
 [30, 10, 31],
 [10, 31, 32],
 [31, 32, 1],
 [32, 1, 11],
 [1, 11, 1],
 [11, 1, 11],
 [1, 11, 4],
 [11, 4, 3],
 [4, 3, 12],
 [3, 12, 13],
 [12, 13, 33],
 [13, 33, 13],
 [33, 13, 14],
 [13, 14, 12],
 [14, 12, 34],
 [12, 34, 1],
 [34, 1, 35],
 [1, 35, 4],
 [35, 4, 3],
 [4, 3, 1],
 [3, 1, 36],
 [1, 36, 37],
 [36, 37, 38],
 [37, 38, 7],
 [38, 7, 39],
 [7, 39, 1],
 [39, 1, 40],
 [1, 40, 4],
 [40, 4, 3],
 [4, 3, 1],
 [3, 1, 41],
 [1, 41, 42],
 [41, 42, 14],
 [42, 14, 1],
 [14, 1, 43],
 [1, 43, 9],
 [43, 9, 44],
 [9, 44, 45],
 [44, 45, 2],
 [45, 2, 46],
 [2, 46, 47],
 [46, 47, 

In [None]:
print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)


Total Sequences: 75
Max Sequence Length: 3


# Model Training

Here the text corpus file was split into training and validation sets. The validation data was piped into the model.fit() function in order to track validation error per epoch. 

Below, I wrote a function that split the text corpus file into training and validation and piped the validation data into the model.fit() function.

I tried first to use test train split from scikit, but found I could do this more easily with the documentation of the model.fit() function.

In [None]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)


The model has a single hidden LSTM layer with 42 units. Uses a softmax function to make sure the probability is normalized between 0 and 1.

The model uses a learned word embedding in the input layer. Where each word vector has specified length - a 10-dim specification here - and has one real valued vector for each word in the vocab.

In [None]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(42))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 2, 10)             510       
_________________________________________________________________
lstm (LSTM)                  (None, 42)                8904      
_________________________________________________________________
dense (Dense)                (None, 51)                2193      
Total params: 11,607
Trainable params: 11,607
Non-trainable params: 0
_________________________________________________________________
None


Complied and fit the network on the encoded text data. I used the Adam implementation of gradient descent and tracked accuracy at each epoch ending. About 500 epochs in total.

I then fed the vocab mappings into the sequence generation function defined at the top. This function takes a word then generate an associated sequence.

From the accuracy scores, the model does not perfectly output the associated words tied to singular words all the time.

This is because some words have ambiguity associated them.

Ex: the => pie the => king

When the pie was opened
The birds began to sing;
Wasn’t that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.
As shown from the outputs. The model reasonably, with some error, reproduces the associated lines with recognizability.

But it cannot reproduce "hello world" as these words/phrase does not exist within the original word corpus.

In [None]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=442, verbose=2)

Epoch 1/442
3/3 - 0s - loss: 3.9321 - accuracy: 0.0400
Epoch 2/442
3/3 - 0s - loss: 3.9296 - accuracy: 0.0800
Epoch 3/442
3/3 - 0s - loss: 3.9278 - accuracy: 0.0933
Epoch 4/442
3/3 - 0s - loss: 3.9258 - accuracy: 0.1067
Epoch 5/442
3/3 - 0s - loss: 3.9241 - accuracy: 0.1067
Epoch 6/442
3/3 - 0s - loss: 3.9221 - accuracy: 0.1067
Epoch 7/442
3/3 - 0s - loss: 3.9200 - accuracy: 0.1067
Epoch 8/442
3/3 - 0s - loss: 3.9179 - accuracy: 0.1067
Epoch 9/442
3/3 - 0s - loss: 3.9157 - accuracy: 0.1067
Epoch 10/442
3/3 - 0s - loss: 3.9132 - accuracy: 0.1067
Epoch 11/442
3/3 - 0s - loss: 3.9102 - accuracy: 0.1067
Epoch 12/442
3/3 - 0s - loss: 3.9075 - accuracy: 0.1067
Epoch 13/442
3/3 - 0s - loss: 3.9044 - accuracy: 0.1067
Epoch 14/442
3/3 - 0s - loss: 3.9009 - accuracy: 0.1067
Epoch 15/442
3/3 - 0s - loss: 3.8973 - accuracy: 0.1067
Epoch 16/442
3/3 - 0s - loss: 3.8932 - accuracy: 0.1067
Epoch 17/442
3/3 - 0s - loss: 3.8890 - accuracy: 0.1067
Epoch 18/442
3/3 - 0s - loss: 3.8838 - accuracy: 0.1067
E

<tensorflow.python.keras.callbacks.History at 0x7fb18c693fd0>

In [None]:
# test start of rhyme
print(generate_seq(model, tokenizer, max_length-1, 'Sing a son', 3))
# test mid-line
print(generate_seq(model, tokenizer, max_length-1, 'king was i', 3))
# test not in original
#print(generate_seq(model, mapping, 10, 'hello worl', 1))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Sing a son song of sixpence
king was i in the parlour


In summary, what worked was any way to reduce the code down. In this case, using Tokenizer, encoding the words to numerical values, avoiding manually adjusting sequence length parameters, increasing epochs and adjusting the LSTM layer parameters.

# Line-by-Line Sequencing

This alternate approach allows the model to use each line's context to help model cases of word ambiguity in the from the input sequence.

The cost is that words are predicted across lines.

This model uses sequence padding to ensure a fixed length input.

In [None]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# pre-pad sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text
 

The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words with each iteration.



In [None]:
# source text
data = """ Well here we are again
It's always such a pleasure
Remember when you tried
to kill me twice?

Oh how we laughed and laughed
Except I wasn't laughing
Under the circumstances
I've been shockingly nice

You want your freedom?
Take it
That's what I'm counting on
I used to want you dead
but
Now I only want you gone

She was a lot like you
(Maybe not quite as heavy)
Now little Caroline is in here too

One day they woke me up
So I could live forever
It's such a shame the same
will never happen to you

You've got your
short sad life left
That's what I'm counting on
I'll let you get right to it
Now I only want you gone

Goodbye my only friend
Oh, did you think I meant you?
That would be funny
if it weren't so sad

Well you have been replaced
I don't need anyone now
When I delete you maybe
I'll stop feeling so bad

Go make some new disaster
That's what I'm counting on
You're someone else's problem
Now I only want you gone
Now I only want you gone
Now I only want you gone"""


In [None]:
# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)


Vocabulary Size: 118


In [None]:
# create line-based sequences
sequences = list()
for line in data.split('\n'):
	encoded = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(encoded)):
		sequence = encoded[:i+1]
		sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))


Total Sequences: 156


In [None]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)


Max Sequence Length: 7


In [None]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6, 10)             1180      
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 118)               6018      
Total params: 19,398
Trainable params: 19,398
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
5/5 - 0s - loss: 4.7709 - accuracy: 0.0000e+00
Epoch 2/500
5/5 - 0s - loss: 4.7655 - accuracy: 0.0449
Epoch 3/500
5/5 - 0s - loss: 4.7598 - accuracy: 0.0897
Epoch 4/500
5/5 - 0s - loss: 4.7532 - accuracy: 0.0897
Epoch 5/500
5/5 - 0s - loss: 4.7443 - accuracy: 0.0897
Epoch 6/500
5/5 - 0s - loss: 4.7314 - accuracy: 0.0897
Epoch 7/500
5/5 - 0s - loss: 4.7133 - acc

<tensorflow.python.keras.callbacks.History at 0x7fb189339908>

Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.

Still two lines of text that start with ‘Jack‘ that may still be a problem for the network.

In [None]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Except', 3))
print(generate_seq(model, tokenizer, max_length-1, 'Now', 7))

Except i wasn't laughing
Now i only want you gone gone dead


The first line mnatches the source text. But the second is strange, "Now I only want you gone gone dead"

The network saw" 'Now" within the input seuqence and not at the start of the sequence, so it outputted the word 'Now I only want you gone' which was the last line of the rhyme. However, it got confused on where to stop and merged it with an ambiguious phrase "gone dead." Since there are three lines of "Now I only want you gone," at the end of the input text, the model should have outputted. "Now I only want you gone Now I only."

Better framing may result in better new lines, but for all partial lines of input.