<a href="https://colab.research.google.com/github/pdevall/TextGeneration/blob/master/CharacterLevelModelTextGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import tensorflow as tf
from tensorflow import keras


Load the Data and set max_len of the word and get the number of samples. Shuffle the samples

In [0]:
datasetNPARRAY = np.loadtxt("/content/drive/My Drive/Colab Notebooks/CharacterLevelLanguageModel/dinos.txt", dtype="str")
np.random.shuffle(datasetNPARRAY)
print(datasetNPARRAY)
max_length = max(len(x) for x in datasetNPARRAY)
print(max_length)
m = len(datasetNPARRAY)
print(m)


Create Vocabulary with the unique characters from the text file
Create char to index: for each character mapped to index
Create index to char: reverse of char to index
We will be adding new line to the vocabulary.

In [0]:
vocabulary = set(char for word in datasetNPARRAY.tolist() for char in word.lower())
vocabulary.add("\n")
char_to_index = {k: v for v, k in enumerate(sorted(vocabulary))}
print(char_to_index)
index_to_char = dict(map(reversed, char_to_index.items()))
print(index_to_char)
vocab_size = len(vocabulary)
print(vocab_size)


Create X our input data with the shape m: number of samples max_length: max length of the word in the file, vocab_size: the vocabulary size and initiate with zeros.
Same with our labels Y
Loop through the words and characters to create a one hot matrix for X and Y Y will be one position shift to the left.

In [0]:
X = np.zeros((m, max_length, vocab_size))
Y = np.zeros((m, max_length, vocab_size))
for wordCount in range(len(datasetNPARRAY)):
  word = datasetNPARRAY[wordCount].lower()
  for charCount in range(len(word)):
    char = word[charCount]
    X[wordCount, charCount, char_to_index[char]] = 1
    if charCount < len(word)-1:
      Y[wordCount, charCount, char_to_index[word[charCount+1]]]= 1
print(X.shape)
print(Y.shape)

Build the model. Here we will use LSTM and Dense with softmatrix. No Embedding layer.

In [0]:
def build_model(vocab_size, rnn_units):
  model = keras.models.Sequential()
  model.add(keras.layers.LSTM(rnn_units, return_sequences=True, input_shape=(max_length, vocab_size)))
  model.add(keras.layers.Dense(vocab_size,  activation='softmax'))
  return model

In [0]:
model = build_model(vocab_size, rnn_units=128)
model.summary()

In [0]:
def loss(labels, logits):
  return tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=True)

In [0]:
model.compile(optimizer='adam', loss='categorical_crossentropy')

In [0]:
def generate_name(model):
  name = []
  sequence = np.random.randint(1, 26)
  x = np.zeros((1, max_length, vocab_size))
  x[0, 0, sequence] = 1
  temperature  = 1.0
  for i in range(13):
    predictions = list(model.predict(x)[0,i])
    predictions = predictions / np.sum(predictions)
    index = np.random.choice(range(vocab_size), p=predictions)
    char = index_to_char[index]
    x = np.zeros((1, max_length, vocab_size))
    x[0, i+1, index]=1
    name.append(char)
  print(''.join(name))

After few runs
The generated text will be something like 
etasaatosanus<br>
rinauanosalus<br>
aganuatosalus<br>

most of the world end with "us"

In [0]:
def generate_name_loop(epoch, _):
  if epoch % 25 == 0:        
    print('Names generated after epoch %d:' % epoch)
    for i in range(3):
      generate_name(model)
        
    print()

name_generator = keras.callbacks.LambdaCallback(on_epoch_end = generate_name_loop)

model.fit(X, Y, epochs=405, callbacks=[name_generator], verbose=1)
