# Text generation 
##Mathematics of Deep and Machine Learning Algorithms

---



###By: **Jinghua Duan**


> Master 2 - Mathematics and Economic Decision (MED)

## Part 1. Text generation with naive LSTM

In this part, we start to do text generation using O.henry 's 25 short stories and LSTM to generate text. 

First we start to import some packages

In [1]:
import os
import datetime
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer, MinMaxScaler

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from google.colab import files
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import RNN
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 1.1 Data importation
In this project, we choose to load the short stories of O.Henry. The file is a text file that you can download from the data folder on my github:

Click on this link : https://github.com/jinghua-duan/Text-generation-with-O-henry-s-short-stories/blob/main/sixes%20and%20sevenes%20ohenry.txt




In [2]:
uploaded = files.upload()

Saving sixes and sevenes ohenry.txt to sixes and sevenes ohenry.txt


In [3]:
data = open('sixes and sevenes ohenry.txt','r')
ohenry = data.read()
data.close()

Let us see the stories.

In [4]:
print(ohenry[0:1003])

Inexorably Sam Galloway saddled his pony. He was going away from the
Rancho Altito at the end of a three-months' visit. It is not to be
expected that a guest should put up with wheat coffee and biscuits
"yellow-streaked with saleratus for longer than that. Nick Napoleon, the"
"big Negro man cook, had never been able to make good biscuits: Once"
"before, when Nick was cooking at the Willow Ranch, Sam had been forced to"
"fly from his _cuisine_, after only a six-weeks' sojourn."

"On Sam's face was an expression of sorrow, deepened with regret and"
slightly tempered by the patient forgiveness of a connoisseur who cannot
be understood. But very firmly and inexorably he buckled his
"saddle-cinches, looped his stake-rope and hung it to his saddle-horn, tied"
"his slicker and coat on the cantle, and looped his quirt on his right"
"wrist. The Merrydews (householders of the Rancho Altito), men, women,"
"children, and servants, vassals, visitors, employes, dogs, and casual"
"callers were groupe

### 1.2 Text processing

In this part, we do text processing at character level. First, we need to find all the distinct character and build connection between character and id

In [9]:
characters = sorted(list(set(ohenry)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  81
['\n', ' ', '!', '"', '#', '$', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [10]:
print(n_to_char)

{0: '\n', 1: ' ', 2: '!', 3: '"', 4: '#', 5: '$', 6: '&', 7: "'", 8: '(', 9: ')', 10: ',', 11: '-', 12: '.', 13: '0', 14: '1', 15: '2', 16: '3', 17: '4', 18: '5', 19: '6', 20: '7', 21: '8', 22: '9', 23: ':', 24: ';', 25: '?', 26: 'A', 27: 'B', 28: 'C', 29: 'D', 30: 'E', 31: 'F', 32: 'G', 33: 'H', 34: 'I', 35: 'J', 36: 'K', 37: 'L', 38: 'M', 39: 'N', 40: 'O', 41: 'P', 42: 'Q', 43: 'R', 44: 'S', 45: 'T', 46: 'U', 47: 'V', 48: 'W', 49: 'X', 50: 'Y', 51: 'Z', 52: '[', 53: ']', 54: '_', 55: 'a', 56: 'b', 57: 'c', 58: 'd', 59: 'e', 60: 'f', 61: 'g', 62: 'h', 63: 'i', 64: 'j', 65: 'k', 66: 'l', 67: 'm', 68: 'n', 69: 'o', 70: 'p', 71: 'q', 72: 'r', 73: 's', 74: 't', 75: 'u', 76: 'v', 77: 'w', 78: 'x', 79: 'y', 80: 'z'}


We have 81 unique characters in the document. It contains large and small letters, some separator,punctuation. And we represent each character with a unique number from 0 to 80. 

In [11]:
X = []   
Y = []  
length = len(ohenry)
seq_length = 100   

for i in range(0, length - seq_length, 1):
    sequence = ohenry[i:i + seq_length]
    label = ohenry[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))


Number of extracted sequences: 356995


We then do some transform of our text and represent it as a vector of numbers.Those number is composed by the id we assigned to each character.Our documents have overall 357095 characters.

However there is a problem with our text. We did not remove the separator of each line and some html etc. And we will do that in next part.

In [13]:
np.shape(Y)

(356995,)

In [14]:
print(Y[0:100])

[62, 72, 59, 59, 11, 67, 69, 68, 74, 62, 73, 7, 1, 76, 63, 73, 63, 74, 12, 1, 34, 74, 1, 63, 73, 1, 68, 69, 74, 1, 74, 69, 1, 56, 59, 0, 59, 78, 70, 59, 57, 74, 59, 58, 1, 74, 62, 55, 74, 1, 55, 1, 61, 75, 59, 73, 74, 1, 73, 62, 69, 75, 66, 58, 1, 70, 75, 74, 1, 75, 70, 1, 77, 63, 74, 62, 1, 77, 62, 59, 55, 74, 1, 57, 69, 60, 60, 59, 59, 1, 55, 68, 58, 1, 56, 63, 73, 57, 75, 63]


In [16]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

X_modified.shape, Y_modified.shape

((356995, 100, 1), (356995, 81))





>**The shape of our input X**

1.   We have 356995 sequences of characters,each sequence has 100 characters. And the depth of each character is 1. It is exactly character level text representation.
2. In our model, we predict Y character by character using X. We have 356995 characters in Y and we trans form each character into dummy, then we will have 356995 ✖️ 81 dimension of Y.









In [19]:
print(X_modified.shape[1])

100


In [20]:
print(X_modified.shape[2])

1


In [21]:
print(Y_modified.shape[1])

81


### 1.3 Build LSTM model

In this part, we start to build our LSTM model. Our model's input is 100✖️1, that means every time we feed our model 100 characters and it gives the output of next character which is a categorical variable.

In [30]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 100, 400)          643200    
                                                                 
 dropout (Dropout)           (None, 100, 400)          0         
                                                                 
 lstm_1 (LSTM)               (None, 400)               1281600   
                                                                 
 dropout_1 (Dropout)         (None, 400)               0         
                                                                 
 dense (Dense)               (None, 81)                32481     
                                                                 
Total params: 1,957,281
Trainable params: 1,957,281
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
filepath="/content/baseline-improvement-ohenry-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [9]:
model.fit(X_modified, Y_modified, epochs=2, batch_size=2048, callbacks = callbacks_list)

Epoch 1/2
Epoch 00001: loss improved from inf to 3.15183, saving model to /content/baseline-improvement-ohenry-01-3.1518.hdf5
Epoch 2/2
Epoch 00002: loss improved from 3.15183 to 2.90858, saving model to /content/baseline-improvement-ohenry-02-2.9086.hdf5


<keras.callbacks.History at 0x7ff6b60f4850>

From above, we can see our model performs poorly. The loss is still large and is very computationally expensive. I spend 8 hours to train this model.

### 1.4 Generate the text

We use the model we train before to generate some text.You can download the weight file in my github and load it.

In [31]:
filename = "/baseline-improvement-ohenry-02-2.9086.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [118]:
start = np.random.randint(0, len(X)-1) 

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")

Seed:
" seats. Two of them
made a fight and were both killed.

It took the Daltons just ten minutes to captu "


Here we select some random sequence of our text with length of sequence as 100 characters. And use it to generate the text.

We just plug in our 100 characters into our model to predict next characters and do it repeatedly until obtaining plausible sequences.

In [119]:
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    full_string.append(n_to_char[pred_index])
    
    string_mapped.append(pred_index)  
    string_mapped = string_mapped[1:len(string_mapped)] 

In [120]:
txt=""
for char in full_string:
    txt = txt+char

print(start)
print(txt)

75892
seats. Two of them
made a fight and were both killed.

It took the Daltons just ten minutes to captu to the toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe 


#### 1.4.1 Conclusion

From the results, we can see that our model only generate "toe" repeatedly. It makes no sense. Why our model performs so baddly?

  

1.  one reason is our model is underfitting.
2.  Another reason is we didn't do sampling when generating text.
3.  Or it is because we didn't do text cleaning.



### 1.4.2 An alternative input after text processing

In this section, we will clean our original data and build the same model as before to see if it is the reason of model's bad performance.

In [122]:
from sklearn.feature_extraction.text import TfidfVectorizer
### data cleaning
def convert_text_to_lowercase(df):
    df = df.lower()
    return df
    
def not_regex(pattern):
        return r"((?!{}).)".format(pattern)

def remove_punctuation(df):
    df = df.replace('\n', ' ')
    df = df.replace('\r', ' ')
    alphanumeric_characters_extended = '(\\b[-/]\\b|[a-zA-Z0-9])'
    df = df.replace(not_regex(alphanumeric_characters_extended), ' ')
    return df

def text_cleaning(df):
    """
    Takes in a string of text, then performs the following:
    1. convert text to lowercase
    2. remove punctuation and new line characters '\n'
    3. Remove all stopwords
    
    """
    df = convert_text_to_lowercase(df)
    df = remove_punctuation(df)
    
    return df

In [123]:
ohenryclean = text_cleaning(ohenry)


In [124]:
characters = sorted(list(set(ohenryclean)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  54
[' ', '!', '"', '#', '$', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


After cleaning, we only have 54 unique characters.

In [125]:
X = []   
Y = []  
length = len(ohenryclean)
seq_length = 100   

for i in range(0, length - seq_length, 1):
    sequence = ohenryclean[i:i + seq_length]
    label = ohenryclean[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 356995


In [126]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

X_modified.shape, Y_modified.shape

((356995, 100, 1), (356995, 54))

In [127]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [18]:
filepath="/content/baseline-improvement-ohenry-clean-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [32]:
model.fit(X_modified, Y_modified, epochs=2, batch_size=256, callbacks = callbacks_list)

Epoch 1/2
Epoch 00001: loss improved from inf to 2.69770, saving model to /content/baseline-improvement-ohenry-clean-01-2.6977.hdf5
Epoch 2/2
Epoch 00002: loss improved from 2.69770 to 2.46561, saving model to /content/baseline-improvement-ohenry-clean-02-2.4656.hdf5


<keras.callbacks.History at 0x7f94e73882d0>

Still, the model is costly in computation. We take **8 hours** to finish training

In [36]:
filename = "/content/baseline-improvement-ohenry-clean-02-2.4656.hdf5" # link to find the file : https://github.com/DLProjectTextGeneration/TextGeneration/blob/main/Weights/baseline-improvement-britney-clean-40-0.3835_50_128.hdf5
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [37]:
start = 40560

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")


Seed:
" s meagre purchase, but her courage failed" at the act. she did not dare affront him. she knew the pr "


In [38]:
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    full_string.append(n_to_char[pred_index])
    
    string_mapped.append(pred_index)  
    string_mapped = string_mapped[1:len(string_mapped)] 

In [39]:
txt=""
for char in full_string:
    txt = txt+char

print(start)
print(txt)

40560
s meagre purchase, but her courage failed" at the act. she did not dare affront him. she knew the proe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe of the sooe o


Again, our model only generate some repetitions "the sooe of" . It seems that the problems is not about text cleaning. There must be some considerable flaw in you model

## Part 2. Text generation with correct *LSTM*

### 2.1 Data importation

Start with some importations

In [18]:
import tensorflow as tf
import tensorflow.keras.backend as K
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras import layers, Model
import os
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

In [19]:
raw_data_ds = tf.data.TextLineDataset(["/content/sixes and sevenes ohenry.txt"])

In [20]:
for elems in raw_data_ds.take(10):
    print(elems.numpy().decode("utf-8"))

Inexorably Sam Galloway saddled his pony. He was going away from the
Rancho Altito at the end of a three-months' visit. It is not to be
expected that a guest should put up with wheat coffee and biscuits
"yellow-streaked with saleratus for longer than that. Nick Napoleon, the"
"big Negro man cook, had never been able to make good biscuits: Once"
"before, when Nick was cooking at the Willow Ranch, Sam had been forced to"
"fly from his _cuisine_, after only a six-weeks' sojourn."

"On Sam's face was an expression of sorrow, deepened with regret and"
slightly tempered by the patient forgiveness of a connoisseur who cannot


### 2.2 Text processing 

In this part, we will do something different from part 1. We use tensorflow and keras to process the data. First, we concatenate all the text in one line. Thus we can escape from separator and other weird notation.


In [21]:
text=""
for elem in raw_data_ds:
   text=text+(elem.numpy().decode('utf-8'))


print(text[:1000])

Inexorably Sam Galloway saddled his pony. He was going away from theRancho Altito at the end of a three-months' visit. It is not to beexpected that a guest should put up with wheat coffee and biscuits"yellow-streaked with saleratus for longer than that. Nick Napoleon, the""big Negro man cook, had never been able to make good biscuits: Once""before, when Nick was cooking at the Willow Ranch, Sam had been forced to""fly from his _cuisine_, after only a six-weeks' sojourn.""On Sam's face was an expression of sorrow, deepened with regret and"slightly tempered by the patient forgiveness of a connoisseur who cannotbe understood. But very firmly and inexorably he buckled his"saddle-cinches, looped his stake-rope and hung it to his saddle-horn, tied""his slicker and coat on the cantle, and looped his quirt on his right""wrist. The Merrydews (householders of the Rancho Altito), men, women,""children, and servants, vassals, visitors, employes, dogs, and casual""callers were grouped in the ""gall

In [22]:
print("Corpus length:", int(len(text)),"chars")

Corpus length: 350160 chars


In [23]:
chars = sorted(list(set(text)))
print("Total disctinct chars:", len(chars))
print(chars)

Total disctinct chars: 80
[' ', '!', '"', '#', '$', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


We have 80 distinct characters and 350160 overall characters in our text. This is because we remove delimiter.


---

Next we will change the construction of our character sequences.


1.   First, we set a shorter character sequence length from 100 to 20.
2.   Second, we create next sequence of characters 3 steps after rather than 1 step after.






In [24]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 20
step = 1
input_chars = []
next_char = []

In [25]:
for i in range(0, len(text) - maxlen, step):
    input_chars.append(text[i : i + maxlen])
    next_char.append(text[i + maxlen])

In [26]:
print("Number of sequences:", len(input_chars))
print("input X  (input_chars)  --->   output y (next_char) ")

for i in range(5):
  print( input_chars[i],"   --->  ", next_char[i])


Number of sequences: 350140
input X  (input_chars)  --->   output y (next_char) 
Inexorably Sam Gallo    --->   w
nexorably Sam Gallow    --->   a
exorably Sam Gallowa    --->   y
xorably Sam Galloway    --->    
orably Sam Galloway     --->   s


In [27]:
X_train_ds_raw=tf.data.Dataset.from_tensor_slices(input_chars)
y_train_ds_raw=tf.data.Dataset.from_tensor_slices(next_char)

Here we define some functions to clean the data. Such as transform text into lowercase,remove punctuation, split words etc.

In [28]:
def custom_standardization(input_data):
    lowercase     = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    stripped_num  = tf.strings.regex_replace(stripped_html, "[\d-]", " ")
    stripped_punc  =tf.strings.regex_replace(stripped_num, 
                             "[%s]" % re.escape(string.punctuation), "")    
    return stripped_punc

def char_split(input_data):
  return tf.strings.unicode_split(input_data, 'UTF-8')

def word_split(input_data):
  return tf.strings.split(input_data)

In [29]:
# Model constants.
batch_size = 64
max_features = 96           # Number of distinct chars  
embedding_dim = 16             # Embedding layer output dimension
sequence_length = maxlen       # Input sequence size


Here we choose to use tensorflow to vectorize our data, rather than directly use the id constructed by default.

In [30]:
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    split=char_split, # char_split
    output_mode="int",
    output_sequence_length=sequence_length,
)

This vectorize_layer `adapt` is basically used on the text-only dataset to create the vocabulary.

In [33]:
vectorize_layer.adapt(X_train_ds_raw.batch(batch_size))

In [15]:
print("The number of distinct characters: ", len(vectorize_layer.get_vocabulary()))

The number of distinct characters:  29


In [16]:
print("The characters are: ", vectorize_layer.get_vocabulary())

The characters are:  ['', '[UNK]', ' ', 'e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'y', 'f', 'g', 'p', 'b', 'k', 'v', 'j', 'x', 'q', 'z']


In [38]:
def vectorize_text(text):
  text = tf.expand_dims(text, -1)
  return tf.squeeze(vectorize_layer(text))

In [40]:
# Vectorize the data.
X_train_ds = X_train_ds_raw.map(vectorize_text)
y_train_ds = y_train_ds_raw.map(vectorize_text)

X_train_ds.element_spec, y_train_ds.element_spec

(TensorSpec(shape=(20,), dtype=tf.int64, name=None),
 TensorSpec(shape=(20,), dtype=tf.int64, name=None))

In [41]:
y_train_ds=y_train_ds.map(lambda x: x[0])

Let us see the vectorized text

In [43]:
for (X,y) in zip(X_train_ds.take(5), y_train_ds.take(5)):
  print(X.numpy()," --> ",y.numpy())

[ 8  7  3 26  6 11  5 22 13 18  2  9  5 15  2 20  5 13 13  6]  -->  16
[ 7  3 26  6 11  5 22 13 18  2  9  5 15  2 20  5 13 13  6 16]  -->  5
[ 3 26  6 11  5 22 13 18  2  9  5 15  2 20  5 13 13  6 16  5]  -->  18
[26  6 11  5 22 13 18  2  9  5 15  2 20  5 13 13  6 16  5 18]  -->  2
[ 6 11  5 22 13 18  2  9  5 15  2 20  5 13 13  6 16  5 18  2]  -->  9


In [44]:
train_ds =  tf.data.Dataset.zip((X_train_ds,y_train_ds))

In [45]:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.shuffle(buffer_size=512).batch(batch_size, drop_remainder=True).cache().prefetch(buffer_size=AUTOTUNE)

In [46]:
for sample in train_ds.take(1):
  print("input (X) dimension: ", sample[0].numpy().shape, "\noutput (y) dimension: ",sample[1].numpy().shape)

input (X) dimension:  (64, 20) 
output (y) dimension:  (64,)


The batch size is 64,thus the input and output dimension is therefore 64✖️20 and 64✖️1

### 2.3 Define model

Here we define some sampling method to generate text. 




1.   Why do we need sampling?
2.   Sampling means randomly picking the next word according to its conditional probability distribution. After generating a probability distribution over vocabulary for the given input sequence, we need to carefully decide how to select the next token from this distribution.




In [47]:
def softmax(z):
   return np.exp(z)/sum(np.exp(z))
def greedy_search(conditional_probability):
  return (np.argmax(conditional_probability))
def temperature_sampling (conditional_probability, temperature=1.0):
  conditional_probability = np.asarray(conditional_probability).astype("float64")
  conditional_probability = np.log(conditional_probability) / temperature
  reweighted_conditional_probability = softmax(conditional_probability)
  probas = np.random.multinomial(1, reweighted_conditional_probability, 1)
  return np.argmax(probas)
def top_k_sampling(conditional_probability, k):
  top_k_probabilities, top_k_indices= tf.math.top_k(conditional_probability, k=k, sorted=True)
  top_k_probabilities= np.asarray(top_k_probabilities).astype("float32")
  top_k_probabilities= np.squeeze(top_k_probabilities)
  top_k_indices = np.asarray(top_k_indices).astype("int32")
  top_k_redistributed_probability=softmax(top_k_probabilities)
  top_k_redistributed_probability = np.asarray(top_k_redistributed_probability).astype("float32")
  sampled_token = np.random.choice(np.squeeze(top_k_indices), p=top_k_redistributed_probability)
  return sampled_token

In [48]:
inputs = tf.keras.Input(shape=(sequence_length), dtype="int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.2)(x)
x = layers.LSTM(512, return_sequences=True)(x)
x = layers.Flatten()(x)
predictions=  layers.Dense(max_features, activation='softmax')(x)
model_LSTM = tf.keras.Model(inputs, predictions,name="model_LSTM")

In [49]:
model_LSTM.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_LSTM.summary())

Model: "model_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20)]              0         
                                                                 
 embedding (Embedding)       (None, 20, 16)            1536      
                                                                 
 dropout (Dropout)           (None, 20, 16)            0         
                                                                 
 lstm (LSTM)                 (None, 20, 512)           1083392   
                                                                 
 flatten (Flatten)           (None, 10240)             0         
                                                                 
 dense (Dense)               (None, 96)                983136    
                                                                 
Total params: 2,068,064
Trainable params: 2,068,064
Non-

Here we mention that we use the different loss function. This is because our output vector is sparse.

In [None]:
model_LSTM.fit(train_ds, epochs=20)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f7904e02810>

### 2.4 Generate the text

In [50]:
def decode_sequence (encoded_sequence):
  deceoded_sequence=[]
  for token in encoded_sequence:
    deceoded_sequence.append(vectorize_layer.get_vocabulary()[token])
  sequence= ''.join(deceoded_sequence)
  print("\t",sequence)
  return sequence

In [None]:
def generate_text(model, seed_original, step):
    seed= vectorize_text(seed_original)
    print("The prompt is")
    decode_sequence(seed.numpy().squeeze())
    

    seed= vectorize_text(seed_original).numpy().reshape(1,-1)
    #Text Generated by Greedy Search Sampling
    generated_greedy_search = (seed)
    for i in range(step):
      predictions=model.predict(seed)
      next_index= greedy_search(predictions.squeeze())
      generated_greedy_search = np.append(generated_greedy_search, next_index)
      seed= generated_greedy_search[-sequence_length:].reshape(1,sequence_length)
    print("Text Generated by Greedy Search Sampling:")
    decode_sequence(generated_greedy_search)

    #Text Generated by Temperature Sampling
    print("Text Generated by Temperature Sampling:")
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print("\ttemperature: ", temperature)
        seed= vectorize_text(seed_original).numpy().reshape(1,-1)
        generated_temperature = (seed)
        for i in range(step):
            predictions=model.predict(seed)
            next_index = temperature_sampling(predictions.squeeze(), temperature)
            generated_temperature = np.append(generated_temperature, next_index)
            seed= generated_temperature[-sequence_length:].reshape(1,sequence_length)
        decode_sequence(generated_temperature)

    #Text Generated by Top-K Sampling
    print("Text Generated by Top-K Sampling:")
    for k in [2, 3, 4, 5]:
        print("\tTop-k: ", k)
        seed= vectorize_text(seed_original).numpy().reshape(1,-1)
        generated_top_k = (seed)
        for i in range(step):
            predictions=model.predict(seed)
            next_index = top_k_sampling(predictions.squeeze(), k)
            generated_top_k = np.append(generated_top_k, next_index)
            seed= generated_top_k[-sequence_length:].reshape(1,sequence_length)
        decode_sequence(generated_top_k)

In [None]:
generate_text(model_LSTM,"he talked to her in a ", 100)

The prompt is
	 he talked to her in 
Text Generated by Greedy Search Sampling:
	 he talked to her in soon for the have of a kinge to a maln coruing out betsime that pieces to had frong edpotiency unt
Text Generated by Temperature Sampling:
	temperature:  0.2
	 he talked to her in soon of garest sworld that to me about to a man ffom the molne the capperon a sprank hime and wi
	temperature:  0.5
	 he talked to her in sore doyty wall      of the salinote hoase worside hed noted my lace awas in his rand
	temperature:  1.0
	 he talked to her in dyor for day ecsumllr with a man from the weme a witch gleat now and that badinewimls of to
	temperature:  1.2
	 he talked to her in sheers pkpsive instlads or peh from the seen me wehe and redecen entorated of half never    i
Text Generated by Top-K Sampling:
	Top-k:  2
	 he talked to her in shewimg sowsly hape that the yabee of a hepertyrar nwert i heard of pyemiled a little sir not an
	Top-k:  3
	 he talked to her in somolar on busiellsias wile so

#### 2.4.1 Conclusion



1.   From above, we can see that our model achieve a relatively small loss, high accuracy and very fast training speed. This is because we use tensor flow to processing data. Besides, we do not use sequential model.
2.   And it can produce some meaningful text, not like the first model. Besides, we can also use different sampling method to generate text. 

