# Long Short-term Memory for Text Generation

## Team: Buyang Li, Yuxuan Li

This notebook uses LSTM neural network to generate text from Nietzsche's writings.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import random
import sys
import io


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.utils import get_file



## Dataset

### Get the data
Nietzsche's writing dataset is available online. The following code download the dataset.

In [2]:
path = get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


### Visualize data

In [3]:
print('corpus length:', len(text))

corpus length: 600893


In [5]:
print(text[10:513])

supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to truth, have been unskilled and unseemly methods for
winning a woman? certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--if,
indeed, it stands at all!


In [6]:
chars = sorted(list(set(text)))
# total nomber of characters
print('total chars:', len(chars))

total chars: 57


### Clean data

We cut the text in sequences of maxlen characters with a jump size of 3.
The features for each example is a matrix of size maxlen*num of chars.
The label for each example is a vector of size num of chars, which represents the next character.

In [8]:
# create (character, index) and (index, character) dictionary
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [9]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200285


In [10]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vectorization...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


## The model

### Build the model - fill in this box

In [29]:
model = keras.Sequential()
model.add(layers.LSTM(units = 128, return_sequences = True, input_shape = (x.shape[1], x.shape[2])))
model.add(layers.Dropout(0.2))

model.add(layers.LSTM(units = 128, return_sequences = True))
model.add(layers.Dropout(0.25))

model.add(layers.LSTM(units = 128, return_sequences = True))
model.add(layers.Dropout(0.25))

model.add(layers.LSTM(units = 128))
model.add(layers.Dropout(0.25))

model.add(layers.Dense(y.shape[1], activation='softmax'))

optimizer = keras.optimizers.Adam(learning_rate=0.001)

model.compile(optimizer =optimizer, loss = 'mean_squared_error')


### Inspect the model

Use the `.summary` method to print a simple description of the model

In [30]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_21 (LSTM)              (None, 40, 128)           95232     
                                                                 
 dropout_21 (Dropout)        (None, 40, 128)           0         
                                                                 
 lstm_22 (LSTM)              (None, 40, 128)           131584    
                                                                 
 dropout_22 (Dropout)        (None, 40, 128)           0         
                                                                 
 lstm_23 (LSTM)              (None, 40, 128)           131584    
                                                                 
 dropout_23 (Dropout)        (None, 40, 128)           0         
                                                                 
 lstm_24 (LSTM)              (None, 128)              

### Train the model

In [31]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [32]:
class PrintLoss(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, _):
        # Function invoked at end of each epoch. Prints generated text.
        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.5, 1.0]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

### fill in this box for training the model

In [36]:
desired_callbacks = PrintLoss()

In [37]:
model.fit(x, y, epochs=5, batch_size=256,callbacks=desired_callbacks)

Epoch 1/5
----- Generating text after Epoch: 0
----- diversity: 0.5
----- Generating with seed: "lthough the science does not dominate,
b"
lthough the science does not dominate,
be etio fo e reae tienee esmnna oi mopeee  e niemeto ien e heo rttiv arpta tae e sn oa eee ceaes no eltsn eaoe tene  sttt oet  oele  oa ue tesn rrse to he raee ahd tnn maon ese  t]i woih heri es tare ecredtu reei rienot rrot elhid er eeeddh t hitih rn ar a tiat eh no tte h troeso leh  oeae eei trme rit teist eiie aseee rlef atne eed eroe oul oerr faoe sees  alote  aeas ,s itis tos eee ntei ae rrrae
----- diversity: 1.0
----- Generating with seed: "lthough the science does not dominate,
b"
lthough the science does not dominate,
bndiw7e jseohit o;e s,aeoodrl mu(odd dem [lurw,rnfmte wridsesepnorsee hti ryrstgulnt n s:hhf atnil eniao uasetll
laa ctevladle thhe hbsnrr
lcimle  nzoye (ohtabrnega tgenrl 
 ewtaee ot m= -lennuwrnoe ceodew
h'u eeemem e,, [wnle smio tthssa  napaomesusdt owtbtt trethtn taprrcerprr seme !lhd

<keras.callbacks.History at 0x1b70106f2e0>

In [1]:
def hammingDistance(x,y):
    xor = x ^ y
    distance = 0
    while xor:
        # mask out the rest bits
        if xor & 1:
            distance += 1
        xor = xor >> 1
    return distance
        

In [2]:
1 ^ 3

2

In [10]:
1 & 3

1

In [9]:
bin(4^1).count('1')

2

In [20]:
1 << 

65536