<a href="https://colab.research.google.com/github/mallorybundy/Statsdata/blob/main/G5_book_publishing_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os
import time
#%load_ext tensorboard
#https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_in_notebooks.ipynb#scrollTo=Po7rTfQswAMT

In [None]:
#%tensorboard --logdir logs


[Full Module links](https://byui-cse.github.io/cse450-course/module-06/)

[Project Overview](https://byui-cse.github.io/cse450-course/module-06/project.html)



---




[Basis](https://www.tensorflow.org/text/tutorials/text_generation)


---


[Extra1](https://www.tensorflow.org/text/tutorials/text_generation#advanced_customized_training)

[Extra2.1](https://www.tensorflow.org/text/guide/word_embeddings)
[Extra2.2](https://www.tensorflow.org/tutorials/text/word2vec)

[Extra3.1](https://www.tensorflow.org/text/tutorials/nmt_with_attention)
[Extra3.2](https://github.com/tensorflow/nmt#background-on-the-attention-mechanism)

## I. Parse Text Sources
First we'll load our text sources and create our vocabulary lists and encoders. 

There are ways we could do this in pure python, but using the tensorflow data structures and libraries allow us to keep things super-optimized.

In [None]:
import pandas as pd
sources = pd.read_csv('https://raw.githubusercontent.com/TBrost/CSE450_ML/main/sources.csv')
source_list = sources['url']
sourceN_list = sources['title']

In [None]:
print(sources)

                                                 url  \
0  https://raw.githubusercontent.com/byui-cse/cse...   
1  https://raw.githubusercontent.com/TBrost/CSE45...   
2  https://raw.githubusercontent.com/TBrost/CSE45...   
3  https://raw.githubusercontent.com/TBrost/CSE45...   
4  https://raw.githubusercontent.com/TBrost/CSE45...   
5  https://raw.githubusercontent.com/TBrost/CSE45...   

                            title      id               author  \
0                            Emma    emma          Jane Austen   
1  Adventures of Huckleberry Finn    huck           Mark Twain   
2                The Great Gatsby  gatsby  F. Scott Fitzgerald   
3       The Count of Monte Cristo   monte      Alexandre Dumas   
4                     The Odessey     ode                Homer   
5                       Moby Dick    moby              Dickens   

              file  
0       austen.txt  
1     HuckFinn.txt  
2       Gatsby.txt  
3  MonteCristo.txt  
4      Odessey.txt  
5         Moby.txt

In [None]:
text=""
print('Please enter the file id of the desired book to sample from.')
file = str(input('File id: '))
source=sources.loc[sources['id'] == file] #emma, huck, gatsby, monte, ode, moby
source=source.reset_index()
fname=str(source['file'][0])
origin=str(source['url'][0])
path_to_file = tf.keras.utils.get_file(fname=fname, origin=origin)
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

Please enter the file id of the desired book to sample from.
File id: huck
Downloading data from https://raw.githubusercontent.com/TBrost/CSE450_ML/main/txt_files/HuckFinn.txt


In [None]:
# Load file data
#path_to_file = tf.keras.utils.get_file('austen.txt', 'https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/austen/austen.txt')
#text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
#print('Length of text: {} characters'.format(len(text)))

In [None]:
# Verify the first part of our data
print(text[:200])

﻿
ADVENTURES
OF
HUCKLEBERRY FINN

(Tom Sawyer’s Comrade)

By Mark Twain



HUCKLEBERRY FINN

Scene: The Mississippi Valley Time: Forty to fifty years ago




CHAPTER I.


You don’t know about me witho


In [None]:
# Now we'll get a list of the unique characters in the file. This will form the
# vocabulary of our network. There may be some characters we want to remove from this 
# set as we refine the network.
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))
print(vocab)

76 unique characters
['\n', ' ', '!', '$', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '“', '”', '\ufeff']


In [None]:
# Next, we'll encode encode these characters into numbers so we can use them
# with our neural network, then we'll create some mappings between the characters
# and their numeric representations
ids_from_chars = preprocessing.StringLookup(vocabulary=list(vocab))
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True)

# Here's a little helper function that we can use to turn a sequence of ids
# back into a string:
# turn them into a string:
def text_from_ids(ids):
  joinedTensor = tf.strings.reduce_join(chars_from_ids(ids), axis=-1)
  return joinedTensor.numpy().decode("utf-8")

In [None]:
# Now we'll verify that they work, by getting the code for "A", and then looking
# that up in reverse
testids = ids_from_chars(["T", "r", "u", "t", "h"])
testids

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([38, 62, 65, 64, 52])>

In [None]:
chars_from_ids(testids)

<tf.Tensor: shape=(5,), dtype=string, numpy=array([b'T', b'r', b'u', b't', b'h'], dtype=object)>

In [None]:
testString = text_from_ids( testids )
testString

'Truth'

## II. Construct our training data
Next we need to construct our training data by building sentence chunks. Each chunk will consist of a sequence of characters and a corresponding "next sequence" of the same length showing what would happen if we move forward in the text. This "next sequence" becomes our target variable.

For example, if this were our text:

> It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

And our sequence length was 10 with a step size of 1, our first chunk would be:

* Sequence: `It is a tr`
* Next Sequence: `t is a tru`

Our second chunk would be:

* Sequence: `t is a tru`
* Next Word: ` is a trut`



In [None]:
# First, create a stream of encoded integers from our text
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(562874,), dtype=int64, numpy=array([76,  1, 19, ...,  1,  1,  1])>

In [None]:
# Now, convert that into a tensorflow dataset
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [None]:
# Finally, let's batch these sequences up into chunks for our training
seq_length = 150
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

# This function will generate our sequence pairs:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

# Call the function for every sequence in our list to create a new dataset
# of input->target pairs
dataset = sequences.map(split_input_target)

In [None]:
# Verify our sequences
for input_example, target_example in  dataset.take(1):
    print("Input: ", text_from_ids(input_example))
    print("--------")
    print("Target: ", text_from_ids(target_example))

Input:  ﻿
ADVENTURES
OF
HUCKLEBERRY FINN

(Tom Sawyer’s Comrade)

By Mark Twain



HUCKLEBERRY FINN

Scene: The Mississippi Valley Time: Forty to fifty years 
--------
Target:  
ADVENTURES
OF
HUCKLEBERRY FINN

(Tom Sawyer’s Comrade)

By Mark Twain



HUCKLEBERRY FINN

Scene: The Mississippi Valley Time: Forty to fifty years a


In [None]:
# Finally, we'll randomize the sequences so that we don't just memorize the books
# in the order they were written, then build a new streaming dataset from that.
# Using a streaming dataset allows us to pass the data to our network bit by bit,
# rather than keeping it all in memory. We'll set it to figure out how much data
# to prefetch in the background.

BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 150), dtype=tf.int64, name=None), TensorSpec(shape=(64, 150), dtype=tf.int64, name=None))>

## III. Build the model

Next, we'll build our model. Up until this point, you've been using the Keras symbolic, or imperative API for creating your models. Doing something like:

    model = tf.keras.models.Sequentla()
    model.add(tf.keras.layers.Dense(80, activation='relu))
    etc...

However, tensorflow has another way to build models called the Functional API, which gives us a lot more control over what happens inside the model. You can read more about [the differences and when to use each here](https://blog.tensorflow.org/2019/01/what-are-symbolic-and-imperative-apis.html).

We'll use the functional API for our RNN in this example. This will involve defining our model as a custom subclass of `tf.keras.Model`.

If you're not familiar with classes in python, you might want to review [this quick tutorial](https://www.w3schools.com/python/python_classes.asp), as well as [this one on class inheritance](https://www.w3schools.com/python/python_inheritance.asp).

Using a functional model is important for our situation because we're not just training it to predict a single character for a single sequence, but as we make predictions with it, we need it to remember those predictions as use that memory as it makes new predictions.


In [None]:
# Create our custom model. Given a sequence of characters, this
# model's job is to predict what character should come next.
class TextModel(tf.keras.Model):

  # This is our class constructor method, it will be executed when
  # we first create an instance of the class 
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)

    # Our model will have three layers:
    
    # 1. An embedding layer that handles the encoding of our vocabulary into
    #    a vector of values suitable for a neural network
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    # 2. A GRU layer that handles the "memory" aspects of our RNN. If you're
    #    wondering why we use GRU instead of LSTM, and whether LSTM is better,
    #    take a look at this article: https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm
    #    then consider trying out LSTM instead (or in addition to!)
    self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)

    # 3. Our output layer that will give us a set of probabilities for each
    #    character in our vocabulary.
    self.dense = tf.keras.layers.Dense(vocab_size)

  # This function will be executed for each epoch of our training. Here
  # we will manually feed information from one layer of our network to the 
  # next.
  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs

    # 1. Feed the inputs into the embedding layer, and tell it if we are
    #    training or predicting
    x = self.embedding(x, training=training)

    # 2. If we don't have any state in memory yet, get the initial random state
    #    from our GRUI layer.
    if states is None:
      states = self.gru.get_initial_state(x)
    
    # 3. Now, feed the vectorized input along with the current state of memory
    #    into the gru layer.
    x, states = self.gru(x, initial_state=states, training=training)

    # 4. Finally, pass the results on to the dense layer
    x = self.dense(x, training=training)

    # 5. Return the results
    if return_state:
      return x, states
    else: 
      return x

In [None]:
# Create an instance of our model
vocab_size=len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

model_huck = TextModel(vocab_size, embedding_dim, rnn_units)
model_austen = TextModel(vocab_size, embedding_dim, rnn_units)
#model_gatsby = TextModel(vocab_size, embedding_dim, rnn_units)

In [None]:
# Verify the output of our model is correct by running one sample through
# This will also compile the model for us. This step will take a bit.
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model_huck(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 150, 77) # (batch_size, sequence_length, vocab_size)


In [None]:
# Now let's view the model summary
model_huck.summary()

Model: "text_model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_15 (Embedding)    multiple                  19712     
                                                                 
 gru_15 (GRU)                multiple                  3938304   
                                                                 
 dense_15 (Dense)            multiple                  78925     
                                                                 
Total params: 4,036,941
Trainable params: 4,036,941
Non-trainable params: 0
_________________________________________________________________


## IV. Train the model

For our purposes, we'll be using [categorical cross entropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) as our loss function*. Also, our model will be outputting ["logits" rather than normalized probabilities](https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow), because we'll be doing further transformations on the output later. 


\* Note that since our model deals with integer encoding rather than one-hot encoding, we'll specifically be using [sparse categorical cross entropy](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other).

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model_huck.compile(optimizer='Nadam', loss=loss)

history = model_huck.fit(dataset, epochs=90)

Epoch 1/90
Epoch 2/90
Epoch 3/90
Epoch 4/90
Epoch 5/90
Epoch 6/90
Epoch 7/90
Epoch 8/90
Epoch 9/90
Epoch 10/90
Epoch 11/90
Epoch 12/90
Epoch 13/90
Epoch 14/90
Epoch 15/90
Epoch 16/90
Epoch 17/90
Epoch 18/90
Epoch 19/90
Epoch 20/90
Epoch 21/90
Epoch 22/90
Epoch 23/90
Epoch 24/90
Epoch 25/90
Epoch 26/90
Epoch 27/90
Epoch 28/90
Epoch 29/90
Epoch 30/90
Epoch 31/90
Epoch 32/90
Epoch 33/90
Epoch 34/90
Epoch 35/90
Epoch 36/90
Epoch 37/90
Epoch 38/90
Epoch 39/90
Epoch 40/90
Epoch 41/90
Epoch 42/90
Epoch 43/90
Epoch 44/90
Epoch 45/90
Epoch 46/90
Epoch 47/90
Epoch 48/90
Epoch 49/90
Epoch 50/90
Epoch 51/90
Epoch 52/90
Epoch 53/90
Epoch 54/90
Epoch 55/90
Epoch 56/90
Epoch 57/90
Epoch 58/90
Epoch 59/90
Epoch 60/90
Epoch 61/90
Epoch 62/90
Epoch 63/90
Epoch 64/90
Epoch 65/90
Epoch 66/90
Epoch 67/90
Epoch 68/90
Epoch 69/90
Epoch 70/90
Epoch 71/90
Epoch 72/90
Epoch 73/90
Epoch 74/90
Epoch 75/90
Epoch 76/90
Epoch 77/90
Epoch 78/90
Epoch 79/90
Epoch 80/90
Epoch 81/90
Epoch 82/90
Epoch 83/90
Epoch 84/90
E

## V. Use the model

Now that our model has been trained, we can use it to generate text. As mentioned earlier, to do so we have to keep track of its internal state, or memory, so that we can use previous text predictions to inform later ones.

However, with RNN generated text, if we always just pick the character with the highest probability, our model tends to get stuck in a loop. So instead we will create a probability distribution of characters for each step, and then sample from that distribution. We can add some variation to this using a paramter known as ["temperature"](https://cs.stackexchange.com/questions/79241/what-is-temperature-in-lstm-and-neural-networks-generally).

In [None]:
# Here's the code we'll use to sample for us. It has some extra steps to apply
# the temperature to the distribution, and to make sure we don't get empty
# characters in our text. Most importantly, it will keep track of our model
# state for us.

class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=0.5):
    super().__init__()
    self.temperature=temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "" or "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['','[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices = skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())]) 
    self.prediction_mask = tf.sparse.to_dense(sparse_mask,validate_indices=False)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits] 
    predicted_logits, states =  self.model(inputs=input_ids, states=states, 
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    
    # Apply the prediction mask: prevent "" or "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Return the characters and model state.
    return chars_from_ids(predicted_ids), states


In [None]:
# Create an instance of the character generator
model = str(input('Model to use (huck, gatsby, austen): '))
if model == 'huck':
  one_step_model = OneStep(model_huck, chars_from_ids, ids_from_chars)
elif model == 'gatsby':
  one_step_model = OneStep(model_gatsby, chars_from_ids, ids_from_chars)
else:
  one_step_model = OneStep(model_austen, chars_from_ids, ids_from_chars)


# Now, let's generate a 1000 character chapter by giving our model "Chapter 1"
# as its starting text
states = None
next_char = tf.constant(['The world seemed like such a peaceful place until the magic tree was discovered in London.'])
result = [next_char]

for n in range(1500):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)

# Print the results formatted.
result=(result[0].numpy().decode('utf-8'))
result =result.replace("igge","****")
print(result)


Model to use (huck, gatsby, austen): huck
The world seemed like such a peaceful place until the magic tree was discovered in London.
There was a stiff cares; he said we
must slid in our tracks and started for the raft in the same time.

I didn’t see no disturbance, and then they’d jaived in the front door, and set there and watched him right and talked, but the
doctor says:

“I don’t want to blow on nobody; and I’ll do
it. Tom and me was to play this n****r’s hours and
tagged and goes to decide, and keep them to the brue and easy, and if anybody ever come more than a
passel of fault with me for de watchman on dat
you’re going to see you comes.”

“So you live difforeder.”

“Well, then, what do you _be_, I s’pose we had ever live’. I doan’ might find no more about it. He’d
_got_ to do it. They’ve got across; and there’s a reward but a
shake son of things on it. One man said he reckoned
a body could a got the
boy, and was going to stir, and gave us a cussing,
and said four some stripsing 

## VI. Next Steps

This is a very simple model with one GRU layer and then an output layer. However, considering how simple it is and the fact that we are predicting outputs character by character, the text it produces is pretty amazing. Though it still has a long way to go before publication.

There are many other RNN architectures you could try, such as adding additional hidden dense layers, replacing GRU with one or more LSTM layers, combining GRU and LSTM, etc...

You could also experiment with better text cleanup to make sure odd punctuation doesn't appear, or finding longer texts to use. If you combine texts from two authors, what happens? Can you generate a Jane Austen stageplay by combining austen and shakespeare texts?

Finally, there are a number of hyperparameters to tweak, such as temperature, epochs, batch size, sequence length, etc...