<a href="https://colab.research.google.com/github/pcormac/20Time/blob/master/20Time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the very beginning of the code. To analyze our data, we will be using TensorFlow and Keras, which are APIs for machine learning that can both have high levels of obstraction. High levels of obstraction means that all I need to know is data goes into this "black box" and comes out with a trained model that can be used. You don't need to understand exactly how it works, just that it does.

To run this yourself, first make a copy of this document. Then, go through each step of code and click the play button on the left hand side that you see when you hover over each section. Follow prompts from the instructions with each block of code.

In [0]:
import tensorflow as tf
tf.enable_eager_execution()
from google.colab import files
import os
import time

import numpy as np
print(tf.__version__)

1.12.0


The next cell is a lot of variables you can tweak to change the model.

In [0]:
# Number of times you want to have the model read the entirety of your data
# It is recommended that you keep this number low for times sake
EPOCHS=3
# Number of characters to generate when making predictions at the end
# This is how long you want your output text to be
num_generate = 1000
# You can change the start string to experiment, make it relevant to your data
start_string_var = 'ROMEO'
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0
# The embedding dimension 
embedding_dim = 256
# Number of RNN units
rnn_units = 1024


This next bit of code will ask you to upload a .txt file that will be used to train the model.

---



**Please only upload one file!**



In [0]:
text = ""
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  data_path = fn
  with open(data_path, 'r') as file:
    text = file.read().lower();
    
# Stats of characters in the file
print ('{} characters'.format(len(text)))
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab))

In [0]:
# Temp text for testing
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file).read()
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

In [0]:
# Creating a mapping from unique characters to indices, and for indices to characters
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

In [0]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Batch size 
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

https://www.tensorflow.org/tutorials/sequences/text_generation

In [0]:
# Length of the vocabulary in characters
vocab_size = len(vocab)

if tf.test.is_gpu_available():
  rnn = tf.keras.layers.CuDNNGRU
else:
  import functools
  rnn = functools.partial(
    tf.keras.layers.GRU, recurrent_activation='sigmoid')

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                              batch_input_shape=[batch_size, None]),
    rnn(rnn_units,
        return_sequences=True, 
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(
  vocab_size = len(vocab), 
  embedding_dim=embedding_dim, 
  rnn_units=rnn_units, 
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1): 
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

model.summary()
example_batch_loss  = tf.losses.sparse_softmax_cross_entropy(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)") 
print("scalar_loss:      ", example_batch_loss.numpy())
model.compile(
    optimizer = tf.train.AdamOptimizer(),
    loss = tf.losses.sparse_softmax_cross_entropy)

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])

tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)
  # This step uses a lot of parameters defined in cell number 2. You can change them there and run that cell again to change the generation
  
  # Converting our start string to numbers (vectorizing) 
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
      
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)
      
      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))
final_predictions = generate_text(model, start_string=start_string_var)
print(final_predictions)

The next step will download the output file to your computer

**CHANGE TO OUTPUT VARIABLE NOT SPECIFIC STRING**

In [0]:
with open('output.txt', 'w') as f:
  f.write(final_predictions)
files.download('output.txt')

Portions of this page are reproduced and/or modified from work created and [shared by Google](https://developers.google.com/readme/policies/) and used according to terms described in the [Creative Commons 3.0 Attribution License](http://creativecommons.org/licenses/by/3.0/).