<a href="https://colab.research.google.com/github/pcormac/20Time/blob/master/20Time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This is the very beginning of the code. To analyze our data, we will be using TensorFlow and Keras, which are APIs for machine learning that can both have high levels of obstraction. High levels of obstraction means that all I need to know is data goes into this "black box" and comes out with a trained model that can be used. You don't need to understand exactly how it works, just that it does.**

**To run this yourself, first make a copy of this document. Then, go through each step of code and click the play button on the left hand side that you see when you hover over each section. Follow prompts from the instructions with each block of code. **

In [0]:
import tensorflow as tf
tf.enable_eager_execution()
from google.colab import files
import os
import time

import numpy as np
print(tf.__version__)

1.12.0


**The next cell is a lot of variables you can tweak to change the model.**

In [0]:
# Number of times you want to have the model read the entirety of your data
# It is recommended that you keep this number low for times sake
# If you are running it remotely (which is the default and what I recommend)
# It takes about 1.5 hours per epoch
EPOCHS=2
# The embedding dimension 
# Haven't experimented much with this
embedding_dim = 256
# Number of RNN units
# Haven't experimented much with this
rnn_units = 1024


**This next bit of code will ask you to upload a .txt file that will be used to train the model**.

---


**Please only upload one file!**


In [0]:
text = ""
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  data_path = fn
  with open(data_path, 'r') as file:
    text = file.read().lower();
    
# Stats of characters in the file
print ('{} characters'.format(len(text)))
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

Saving OfficeS5E13.txt to OfficeS5E13.txt
User uploaded file "OfficeS5E13.txt" with length 32281 bytes
32281 characters
47 unique characters


{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '"' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '0' :   9,
  '1' :  10,
  '2' :  11,
  '4' :  12,
  '5' :  13,
  '6' :  14,
  '8' :  15,
  '9' :  16,
  ':' :  17,
  '?' :  18,
  '[' :  19,
  ...
}
dwight: last week i gave a fire safety talk. [clears throat] and nobody paid any attention. it's my own fault for using powerpoint. powerpoint is boring. people learn in a lot of different ways, but experience is the best teacher. [lights a cigarette


In [0]:
# Creating a mapping from unique characters to indices, and for indices to characters
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')
print(text[:50])

# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Batch size 
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

**The cell below will make the model but not start training it.**

In [0]:
# Length of the vocabulary in characters
vocab_size = len(vocab)

if tf.test.is_gpu_available():
  rnn = tf.keras.layers.CuDNNGRU
else:
  import functools
  rnn = functools.partial(
    tf.keras.layers.GRU, recurrent_activation='sigmoid')

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                              batch_input_shape=[batch_size, None]),
    rnn(rnn_units,
        return_sequences=True, 
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(
  vocab_size = len(vocab), 
  embedding_dim=embedding_dim, 
  rnn_units=rnn_units, 
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1): 
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

model.summary()
example_batch_loss  = tf.losses.sparse_softmax_cross_entropy(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)") 
print("scalar_loss:      ", example_batch_loss.numpy())
model.compile(
    optimizer = tf.train.AdamOptimizer(),
    loss = tf.losses.sparse_softmax_cross_entropy)

(64, 100, 47) # (batch_size, sequence_length, vocab_size)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           12032     
_________________________________________________________________
cu_dnngru (CuDNNGRU)         (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 47)            48175     
Total params: 3,998,511
Trainable params: 3,998,511
Non-trainable params: 0
_________________________________________________________________
Prediction shape:  (64, 100, 47)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       3.8493035


**This cell will actually start the training process. Be warned, it will likely take about 1.5 hours for each epoch you do.**

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])

tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            12032     
_________________________________________________________________
cu_dnngru_1 (CuDNNGRU)       (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 47)             48175     
Total params: 3,998,511
Trainable params: 3,998,511
Non-trainable params: 0
_________________________________________________________________


**The next cell will make a new script based off the training and let you download it. If you run it twice, it will give you two different scripts. The three variables at the top of the block can be changed to whatever you want.**

In [0]:
# Number of characters to generate when making predictions at the end
# This is how long you want your output text to be
num_generate_var = 10000
# You can change the start string to experiment, make it relevant to your data
start_string_var = "dwight: "
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature_var = 2.0

def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)
  # This step uses a lot of parameters defined in cell number 2. You can change them there and run that cell again to change the generation
  # Number of characters to generate
  num_generate = num_generate_var

  # You can change the start string to experiment
  start_string = start_string_var

  # Converting our start string to numbers (vectorizing) 
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = temperature_var

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
      
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)
      
      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))
final_predictions = generate_text(model, start_string=start_string_var)
with open('output.txt', 'w') as f:
  f.write(final_predictions)
files.download('output.txt')


dwight: i didn't th: clap, music loud, fire!
dwight: n-no one working ineck?
jim: are you-- [peoplearned? [pause] he was aguy: jerks. just kidding, not should haropped ould have died, be a roast chael: [toby trying]
darryl: ropped outs on high-rise.
dwight: [sing."
dwight: this is ah.. nope, wayou, but i knowng hele has thatst of our everybody, i'm going on?! [thstake] hesty, like, am: what?
michael: what?
jim: yes honess hones... 
michael: expractice dummy]
dwight: michael!
dwight: wah... yes. we at's how... yllis: oh yebegins to be nide got ice dummake out?
michael: i am not a neck?
jim: are you alright?
ough, aved tooah. and now, a mont?
michael: dwight: last weed: [to rosping] sho. [whigr dows towl: okay with the lasghing, roast! ah. ahe? 911. [move dummeredith pau a heart atter mily. 
rks. just kiddwight: i kid, you kn ist twil.. nope, we? [pul, no only right lesty, laughte'sing."
jim: gounny cigarettew caw.. nop to excity: get out of the site as petrifie.
michael! [dwightad fill 

Portions of this page are reproduced and/or modified from work created and [shared by Google](https://developers.google.com/readme/policies/) and used according to terms described in the [Creative Commons 3.0 Attribution License](http://creativecommons.org/licenses/by/3.0/).