In [1]:
import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl
from tensorflow.keras.models import Sequential


from google.colab import drive
import numpy as np
import pandas as pd

In this example, we're going to train a [CharRNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on a body of Shakespearian text. Ultimtely, this is an unsuperived learning task. But similar to our previous explorations in unsupervised DL, we will use an unlabeled dataset and create many samples of labeled data that we can use with our familiar supervised loss functions. The result will be a model that has learned the statistical properties of the input text, and can then be considered a "generative" model of language because we can use it to generate synthetic passages of Shakespeare.  

In [2]:
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [3]:
file_path = "/content/gdrive/My Drive/dNN/shakespeare.txt"

with open(file_path,"r") as f:
  text = f.read()

We've loaded our Shakespeare text, let's take a look at a random snippet.

In [4]:
print(text[31600:32000])

 lies i' the second chamber?
  LADY MACBETH. Donalbain.
  MACBETH. This is a sorry sight.           [Looks on his hands.
  LADY MACBETH. A foolish thought, to say a sorry sight.
  MACBETH. There's one did laugh in 's sleep, and one cried,
      "Murther!"
    That they did wake each other. I stood and heard them,
    But they did say their prayers and address'd them
    Again to sleep.
  LADY MACB


We need to convert our text into numeric arrays, the next several blocks accomplish this.

First, we'll create a mapping between characters and their numeric index. We'll also create the reverse mapping, which is useful.

In [5]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
## total chars = 75 meaning a one hot encoding vecotr len 75

total chars: 75


Next, we'll create a training set of sub-sequences. Remember, we're trying to train a model to be able to predict the next chracter if it is given several characters of a subsequence. So we will create training pairs where each X is a fixed-length subsequences and each Y is the corresponding next letter in the text.

In [6]:
maxlen = 40 # make the taining data 40 letters # target will be 41st letter
step = 3 # skip every 3 letter
sub_sequences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sub_sequences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen]) # grab the 41st letter
print('nb sequences:', len(sub_sequences))

nb sequences: 38700


In [7]:
k=300
print("(Sequence):\n" + sub_sequences[k])
print("\n(Target Character): \n" + next_chars[k])

(Sequence):
 and other Apparitions
  Lords, Gentleme

(Target Character): 
n


Next we'll create one-hot vectors for our sub-sequences. The tensor we create here will be shaped as (num_sequences x sequence_length x alphabet_size).

In [8]:
# 3-d - sequences, size each seq, size of onehot (len of chars list i.e. total number of chars in vocabulary)
X = np.zeros((len(sub_sequences), maxlen, len(chars)), dtype=np.uint8)
# 38,000x75 - onehot vector for each predicition
Y = np.zeros((len(sub_sequences), len(chars)), dtype=np.uint8)
for i, seq in enumerate(sub_sequences):
    for t, char in enumerate(seq):
        X[i, t, char_indices[char]] = 1
        Y[i, char_indices[next_chars[i]]] = 1

In [9]:
X[0,0,:]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=uint8)

In [10]:
Y[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)

Our RNN model will be quite simple.

In [11]:
char_rnn = Sequential()
# 128 cells in LSTM
# each x is len 40x tot vocal size (75) - dont need to tell it the seq len but its faster if you do
char_rnn.add(tfkl.LSTM(128, input_shape=(maxlen, len(chars))))
# make a prediction using softamx over all 75 letters
char_rnn.add(tfkl.Dense(len(chars),activation="softmax"))

In [12]:
char_rnn.compile(loss='categorical_crossentropy', optimizer=tfk.optimizers.RMSprop(lr=0.01))

In [15]:
# train, can keep clicking train and it will imporove the model i.e. keep running cell - get smaller loss function
char_rnn.fit(X,Y, epochs=20, batch_size=1024)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f535f79e7b8>

Once we have a trained model, we can simulate new text by making predictions about the next character and then drawing characters in proportion to the predicted probabilities. And then simple repeat that process over and over, each time drawing the next character.

What to do with the output probabilities?
Use a train model to generate new sequences
  - see which prediction is the highest
  - append that to the seed sequence and move the seed sequence over and
  feed it back to the network, based on that predicition it will predict a new letter
-OR-
Take the softmax as a probability vector and draw a random character by it's index

In [16]:
def draw_char(probs):
    probs = np.asarray(probs).astype('float64')
    if sum(probs) != 1.0:
      probs = probs / np.sum(probs)
    draw = np.random.choice(range(len(probs)) , p=probs)
    return draw

def sample_text(model, sample_length=100):
    start = np.random.randint(0, len(text) - maxlen - 1)
    sequence = text[start: start + maxlen]
  
    x_preds = np.zeros((sample_length, maxlen, len(chars)))
    for i in range(sample_length):
        for t, char in enumerate(sequence[-maxlen:]):
            x_preds[i, t, char_indices[char]] = 1.

        preds = model.predict(np.expand_dims(x_preds[i,:,:], axis=0), verbose=0)[0]
        next_index = draw_char(preds)
        next_char = indices_char[next_index]

        sequence += next_char
    return sequence

In [17]:
sim = sample_text(char_rnn,sample_length=500) 

In [18]:
# do the first option and generate new sequences of characters
print(sim)

e,
    They are not yet come back. But I hade they.
    Thou loses again fround as his swains, 
    You smild for rade folors hath of vales
    Then for at hade of sliet flame fried?
                                              Munching?

                      Enter Macbeth.

  MACBETH. To keips us all man cortle; the lidelan,
    Hath not somelt.
  MACBETH. [Which it same bordfed speetle your himse,
    Ne moment grainsing theming thos frees us.
    Most saying.

    Leven placither it noble ture as hasthers wereabes
    The slamoma


Notice that we can do pretty well to learn the typical statistical patterns of this text and then simulate new text that appears to be very similar to legitimate Shakespeare. 

But just a caution - we can also do pretty well with a much simpler method (Markov model): http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

So the lesson is to try something simple before jumping right in to deep learning.

## Exercise

In this example, we're going to use an RNN for sequence classification. The task we'll set up is to generate a training set of randomized strings, and train our model to detect whether a string contains any vowels.

First, we'll create a training dataset of short randomized character sequences and the corresponding label of whether or not they contain at least one vowel.

In [4]:
import string

In [5]:
def contains_vowels(sequence):
  vowels = ["a", "e", "i", "o", "u"]
  return any([vowel in list(sequence) for vowel in vowels])

In [6]:
contains_vowels("gradient")

True

In [7]:
sequences = []
labels = []
for i in range(10000):
 char_list = np.random.choice( list(string.ascii_lowercase), size = 5, replace=True)
 seq = "".join(char_list)
 sequences.append(seq)
 labels.append(int(contains_vowels(seq)))

In [8]:
df = pd.DataFrame({"sequence": sequences, "label":labels})

In [9]:
df.head()

Unnamed: 0,sequence,label
0,edued,1
1,deyyw,1
2,zdjho,1
3,esrlt,1
4,gzovf,1


Next, set up and train an RNN (of any type) to solve this task. What preprocessing will you need to do first on the raw data in order to prepare it for the network?

In [27]:
# give each letter a unique value
chars = sorted(list({letter for word in sequences for letter in word}))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 26


In [28]:
# one hot encode x
X = np.zeros((len(sequences), 5, len(chars)), dtype=np.uint8)
for i, seq in enumerate(sequences):
    for t, char in enumerate(seq):
        X[i, t, char_indices[char]] = 1

In [29]:
Y = np.array(labels)

In [30]:
'''
I was currious why this wouldn't work for me. I kept getting errors when I
ran this. Does it only take one hot encodings as inputs?

ValueError: Input 0 of layer sequential_2 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [100, 5]

# encode the data based on the dictonary values
endcoded = []
for seq in sequences:
  seqList = []
  for letter in seq:
    let_code = char_indices[letter]
    seqList.append(let_code)
  endcoded.append(seqList)
X = np.array(endcoded)
'''

"\nI was currious why this wouldn't work for me. I kept getting errors when I\nran this.\n\nValueError: Input 0 of layer sequential_2 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [100, 5]\n\n# encode the data based on the dictonary values\nendcoded = []\nfor seq in sequences:\n  seqList = []\n  for letter in seq:\n    let_code = char_indices[letter]\n    seqList.append(let_code)\n  endcoded.append(seqList)\nX = np.array(endcoded)\n"

In [31]:
seq_rnn = Sequential()
seq_rnn.add(tfkl.LSTM(50, input_shape=(5,26)))
seq_rnn.add(tfkl.Dense(1,activation="sigmoid"))

In [32]:
seq_rnn.compile(loss='binary_crossentropy', optimizer=tfk.optimizers.RMSprop(lr=0.01))

In [33]:
seq_rnn.fit(X,Y, epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fb4879b4f60>