# Spelling Bee

This notebook starts our deep dive (no pun intended) into NLP by introducing sequence-to-sequence learning on Spelling Bee.

## Data Stuff

We take our data set from [The CMU pronouncing dictionary](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary)

In [16]:
%matplotlib inline
import importlib

import utils2; importlib.reload(utils2)
from utils2 import *
np.set_printoptions(4)
PATH = ''

In [17]:
limit_mem()

In [18]:
from sklearn.model_selection import train_test_split

The CMU pronouncing dictionary consists of sounds/words and their corresponding phonetic description (American pronunciation).

The phonetic descriptions are a sequence of phonemes. Note that the vowels end with integers; these indicate where the stress is.

Our goal is to learn how to spell these words given the sequence of phonemes.

The preparation of this data set follows the same pattern we've seen before for NLP tasks.

Here we iterate through each line of the file and grab each word/phoneme pair that starts with an uppercase letter. 

In [19]:
lines = [l.strip().split("  ") for l in open(PATH+"cmudict-0.7b", encoding='latin1') 
         if re.match('^[A-Z]', l)]
lines = [(w, ps.split()) for w, ps in lines]
lines[0], lines[-1]

(('A', ['AH0']), ('ZYWICKI', ['Z', 'IH0', 'W', 'IH1', 'K', 'IY0']))

Next we're going to get a list of the unique phonemes in our vocabulary, as well as add a null "_" for zero-padding.

In [20]:
phonemes = ["_"] + sorted(set(p for w, ps in lines for p in ps))
phonemes[:5]

['_', 'AA0', 'AA1', 'AA2', 'AE0']

In [21]:
len(phonemes)

70

Then we create mappings of phonemes and letters to respective indices.

Our letters include the padding element "_", but also "*" which we'll explain later.

In [22]:
p2i = dict((v, k) for k,v in enumerate(phonemes))
letters = "_abcdefghijklmnopqrstuvwxyz*"
l2i = dict((v, k) for k,v in enumerate(letters))

Let's create a dictionary mapping words to the sequence of indices corresponding to it's phonemes, and let's do it only for words between 5 and 15 characters long.

In [23]:
maxlen=15
pronounce_dict = {w.lower(): [p2i[p] for p in ps] for w, ps in lines
                 if (5<=len(w)<=maxlen) and re.match("^[A-Z]+$", w)}
len(pronounce_dict)

108006

Aside on various approaches to python's list comprehension:
* the first list is a typical example of a list comprehension subject to a conditional
* the second is a list comprehension inside a list comprehension, which returns a list of list
* the third is similar to the second, but is read and behaves like a nested loop
    * Since there is no inner bracket, there are no lists wrapping the inner loop

In [24]:
a=['xyz','abc']
[o.upper() for o in a if o[0]=='x'], [[p for p in o] for o in a], [p for o in a for p in o]

(['XYZ'], [['x', 'y', 'z'], ['a', 'b', 'c']], ['x', 'y', 'z', 'a', 'b', 'c'])

Split lines into words, phonemes, convert to indexes (with padding), split into training, validation, test sets. Note we also find the max phoneme sequence length for padding.

In [25]:
maxlen_p = max([len(v) for k,v in pronounce_dict.items()])

In [26]:
pairs = np.random.permutation(list(pronounce_dict.keys()))
n = len(pairs)
input_ = np.zeros((n, maxlen_p), np.int32)
labels_ = np.zeros((n, maxlen), np.int32)

for i, k in enumerate(pairs):
    for j, p in enumerate(pronounce_dict[k]): input_[i][j] = p
    for j, letter in enumerate(k): labels_[i][j] = l2i[letter]

In [27]:
go_token = l2i["*"]
dec_input_ = np.concatenate([np.ones((n,1)) * go_token, labels_[:,:-1]], axis=1)

Sklearn's <tt>train_test_split</tt> is an easy way to split data into training and testing sets.

In [28]:
(input_train, input_test, labels_train, labels_test, dec_input_train, dec_input_test
    ) = train_test_split(input_, labels_, dec_input_, test_size=0.1)

In [29]:
input_vocab_size, output_vocab_size = len(phonemes), len(letters)
input_vocab_size, output_vocab_size

(70, 28)

In [30]:
input_train = keras.utils.to_categorical(input_train, num_classes=input_vocab_size)

In [31]:
input_train.shape

(97205, 16, 70)

In [32]:
labels_train = keras.utils.to_categorical(labels_train, num_classes=output_vocab_size)

Next we proceed to build our model.

## Keras code

In [34]:
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np
import pandas as pd

In [35]:
def softmax(x, axis=1):
    """Softmax activation function.
    # Arguments
        x : Tensor.
        axis: Integer, axis along which the softmax normalization is applied.
    # Returns
        Tensor, output of softmax transformation.
    # Raises
        ValueError: In case `dim(x) == 1`.
    """
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')

In [36]:
Tx=16
Ty=15
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

In [37]:
def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
    """
    
    ### START CODE HERE ###
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev =  repeator(s_prev)
    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
    concat = concatenator([a,s_prev])
    # Use densor to propagate concat through a small fully-connected neural network to compute the "energies" variable e. (≈1 lines)
    e = densor(concat)
    # Use activator and e to compute the attention weights "alphas" (≈ 1 line)
    alphas = activator(e)
    # Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = dotor([alphas,a])
    ### END CODE HERE ###
    
    return context

In [38]:
n_a = 64
n_s = 128
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(output_vocab_size, activation=softmax)

In [39]:
def model(Tx, Ty, n_a, n_s, text_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
    X = Input(shape=(Tx, text_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initialize empty list of outputs
    outputs = []
    
    ### START CODE HERE ###
    
    # Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    # Step 2: Iterate for Ty steps
    for t in range(Ty):
    
        # Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
        context = one_step_attention(a, s)
        
        # Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
        # Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
        s, _, c = post_activation_LSTM_cell(context, initial_state=[s,c] )
        
        # Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
        out = output_layer(s)
        
        # Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
        outputs.append(out)
    
    # Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
    model = Model([X,s0,c0],outputs)
    
    ### END CODE HERE ###
    
    return model

In [40]:
model = model(Tx, Ty, n_a, n_s, input_vocab_size, output_vocab_size)

In [41]:
opt = Adam(lr= 0.005,  beta_1=0.9, beta_2=0.999, decay = 0.01) 
model.compile(optimizer=opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [42]:
s0 = np.zeros((input_train.shape[0], n_s))
c0 = np.zeros((input_train.shape[0], n_s))
outputs = list(labels_train.swapaxes(0,1))

In [43]:
labels_train.shape

(97205, 15, 28)

In [44]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 16, 70)       0                                            
__________________________________________________________________________________________________
s0 (InputLayer)                 (None, 128)          0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 16, 128)      69120       input_1[0][0]                    
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 16, 128)      0           s0[0][0]                         
                                                                 lstm_1[0][0]                     
          

                                                                 dense_1[10][0]                   
                                                                 dense_1[11][0]                   
                                                                 dense_1[12][0]                   
                                                                 dense_1[13][0]                   
                                                                 dense_1[14][0]                   
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1, 128)       0           attention_weights[0][0]          
                                                                 bidirectional_1[0][0]            
                                                                 attention_weights[1][0]          
                                                                 bidirectional_1[0][0]            
          

                                                                 lstm_1[13][0]                    
                                                                 lstm_1[13][2]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 28)           3612        lstm_1[0][0]                     
                                                                 lstm_1[1][0]                     
                                                                 lstm_1[2][0]                     
                                                                 lstm_1[3][0]                     
                                                                 lstm_1[4][0]                     
                                                                 lstm_1[5][0]                     
                                                                 lstm_1[6][0]                     
          

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='loss', patience=5, mode='min')
save_best = ModelCheckpoint('spelling_bee.hdf', save_best_only=True, 
                           monitor='loss', mode='min')

history = model.fit([input_train, s0, c0], outputs, epochs=300, batch_size=100,verbose=2,callbacks=[early_stopping,save_best])

In [116]:
input_test = keras.utils.to_categorical(input_test, num_classes=input_vocab_size)

In [120]:
input_test = input_test.swapaxes(0,1)

In [121]:
input_test.shape

(10801, 16, 70)

In [122]:
preds = model.predict([input_test,s0, c0], batch_size=128)

In [134]:
preds[0][0]

array([  4.9091e-05,   9.1638e-04,   1.4099e-03,   3.6515e-04,
         1.7284e-03,   2.7348e-04,   1.2466e-04,   9.2841e-01,
         1.1894e-03,   4.9724e-04,   5.7810e-02,   3.9567e-04,
         3.5013e-04,   2.4107e-04,   1.0281e-03,   1.1997e-04,
         7.8341e-07,   1.0567e-04,   9.5835e-05,   9.0494e-05,
         1.3909e-04,   1.8986e-04,   4.1442e-04,   1.5949e-03,
         4.5406e-05,   1.0328e-03,   1.3868e-03,   2.1544e-07], dtype=float32)

In [126]:
prediction = np.argmax(preds, axis = -1)

In [132]:
prediction[0][0]

7

In [140]:
letters

'_abcdefghijklmnopqrstuvwxyz*'

In [153]:
res = ''
for i in range(15):
    res += letters[prediction[i][1]]

In [154]:
res

'babbnn_________'

In [150]:
a.shape

(10801, 16)

In [146]:
a = np.argmax(input_test, axis = -1)

In [155]:
pre = ''
for i in range(16):
    pre += '-'+phonemes[a[1][i]]

In [156]:
pre

'-B-AE1-B-IH0-N-_-_-_-_-_-_-_-_-_-_-_'

In [111]:
np.argmax(input_test, axis = 2)

array([[57, 63, 57,  7, 43,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [24, 44,  7, 45, 31, 57,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [65,  5, 45, 21,  7, 34, 16,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21,  2, 43, 34, 15, 55,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [33, 54, 24, 45, 35, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [41, 63, 45, 35, 42,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 24, 45, 44,  7, 45,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [53,  2, 57, 26, 32,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [20,  2, 46, 42, 47,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [56,  2, 54, 32,  7, 45, 19, 26, 33,  0,  0,  0,  0,  0,  0,  0],
       [57,  5, 46, 33, 43, 35, 46,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [39, 55, 57,  7, 45,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [42,  5, 19, 38,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [24, 45, 21, 53, 54, 12, 21,  7, 42, 57,  0,