# Recap of Neural Networks

## Concepts

**Perceptron** The Perceptron takes in some input values and generates an output value - via calculating the weighted sum of input values then applying an activation function (nonlinear transformation) to it.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573599295359_Screenshot+2019-11-12+14.54.53.png)

**Multilayer Perceptron (MLP)**
A Multilayer Perceptron is a feedforward neural network with many fully-connected layers that uses nonlinear activation functions. An MLP is the most basic form of a deep neural net if it has more than 2 layers.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573599514820_Screenshot+2019-11-12+14.58.29.png)


**Backpropagation**
Backpropagation is an algorithm to help a neural network learn by calculating the gradients, or a feedforward computational graph. It does so by propagating the gradients backwards into the network via differentiating them using the chain rule.

### Building Blocks of Neural Networks
**Activation Function**
Activation functions (sigmoid, tanh, ReLU) are non-linear functions that are applied to the weighted sum of the inputs of a neuron. They enable neural nets to go beyond linear functions and approximate any complex function.

![](https://www.researchgate.net/profile/Vivienne_Sze/publication/315667264/figure/download/fig3/AS:669951052496900@1536740186369/Various-forms-of-non-linear-activation-functions-Figure-adopted-from-Caffe-Tutorial.png)

**Batch Size**
Total number of training examples present in a single batch. Large batch sizes can be great because they can harness the power of GPUs to process more training instances per time. Small batch sizes' performance generalizes better and are less memory intensive.

**Categorical Cross-Entropy Loss**
The categorical cross-entropy loss aka the negative log likelihood is a popular loss function for classification tasks, that measures the similarity between two probability distributions – the true and predicted labels.

**Dropout**
Dropout is a fantastic regularization technique that gives you a massive performance boost (~2% for state-of-the-art models) for how simple the technique actually is. All dropout does is randomly turn off a percentage of neurons at each layer, at each training step. This makes the network more robust because it can’t rely on any particular set of input neurons for making predictions. The knowledge is distributed amongst the whole network. Around 2^n (where n is the number of neurons in the architecture) slightly-unique neural networks are generated during the training process, and ensembled together to make predictions.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573598735731_Screenshot+2019-11-12+14.45.29.png)

**Early Stopping** Early Stopping lets you train a model with more hidden layers, hidden neurons and for more epochs than you need – then just stopping training when performance stops improving consecutively for n epochs. It saves the best performing model for you and prevents overfitting.

**Epoch** Epochs: number of times your network sees your data. One epoch is when an entire dataset is passed both forward and backward through the neural network only once.

**Gradient Descent** Gradient descent finds the minimum value of a function by using an iterative optimization algorithm for differentiable functions. We use it to find the lowest point in our loss function.

**Hidden Layer** A hidden layer in neural network is a layer in between input layers and output layers, where neurons take in a set of weighted inputs and produce an output through an activation function.

**Hyperparameter Tuning** Hyperparameter tuning is the process of searching the hyperparameter space to find the best hyperparameters values, through grid, random or bayesian search.

**Learning Rate** Learning rate is a scalar used to update model parameters during gradient descent. It is the factor by which we multiply the gradients. Used to determine the amount by which the weights are updated during training.

![](https://paper-attachments.dropbox.com/s_39292DB9CE2A9400103E176C2ABC438C6A626910E9DBB0D6FBE28EE673C7492C_1565307718429_image.png)

**Momentum**
Gradient Descent takes tiny, consistent steps towards the local minima and when the gradients are tiny it can take a lot of time to converge. Momentum takes into account the previous gradients & accelerates small but consistent gradients. It accelerates convergence by pushing over valleys faster & avoiding local minima.

**One Hot Encoding** Many machine learning algorithms cannot operate on label data directly. They need all input variables and output variables to be numeric. In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves. This means that categorical data must be converted to a numerical form. This is called one hot encoding.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573598356099_Screenshot+2019-11-12+14.39.07.png)

### Basic Neural Network

In [0]:
# Install wandb
%pip install -qq wandb
#import libraries
import warnings
warnings.filterwarnings('ignore')
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import pandas as pd
import numpy as np

In [0]:
# keras-perceptron/perceptron-normalize.py
import wandb
import tensorflow as tf

# logging code
run = wandb.init(entity="wandb", project="bloomberg-class")
config = run.config
config.concept = 'mlp'

# load data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]

# normalize data

X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.

# one hot encode outputs
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)
labels = [str(i) for i in range(10)]

num_classes = y_train.shape[1]

# create model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(img_width, img_height)))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])

# Fit the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test),
          callbacks=[wandb.keras.WandbCallback(data_type="image", labels=labels, save_model=False)])

# LSTMs

## Concepts

**Attention Mechanism**
Attention Mechanisms are inspired by human visual attention, the ability to focus on specific parts of an image. Attention mechanisms can be incorporated in both Language Processing and Image Recognition architectures to help the network learn what to “focus” on when making predictions.

**Bag of Words**
Bag of words is a method of feature engineering for text. In the resulting BoW feature vector, each dimension represents whether a specific token is present in the corpus. BoW ignores the order of words.
![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573602945512_Screenshot+2019-11-12+15.55.17.png)

**Embedding**
Embeddings map inputs like words or sentences to vectors of numbers.
![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573602983980_Screenshot+2019-11-12+15.56.20.png)

**GloVe**
GloVe is an unsupervised algorithm to convert words into embeddings trained on co-occurrence statistics.
![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573603065273_Screenshot+2019-11-12+15.57.14.png)

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573603065257_Screenshot+2019-11-12+15.57.32.png)

**word2vec**
word2vec is an algorithm that also learns word embeddings by trying to predict the context of words in a document. word2vec vectors have mathematical properties and can be added and subtracted. e.g. `vector('queen') = vector('king') - vector('man') + vector('woman')`.
![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573602983972_Screenshot+2019-11-12+15.56.11.png)

### Create Character Encodings
![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573602895710_Screenshot+2019-11-12+15.54.42.png)


In [0]:
# Utility Functions
import os
import subprocess
import wandb
import math
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb

# Get the imdb dataset
if not os.path.exists("aclImdb_v1.tar.gz"):
    print("Downloading imdb dataset...")
    subprocess.check_output(
        "curl -OL http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz && tar xvfz aclImdb_v1.tar.gz", shell=True)

word_to_id = imdb.get_word_index()
word_to_id = {k: (v+3) for k, v in word_to_id.items()}
id_to_word = {value: key for key, value in word_to_id.items()}
id_to_word[0] = ""  # Padding
id_to_word[1] = ""  # Start token
id_to_word[2] = "�"  # Unknown
id_to_word[3] = ""  # End token

class TextLogger(tf.keras.callbacks.Callback):
    def __init__(self, inp, out):
        self.inp = inp
        self.out = out

    def on_epoch_end(self, logs, epoch):
        out = self.model.predict(self.inp)
        data = [[decode(self.inp[i]), o, self.out[i]]
                for i, o in enumerate(out)]
        wandb.log({"text": wandb.Table(rows=data)}, commit=False)

def cosine_sim(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

def decode(words):
    return ' '.join(id_to_word[id] for id in words if id > 0)

if not os.path.exists("glove.6B.50d.txt"):
    print("Downloading glove embeddings...")
    subprocess.check_output(
        "curl -OL https://storage.googleapis.com/wandb/glove.6B.50d.txt", shell=True)

In [0]:
# Inspect data
vocab_size = 10000
embedding_dims = 50
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)
print(X_train[0])
print(decode(X_train[0]))

In [0]:
# Inspect Glove embeddings
embeddings_index = dict()
print("Loading embeddings")
f = open('glove.6B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
embedding_matrix = np.zeros((vocab_size, embedding_dims))
for index in range(vocab_size):
    word = id_to_word[index]
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector
f.close()
film = embeddings_index["film"]
movie = embeddings_index["movie"]
book = embeddings_index["book"]
car = embeddings_index["car"]
truck = embeddings_index["truck"]
plane = embeddings_index["plane"]
print(embeddings_index["plane"])
print(cosine_sim(truck, film))

### Predicting Sentiment Data using LSTMs

RNNs are an ML algorithm used to model sequential data by saving hidden states. It uses the same parameters and performs the same calculations at each step, with different inputs. At each time step, it calculates a new hidden state (“memory”) based on the current input and the previous hidden state and persists the information by using an internal loop in the network. RNNs are able to capture and learn the order of inputs they receive. RNNs are used with sequential data, like in natural language processing. RNNs can succumb to vanishing and exploding gradient problems.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573671695722_Screenshot+2019-11-13+11.01.23.png)


**Long Short-Term Memory Unit (LSTM)**
LSTM units in RNNs help combat the vanishing gradient problem. by using neurons with a memory cell and three gates:
- input – determines how much of information from the previous layer gets stored in the cell
- output – determines how of the next layer gets to know about the state of the current cell
- forget – determines what to forget about the current state of the memory cell

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573671832985_Screenshot+2019-11-13+11.03.50.png)

In [0]:
# No Glove Embeddings
# examples/lstm/imdb-classifier/imdb-lstm.py
import wandb
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import LSTM, GRU
from tensorflow.python.client import device_lib
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.datasets import imdb

wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = "lstm-embedding"
config.vocab_size = 1000
config.maxlen = 300
config.batch_size = 32
config.embedding_dims = 50
config.filters = 250
config.kernel_size = 3
config.hidden_dims = 100
config.epochs = 10

# Load and tokenize input
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=config.vocab_size)

# Ensure all input is the same size
X_train = sequence.pad_sequences(
    X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(
    X_test, maxlen=config.maxlen)

# overide LSTM & GRU
if 'GPU' in str(device_lib.list_local_devices()):
    print("Using CUDA for RNN layers")
    LSTM = tf.keras.layers.CuDNNLSTM
    GRU = tf.keras.layers.CuDNNGRU

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(config.vocab_size,
                                    config.embedding_dims,
                                    input_length=config.maxlen))
model.add(LSTM(100))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=config.batch_size,
          epochs=config.epochs,
          validation_data=(X_test, y_test), callbacks=[wandb.keras.WandbCallback(save_model=False)])

In [0]:
# examples/lstm/imdb-classifier/imdb-embedding.py
import wandb
import tensorflow as tf
import numpy as np
import subprocess
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.python.client import device_lib
from tensorflow.keras.layers import LSTM, GRU, CuDNNLSTM, CuDNNGRU
from tensorflow.keras.datasets import imdb
import os

# set parameters:
wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'lstm-glove'
config.vocab_size = 1000
config.maxlen = 300
config.batch_size = 64
config.embedding_dims = 50
config.filters = 250
config.kernel_size = 3
config.hidden_dims = 100
config.epochs = 10

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=config.vocab_size)

X_train = sequence.pad_sequences(X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=config.maxlen)

embeddings_index = dict()

f = open('glove.6B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((config.vocab_size, config.embedding_dims))
for index in range(config.vocab_size):
    word = id_to_word[index]
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

# overide LSTM & GRU
if 'GPU' in str(device_lib.list_local_devices()):
    print("Using CUDA for RNN layers")
    LSTM = CuDNNLSTM
    GRU = CuDNNGRU

# create model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(config.vocab_size, config.embedding_dims, input_length=config.maxlen,
                                    weights=[embedding_matrix], trainable=True))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(config.hidden_dims))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=config.batch_size,
          epochs=config.epochs,
          validation_data=(X_test, y_test), callbacks=[TextLogger(X_test[:20], y_test[:20]), wandb.keras.WandbCallback(save_model=False)])

### Predicting Sentiment using BiDirecitonal LSTMs

Bidirectional RNNs are composed of two RNNs flowing in different directions, stacked on top of each other. The forward RNN reads the input sequence from start to end, while the backward RNN reads it from end to start. We combine their states by appending their vectors and in doing so we're able to make predictions by using the context from before and after the words.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573672185132_Screenshot+2019-11-13+11.09.42.png)

In [0]:
# examples/lstm/imdb-classifier/imdb-lstm.py
import wandb
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import LSTM, GRU
from tensorflow.python.client import device_lib
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.datasets import imdb

# set parameters:
wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'lstm-bidir'
config.vocab_size = 1000
config.maxlen = 300
config.batch_size = 32
config.embedding_dims = 50
config.filters = 250
config.kernel_size = 3
config.hidden_dims = 100
config.epochs = 10

# Load and tokenize input
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=config.vocab_size)

# Ensure all input is the same size
X_train = sequence.pad_sequences(
    X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(
    X_test, maxlen=config.maxlen)

# overide LSTM & GRU
if 'GPU' in str(device_lib.list_local_devices()):
    print("Using CUDA for RNN layers")
    LSTM = tf.keras.layers.CuDNNLSTM
    GRU = tf.keras.layers.CuDNNGRU

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(config.vocab_size,
                                    config.embedding_dims,
                                    input_length=config.maxlen))
model.add(tf.keras.layers.Bidirectional(LSTM(config.hidden_dims)))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=config.batch_size,
          epochs=config.epochs,
          validation_data=(X_test, y_test), callbacks=[TextLogger(X_test[:20], y_test[:20]),
                                                       wandb.keras.WandbCallback(save_model=False)])

### Seq2Seq Translation

seq2seq models are a combination of two RNNs – one serving as the encoder, the other as the decoder. They are great for machine translation.

![](https://paper-attachments.dropbox.com/s_92F7A2BE132D5E4492B0E3FF3430FFF0FB2390A4135C0D77582A2D21A2EF8567_1573672286628_Screenshot+2019-11-13+11.11.20.png)

In [0]:
# examples/lstm/seq2seq/train.py
# adapted from https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
import tensorflow as tf
import numpy as np
import wandb

wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'seq2seq'


class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    """

    def __init__(self, chars):
        """Initialize character table.
        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One hot encode given string C.
        # Arguments
            num_rows: Number of rows in the returned one hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)


# Parameters for the model and dataset.
config.training_size = 10000
config.digits = 3
config.hidden_size = 64
config.batch_size = 64

# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of
# int is DIGITS.
maxlen = config.digits + 1 + config.digits + 1 + config.digits

# All the numbers, plus sign and space for padding.
chars = '0123456789+- '
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()
print('Generating data...')
while len(questions) < config.training_size:
    def f(): return int(''.join(np.random.choice(list('0123456789'))
                                for i in range(np.random.randint(1, config.digits + 1))))
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    
    # Pad the data with spaces such that it is always MAXLEN.
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (maxlen - len(q))
    ans = str(a + b)

    # Pad answer - Answers can be of maximum size DIGITS + 1.
    ans += ' ' * (config.digits + 1 - len(ans))

    questions.append(query)
    expected.append(ans)


def log_table(epoch, logs):
    # Select 10 samples from the validation set at random so we can visualize
    # errors.
    data = []
    print()
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Q', q, end=' ')
        print('T', correct, end=' ')
        if correct == guess:
            print('☑', end=' ')
        else:
            print('☒', end=' ')
        data.append([q, guess, correct])
        print(guess)
    wandb.log({"examples": wandb.Table(data=data)})


log_table_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=log_table)

print('Total addition questions:', len(questions))

print('Vectorization...')
x = np.zeros((len(questions), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(questions), config.digits + 1, len(chars)), dtype=np.bool)
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, maxlen)
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, config.digits + 1)

# Shuffle (x, y) in unison as the later parts of x will almost all be larger
# digits.
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

# Explicitly set apart 10% for validation data that we never train over.
split_at = len(x) - len(x) // 10
(x_train, x_val) = x[:split_at], x[split_at:]
(y_train, y_val) = y[:split_at], y[split_at:]

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(config.hidden_size,
                               input_shape=(maxlen, len(chars))))
model.add(tf.keras.layers.RepeatVector(config.digits + 1))
model.add(tf.keras.layers.LSTM(config.hidden_size, return_sequences=True))
model.add(tf.keras.layers.TimeDistributed(
    tf.keras.layers.Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()
model.fit(x_train, y_train,
          batch_size=config.batch_size,
          epochs=100,
          validation_data=(x_val, y_val), callbacks=[wandb.keras.WandbCallback(), log_table_callback])


# Show predictions against the validation dataset.
for iteration in range(1, 10):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    # Select 10 samples from the validation set at random so we can visualize
    # errors.
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Q', q, end=' ')
        print('T', correct, end=' ')
        if correct == guess:
            print('☑', end=' ')
        else:
            print('☒', end=' ')
        print(guess)