### MNIST Softmax Model with 1 hidden Layer

MNIST is a great dataset to use to write a simple NN.  MNIST is a dataset of handwritten digits.  The handwritten digits are represented by a 28 by 28 pixel matrix, with each pixel ranging from 0-255, the darkness of that pixel.  Let's see if we can obtain any accuracy in predicting handwritten digit by using each pixel as a parameter.

We are going to load some MNIST training data into a NN.  Run it through 1 hidden layer using the sigmoid function.  Finally we are going to run the output layer through the softmax function to 

In [51]:
import numpy as np
import tensorflow as tf

from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Let's throw away the vertical pixel adjacency data by flattening the 28x28 pixel image into a vector of length 28x28.

In [52]:
NUM_PIXELS = 28*28

def normalize_input(x):
    flattened = np.reshape(x, (x.shape[0], NUM_PIXELS))
    stddev = flattened.std(axis=0) + 1e-100
    mean = flattened.mean(axis=0)
    return (flattened - mean) / stddev

# by subtracting the mean, and dividing by the standard deviation, 
# we attempt to coerce our inputs into a normal distribution. Although the method is not perfect, this should give 
# us input variance closer to 1.  
# Later on, when we feed Z values through the sigmoid function, we would like those values to also have variance of 1.
# Reason being that the derivative at those specific points in the line will be slanted, and we can acheive some success 
# In moving in the correct directions when we take one step of stochastic gradient descent.

# In the alternate scenatio, if we feed very large values through the squasher function, we will end up with derivatives
# that are near 0.  In such a case, one step of gradient descent will not help us as much.  

# product of variances  var_a * var_b = var_c
# variance of the sum of normally distributed weights = sum of the variances.  so 1/#inputs should be the variance of our weights. 
# ... at least to start.

new_x_train = normalize_input(x_train)
new_x_test = normalize_input(x_test)

In [53]:
def build_graph():
    X = tf.placeholder(shape=[None, NUM_PIXELS], dtype=tf.float32, name="input")
    y = tf.placeholder(shape=[None], dtype=tf.int32, name="y")
    Y = tf.one_hot(y, 10)
    learning_rate = tf.placeholder(shape=[], dtype=tf.float32, name="learning_rate")
    
    #First layer 
    W1 = tf.Variable(
        np.random.normal(size = [NUM_PIXELS, 128], scale = 1/np.sqrt(NUM_PIXELS)),
        dtype=tf.float32
    )
    b1 = tf.Variable(np.zeros(shape = [128]), dtype=tf.float32)
    
    Z1 = tf.matmul(X, W1) + b1
    H1 = tf.nn.sigmoid(Z1)
    
    #Second Layer
    W2 = tf.Variable(
        np.random.normal(size = [128, 10], scale = 1/np.sqrt(128)),
        dtype=tf.float32
    )
    b2 = tf.Variable(np.zeros(shape = [10]), dtype=tf.float32)
    Z2 = tf.matmul(H1, W2) + b2 
    H2 = tf.nn.softmax(Z2) 
    
    #Calculating Error, Mean Cross Entropy
    negative_logs = -tf.log(H2)
    errors = tf.reduce_sum(Y * negative_logs, axis = 1) 
    mean_error = tf.reduce_mean(errors) 
    
    predictions = tf.argmax(H2, axis = 1, output_type = tf.int32)
    accuracy = tf.reduce_mean(
        tf.cast(tf.equal(y, predictions), tf.float32)
    )
    
    # Training
    optimizer = tf.train.GradientDescentOptimizer(
        learning_rate = learning_rate
    )
    train_step = optimizer.minimize(mean_error)
    
    return {
        "X": X, 
        "y": y,
        "learning_rate": learning_rate,
        "mean_error": mean_error,
        "accuracy": accuracy,
        "train_step": train_step,
    }
    

In [54]:
# Create our batches from x and y
BATCH_SIZE = 32
mnist_batches = []

for batch_start in range(0, new_x_train.shape[0], BATCH_SIZE):
    batch_x = new_x_train[batch_start:batch_start + BATCH_SIZE, :]
    batch_y = y_train[batch_start:batch_start + BATCH_SIZE]
    
    mnist_batches.append((batch_x, batch_y))
    # store batches in tuples in mnist_batches.

In [55]:
LEARNING_RATE = 0.01

def train_batch(session, batch_x, batch_y, graph):
    session.run(
        graph["train_step"],
        feed_dict = {
            graph["X"]: batch_x,
            graph["y"]: batch_y,
            graph["learning_rate"]: LEARNING_RATE
        }
    )
    
def evaluate_model(batch_idx, session, graph):
    me, acc = session.run(
        [graph["mean_error"], graph["accuracy"]],
        feed_dict = {
            graph["X"]: new_x_test,
            graph["y"]: y_test,
        }
    )
    
    print(f'B: {batch_idx} | ME: {me:0.2f} | ACC: {acc:0.2f}')

def train_epoch(session, graph):
    for (batch_idx, (batch_x, batch_y)) in enumerate(mnist_batches):
        train_batch(
            session = session,
            batch_x = batch_x,
            batch_y = batch_y,
            graph = graph
        )
        
        if batch_idx % 100 == 0:
            evaluate_model(batch_idx, session, graph)


NUM_EPOCHS = 10

with tf.Session() as session:
    graph = build_graph()
    session.run(tf.global_variables_initializer())
    
    for epoch_idx in range(NUM_EPOCHS):
        train_epoch(session, graph)

B: 0 | ME: 2.36 | ACC: 0.13
B: 100 | ME: 1.91 | ACC: 0.59
B: 200 | ME: 1.60 | ACC: 0.68
B: 300 | ME: 1.38 | ACC: 0.74
B: 400 | ME: 1.20 | ACC: 0.78
B: 500 | ME: 1.07 | ACC: 0.80
B: 600 | ME: 0.96 | ACC: 0.82
B: 700 | ME: 0.87 | ACC: 0.84
B: 800 | ME: 0.81 | ACC: 0.84
B: 900 | ME: 0.75 | ACC: 0.85
B: 1000 | ME: 0.70 | ACC: 0.86
B: 1100 | ME: 0.66 | ACC: 0.86
B: 1200 | ME: 0.63 | ACC: 0.86
B: 1300 | ME: 0.60 | ACC: 0.87
B: 1400 | ME: 0.58 | ACC: 0.87
B: 1500 | ME: 0.56 | ACC: 0.87
B: 1600 | ME: 0.54 | ACC: 0.88
B: 1700 | ME: 0.52 | ACC: 0.88
B: 1800 | ME: 0.50 | ACC: 0.88
B: 0 | ME: 0.49 | ACC: 0.88
B: 100 | ME: 0.48 | ACC: 0.88
B: 200 | ME: 0.47 | ACC: 0.88
B: 300 | ME: 0.46 | ACC: 0.89
B: 400 | ME: 0.45 | ACC: 0.89
B: 500 | ME: 0.44 | ACC: 0.89
B: 600 | ME: 0.43 | ACC: 0.89
B: 700 | ME: 0.42 | ACC: 0.89
B: 800 | ME: 0.42 | ACC: 0.89
B: 900 | ME: 0.41 | ACC: 0.90
B: 1000 | ME: 0.40 | ACC: 0.90
B: 1100 | ME: 0.40 | ACC: 0.90
B: 1200 | ME: 0.39 | ACC: 0.90
B: 1300 | ME: 0.39 | ACC: 0.90
B

We are able to achieve 94% accuracy in predicting handwritten digits after about 10 epochs.  Notice the help we got in speed for calculating our data by normalizing it to start.  89% accuracy on the test set after the first epoch as opposed to 80% if we had just pushed inputs to range 0-1.  Far worse than that I assume if we had not normalized at all.  Well we would have just needed to adjust learning rate.