# PIMA MLP in TensorFlow

This is the same 12-8-1 MLP model that was shown in Keras. Now I am using just TensorFlow operations.

## The beginning is the same as with Keras. Just loading the data

In [1]:
import numpy as np

# fix random seed for reproducibility
np.random.seed(7)
# load pima indians dataset
dataset = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

In [2]:
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
y = dataset[:,8]

## All of the TensorFlow examples use multi-class even when there is just one class

I'm sure there is a way to use the sigmoid function and just predict one class, but wasn't readily finding it. All of the examples seem to use softmax multi-class predictors.

I kept this format since we will often use multi-class and since this shows the concept of [OneHot Encoding](https://en.wikipedia.org/wiki/One-hot).  


To one hot encode, we simply have take each possible value and turn it into its own category. So for a "diabetes"/"no diabetes" binary classifier, we need 2 bits (01 and 10). This means that we'll have 2 y values for each possible outcome: 01 for no diabetes and 10 for diabetes.

Sample # | Category | One Hot Encode
:-----: | ------| ------
1 | Diabetes | 1,0
2 | No Diabetes | 0,1
3 | Diabetes | 1,0
4 | Diabetes | 1,0
5 | Diabetes | 1,0
6 | No diabetes | 0,1
7 | No diabetes | 0,1

So below, we are turning our original one-column label of Diabetes (0,1), into two columns: the first column is if the person has diabetes, the second is if they don't. Note that this means 1,1 and 0,0 are meaningless for one-hot in this case.

Our softmax is going to give a prediction (probability) of which class we are in. So it will say something like: no diabetes = 0.33 and diabetes = 0.66. So we just need to find the argmax to get the prediction (0 or 1).

In [3]:
y = np.array([y, 1-y]).T  # One hot encode the binary classifier

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Now we start the Tensorflow code

In [5]:
import tensorflow as tf

In [6]:
from tensorflow.contrib import layers
from tensorflow.contrib import learn

## This is how we setup our learning rate ($\alpha$) for backprop

We start at rate 0.01 and then with each step (global_step=1), we exponentially decrease $\alpha$. This is along the same lines as an algorithm called [simulated annealing](http://katrinaeg.com/simulated-annealing.html). We start taking large steps (large $\alpha$) in our random walk and then take progressively shorter steps as we feel we are approaching the minimum value. In true simulated annealing, we are actually calculating the probability of taking a step in the new direction (based on how good the new direction seems to be and how close we think we are to the bottom).

In [7]:
learning_rate = tf.train.exponential_decay(learning_rate=0.01,
                                           global_step=1,
                                           decay_steps=X_train.shape[0],
                                           decay_rate=0.95,
                                           staircase=True)

# Defining the model parameters

In [8]:
training_epochs = 150   # Let's go through the training step 150 times (maximum)
batch_size = 10        # Let's consider 10 training examples at a time (so 150/10 = 15 iterations per epoch)
display_step = 10       # Let's print out the error for every 10 training examples (i.e. every batch)

## These are the hyperparameters.

We need to define these ahead of time. Basically, # of layers and number of neurons in each layer.

In [9]:
n_hidden_1 = 12                   # 12 neurons in first layer
n_hidden_2 = 8                    # 8 neurons in second layer
n_hidden_3 = 1                    # 1 neuron in the third layer
n_input_features = np.shape(X_train)[1]    # Our input layer depends on shape of data features
n_classes = np.shape(y_train)[1]  # Our output layer depends on shape of data label
dropout = 1.0-0.2             # Let's add dropout to our model (20% of the neurons will dropout randomly for each pass)

## Remember TensorFlow Placeholders?

These are variables that will expect a stream of external data. We use these for our X and y variables.

Note that with x, we do not assume how many rows of data we have. Instead we just tell it [None]. So the Tensorflow graph just knows it will receive an unknown number of n_input vectors (i.e. out features per training example). This is really nice because the algorithm is set to handle 1 to 900 zillion examples.

In [10]:
x = tf.placeholder("float", [None, n_input_features])  
y = tf.placeholder("float", [None, n_classes])

# Here's the neural network

In Keras we did this:

~~~~
model = Sequential() 
model.add(Dense(12, input_dim=8, activation='relu')) # Layer 1
model.add(Dense(6, activation='relu'))               # Layer 2
model.add(Dropout(0.5))                              # Dropout
model.add(Dense(1, activation='sigmoid'))            # Output layer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
~~~~

In [11]:
def model(x, weights, biases):
    
   # Hidden layer with relu activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.relu(layer_1)
    
    # Hidden layer with relu activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.relu(layer_2)

    layer_2 = tf.nn.dropout(layer_2, dropout)
    
    layer_3 = tf.add(tf.matmul(layer_2, weights['h3']), biases['b3'])
    
    out_layer = tf.matmul(layer_3, weights['out']) + biases['out']
 
    return out_layer


## In Keras we didn't have to explicitly set the initial values of the weights.

(Although we could). There are lots of algorithms for setting the initial values. Here we are simple setting them based on a uniform distribution from 0.1 to 0.4

Note that here we use TensorFlow **Variable** type. We don't use placeholders because these values are all internal to the graph. We'll update them just like any other internal variable. Also, note that TF uses LazyExecution. So these Variables are not actually allocated or assigned until explicitly told to do so (by running the session). We'll use teh Tensorflow command tf.global_variables_initializer() to initialize all the variables to their default values at the start of the graph execution.

In [12]:
minval = 0.1
maxval = 0.4

weights = {
    'h1': tf.Variable(tf.random_uniform(shape=(n_input_features,n_hidden_1),minval=minval, maxval=maxval, dtype=tf.float32, seed=0)),
    'h2': tf.Variable(tf.random_uniform(shape=(n_hidden_1, n_hidden_2),minval=minval, maxval=maxval, dtype=tf.float32, seed=0)),
    'h3': tf.Variable(tf.random_uniform(shape=(n_hidden_2, n_hidden_3),minval=minval, maxval=maxval, dtype=tf.float32, seed=0)),
    'out': tf.Variable(tf.random_uniform(shape=(n_hidden_3, n_classes),minval=minval, maxval=maxval, dtype=tf.float32, seed=0))
}

biases = {
    'b1': tf.Variable(tf.random_uniform([n_hidden_1])),
    'b2': tf.Variable(tf.random_uniform([n_hidden_2])),
    'b3': tf.Variable(tf.random_uniform([n_hidden_3])),
    'out': tf.Variable(tf.random_uniform([n_classes]))
}


## TensorFlow Op

Now we assign a TensorFlow operation (Op) called $pred$ which contains our custom function $neural\_network$.  The key here is that $neural\_network$ can be copied to multiple GPU/CPUs and run in parallel with our data. No training example is dependent on any other.

As with Spark, TensorFlow lazy execution won't actually run the $pred$ op graph until explicitly needed.

In [13]:
pred = model(x, weights, biases)

## Cost is yet another Op in our graph.

Here we create another operation called "cost" which tells us how close our MLP's prediction ($pred$) is to the true value ($y$). The equation is:

$$-y log(pred) - (1 - y) log(1 - pred)$$


In [14]:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))

## Now our gradient descent Op

We'll use [Adam](https://arxiv.org/abs/1412.6980) as our gradient descent/backpropagation algorithm. Adam uses *momentum* which is a measure of how much our error is changing over the last few iterations. Adam uses a larger effective step size and will automatically converge to the best step size without the need for hand tuning. So it tends to be good at handling some of the hyperparameters. In short, Adam gets you to the bottom of the hill without needing to hand tune your learning rate ($alpha$).

In [15]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# Now our initialize variables Op

This sets all of our TF variables (e.g. weights, learning rates) to their default values. If we don't do this, then TensorFlow will throw a runtime error because the variables haven't been initialized.

In [16]:
init = tf.global_variables_initializer()

## Now run the TF graph in a session

In [17]:
with tf.Session() as sess:
    
    sess.run(init)   # First run the variable initializing Op
    
    # Training cycle
    for epoch in range(training_epochs):
        
        avg_cost = 0.
        
        total_batch = int(len(X_train) / batch_size)  # Calculate the batch size

        X_batches = np.array_split(X_train, total_batch)  # Grab a batch
        Y_batches = np.array_split(y_train, total_batch)  # Grab a batch

        # Loop over all batches
        for i in range(total_batch):
            
            batch_x, batch_y = X_batches[i], Y_batches[i]  # Here's how we feed the PlaceHolder

            # We run both the optimizer and cost Op nodes of the graph
            _, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
            
            # Compute average loss
            avg_cost += c / total_batch
            
        # Display logs per epoch step
        if epoch % display_step == 0:
            print("epoch {}:\t cost = {:.3f}".format(epoch+1, avg_cost))
            
    
    # Now let's use our model to get the accuracy for the test set prediction
    # This is another Op on our graph. We test whether 'pred' and 'y' are equal.
    
    # We one-hot encoded so the prediction will be determined by which of the two bits is larger
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))

    # Calculate accuracy - Yet another Op
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    print("")
    
    # Now execute the 'accuracy' Op (which calls the 'correct_prediction' op which calls the 'model op')
    print("Accuracy on training set = {:.4f}".format(accuracy.eval({x: X_train, y: y_train})))
    
    # Now execute the 'accuracy' Op (which calls the 'correct_prediction' op which calls the 'model op')
    print("Accuracy on test set = {:.4f}".format(accuracy.eval({x: X_test, y: y_test})))
    
    predTest = sess.run(tf.argmax(pred, 1), feed_dict={x: X_test, y: y_test})

epoch 1:	 cost = 5.803
epoch 11:	 cost = 0.630
epoch 21:	 cost = 0.573
epoch 31:	 cost = 0.571
epoch 41:	 cost = 0.566
epoch 51:	 cost = 0.550
epoch 61:	 cost = 0.556
epoch 71:	 cost = 0.544
epoch 81:	 cost = 0.532
epoch 91:	 cost = 0.508
epoch 101:	 cost = 0.520
epoch 111:	 cost = 0.522
epoch 121:	 cost = 0.520
epoch 131:	 cost = 0.508
epoch 141:	 cost = 0.485

Accuracy on training set = 0.7866
Accuracy on test set = 0.7597
