# Exercise 5: Multilayer perceptron

The focus of this exercise will be on multilayer perceptron and to do that in a simpler way, an introduction to TensorFlow is given first.

## 5.1 Introduction to TensorFlow

TensorFlow is an open-source symbolic math software library used for machine learning applications such as neural networks. The following command is used to import TensorFlow in the Python code:

In [1]:
import tensorflow as tf

### 5.1.1 Constants, sessions, and operations
TensorFlow is based around tensors - *n*-dimensional arrays of a given type. Three main tensor types in TensorFlow are constant, variable, and placeholder. To create a constant, the [tf.constant()](https://www.tensorflow.org/api_docs/python/tf/constant) method is used:

In [2]:
c=tf.constant(2)
print(c)

Tensor("Const:0", shape=(), dtype=int32)


This constant tensor like other tensors has a value, a shape, a data type, and a name. These can be directly specified:

In [3]:
c=tf.constant(3, shape=(2, 5), dtype=tf.float32, name="our_constant")
print(c)

Tensor("our_constant:0", shape=(2, 5), dtype=float32)


To evaluate a tensor, a [Session](https://www.tensorflow.org/api_docs/python/tf/Session) instance is required. Sessions are environments where tensors and operations are executed. A session can be created and then used for evaluation as follows:

In [4]:
session=tf.Session()
print(session.run(c))

[[ 3.  3.  3.  3.  3.]
 [ 3.  3.  3.  3.  3.]]


Other useful ways of creating constant tensors include the methods [tf.zeros()](https://www.tensorflow.org/api_docs/python/tf/zeros) and [tf.ones()](https://www.tensorflow.org/api_docs/python/tf/ones):

In [5]:
z=tf.zeros((2, 3))
print(session.run(z))

o=tf.ones((3, 1))
print(session.run(o))

[[ 0.  0.  0.]
 [ 0.  0.  0.]]
[[ 1.]
 [ 1.]
 [ 1.]]


The most common methods to create tensors with random values are [tf.random_uniform](https://www.tensorflow.org/api_docs/python/tf/random_uniform) and [tf.random_normal](https://www.tensorflow.org/api_docs/python/tf/random_normal):

In [6]:
u=tf.random_uniform(shape=(2, 4), minval=2, maxval=7)
print(session.run(u))

n=tf.random_normal(shape=(2, 4), mean=0, stddev=1)
print(session.run(n))

[[ 2.50471497  3.20345545  2.22720933  3.15086555]
 [ 6.02471256  5.7553606   2.40120959  4.41457987]]
[[-0.38866007  0.04213262 -0.35970542  1.1231339 ]
 [-1.06264019 -0.05452274  0.16148189  1.00665569]]


Applying addition, subtraction, multiplication, and division to tensors can be achieved by using operators +, -, *, and / or by calling  [tf.add()](https://www.tensorflow.org/api_docs/python/tf/add), [tf.subtract()](https://www.tensorflow.org/api_docs/python/tf/subtract), [tf.multiply()](https://www.tensorflow.org/api_docs/python/tf/multiply), [tf.divide()](https://www.tensorflow.org/api_docs/python/tf/divide). Each of these operations is performed element-wise, e.g. when applied to matrices, [tf.multiply()](https://www.tensorflow.org/api_docs/python/tf/multiply) is not matrix multiplication like [tf.matmul()](https://www.tensorflow.org/api_docs/python/tf/matmul), but element-wise multiplication.

In [7]:
a=tf.ones((1, 2))
b=2*tf.ones((1, 2))

print(session.run(a+b+5))
print(session.run(tf.subtract(a, b)))
print(session.run(a*b))
print(session.run(tf.divide(a, b)))

[[ 8.  8.]]
[[-1. -1.]]
[[ 2.  2.]]
[[ 0.5  0.5]]


Some other operations include [tf.abs()](https://www.tensorflow.org/api_docs/python/tf/abs), [tf.exp()](https://www.tensorflow.org/api_docs/python/tf/exp), [tf.matmul()](https://www.tensorflow.org/api_docs/python/tf/matmul), [tf.pow()](https://www.tensorflow.org/api_docs/python/tf/pow), [tf.square()](https://www.tensorflow.org/api_docs/python/tf/square), [tf.transpose()](https://www.tensorflow.org/api_docs/python/tf/transpose).

In [8]:
print(session.run(tf.transpose(tf.abs(tf.random_normal((1, 5), mean=0, stddev=3)))))

[[ 1.27414322]
 [ 1.71011102]
 [ 4.21111012]
 [ 3.64093161]
 [ 2.93537712]]


### 5.1.2 Placeholders and variables
Tensors used for more complex data and computation are [placeholders](https://www.tensorflow.org/api_docs/python/tf/placeholder) and [variables](https://www.tensorflow.org/api_docs/python/tf/Variable). Placeholders are providers of future values and mostly serve to take the input to the network. For this reasong thay cannot be directly evaluated unless its value is "fed" i.e. given by means of dictionary.

In [9]:
#we create a placeholder
a=tf.placeholder(dtype=tf.float32)

#this would produce an error
#print(session.run(a))

#but not this
print(session.run(a, feed_dict={a:-5}))

#we use its value later
b=tf.abs(a)

#to provide a value to the placeholder, feed_dict is used
print(session.run(b, feed_dict={a:-5}))
#we can also use different input size
print(session.run(b, feed_dict={a:[-17, 1, -2]}))

print("\n\n")

m1=tf.placeholder(dtype=tf.float32)
m2=tf.placeholder(dtype=tf.float32)

p=tf.matmul(m1, m2)
print(session.run(p, feed_dict={m1:[[1], [2], [3]], m2:[[1, 2, 3]]}))
print(session.run(p, feed_dict={m1:[[1, 2, 3]], m2:[[1], [2], [3]]}))


-5.0
5.0
[ 17.   1.   2.]



[[ 1.  2.  3.]
 [ 2.  4.  6.]
 [ 3.  6.  9.]]
[[ 14.]]


Variable are mostly used for trainable parameters. While constants are intialized when created,variables are initialized within the session by a procedure that must be defined. Manual assignment of new values tan be is possible using [tf.assign()](https://www.tensorflow.org/api_docs/python/tf/assign). Variables are mostly changed during the optimization process.


In [10]:
session=tf.Session()
#a constant is used for the initialization procedure
a=tf.Variable(3)
#random values will be used for initialization
b=tf.Variable(tf.random_uniform(shape=(2, 3)))

#this would produce an error since the variable has not been initialized - only the initialization procedure has been defined
#print(session.run(a))

#initialize all variables
session.run(tf.global_variables_initializer())
#now evaluate the variable
print(session.run(a))
print(session.run(b))

3
[[ 0.81913841  0.51047146  0.99225414]
 [ 0.04949152  0.70751309  0.2673707 ]]


### 5.1.3 Linear regression
To have an example of a fully written network, let's now implement simple multivariate linear regression using TensorFlow. The used model will be $y=\mathbf{w}^{T}\mathbf{x}+\mathbf{b}$.

In [11]:
#data placeholders - this will be used for the given features and for the ground-truth value of y
x=tf.placeholder(dtype=tf.float32, shape=[None, 3])
y=tf.placeholder(dtype=tf.float32, shape=[None, 1])

#parameter variables
w=tf.Variable(tf.random_normal(shape=(3, 1)))
b=tf.Variable(tf.random_normal([1, 1]))

#the model for y - this will be used for the predicted value of y
y_predicted=tf.matmul(x, w)+b

TensorFlow trains a model i.e. learns its parameter values by minimizing a loss function that needs to be defined. The minimization is carried out by a defined optimizer object by calling its [minimize()](https://www.tensorflow.org/api_docs/python/tf/train/Optimizer#minimize) method called. The learning rate chosen when defining the optimizer objects and the number of training epochs will have a significant impact on the model training process by influencing how fast the learning process will converge.

In [12]:
#the loss function will be mean square
loss=tf.reduce_mean(tf.square(y_predicted-y))

#gradient descent optimizer with learning rate 0.1
optimizer=tf.train.GradientDescentOptimizer(0.1)

#train operation
train=optimizer.minimize(loss)

#generation data for regression
import numpy as np
w_real=np.array([[1], [3], [-2]])
x_train=np.random.normal(size=(100, 3))
y_train=(w_real.T@x_train.T).T;

session.run(tf.global_variables_initializer())
for epoch in range(100):
    session.run(train, feed_dict={x:x_train, y:y_train})
    if ((epoch+1)%10==0):
        print("Epoch #"+str(epoch+1)+": "+str(session.run(loss, feed_dict={x:x_train, y:y_train})))

#print the trained weights
print(session.run(w))

Epoch #10: 0.402786
Epoch #20: 0.023619
Epoch #30: 0.00153801
Epoch #40: 0.000104485
Epoch #50: 7.22282e-06
Epoch #60: 5.02955e-07
Epoch #70: 3.50975e-08
Epoch #80: 2.45947e-09
Epoch #90: 1.73495e-10
Epoch #100: 1.24497e-11
[[ 0.99999839]
 [ 2.99999762]
 [-1.99999654]]


## 5.2 The XOR problem
XOR samples are not linearly separable. However, they can be separated by introducing non-linearities. In TensorFlow some of them include [tf.sigmoid()](https://www.tensorflow.org/api_docs/python/tf/sigmoid), [tf.tanh()](https://www.tensorflow.org/api_docs/python/tf/tanh), [tf.nn.relu()](https://www.tensorflow.org/api_docs/python/tf/nn/relu), etc. Besides the common [tf.train.GradientDescentOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer), there are other optimizers as well, e.g. [tf.train.AdamOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer).

**Task**

Below is the code for solving the XOR problem in TensorFlow. Determine how many epochs are required for the training process to converge for each combination of the chosen activation function, optimizer, and various learning rates. Which combination turned out to be the best?

In [25]:
#activation_type=tf.nn.sigmoid;
#activation_type=tf.nn.tanh;
activation_type=tf.nn.relu;

#optimizer_type=tf.train.GradientDescentOptimizer
optimizer_type=tf.train.AdamOptimizer

learning_rate=0.1;



threshold=1e-4

session=tf.Session()

#training data
x_train=np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_train=np.array([[0], [1], [1], [0]])

x=tf.placeholder(tf.float32, [None, 2])
y=tf.placeholder(tf.float32, [None, 1])

w1=tf.Variable(tf.random_uniform((2, 2)))
b1=tf.Variable(tf.random_uniform([2]))

w2=tf.Variable(tf.random_uniform((2, 1)))
b2=tf.Variable(tf.random_uniform([1]))


f1=tf.matmul(x, w1)+b1
f2=activation_type(f1)
y_predicted=tf.matmul(f2, w2)+b2

loss=tf.reduce_mean(tf.square(y_predicted-y))

optimizer=optimizer_type(learning_rate)
train=optimizer.minimize(loss)

session.run(tf.global_variables_initializer())

for epoch in range(10000):
    session.run(train, feed_dict={x:x_train, y:y_train})
    error=session.run(loss,{x:x_train, y:y_train})
    if ((epoch+1)%100==0):
        print("Epoch #"+str(epoch+1)+": "+str(error))
    if (error<threshold):
        print("Threshold passed at epoch #"+str(epoch+1)+".");
        break;

session.close();

Epoch #100: 0.250031
Epoch #200: 0.25
Epoch #300: 0.25
Epoch #400: 0.25
Epoch #500: 0.25
Epoch #600: 0.25
Epoch #700: 0.25
Epoch #800: 0.25
Epoch #900: 0.25
Epoch #1000: 0.25
Epoch #1100: 0.25
Epoch #1200: 0.25
Epoch #1300: 0.25
Epoch #1400: 0.25
Epoch #1500: 0.25
Epoch #1600: 0.25
Epoch #1700: 0.25
Epoch #1800: 0.25
Epoch #1900: 0.25
Epoch #2000: 0.25
Epoch #2100: 0.25
Epoch #2200: 0.25
Epoch #2300: 0.25
Epoch #2400: 0.25
Epoch #2500: 0.25
Epoch #2600: 0.25
Epoch #2700: 0.25
Epoch #2800: 0.25
Epoch #2900: 0.25
Epoch #3000: 0.25
Epoch #3100: 0.25
Epoch #3200: 0.25
Epoch #3300: 0.25
Epoch #3400: 0.25
Epoch #3500: 0.25
Epoch #3600: 0.25
Epoch #3700: 0.25
Epoch #3800: 0.25
Epoch #3900: 0.25
Epoch #4000: 0.25
Epoch #4100: 0.25
Epoch #4200: 0.25
Epoch #4300: 0.25
Epoch #4400: 0.25
Epoch #4500: 0.25
Epoch #4600: 0.25
Epoch #4700: 0.25
Epoch #4800: 0.25
Epoch #4900: 0.25
Epoch #5000: 0.25
Epoch #5100: 0.25
Epoch #5200: 0.25
Epoch #5300: 0.25
Epoch #5400: 0.25
Epoch #5500: 0.25
Epoch #5600: 0.

## Results

Activation sigmoid + GradDescent : 4002 Epoch's

Activation sigmoid + AdamOptimizer : 221 Epoch's

Activation tanh + GradDescent : 862 Epoch's

Activation tanh + AdamOptimizer : 116 Epoch's

Activation relu + GradDescent : 10000 Epoch's    // did not converge cause relu is linear

Activation relu + AdamOptimizer : 10000 Epoch's // did not converge cause relu is linear

## 5.3 The MNIST dataset
[The MNIST dataset](http://yann.lecun.com/exdb/mnist/) contains 60,000 training and 10,000 test images of handwritten digits. It is used to test the ability of a method to recognize which digit is on a given image. Although spatial distribution of individual image pixels matters, in this example we are going to disregard it and simply use individual pixel values as features. There are $28\cdot 28=784$ pixels i.e. features per image. The basic code is given below.

**Task**

Experiment with different activation functions, learning rates, batch sizes, optimizers, and architectures. What is the best combination of them? Which of them has the highest impact on the accuracy and rate of convergence? How about the size of hidden layers? Make the comparisons and draw the appropriate plots.

In [5]:
#use MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets("mnist/", one_hot=True)

import tensorflow as tf

#settings
learning_rate=0.001
training_epochs_count=40
batch_size=100
batches_count=int(mnist.train.num_examples/batch_size)

activation_function=tf.nn.relu
optimizer_type=tf.train.AdamOptimizer

batch_size=100
display_step=1

#architecture
hidden_layer_size_1=128
hidden_layer_size_2=128
hidden_layer_size_3=128
hidden_layer_size_4=128
input_size=784
n_classes=10

#data input
x=tf.placeholder(tf.float32, [None, input_size])
y=tf.placeholder(tf.float32, [None, n_classes])

#weights
w1=tf.Variable(tf.random_normal([input_size, hidden_layer_size_1]))
w2=tf.Variable(tf.random_normal([hidden_layer_size_1, hidden_layer_size_2]))
w3=tf.Variable(tf.random_normal([hidden_layer_size_2,hidden_layer_size_3]))
w4=tf.Variable(tf.random_normal([hidden_layer_size_3,hidden_layer_size_4]))
w5=tf.Variable(tf.random_normal([hidden_layer_size_2, n_classes]))

#biases
b1=tf.Variable(tf.random_normal([hidden_layer_size_1]))
b2=tf.Variable(tf.random_normal([hidden_layer_size_2]))
b3=tf.Variable(tf.random_normal([hidden_layer_size_3]))
b4=tf.Variable(tf.random_normal([hidden_layer_size_4]))
b5=tf.Variable(tf.random_normal([n_classes]))

#layers
layer_1=activation_function(tf.add(tf.matmul(x, w1), b1))
layer_2=activation_function(tf.add(tf.matmul(layer_1, w2), b2))
layer_3=activation_function(tf.add(tf.matmul(layer_2,w3),b3))
layer_4=activation_function(tf.add(tf.matmul(layer_3,w4),b4))
y_predicted=tf.matmul(layer_2, w5)+b5

cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_predicted, labels=y))
optimizer=optimizer_type(learning_rate=learning_rate).minimize(cost)


session=tf.Session();
session.run(tf.global_variables_initializer())

correct_y_predictediction=tf.equal(tf.argmax(y_predicted, 1), tf.argmax(y, 1))
accuracy=tf.reduce_mean(tf.cast(correct_y_predictediction, tf.float32))

for epoch in range(training_epochs_count):
	for i in range(batches_count):
		batch_x, batch_y = mnist.train.next_batch(batch_size)
		session.run(optimizer, feed_dict={x:batch_x, y:batch_y})
	if ((epoch+1)%display_step==0):
		print("Epoch #"+str(epoch+1)+" "+str(session.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})))

session.close()

Extracting mnist/train-images-idx3-ubyte.gz
Extracting mnist/train-labels-idx1-ubyte.gz
Extracting mnist/t10k-images-idx3-ubyte.gz
Extracting mnist/t10k-labels-idx1-ubyte.gz
Epoch #1 0.8029
Epoch #2 0.8584
Epoch #3 0.8763
Epoch #4 0.8919
Epoch #5 0.9012
Epoch #6 0.9089
Epoch #7 0.9139
Epoch #8 0.9148
Epoch #9 0.9215
Epoch #10 0.9266
Epoch #11 0.9262
Epoch #12 0.9279
Epoch #13 0.9269
Epoch #14 0.9285
Epoch #15 0.9337
Epoch #16 0.9313
Epoch #17 0.9308
Epoch #18 0.9313
Epoch #19 0.9354
Epoch #20 0.9335
Epoch #21 0.9347
Epoch #22 0.9349
Epoch #23 0.9341
Epoch #24 0.9367
Epoch #25 0.9368
Epoch #26 0.9396
Epoch #27 0.9388
Epoch #28 0.9372
Epoch #29 0.9395
Epoch #30 0.9406
Epoch #31 0.9427
Epoch #32 0.944
Epoch #33 0.9446
Epoch #34 0.9441
Epoch #35 0.9433
Epoch #36 0.944
Epoch #37 0.9453
Epoch #38 0.945
Epoch #39 0.9476
Epoch #40 0.9458


In [33]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf


def deepnn(x):
    with tf.name_scope('reshape'):
        x_image = tf.reshape(x, [-1,28,28,1])
    
    # First conv layer
    with tf.name_scope('conv1'):
        W_conv1 = weight_variable([5,5,1,32])
        b_conv1 = bias_variable([32])
        h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
    
    # First pooling layer
    with tf.name_scope('pool1'):
        h_pool1 = max_pool_2x2(h_conv1)
    
    # Second conv layer
    with tf.name_scope('conv2'):
        W_conv2 = weight_variable([5,5,32,64])
        b_conv2 = bias_variable([64])
        h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
    
    # Second pooling layer
    with tf.name_scope('pool2'):
        h_pool2 = max_pool_2x2(h_conv2)
        
    # Fully connected layer
    with tf.name_scope('fc1'):
        W_fc1 = weight_variable([7 * 7 * 64, 1024])
        b_fc1 = bias_variable([1024])
        
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
    
    # Dropout 
    with tf.name_scope('dropout'):
        keep_prob = tf.placeholder(tf.float32)
        h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob)
    
    # Map features with classes
    with tf.name_scope('fc2'):
        W_fc2 = weight_variable([1024,10])
        b_fc2 = bias_variable([10])
        
        y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
        return y_conv, keep_prob

def conv2d(x,W):
    return tf.nn.conv2d(x,W,strides=[1,1,1,1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1,2,2,1],
                            strides=[1,2,2,1], padding='SAME')

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)


minst = input_data.read_data_sets("mnist/", one_hot=True)
    
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])
y_conv, keep_prob = deepnn(x)
with tf.name_scope('loss'):
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y_conv)
    cross_entropy = tf.reduce_mean(cross_entropy)
    
with tf.name_scope('adam_optimizer'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
    
with tf.name_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
    correct_prediction = tf.cast(correct_prediction, tf.float32)
    accuracy = tf.reduce_mean(correct_prediction)
        
    
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

Extracting mnist/train-images-idx3-ubyte.gz
Extracting mnist/train-labels-idx1-ubyte.gz
Extracting mnist/t10k-images-idx3-ubyte.gz
Extracting mnist/t10k-labels-idx1-ubyte.gz
step 0, training accuracy 0.08
step 100, training accuracy 0.8
step 200, training accuracy 0.9
step 300, training accuracy 0.94
step 400, training accuracy 0.98
step 500, training accuracy 0.94
step 600, training accuracy 0.96
step 700, training accuracy 0.98
step 800, training accuracy 0.86
step 900, training accuracy 0.96
step 1000, training accuracy 0.96
step 1100, training accuracy 1
step 1200, training accuracy 0.94
step 1300, training accuracy 0.96
step 1400, training accuracy 0.92
step 1500, training accuracy 1
step 1600, training accuracy 0.96
step 1700, training accuracy 0.96
step 1800, training accuracy 1
step 1900, training accuracy 1
step 2000, training accuracy 0.92
step 2100, training accuracy 0.98
step 2200, training accuracy 0.96
step 2300, training accuracy 1
step 2400, training accuracy 0.98
step 