# <center>Assignment 1</center>

There are 2 main parts asked in this assignment - Tensorflow Basics and Neural Networks. You can choose to code in Python2 or Python3. All the imports made in this notebook are as below; if these imports work, you are (mostly) set to complete the assignment.

In [None]:
from __future__ import print_function,division
%matplotlib inline
import matplotlib.pyplot as plt
import random 
import tensorflow as tf
import numpy as np

## Tensorflow - Basics

### I. Linear Regression

<b>1a. Creating Sample Data </b>

In [None]:
x = np.random.randn(100,3) # 100 data points of dimension 3
w = np.array([[1],[2],[3]])
b = 10
y = np.dot(x, w) + b # Write code to create the target. Use Numpy operations

**1b. Plot Data**

In [None]:
# Explore the data by plotting whatever makes you understand the problem better. 
# Your code here.

# Plot the first axis of x against y.
plt.subplot(121)
plt.plot(x[:,0], y, 'bs')
plt.title("x_0 versus y")
plt.xlabel("x_0")
plt.ylabel("y")

# Plot the second axis of x againts y.
plt.subplot(122)
plt.plot(x[:,1], y, 'bs') 
plt.title("x_1 versus y")
plt.xlabel("x_1")
plt.ylabel("y")
plt.show()

# Plot the third axis of x againts y.
plt.subplot(111)
plt.plot(x[:,2], y, 'bs')
plt.title("x_2 versus y")
plt.xlabel("x_2")
plt.ylabel("y")
plt.show()

<b>2. Creating Placeholders</b>

In [None]:
X = tf.placeholder(dtype=tf.float32,shape=[None,3]) 
Y_Expected = tf.placeholder(dtype=tf.float32,shape=[None,1]) # Write code to create the placeholder for target.

<b>3. Creating Variables</b>

In [None]:
b = tf.Variable(dtype=tf.float32,initial_value=np.zeros(shape=(1,1)),name="b")
W = tf.Variable(dtype=tf.float32,initial_value=np.zeros(shape=(3,1)),name="w") # Write code to instantiate W with zeros. 

<b> 4. Creating Compute Graph </b>

In [None]:
Y = tf.matmul(X, W) + b # Define the equation to compute the output variable.
cost = tf.reduce_mean(tf.square((Y - Y_Expected))) # Define the cost function.  

<b> 5. Training and optimizer </b>

In [None]:
# This part has been done for you already! Just run it after you finish coding the above sections. 
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(cost)
for epoch in range(30):
    epoch_cost,_ = sess.run([cost,train_op],feed_dict={X:x,Y_Expected:y})
    print (epoch,epoch_cost)

<b> 5. Print out parameters </b>

In [None]:
# Replace the None with the correct operation. You should get W close to [[1],[2],[3]] and b close to 10. 
print("W:", W.eval())
print("b:", b.eval())

### II. Matrix Multiplication

In [None]:
def ndmatmul():
    """
      # 3d x 2d Matmul operation. 
      You may find some of these functions useful: einsum, tile, expand_dims.
      :return a: Placeholder for 3d tensor [float64]
              b: Placeholder for 2d tensor [float64]
              c: Matrix Product
      """
    a = tf.placeholder(dtype=tf.float64, shape=(None, None, None))
    b = tf.placeholder(dtype=tf.float64, shape=(None, None))
    c = tf.einsum('ijk,kl->ijl', a, b)
    return a,b,c

In [None]:
A,B,C = ndmatmul()

In [None]:
np.random.seed(1)
a = np.random.randn(5,2,3)
b = np.random.randn(3,1)
c = np.matmul(a,b)
print(a.shape)
print(b.shape)
print(c.shape)
print(c)

In [None]:
# Will give error if function not implemented. Your output should match Numpy's output.
sess = tf.InteractiveSession()
c_tensor = sess.run(C,feed_dict={A:a,
                            B:b})
print(c_tensor)
if (c_tensor-c<10**-10).all():
    print("Correct!")

### III. Experiments with Feed-forward NN on MNIST

In this Qn, you will experiment with Feed-forward Neural nets while training on the MNIST dataset. Read more about it <a href = "https://en.wikipedia.org/wiki/MNIST_database">here</a>. A random sample of the images has been shown to you. 

In [None]:
# Load MNIST Data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
print(train_data.shape)
print(train_labels.shape)
print(eval_data.shape)
print(eval_labels.shape)
# Randomly choose 10 images from first 50 images of Train Data.
for index,idx in enumerate(random.sample(range(50),10)): 
    plt.subplot(10,1,index+1)
    plt.imshow(train_data[idx].reshape(28,28))

Fill in the following snippet as per the instructions. 
* For initialising placeholders, use None to accommodate variable batch_size. 
* Do not change the seed; use it for comparing epoch-wise loss with your friends.
* You can use the following <a href ="https://www.tensorflow.org/versions/r1.1/get_started/mnist/beginners">tutorial</a> for reference. Note that they use softmax in their example, while you are required to code Feedforward neural network. 


In [None]:
def initializer_1(shape):
    # Do not change the seed. 
    np.random.seed(1)
    return np.random.randn(*shape)

def initializer_2(shape):
    # Do not change the seed.
    np.random.seed(1)
    return 0.01 * np.random.randn(*shape)

class MNIST_ANN:
    def __init__(self,hidden_units,activations,initializer):
        """
        Initialise the weights and build the compute graph. Use AdamOptimizer with default parameters.
        :param hidden_units - list of number of hidden units. 
               Eg: [10,20] => Layer 1 has 10 hidden units and Layer 2 has 20.
        :param activations - list of activations for each of the hidden layers.
               Eg: [tf.nn.sigmoid, tf.nn.tanh]
        :param intializer - the reference to the function used for intializing the weights
        """
        # Define the placeholders
        self.input = tf.placeholder(dtype=tf.float32, shape=(None, 784))
        self.expected_output = tf.placeholder(dtype=tf.int32, shape=(None, 10))
        
        # Initialise the weights and biases. Use zeros for the biases. 
        weights = []
        biases = []
        # Initializing the weights and biases for the input layer -> first hidden layer.
        weights.append(tf.Variable(dtype=tf.float32, initial_value=initializer((784, hidden_units[0]))))
        biases.append(tf.Variable(dtype=tf.float32, initial_value=np.zeros(hidden_units[0])))
        
        # Loop here.
        for i in range(1, len(hidden_units)):
            weights.append(tf.Variable(dtype=tf.float32, 
                                     initial_value=initializer((hidden_units[i - 1], hidden_units[i]))))
            biases.append(tf.Variable(dtype=tf.float32, initial_value=np.zeros(hidden_units[i])))
        # Initializing the weights and biases for the last hidden layer -> output layer
        weights.append(tf.Variable(dtype=tf.float32, initial_value=initializer((hidden_units[-1], 10))))
        biases.append(tf.Variable(dtype=tf.float32, initial_value=np.zeros(10)))
        
        # Build the graph for computing output.
        h = self.input
        for i in range(0, len(activations)):
            h = activations[i](tf.matmul(h, weights[i]) + biases[i])
        # For output layer
        self.output = tf.matmul(h, weights[-1]) + biases[-1]
        
        # Define the loss and accuracy here. (Refer Tutorial)
        self.cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.expected_output, logits=self.output))
        self.correct_prediction = tf.equal(tf.argmax(self.output, 1), tf.argmax(self.expected_output, 1))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
        
        # Instantiate the optimizer
        optimizer = tf.train.AdamOptimizer()
        self.train_op = optimizer.minimize(self.cost)
        self.session = tf.Session()
        
        # Initialize all variables
        self.session.run(tf.initialize_all_variables())
    
    def train(self,train_data,train_labels,eval_data,eval_labels,batch_size,epochs=100):
        """
        Training code.
        """
        sess = self.session

        # Slice the data and labels into batches depending on the batch_size.
        batches = []
        num_batches = train_data.shape[0] // batch_size
        for i in range(num_batches):
            batch = [train_data[i * batch_size: i * batch_size + batch_size], 
                     train_labels[i * batch_size: i * batch_size + batch_size]]
            batches.append(batch)
            
        for epoch in range(epochs):
            cost_epoch = 0
            for batch in batches:
                # Forward Propagate, compute cost and backpropagate.
                cost,_ = sess.run([self.cost,self.train_op],feed_dict={self.input:batch[0],
                                                             self.expected_output: batch[1]})
                cost_epoch += cost
            if epoch%10 == 0:
                print("Train accuracy: {0:.12f}".format(self.compute_accuracy(train_data,train_labels)))        
                print("Test accuracy: {0:.12f}".format(self.compute_accuracy(eval_data,eval_labels)))
            print("Epoch {0:d}: {1:.8f}".format(epoch,cost_epoch))
        print("Train accuracy: {0:.12f}".format(self.compute_accuracy(train_data,train_labels)))
        print("Test accuracy: {0:.12f}".format(self.compute_accuracy(eval_data,eval_labels)))

    def compute_accuracy(self,data,labels):
        """
        Fill in code to compute accuracy
        """
        sess = self.session
        return sess.run(self.accuracy, feed_dict={self.input: data, self.expected_output: labels})

In [None]:
ann = MNIST_ANN([10],[tf.nn.sigmoid],initializer_1)
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10,epochs=3)

The expected output for the above snippet is
<pre>
Train accuracy: 0.780763626099
Test accuracy: 0.791599988937
Epoch 0: 6768.86486949
Epoch 1: 3275.00310887
Epoch 2: 2590.16959983
Train accuracy: 0.873399972916
Test accuracy: 0.876900017262
</pre>
If you get any other output and you feel you are correct, you can proceed (However, I cannot think of any case where you can get a different output). 

### Answer the following questions by running code snippets. Unless asked explicitly (like in Q1 and Q4), you need to just show the system performance and need not comment.

**1. Use 1 hidden layer of 10 hidden units with sigmoid activation and batch_size=10 for this question. Observe the network performance for initializer_1 and initializer_2 and explain the behavior. Why does this happen? What is your guess for tanh and relu? Why?**

In [None]:
# Your code here. 

# ANN with initializer_1
ann = MNIST_ANN([10], [tf.nn.sigmoid], initializer_1)
print("START TRAINING WITH INITIALIZER_1")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)
print()

# ANN with initializer_2
ann = MNIST_ANN([10], [tf.nn.sigmoid], initializer_2)
print("START TRAINING WITH INITIALIZER_2")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)
print()

The cost of the network is decreasing faster with initializer_2; or the train and test accuracies of the network is increasing faster with initializer_2.

The weights initialized by initializer_2 are less than those initialized by initializer_1 (100 times less). Therefore, the values of the neurons in the hidden layer with initializer_2 will be samller than those with initializer_1. After we applying the Sigmoid activation function to the neurons of the hidden layer, the resulting values with initializer_2 will be less than those with initializer_1, according to the curve of Sigmoid function. Thus, the cost with initializer_2 will be less than that with initializer_1. 

For Tanh and ReLU, my guess is that the cost would decrease much faster than that for Sigmoid. 

If we take a look at the graph of Tanh, we can see that with same input, the output of Tanh is much samller than that of Sigmoid, which means less cost. 

For ReLU, I think it will be the most fastest. Since it will only output positive value or zero, which pretty fits the task with classifiction, and also giving less cost compared to other activations.

<b>2. Play around with different configurations of the system. Spend some time on <a href="https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.52239&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false"> Tensorflow Playground </a> to get a feel. Just demonstrate the performance of the system and make observations. No need to make any comments. </b>

In [None]:
# Your code here.

# ANN with three hidden layers (30, 20, 10), initializer_2, and relu as activations
ann_0 = MNIST_ANN([30, 20, 10], [tf.nn.relu, tf.nn.relu, tf.nn.relu], initializer_2)
print("START TRAINING ANN_0 WITH RELU AND INITIALIZER_2")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)

# ANN with two hidden layers (10, 20), initializer_2, and sigmoid and relu as activations
ann_1 = MNIST_ANN([10, 20], [tf.nn.sigmoid, tf.nn.relu], initializer_2)
print("START TRAINING ANN_1 WITH SIGMOID AND RELU, AND INITIALIZER_2")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)

# # ANN with two hidden layers (20, 10), initializer_2, and sigmoid and tanh as activations
ann_2 = MNIST_ANN([20, 10], [tf.nn.sigmoid, tf.nn.tanh], initializer_2)
print("START TRAINING ANN_2 WITH SIGMOID AND TANH, AND INITIALIZER_2")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)

# ANN with two hidden layers (10, 20), initializer_2, and sigmoid and tanh as activations
ann_3 = MNIST_ANN([20, 10], [tf.nn.sigmoid, tf.nn.tanh], initializer_2)
print("START TRAINING ANN_3 WITH SIGMOID AND TANH, AND INITIALIZER_2")
ann.train(train_data,train_labels,eval_data,eval_labels,batch_size=10)

<b>4. List the problems you faced while experimenting [Loss did not decrease, ran into NaNs, etc]. What conclusions did you make? </b>

The network I used in TensorFlow Playground: 0.01 learning rate, no regularization, 2 hidden layers, and the task is classification.

When I used the **Linear** function as the activation functions: after about 300 epochs, the train and test loss did not decrease anymore.

When I used the **Sigmoid** function as the activation functions: the train and test loss decreased pretty slow. It took about 2,000 epochs for the network to achieve a 0.029 train and test loss respectively.

When I used the **Tanh** function as the activation functions: the train and test loss decreased faster than the network with Sigmoid function. It only took about 200 epochs for the network to achieve a 0.027 train loss and 0.030 test loss respectively.

When I used the **ReLU** function as the activation functions: the train and test loss decreased really fast. It only took about 70 epochs for the network to achieve a 0.028 train and test loss respectively.

My conclusions: it seems that ReLU has the best performance when the problem is classification; Tanh comes as the second choice, though the loss is decreasing slower with Tanh compared to ReLU; Sigmoid could be used, but its performance is not that good. It takes so long for the train and test loss to decrease to some relatively some number, such as 0.028; Linear should not be used, given the problem itself is not linear. The train and test loss do not decrease anymore after some epochs.

Another thing that I found out is that adding more layers does not necessarily imporove the performance of the network. However, it might decrease the performance of the network.