<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/CSC645/blob/master/4shallow_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Differentiation Using Tensorflow

In this notebook we use the automatic differentiation cababilities of Tensorflow to reimplement an ML model that we already have seen: the Mine/Rock classification of sonar data.

In [1]:
import tensorflow as tf
import numpy as np


### Tensorflow: Variables, Tensors and Gradients

In this section we introduce some of the tools of tensorflow we will use later. 
As a simple example we consider the function $y=x^2$ and we start with $x=2.2$. Then we apply gradients to change the value of $x$ to move toward the minimal point of $y$. 

Tensorflow uses an object called __GradientTape__ to "record" the independent variables and the operations that depend on those variables.


In [2]:
x=tf.Variable(2.2,name='x')
with tf.GradientTape() as t:
  y=x**2
print("x and y")
print(x)
print(y)



x and y
<tf.Variable 'x:0' shape=() dtype=float32, numpy=2.2>
tf.Tensor(4.84, shape=(), dtype=float32)


#### Compute and apply gradient
__Note__ that the apply_gradient method takes a __list of pairs__ (gradient,variable) 

In [3]:
grad=t.gradient(y,x)
print("gradient")
print(grad)
## Recall the update rule of variables
## new value of x= old value of x - rate * gradient of y wrt x
print("expected new value of x is {:.2f}".format(x.numpy()-grad.numpy()*0.1))
opt=tf.optimizers.SGD(0.1)
opt.apply_gradients([(grad,x)])
print("new value of x")
print(x)


gradient
tf.Tensor(4.4, shape=(), dtype=float32)
expected new value of x is 1.76
new value of x
<tf.Variable 'x:0' shape=() dtype=float32, numpy=1.76>


#### Another example
Typically we would like to compute the gradient of a loss function with respect to multiple variables. Below we illustrate with $z=x^2+y$.

The __apply_gradients__ function expects inputs of the form [(grad_x,x),(grad_y,y),...]. We could call it using 
```
opt.apply_gradients([(grad[0],x),(grad[1],y)]
```
But a more convenient way, especially if we have many variables, is to use the __zip__ function

```
opt.apply_gradients(zip(grad,[x,y]))
```

In [16]:
x=tf.Variable(2.2,name='x')
y=tf.Variable(3.,name='y')
with tf.GradientTape() as t:
  z=x**2+y
print(t.watched_variables())
grad=t.gradient(z,[x,y])
opt=tf.optimizers.SGD(0.2)
opt.apply_gradients(zip(grad,[x,y]))
print(x)
print(y)

(<tf.Variable 'x:0' shape=() dtype=float32, numpy=2.2>, <tf.Variable 'y:0' shape=() dtype=float32, numpy=3.0>)
<tf.Variable 'x:0' shape=() dtype=float32, numpy=1.32>
<tf.Variable 'y:0' shape=() dtype=float32, numpy=2.8>


## Classification of Mines/Rock 

The data is from https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks) the mines vs rocks sonar data. It is in csv format and very small so download it to your computer. Next we upload it to colab and read it using the pandas package.


### Upload to colab


In [5]:
from google.colab import files
file=files.upload()
!mkdir /root/.kaggle
!mv kaggle.json  /root/.kaggle
!kaggle datasets download -d mattcarter865/mines-vs-rocks
!unzip mines-vs-rocks.zip

Saving kaggle.json to kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
mines-vs-rocks.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  mines-vs-rocks.zip
replace sonar.all-data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: sonar.all-data.csv      


### Read using Pandas

In [6]:
import pandas as pd
df=pd.read_csv("sonar.all-data.csv",header=None)


### Preprocessing the data

We need to perform several operations on the data before we use it. 
1. The data is sorted: all mines followed by all rocks so we shuffle it using numpy
2. We need to break it into train and test sets.
3. Make sure that the data in "float32" type instead of "float". For some reason  Tensorflow is sensitive about that 

In [7]:
#pandas data frame
m=df.values
# randomize (shuffle) the data
np.random.shuffle(m)

# Each row has 61 entries, 60 for data and the last one is the label "M" or "R"

# X contains all the data
X=m[:,0:60].astype("float32")
# Y contains all the labels
Y=m[:,60]

# convert the labels: "M"->1 and "R"->0
Y=np.array([1.0 if i=='M' else 0.0 for i in Y])

Y=Y.reshape((len(Y),1))
Y=Y.astype("float32")

# split the data and labels into a training and test sets
train_size=180
data_size=X.shape[0]

x_train=X[0:train_size,:]
x_test=X[train_size:data_size,:]

y_train=Y[0:train_size,:]
y_test=Y[train_size:data_size,:]

print("x_train shape={}".format(x_train.shape))
print("x_test shape={}".format(x_test.shape))
print("y_train shape={}".format(y_train.shape))
print("y_test shape={}".format(y_test.shape))

x_train shape=(180, 60)
x_test shape=(28, 60)
y_train shape=(180, 1)
y_test shape=(28, 1)


### Important Note
Tensorflow stacks the samples row-wise instead of column-wise
as we have been doing when we did the gradient descent oursleves. We need to keep that in mind.

### Defining the parameters


In [8]:
learning_rate = 3
nb_iterations = 2500

# Network Parameters
n_h = 16 # number of neurons in hidden layer
n_x = X.shape[1] #number of neurons in input
n_y = Y.shape[1] #number of neurons in ouput


### Initialization
The forward propagation phase is the same as when we did this exercise from first principles but since tensorflow stacks the data row-wise the forward propagation is slightly different then we are used to.
Let $W^0$,$W^1$,$b^0$,$b^1$ be the weights and biases of the first and second layer respectively then forward propagation is given by
\begin{align*}
Z^1&=X\cdot W^0+b^0\\
A^1 &=\sigma(Z^1)\\
Z^2 &=A^1\cdot W^1+b^1\\
A^2 &=\sigma(Z^2)
\end{align*}
Compare the above with the equations in the previous exercise. For more details consult the lecture [backpropagation](https://github.com/hikmatfarhat-ndu/CSC645/blob/master/lectures/csc645-lecture-backprop.pdf).

According to the above equations we have to define the tensorflow variables that will hold the weights and biases. 
The biases are initialized to zero  and the weights randomly.

In [9]:

initializer = tf.initializers.RandomNormal()

W0=tf.Variable(initializer([n_x,n_h]),trainable=True,dtype=tf.float32)
W1=tf.Variable(initializer([n_h,n_y]),trainable=True,dtype=tf.float32)

b0=tf.Variable(tf.zeros([n_h]))            #biases of the first layer
b1=tf.Variable(tf.zeros([n_y]))            #biases of the second layer


### Defining the model
Our model has two layers. The function "model" below should return the ouput of our model for a given input. Note since Tensorflow uses the first index as the sample size the dot product has a different order.

In [10]:
def model(input):
    # Hidden fully connected layer with 16 neurons
   
    layer_1 = tf.add(tf.matmul(input, W0), b0)
    # Output fully connected layer with a neuron for each class
    out_layer = tf.matmul(tf.sigmoid(layer_1), W1) + b1
    return out_layer

Once the model is defined the remaining code is similar to our previous exercise. We define the loss
as an average over the cross-entropy but this time since it is binary classification we use the sigmoid instead
of the softmax function. Then our optimizer uses gradient descent to minimize the loss

In [11]:

# Define loss and optimize
def loss(pred,label):
   return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=pred, labels=label))


## Prediction

Given input X ( and the parameters we are trying to learn) this function predicts if it is Rock or Mine

In [12]:
def prediction(X):
  a=tf.math.sigmoid(model(X))
  return tf.cast((a>0.5),tf.int32)

The model is defined now we run our computation. Recal that our model depends on, the changing, parameters $W^0,W^1,b^0,b^1$ therefore its gradient will change also. The train function is a **single** training step

In [13]:
optimizer=tf.optimizers.SGD(learning_rate)

def train(data,labels):
  with tf.GradientTape() as tape:
    diff=loss(model(data),labels)

  grad=tape.gradient(diff,[W0,W1,b0,b1])
  optimizer.apply_gradients( zip( grad , [W0,W1,b0,b1] ) )
  pT=tf.transpose(prediction(data))
  correct=np.squeeze(np.dot(pT,labels)+np.dot(1-pT,1-labels))
  return diff,correct

### Training Loop

In [14]:
for i in range(nb_iterations):
 cost,corr=train(x_train,y_train)
 if(i%500==0):
  print("cost={:.2f},accuracy={}/{}".format(cost,corr,x_train.shape[0]))
  


cost=0.69,accuracy=95.0/180
cost=0.27,accuracy=157.0/180
cost=0.09,accuracy=178.0/180
cost=0.02,accuracy=180.0/180
cost=0.01,accuracy=180.0/180


### Accuracy

In [15]:
pT=tf.transpose(prediction(x_test))
correct=np.dot(pT,y_test)+np.dot(1-pT,1-y_test)
accuracy=100*float(np.squeeze(correct))/float(y_test.shape[0])
print("Accuracy="+str(accuracy))

Accuracy=71.42857142857143
