<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/CSC645/blob/master/shallow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A shallow (two layers) network- Recognizing Sonar data
In this exercises we will use a **two layer** (1 input, 1 hidden and 1 output) neural network to classify a two-class **sonar** data. Each entry is the result of bouncing off sonar signal from different angles at metals cylinder (Mines) and rock (Rock) objects. It contains 60 values between 0 and 1 and a corresponding label (M or R). A detailed description of the data set can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

Go ahead and download to your computer the file sonar.all-data from  [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/)
and **rename** the file sonar.csv

### Importing packages
We need the follwing packages: numpy for the computation, google.colab for loading the data file into the colab notebook and finally pandas for reading the data from the file.

In [27]:
import numpy as np
#from google.colab import files
import pandas as pd

### Reading the data
Upload the data file to colab and read it using the pandas package.

In [28]:
#file=files.upload()
df=pd.read_csv("sonar.csv")

## Preprocessing the data

Before we start the learning process we need to preprocess the data. First, all the Mines "M" are grouped together and the Rocks "R" are grouped together. We use the numpy shuffle function to mix them randomly. Second, Pandas reads the data as pandas frame so we need to extract the data values and the label values. Third, we convert the labels from "M" to 1 and from "R" to 0. Finally, we divide the data set into training and test subsets.

In [29]:
#pandas data frame
m=df.values
# randomize (shuffle) the data
np.random.shuffle(m)
m=m.T
# Each row has 61 entries, 60 for data and the last one is the label "M" or "R"

# X contains all the data
X=m[0:60,:].astype("float32")

# Y contains all the labels

Y=m[60,:]
# convert the labels: "M"->1 and "R"->0
Y=np.array([1.0 if i=='M' else 0.0 for i in Y])
Y=Y.reshape((1,len(Y)))
Y=Y.astype("float32")

# split the data and labels into a training and test sets
x_train=X[:,0:175]
x_test=X[:,175:208]
y_train=Y[:,0:175]
y_test=Y[:,175:208]

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(60, 175)
(60, 32)
(1, 175)
(1, 32)


## Parameters

In [30]:
learning_rate = 3
nb_iterations = 2000
# Network Parameters
n_h = 64 # number of neurons in hidden layer
n_x = x_train.shape[0] #number of neurons in input
n_y = y_train.shape[0] #number of neurons in ouput

![alt text](shallow-example.png "Title")

### Sigmoid function
First write the sigmoid function

In [31]:
def sigmoid(z):
    s=1/(1+np.exp(-z))
    return s

### Initializing the parameters
Since we have two layers we will need two weight matrices and two bias vectors. Consult the forward propagation equations shown below to be able to determine the shape of the parameters and therefore initialize them.
$\sigma$ is the sigmoid function defined above, $A^0=X$ is the input, $A^1$ and $A^2$ are the output of the first and second layers respectively. Recall that all the variables below (except the parameters) are vectorized version containing all the samples where the samples are column stacked. So X[:,0] is the input of the first (0) sample
Z1[0,0] is the output of the first node in the first layer when the input is the first sample, etc...

\begin{align*}
    Z^1&=W^0\cdot A^0+B^0\\
    A^1&=\sigma(Z^1)\\
    Z^2&=W^1\cdot A^1+B^1\\
    A^2&=\sigma(Z^2)
  \end{align*}
We initialize the weights randomly and the biases to zero. This is done in numpy by using the random.randn and zeros functions. To create an nxm matrix of random numbers we use np.random.randn(n,m) and to create an nxm matrix of zeros we use np.zeros((n,m))

In [32]:

W0=np.random.randn(n_h,n_x)
b0=np.zeros((n_h,1))
W1=np.random.randn(n_y,n_h)
b1=np.zeros((n_y,1))
#dW0=0.001*W0
#db0=b0
#dW1=0.001*W1
#db1=b1


In [33]:
def loss(A2,Y):
    m=Y.shape[1]
    logprob=Y*np.log(A2)+(1-Y)*np.log(1-A2)
    cost=-np.sum(logprob)/m
    cost=np.squeeze(cost)
    return cost

## Forward Propagation
To implement forward propagation recall that 
  \begin{align*}
    Z^1&=W^0\cdot A^0+b^0\\
    A^1&=\sigma(Z^1)\\
    Z^2&=W^1\cdot A^1+b^1\\
    A^2&=\sigma(Z^2)
  \end{align*}


In [34]:
def model(X):
    Z1=np.dot(W0,X)+b0
    A1=sigmoid(Z1)
    Z2=np.dot(W1,A1)+b1
    A2=sigmoid(Z2)
    
    return A1,A2

### Back propagation
To compute the gradients recall the formulas from class.

\begin{align*}
   db^1&=\frac{1}{m}\sum_s(A^2-Y) & (1,1)\\
      dW1&=\frac{1}{m}(A^2-Y)\cdot {A^1}^T& (1,m)\times(m,n_h)=(1,n_h)\\
      db^0&=\frac{1}{m}\sum_s\left[{W^1}^T\cdot (A^2-Y)\right]*\sigma' & \sum_s (n_h,1)\times (1,m)=(n_h,1)\\
      dW^0&=\frac{1}{m}\left[\left({W^1}^T\cdot (A^2-Y)\right)*\sigma'\right]\cdot X^T &(n_h,1)\times(1,m)\times(m,2)=(n_h,2)
    \end{align*}


It is convenient to add temporary variables dZ2 and dZ1 defined as: $dZ2=A^2-Y$, $dZ1=\left({W^1}^T\cdot dZ2\right)*\sigma'$

In [35]:
#def back_propagation(parameters,X,Y):
def gradient(X,Y):
    global dW0,db0,dW1,db1
    #we will be dividing by the number of samples m
    m=X.shape[1]
    
    A1,A2=model(X)
    cost=loss(A2,Y)
    
    # the derivative of the sigmoid
    gp=A1*(1-A1)
    #we will use some temporary variables
    dZ2=A2-Y
    dW1=np.dot(dZ2,A1.T)/m
    db1=np.sum(dZ2,axis=1,keepdims=True)/m
    dZ1=np.dot(W1.T,dZ2)*gp
    dW0=np.dot(dZ1,X.T)/m
    db0=np.sum(dZ1,axis=1,keepdims=True)/m
    return cost


In [36]:
gradient(x_train,y_train)

0.7631966068072591

### Updating the parameters
For every iteration we need to update the parameters

In [37]:
def apply_gradients(learning_rate):

    global W0,b0,W1,b1
    W0=W0-learning_rate*dW0
    b0=b0-learning_rate*db0
    W1=W1-learning_rate*dW1
    b1=b1-learning_rate*db1
    

### Computing the cost
Recall that for $m$ samples we defined the cross-entropy cost function as
\begin{align*}
cost=\frac{-1}{m}\sum_s Y*\log A^2+(1-Y)*\log (1-A^2)
\end{align*}

### Gradient Descent
Having implemented all the above functions now we can implement gradient descent. Note that we are
using the number of nodes in the hidden layer as a variable.


In [38]:
for i in range(nb_iterations):
    cost=gradient(x_train,y_train)
    apply_gradients(learning_rate)
    if i % 500 == 0:
        print ("Cost after iteration %i: %f" %(i, cost))


Cost after iteration 0: 0.763197
Cost after iteration 500: 0.116275
Cost after iteration 1000: 0.013728
Cost after iteration 1500: 0.006381


### Evaluating the results
At this point our network has learned the parameters. We test the predictions as follows: we compute the output $A^2$ and for every data point if the value of $A^2>0.5$ we predict red otherwise it is blue. After that we accumulate all the correct predictions. A prediction for data point $i$ is correct if $Y[i]=1$ and $A^2[i]=1$ or $Y[i]=0$ and 
$A^2[i]=0$. The sum of all correct predictions can be done nicely using the formula belwo
\begin{align*}
 Y\cdot {A^2}^T+(1-Y)\cdot(1-{A^2}^T)
\end{align*}

In [39]:
A1,A2=model(x_test)
predictions=(A2>0.5)
correct=np.dot(y_test,predictions.T)+np.dot(1-y_test,1-predictions.T)
accuracy=100*float(correct)/float(y_test.shape[1])
print("Accuracy="+str(accuracy))

Accuracy=90.625
