<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/CSC645/blob/master/2shallow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A shallow (two layers) network- Recognizing Sonar data
In this exercises we will use a **two layer** (1 input, 1 hidden and 1 output) neural network to classify a two-class **sonar** data. Each entry is the result of bouncing off sonar signal from different angles at metals cylinder (Mines) and rock (Rock) objects. It contains 60 values between 0 and 1 and a corresponding label (M or R). A detailed description of the data set can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))



### Importing packages
We need the follwing packages: numpy for the computation, and pandas for reading the data from the file.

In [1]:
import numpy as np
import pandas as pd


### Retrieving the data from Kaggle
Upload the data file to colab and read it using the pandas package. __Note__ the ! allows us to run any shell command from the notebook.

We will retrieve the data from __kaggle__. To do so you need to do the following
1. Create a kaggle account (its free).
1. Login to kaggle then click on the upper right corner and select "account"
1. Toward the middle of the page click the button "Create New API Token"
1. This will download a file "kaggle.json" on your computer.
1. Upload kaggle.json to colab (see below)

__NOTE__: Below colab will prompt you to select a file to upload


In [2]:
from google.colab import files
file=files.upload()
!mkdir /root/.kaggle
!mv kaggle.json  /root/.kaggle
!kaggle datasets download -d mattcarter865/mines-vs-rocks
!unzip mines-vs-rocks.zip


Saving kaggle.json to kaggle.json
Downloading mines-vs-rocks.zip to /content
  0% 0.00/29.1k [00:00<?, ?B/s]
100% 29.1k/29.1k [00:00<00:00, 48.6MB/s]


### Reading the data

We use pandas to read the .csv file and since there are no headers in the file we supply the header=None parameter. Otherwise the first line would be considered a header

In [4]:
df=pd.read_csv("sonar.all-data.csv",header=None)


Archive:  mines-vs-rocks.zip
replace sonar.all-data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: sonar.all-data.csv      


## Looking at the data

Next we print the shape of the imported data. As you can see below all the Rocks are grouped together followed by the Mines grouped together. Since we have a single set of 208 samples later we will need to use a portion for testing. 

In [3]:
print("the data has {} rows and {} columns".format(df.shape[0],df.shape[1]))
print("we will view the first few and last few rows/columns")
cols=[1,2,59,60]
rows=[1,2,206,207]
df.loc[rows,cols]

the data has 208 rows and 61 columns
we will view the first few and last few rows/columns


Unnamed: 0,1,2,59,60
1,0.0523,0.0843,0.0044,R
2,0.0582,0.1099,0.0078,R
206,0.0353,0.049,0.0048,M
207,0.0363,0.0136,0.0115,M


## Preprocessing the data

Before we start the learning process we need to preprocess the data. 

1. First, all the Mines "M" are grouped together and the Rocks "R" are grouped together as can be seen from the output of the previous cell. We use the numpy __shuffle__ function to mix them randomly. 

1. Second, Pandas reads the data as pandas frame so we need to extract the data values and the label values. Third, we convert the labels from "M" to 1 and from "R" to 0. 

1. Finally, we divide the data set into training and test subsets.

In [4]:
#pandas data frame
m=df.values
# randomize (shuffle) the data
np.random.shuffle(m)

# Each row has 61 entries, 60 for data and the last one is the label "M" or "R"

# X contains all the data
X=m[:,0:60].astype("float32")
# Y contains all the labels
Y=m[:,60]

# convert the labels: "M"->1 and "R"->0
Y=np.array([1.0 if i=='M' else 0.0 for i in Y])

Y=Y.reshape((len(Y),1))
Y=Y.astype("float32")

# split the data and labels into a training and test sets
train_size=180
data_size=X.shape[0]

x_train=X[0:train_size,:]
x_test=X[train_size:data_size,:]

y_train=Y[0:train_size,:]
y_test=Y[train_size:data_size,:]

print("x_train shape={}".format(x_train.shape))
print("x_test shape={}".format(x_test.shape))
print("y_train shape={}".format(y_train.shape))
print("y_test shape={}".format(y_test.shape))

x_train shape=(180, 60)
x_test shape=(28, 60)
y_train shape=(180, 1)
y_test shape=(28, 1)


## Network Architecture and Parameters

The layout of the neural network is shown in the figure below. There are three layers
1. The __input__ layer has dimension 60, which is the number of parameters in the input
1. The __hidden__ layer has dimension 16, this is our choice, kind of arbitrary.
1. The __output__ layer has dimension 1 since this is a 2-class classification problem. The output is the probability that the the input is from a Mine.


![alt text](shallow.png "Title")

In [5]:
learning_rate = 3
nb_iterations = 4000
# Network Parameters
n_h = 16 # number of neurons in hidden layer
n_x = x_train.shape[1] #number of neurons in input
n_y = y_train.shape[1] #number of neurons in ouput

### Sigmoid function
First write the sigmoid function

In [6]:
def sigmoid(z):
    s=1/(1+np.exp(-z))
    return s

### Forward Propagation
Since we have two layers we will need two weight matrices and two bias vectors (see figure above). Consult the forward propagation equations shown below to be able to determine the shape of the parameters and therefore initialize them.
$\sigma$ is the sigmoid function defined above, $A^0=X$ is the input, $A^1$ and $A^2$ are the output of the first and second layers respectively. Recall that all the variables below (except the parameters) are vectorized version containing all the samples where the samples are row stacked. So $A^0[0,:]=X[0,:]$ is the input of the first (0) sample
$Z^1[0,0]$ is the output of the first node in the first layer when the input is the first sample, etc...

\begin{align*}
    Z^1&=W^0\cdot A^0+B^0\\
    A^1&=\sigma(Z^1)\\
    Z^2&=W^1\cdot A^1+B^1\\
    A^2&=\sigma(Z^2)
  \end{align*}


In [7]:
def model(X):
    Z1=np.dot(X,W0)+b0
    A1=sigmoid(Z1)
    Z2=np.dot(A1,W1)+b1
    A2=sigmoid(Z2)
    
    return A1,A2

### Initialization

We initialize the weights randomly and the biases to zero. This is done in numpy by using the random.randn and zeros functions. To create an nxm matrix of random numbers we use np.random.randn(n,m) and to create an nxm matrix of zeros we use np.zeros((n,m))

In [8]:

W0=np.random.randn(n_x,n_h)
b0=np.zeros((n_h))
W1=np.random.randn(n_h,n_y)
b1=np.zeros((n_y))



### Computing the cost
Recall that for $m$ samples we defined the cross-entropy cost function as
\begin{align*}
cost=\frac{-1}{m}\sum_s Y*\log A^2+(1-Y)*\log (1-A^2)
\end{align*}

In [9]:
def loss(A2,Y):
    m=Y.shape[0]
    logprob=Y*np.log(A2)+(1-Y)*np.log(1-A2)
    cost=-np.sum(logprob)/m
    cost=np.squeeze(cost)
    return cost

## Back propagation
To compute the gradients recall the formulas from class. Recall that $m$ is the number of samples, $n_x$ is the size of the input and $n_h$ is the width of the hidden layer. And $\theta$ is the derivative of the sigmoid with respect to its argument. i.e. $\theta=\sigma(1-\sigma)$

                        Formula                         Shape                              

\begin{align*}
   db^1&=\frac{1}{m}\sum_s(A^2-Y) & (1,1)\\
      dW^1&=\frac{1}{m}{A^1}^T\cdot (A^2-Y) & (n_h,m)\times (m,1)=(n_h,1)\\
      db^0&=\frac{1}{m}\sum_s\left[{W^1}^T\cdot (A^2-Y)\right]*\theta & \sum_s (n_h,1)\times (1,m)=(n_h,1)\\
      dW^0&=\frac{1}{m}\left[X^T\cdot\left({W^1}^T\cdot (A^2-Y)\right)*\theta\right] &(n_x,m)\times(m,n_h)=(n_x,n_h)
    \end{align*}


It is convenient to add temporary variables dZ2 and dZ1 defined as: $dZ^2=A^2-Y$, $dZ^1=\left({W^1}^T\cdot dZ^2\right)*\theta$

In [10]:
def gradient(X,Y):
    global dW0,db0,dW1,db1
    #we will be dividing by the number of samples m
    m=X.shape[0]
    
    A1,A2=model(X)
    cost=loss(A2,Y)
    
    # the derivative of the sigmoid
    theta=A1*(1-A1)
    #we will use some temporary variables
    dZ2=A2-Y
    dW1=np.dot(A1.T,dZ2)/m
    db1=np.sum(dZ2,axis=0,keepdims=True)/m
    dZ1=np.dot(dZ2,W1.T)*theta
    dW0=np.dot(X.T,dZ1)/m
    db0=np.sum(dZ1,axis=0,keepdims=True)/m
    return cost


### Updating the parameters
For every iteration we need to update the parameters

In [11]:
def apply_gradients(learning_rate):

    global W0,b0,W1,b1
    W0=W0-learning_rate*dW0
    b0=b0-learning_rate*db0
    W1=W1-learning_rate*dW1
    b1=b1-learning_rate*db1
    

### Gradient Descent
Having implemented all the above functions now we can implement gradient descent. Note that we are
using the number of nodes in the hidden layer as a variable.


In [12]:
for i in range(nb_iterations):
    cost=gradient(x_train,y_train)
    apply_gradients(learning_rate)
    if i % 500 == 0:
        print ("Cost after iteration %i: %f" %(i, cost))


Cost after iteration 0: 0.976271
Cost after iteration 500: 0.186533
Cost after iteration 1000: 0.019441
Cost after iteration 1500: 0.008411
Cost after iteration 2000: 0.005124
Cost after iteration 2500: 0.003611
Cost after iteration 3000: 0.002756
Cost after iteration 3500: 0.002213


### Evaluating the results
At this point our network has learned the parameters. We test the predictions as follows: we compute the output $A^2$ and for every data point if the value of $A^2>0.5$ we predict Mine otherwise it is a Rock. After that we accumulate all the correct predictions. A prediction for data point $i$ is correct if $Y[i]=1$ and $A^2[i]=1$ or $Y[i]=0$ and 
$A^2[i]=0$. The sum of all correct predictions can be done nicely using the formula belwo
\begin{align*}
 {A^2}^T\cdot Y+(1-{A^2}^T)\cdot (1-Y)
\end{align*}

In [13]:
# Get the output of both layers using the model
A1,A2=model(x_test)
#convert the predicted probabilities to Mine (1) or Rock(0) 
predictions=(A2>0.5)

correct=np.dot(predictions.T,y_test)+np.dot(1-predictions.T,1-y_test)
accuracy=100*float(correct)/float(y_test.shape[0])
print("correct={} out of total of {}".format(correct,y_test.shape[0]))
print("Accuracy="+str(accuracy))

correct=[[23.]] out of total of 28
Accuracy=82.14285714285714
