#### Deep Nural Network    
Depth in terms of the number of layers.      

![](../Figures/cnn-4.PNG)

#### Activation Functions in DNN  
##### Sigmoid function   
Defined as $\frac{1}{1 + e^{-x}}$  where x is the input value and e is the mathematical constant of 2.718.  
The function maps any input ($-\infty$ to $+\infty$) to a value between 0 and 1, making it useful for binary classification and logistic regression problems.   
A neuron that uses a sigmoid function as an activation function is called a sigmoid unit.  

##### Hyperbolic tangent activation function   
$$\frac{e^x - e^{-x}}{e^x + e^{-x}}$$  

##### ReLU (Rectified Linear Unit)  


##### Softmax  
Outputs a vector of values that sum up to 1.  
Each value indicates the probability of class membership.  

|Network Type|Activation Function|  
|-|-|    
|Multilayer Perceptron|ReLU|  
|Convolutional Neural Net|ReLU|  
|Recurrent Neural Net|Sigmoid or Tanh|   

#### Output Activation Function   

|Problem Type|Activation Function|  
|-|-|    
|Binary Classification|Sigmoid|  
|Multiclass Classification Net|Softmax|  
|Mutlilabel Classification|Sigmoid|  
|Linear Regression|Linear|   



![](../Figures/DNN_Activation_Fn.png)

#### Designing Neural Network   
![](../Figures/buildNN-1.png)   

Serialise perceptron.  
Input, hidden, and output layers.  
Each has its own weights and its own sigmoid operation (activation).  

A layer with 10 nodes (neurons) has a 10 column matrix. Any number of rows.   
The diagram represents one row. Stacked are nodes equal to the number of rows.   
The functions in between the layers are called **activation function** of the network.     

  

**Input Nodes** one per feature (column) of dataset. 784 pixels + bias = 785 nodes. NMIST Input matrix (60000, 785)  
**Output Nodes** one per label (class) of dataset. Output matrix (60000, 10).   
**Hidden Nodes** Part of design and tuning. Say, 201.    

##### Weights   
The three perceptron network above has two matrices of weights.  
Each matrix of weights in a neural network has as many rows as its input elements and    
as many columns as it output elements.  
In the above diagram, w1 is (n, d) and w2 is (d, k).  
Operations in the network: 
       $$H = sigmoid(X . W1)$$
       $$\hat{Y} = sigmoid(H . W2)$$

The single perceptron has one row per input variable and one column per class.     
**A network for NMIST**   

![](../Figures/cnn-3.png)


#### Softmax   
Activation function before the output layer is softmax most of the times.  

Softmax takes an array of numbers, called _logits_
$softmax(l_i) = \frac{e^{l_i}}{\sum e^l}$   
take the exponential of each logit and divide it by the summed exponentials of all the logits.  
Like the sigmoid, softmax returns an array where each element is between 0 and 1.  
The sum of its output is always 1.   

[1.6, 3.1, 0.5] -> softmax is [0.17198205, 0.77077009, 0.05724785]      
Chances of the item belonging to the second class is77%   


In [None]:
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
def softmax(logits):
    exponentials = np.exp(logits)
    return exponentials/np.sum(exponentials, axis=1).reshape(-1,1)

for MNIST logits would be (60000, 10) matrix.   
axis=1 means calculate the sum by row and not for the entire matrix.   

In [None]:
sample = np.array([[0.3, 0.8, 0.2], [0.1,0.9,0.1]])
exponentials = np.exp(sample) 
print(exponentials); print()
print(np.sum(exponentials, axis=1)); print()
print(softmax(sample))

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

#### Forward Propagation   
Code for prediction in a perceptron is   
```
def predict(X, w):
    weighted_sum = np.matmul(X, w)
    return sigmoid(weighted_sum)
```
To propagate the data forward,   
1. similar to perceptron's prediction   
    h = sigmoid(np.matmul(prepend_bias(X), w1))    
    this calculates the hidden layer.  
2. repeat the process to calculate the output layer   
    y_hat = softmax(np.matmul(prepend_bias(h), w2))  

##### Refactor Code: Extract function      

In [None]:
def prepend_bias(X):
    return np.insert(X, 0, 1, axis=1)
    

In [None]:
import numpy as np
def forward(X, w1, w2):
    h = sigmoid(np.matmul(prepend_bias(X), w1))
    y_hat = softmax(np.matmul(prepend_bias(h), w2))  
    return y_hat

#### classify() function  
Neural network's `classify()` is similar to that of perceptron except that   
it takes two weights instead of one.   
MNIST output layer is (60000, 10) matrix.    
argmax() makes it (60000, 1) matrix.  
  

In [None]:
def classify(X, w1, w2):
    y_hat = forward(X, w1, w2)
    labels = np.argmax(y_hat, axis=1)
    return labels.reshape(-1, 1)

In [None]:
def report(iteration, X_train, Y_train, X_test, Y_test, w1, w2):
    y_hat = forward(X_train, w1, w2)
    training_loss = loss(Y_train, y_hat)
    classifications = classify(X_test, w1, w2)
    accuracy = np.average(classifications == Y_test)*100
    print(f"{iteration} {training_loss} {accuracy}")
    

#### Cross Entropy Loss   
softmax and cross-entropy loss pair well.  
$$L = - \frac{1}{m}\sum y_i\ .\ log(\hat{y}_i)$$


In [None]:
def loss(Y, y_hat):
    return -np.sum(Y * np.log(y_hat)) / Y.shape[0]

#### Complete Code   

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def softmax(logits):
    exponentials = np.exp(logits)
    return exponentials / np.sum(exponentials, axis=1).reshape(-1, 1)

def loss(Y, y_hat):
    return -np.sum(Y * np.log(y_hat)) / Y.shape[0]

def prepend_bias(X):
    return np.insert(X, 0, 1, axis=1)

def forward(X, w1, w2):
    h = sigmoid(np.matmul(prepend_bias(X), w1))
    y_hat = softmax(np.matmul(prepend_bias(h), w2))
    return y_hat

def classify(X, w1, w2):
    y_hat = forward(X, w1, w2)
    labels = np.argmax(y_hat, axis=1)
    return labels.reshape(-1, 1)

def report(iteration, X_train, Y_train, X_test, Y_test, w1, w2):
    y_hat = forward(X_train, w1, w2)
    training_loss = loss(Y_train, y_hat)
    classifications = classify(X_test, w1, w2)
    accuracy = np.average(classifications == Y_test) * 100.0
    print(f"{iteration:5} {training_loss:.6f} {accuracy:.2f}%")

#### Backpropagation   
Calculate the gradients of a neural network's loss with respect to weights using the chain rule.  

### Solution with TensorFlow   
Using keras wrapper  

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

#### Classify MNIST dataset with DNN   
The MNIST dataset, a set of four Numpy arrays, comes preloaded in Keras.  
Load the datasets and normlise (min-max).   
Division by 255 normalises the data as well as converts to floating-point numbers.   

In [None]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# normalise input data 
x_train = x_train / 255.0 
x_test = x_test / 255.0

In [None]:
print(f"{x_train.shape = }")
print(f"{x_test.shape  = }")
print(f"{y_train = }")
print(f"{y_test  = }")

1. Feed the neural network the training data - train_images and train_labels.     
2. The network will learn to associate images and labels.  
3. Ask the network to produce predictions for test_images.  
4. Verify if these predictions match the labels from test_labels.   

#### Make a model (neural network)      
Two densely connected layers.   
The last layer is a 10-way softmax layer. It will return probability scores. Each score will be the probability that the curent digit image belongs to one of the 10 digit classes.  

In [None]:
from keras import models
from keras import layers

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
])

Sequential layers: each layer has one input tensor and one outputtensor.  
We have used Flatten, Dense, and Dropout layers.  

For each example, the model returns a vector of logits or log-odds scores, one for each class.  

#### Train the model    
1.  Before you start training, configure and compile the model using Keras Model.compile.  
Set the optimizer class to adam, set the loss to the loss_fn function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics parameter to accuracy.  
2. Train and evaluate the model    


##### Optimiser classes  
Adadelta, Adafactor, Adagrad, Adam, AdamW, Adamax, 
Ftrl, Lion, LossScaleOptimizer, Nadam, Optimizer, RMSprop, and SGD   

##### Loss Functions   
BinaryCrossentropy, BinaryFocalCrossentropy, CTC, CategoricalCrossentropy,    
CategoricalFocalCrossentropy, CategoricalHinge, CosineSimilarity, Dice,  
Hinge, Huber, KLDivergence, LogCosh, Loss, MeanAbsoluteError,   
MeanAbsolutePercentageError, MeanSquaredError, MeanSquaredLogarithmicError,  
Poisson, Reduction, SparseCategoricalCrossentropy, SquaredHinge.  

##### Metrics   
AUC, Accuracy, BinaryAccuracy, BinaryCrossentropy, BinaryIoU, CategoricalAccuracy,   
CategoricalCrossentropy, CategoricalHinge, CosineSimilarity, F1Score, FBetaScore,   
FalseNegatives, FalsePositives, Hinge, IoU, KLDivergence, LogCoshError, Mean, MeanAbsoluteError,   
MeanAbsolutePercentageError, MeanIoU, MeanMetricWrapper, MeanSquaredError, MeanSquaredLogarithmicError,   
Metric, OneHotIoU, OneHotMeanIoU, Poisson, Precision, PrecisionAtRecall, R2Score, Recall, RecallAtPrecision,   
RootMeanSquaredError, SensitivityAtSpecificity, SparseCategoricalAccuracy, SparseCategoricalCrossentropy,  
 SparseTopKCategoricalAccuracy, SpecificityAtSensitivity, SquaredHinge, Sum,   
 TopKCategoricalAccuracy, TrueNegatives, TruePositives

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In [None]:
model.fit(x_train, y_train, epochs=5)

#### Validate on test set  

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
from random import randint

print(y_test[:10])

In [None]:
num = randint(1,9)
print(f"\nindex of random test value: {num}; digit: {y_test[:10][num-1]}\n")

print(f"\nPredicted digit: {np.argmax(model.predict(x_test[num-1:num]))}")