### Abstract

Deep learning, a subset of machine learning and artificial intelligence, is a learning algorithm that mimic how human brain (neurological system) works. It can recognize complex patterns like image, text and sound.

As an analogy, in neurological system, a neuron receives input from other neurons or external sources, processes these inputs, and generates an output. In the context of deep learning, a neuron can be seen as a function that takes inputs, weights and bias, to compute linear function ($Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$) and applies activation function ($g(Z^{[l]})$).


### Deep learning vs. Machine learning

>Deep learning eliminates some of data pre-processing that is typically involved with machine learning. These algorithms can ingest and process unstructured data, like text and images, and it automates feature extraction, removing some of the dependency on human experts. For example, let’s say that we had a set of photos of different pets, and we wanted to categorize by “cat”, “dog”, “hamster”, et cetera. Deep learning algorithms can determine which features (e.g. ears) are most important to distinguish each animal from another. In machine learning, this hierarchy of features is established manually by a human expert.

>Then, through the processes of gradient descent and backpropagation, the deep learning algorithm adjusts and fits itself for accuracy, allowing it to make predictions about a new photo of an animal with increased precision.  

> Source: https://www.ibm.com/topics/deep-learning


### Forward and Backward Propagation

Forward propagation involves passing input data through a neural network to obtain predictions.
Backward propagation computes gradients of model parameters with respect to a loss function.
By adjusting weights based on gradients, the model gradually improves its predictions.


* **Forward propagation** &nbsp; $A^{[l-1]}, W^{[l]}, b^{[l]} \rightarrow Z^{[l]}, A^{[l]}$

$$Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$$

$$A^{[l]} = g^{[l]} (Z^{[l]})$$

* **Backward propagation** &nbsp; $dA^{[l]} \rightarrow dA^{[l-1]},dW^{[l]}, db^{[l]}$

$$dZ^{[l]} = dA^{[l]} * {g^{[l]}}^{'}(Z^{[l]})$$

$$dW^{[l]} = \frac{1}{m}dZ^{[l]}{A^{[l-1]}}^T$$

$$db^{[l]} = \frac{1}{m}np.sum(dZ^{[l]}, axis=1, keepdims=True)$$

$$dA^{[l-1]} = {W^{[l]}}^T dZ^{[l]} = \frac{dJ}{dA^{[l-1]}} = \frac{dZ^{[l]}}{dA^{[l-1]}} \frac{dJ}{dZ^{[l]}} = \frac{dZ^{[l]}}{dA^{[l-1]}} dZ^{[l]}$$

$$, where \hspace{3mm} dZ^{[L]} = A^{[L]}-Y$$


Also, we have a Loss function $L = -YlogA -(1-Y)log(1-A)$


<img src="img/2layerNN_kiank.png" style="width:650px;height:400px;">
<caption><center> <u>Figure 2</u>: 2-layer neural network. <br> The model can be summarized as: ***INPUT -> LINEAR -> RELU -> LINEAR -> SIGMOID -> OUTPUT***. </center></caption>


<img src="img/LlayerNN_kiank.png" style="width:650px;height:400px;">
<caption><center> <u>Figure 3</u>: L-layer neural network. <br> The model can be summarized as: ***[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID***</center></caption>

### Objective

In the following Jupyter notebook, I go through what I've learned from Andrew Ng's deep learning specialization course.

I revisit the model for training L-layer deep neural network that can identify cats as binary ouput of 0(non-cat) and 1(cat). The training and test data are provided from the lecture. Here are the list of methods that I'm going to implement:

```python
# 1. initialize
def initialize_parameters_deep(layer_dims):
  return parameters

# 2. linear forward
def relu(Z):
  return A, cache

def sigmoid(Z):
  return A, cache

def linear_forward(A, W, b):
  return Z, cache

# 3. activation forward
def linear_activation_forward(A_prev, W, b, activation):
  return A, cache

# 4. nn forward
def L_model_forward(X, parameters):
  return AL, caches

# 5. compute cost
def compute_cost(AL, Y):
  return cost

# 6. backward propagation
# dZ = dA * g'(Z) where g(Z) = relu
def relu_backward(dA, cache):
  return dZ

# dZ = dA * g'(Z) where g(Z) = sigmoid,
# g'(Z) = g(Z)(1- g(Z))
def sigmoid_backward(dA, cache):
  return dZ

def linear_backward(dZ, linear_cache):
  return dA_prev, dW, db

def linear_activation_backward(dA, cache, activation):
  return dA_prev, dW, db

def compute_dA(A,Y):
  return dA

def L_model_backward(AL, Y, caches):
  return grads

# 7. update parameters
def update_parameters(parameters, grads, learning_rate):
  return parameters

# 8. predict
def predict(X, y, parameters):
  return np.zeros((1,m))
```


### Related Topics

- [Geoffrey Hinton: Will digital intelligence replace biological intelligence?](https://www.youtube.com/watch?v=iHCeAotHZa4)
- [Henrik Kniberg: Generative AI in a Nutshell](https://www.youtube.com/watch?v=2IK3DFHRFfw)
- [Emergent Garden: Why Neural Networks can learn (almost) anything](https://www.youtube.com/watch?v=0QczhVg5HaI)
- [Emergent Garden: Watching Neural Networks Learn](https://www.youtube.com/watch?v=TkwXa7Cvfr8)
- [Steve Brunton: Physics Informed Machine Learning](https://www.youtube.com/watch?v=JoFW2uSd3Uo)
- [How I became a machine learning practitioner](https://blog.gregbrockman.com/how-i-became-a-machine-learning-practitioner)
- [Gavin Uberti - Real-Time AI & The Future of AI Hardware](https://podcasts.apple.com/tw/podcast/gavin-uberti-real-time-ai-the-future-of-ai-hardware/id1154105909?i=1000638288111)
- [Michael Royzen: Beating GPT-4 with Open Source Models (Phind)](https://www.youtube.com/watch?v=z1rHPFiY6FA)
- [YC Combinator AI startup by college kids (2024)](https://www.youtube.com/watch?v=fmI_OciHV_8)
- [Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind](https://www.youtube.com/watch?v=UTuuTTnjxMQ)

## Reference

- [Make Your Own Neural Network](https://www.amazon.com/Make-Your-Own-Neural-Network/dp/1530826608)
- [Neural Networks From Scratch](https://nnfs.io/)
- [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
- [Deep Learning: Foundations and Concepts](https://bishopbook.com/)
- [Zero to Mastery Learn PyTorch for Deep Learning](https://www.learnpytorch.io/)
- [LLMs from scratch](https://github.com/rasbt/LLMs-from-scratch)
- [Andrej Karpathy, intro to neural networks and backpropagation: building micrograd](https://www.youtube.com/watch?v=VMj-3S1tku0)
- [Diffusion models from scratch, from a new theoretical perspective](https://www.chenyang.co/diffusion.html)
- [3Blue1Brown: But what is a neural network?](https://www.youtube.com/watch?v=aircAruvnKk)
- [3Blue1Brown: Visualizing Attention, a Transformer's Heart](https://www.youtube.com/watch?v=eMlx5fFNoYc)
- [Mamba explained](https://thegradient.pub/mamba-explained/)
- [LLaMA Now Goes Faster on CPUs](https://justine.lol/matmul/)
- [Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 1](https://www.youtube.com/watch?v=Bg1LQ_jWliU)
- [Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 2](https://www.youtube.com/watch?v=MwIiQsEVyew)

In [1]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
%load_ext autoreload
%autoreload 2

# NOTE: for showing consistent result for each execution
np.random.seed(1)

In [2]:
# Define activation functions

# ReLU is often used for activation function if l=1,...L-1 hidden layers because the derivative is big enough for learning quickly so that the gradient descent converges fast enough to local optimum.
def relu(Z):
  """
  z: any vector(numpy) or scalar variable
  """
  A = np.maximum(0,Z)

  assert(A.shape == Z.shape)

  cache = Z
  return A, cache


# Sigmoid is often used for activation function when l=L, which is the output layer; \hat{y} = a^{{L}}. It is because in binary classification problem, we have to have the estimated result \hat{y} to be between 0 and 1.
def sigmoid(Z):
  A = 1/(1+np.exp(-Z))
  cache = Z

  assert(A.shape == Z.shape)

  return A, cache


In [3]:
def initialize_parameters_deep(layer_dims):
  """
  layer_dims is an array of dimension for each layer
  l = 1,2,...,L-1
  NOTE: in order to use `L_model_forward(X, parameters)`
    manually add L^th parameter W_L, b_L into parameters dictionary structure

  initialize weight and bias parameters
  weight as random matrics with (n_l, n_l-1)
  and bias as zero marices (n_l, 1)
  """

  np.random.seed(1)

  parameters = {}
  L = len(layer_dims)

  for l in range(1, L):
    parameters["W"+str(l)]= np.random.randn(layer_dims[l], layer_dims[l-1])*.01
    parameters["b"+str(l)]= np.zeros((layer_dims[l],1))

    assert(parameters["W"+str(l)].shape == (layer_dims[l], layer_dims[l-1]))
    assert(parameters["b"+str(l)].shape == (layer_dims[l], 1))

  return parameters


def linear_forward(A, W, b):
  """
  A: previous activation
  W: weight
  b: bias
  """
  Z = np.dot(W,A) + b
  cache = (A,W,b)
  return Z, cache


def linear_activation_forward(A_prev, W, b, activation):
  """
  A_prev: previous activation
  W: weight
  b: bias
  """
  if activation == "relu":
    Z, linear_cache = linear_forward(A_prev, W, b)
    A, activation_cache = relu(Z)
  elif activation == "sigmoid":
    Z, linear_cache = linear_forward(A_prev, W, b)
    A, activation_cache = sigmoid(Z)
  
  cache = (linear_cache, activation_cache)

  return A, cache



In [4]:
layer_dims = [5,4,3]
parameters = initialize_parameters_deep(layer_dims)

for l in range(1, len(layer_dims)):
  print("W"+str(l),"=", parameters["W"+str(l)])
  print("b"+str(l),"=", parameters["b"+str(l)])

W1 = [[ 0.01624345 -0.00611756 -0.00528172 -0.01072969  0.00865408]
 [-0.02301539  0.01744812 -0.00761207  0.00319039 -0.0024937 ]
 [ 0.01462108 -0.02060141 -0.00322417 -0.00384054  0.01133769]
 [-0.01099891 -0.00172428 -0.00877858  0.00042214  0.00582815]]
b1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[-0.01100619  0.01144724  0.00901591  0.00502494]
 [ 0.00900856 -0.00683728 -0.0012289  -0.00935769]
 [-0.00267888  0.00530355 -0.00691661 -0.00396754]]
b2 = [[0.]
 [0.]
 [0.]]


In [5]:
def L_model_forward(X, parameters):
  """
  X input feature vector
  parameters is a dictionary data structure with Weights and biases for each layer
    l =1,2,...,L-1
  """
  caches = []
  A = X
  L = len(parameters) // 2

  for l in range(1, L):
    A_prev = A
    A, cache = linear_activation_forward(A_prev,
                              parameters["W"+str(l)],
                              parameters["b"+str(l)],
                              "relu")
    caches.append(cache)

  # NOTE: parameters have to be initialized with additional WL and bL !!!
  AL, cache = linear_activation_forward(A,
                            parameters["W"+str(L)],
                            parameters["b"+str(L)],
                            "sigmoid")

  caches.append(cache)
  print(AL.shape)
  print(X.shape[1])

  assert(AL.shape == (1, X.shape[1]))

  return AL, caches


In [6]:
m = 5
layer_dims = [4, 3, 3, 1]
parameters = initialize_parameters_deep(layer_dims)
X = np.random.randn(layer_dims[0], m)

AL, caches = L_model_forward(X, parameters)


(1, 5)
5


In [7]:
def compute_cost(AL, Y):
  m = Y.shape[1]

  cost = -1/m * np.sum(np.multiply(Y, np.log(AL))+ np.multiply(1-Y, np.log(1-AL)))

  cost = np.squeeze(cost)

  assert(cost.shape == ())

  return cost


def init_Y(m):
  Y = np.random.randint(2, size=(1, m))
  return Y


In [8]:
Y = np.random.rand(1, m)

print("compute cost =", compute_cost(AL,Y))

compute cost = 0.6931470898640357


In [9]:
# NOTE: dZ = dA * g'(Z) where g(Z) = ReLU
def relu_backward(dA, cache):
  Z = cache
  dZ = np.array(dA, copy=True) # just converting dz to a correct object.
  
  # When z <= 0, you should set dz to 0 as well. 
  dZ[Z <= 0] = 0
  
  assert (dZ.shape == Z.shape)
  
  return dZ

# NOTE: dZ = dA * g'(Z) where g(Z) = sigmoid
# g'(Z) = g(Z)(1- g(Z)) = s*(1-s) given s=g(Z)
def sigmoid_backward(dA, cache):
  Z = cache

  s = 1/(1+np.exp(-Z))
  dZ = dA * s * (1-s)
  
  assert (dZ.shape == Z.shape)
  
  return dZ


def linear_backward(dZ, linear_cache):
  """
  linear_cache = (A_prev, W, b)
  """
  A_prev, W, b = linear_cache
  m = A_prev.shape[1]

  dW = 1/m * np.dot(dZ,A_prev.T)
  db = 1/m * np.sum(dZ, axis=1, keepdims=True)
  dA_prev = np.dot(W.T, dZ)

  assert (dA_prev.shape == A_prev.shape)
  assert (dW.shape == W.shape)
  assert (db.shape == b.shape)

  return dA_prev, dW, db


def linear_activation_backward(dA, cache, activation):
  # linear_cache: (A_prev, W, b)
  # activation_cache: (Z)
  linear_cache, activation_cache = cache

  if activation == "relu":
    dZ = relu_backward(dA, activation_cache)
    dA_prev, dW, db= linear_backward(dZ, linear_cache)
  elif activation == "sigmoid":
    dZ = sigmoid_backward(dA, activation_cache)
    dA_prev, dW, db= linear_backward(dZ, linear_cache)

  return dA_prev, dW, db


def compute_dA(A,Y):
  dA = - (np.divide(Y, A) - np.divide(1 - Y, 1 - A))
  return dA


In [10]:
L = len(layer_dims) -1
dAL = compute_dA(AL,Y)
print(dAL)

dA_prev, dW, db = linear_activation_backward(dAL, caches[L-1], "sigmoid")
print(dA_prev)
dA_prev, dW, db = linear_activation_backward(dA_prev, caches[L-2], "relu")
print(dA_prev)
dA_prev, dW, db = linear_activation_backward(dA_prev, caches[L-3], "relu")
print(dA_prev)
# ....


[[ 1.80018618 -0.14358548 -0.65517858 -0.05955494 -1.77837805]]
[[ 5.15178950e-03 -4.10914251e-04 -1.87499614e-03 -1.70434879e-04
  -5.08937880e-03]
 [ 4.05757789e-03 -3.23638335e-04 -1.47675732e-03 -1.34235453e-04
  -4.00842287e-03]
 [ 2.26145841e-03 -1.80377224e-04 -8.23058820e-04 -7.48150504e-05
  -2.23406226e-03]]
[[ 9.54646325e-07 -7.61439845e-08  0.00000000e+00  5.49511373e-07
  -9.43081383e-07]
 [ 1.31801237e-05 -1.05126590e-06  0.00000000e+00  6.54562576e-07
  -1.30204547e-05]
 [-2.48900449e-05  1.98526632e-06  0.00000000e+00 -1.93233858e-06
   2.45885176e-05]]
[[ 1.55067533e-08 -2.76395962e-09  0.00000000e+00 -5.00281070e-10
  -1.12680009e-07]
 [-5.84011012e-09  1.92446255e-08  0.00000000e+00 -1.02463310e-08
   2.99670804e-07]
 [-5.04217222e-09  1.06841253e-08  0.00000000e+00 -1.68319909e-08
  -2.27182426e-07]
 [-1.02430555e-08 -3.28969710e-08  0.00000000e+00  3.48263183e-08
   9.91125999e-08]]


In [11]:
def L_model_backward(AL, Y, caches):
  """
  Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
  
  Arguments:
  AL -- probability vector, output of the forward propagation (L_model_forward())
  Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
  caches -- list of caches containing:
              every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
              the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
  
  Returns:
  grads -- A dictionary with the gradients
            grads["dA" + str(l)] = ... 
            grads["dW" + str(l)] = ...
            grads["db" + str(l)] = ... 
  """

  grads = {}
  L = len(caches) # the number of layers
  m = AL.shape[1]
  Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

  dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
  current_cache = caches[L-1]
  grads["dA"+str(L-1)], grads["dW"+str(L)], grads["db"+str(L)] = linear_activation_backward(dAL, current_cache, activation="sigmoid")

  # l = L-2, ..., 0
  for l in reversed(range(L-1)):
    current_cache = caches[l]
    dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+1)], current_cache, activation="relu")
    grads["dA"+str(l)] = dA_prev_temp
    grads["dW"+str(l+1)] = dW_temp
    grads["db"+str(l+1)] = db_temp

  return grads

In [12]:
from testCases_v2 import *

AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print("dW1=", grads["dW1"])
print("db1=", grads["db1"])
print("dA1=", grads["dA1"])

dW1= [[0.41010002 0.07807203 0.13798444 0.10502167]
 [0.         0.         0.         0.        ]
 [0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1= [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
dA1= [[ 0.12913162 -0.44014127]
 [-0.14175655  0.48317296]
 [ 0.01663708 -0.05670698]]


In [13]:
def update_parameters(parameters, grads, learning_rate):
  """
  Update parameters using gradient descent
  
  Arguments:
  parameters -- python dictionary containing your parameters 
  grads -- python dictionary containing your gradients, output of L_model_backward
  
  Returns:
  parameters -- python dictionary containing your updated parameters 
                parameters["W" + str(l)] = ... 
                parameters["b" + str(l)] = ...
  """

  L = len(parameters) // 2

  for l in range(L):
    parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
    parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]

  return parameters



In [14]:
parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))

W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]
