# Advanced Learning Algorithms
- Neural networks
  - inference (prediction)
  - training
- Practical advice for building ML systems
- Decision Trees


## Neural Networks
- Attempt to mimic how the human brain functions
- Re-branded in 2005 as Deep Learning
- Applied in - Speech Recognition, Computer vision, Natural Language Processing etc
- The neural networks depends on the amount of data available. The more the data, the better will be the neural network performance
- Small, Medium, Large neural networks
- Example : Predicting if a t-shirt is going to be top-selling or not based on a single factor price
  - x --> Input
  - a = Activation = f(x) = $\frac{1}{1+e^{-(wx+b)}}$ --> Output
  - this logistic regression model can be a single neuron in the overall neural network with multiple other models working collectively
- Complex example : Predicting if a t-shirt is going to be top-selling or not with price, shipping cost, marketing, material as inputs
  - Affordability --> Price, Shipping cost --> one neuron
  - Awareness --> Marketing --> one neuron
  - Perceived Quality --> Material --> one neuron
  - It is not a strict rule that price, shipping cost are inputs to only first neuron. All data can be input to all neurons in the next layer and then the neuron will decide which feature or value to use
  - $Input Layer (\vec X) ⇒ hidden layer ⇒ (\vec a) ⇒ Output layer ⇒ Finaloutput(a)$


## Neural Network Model
- Fundamental building block of nerual network is layer of neural network or layer of neurons
- layer 0 --> layer 1 --> layer 2 --> layer 3 --> layer 4
- $\vec a^{[l]}$ --> vector a represents the ouput of a layer and [l] represents the layer number
- $a_j^{[l]}=g(\vec w_j^{[l]}.\vec a_j^{[l-1]}+b_j^{[l]})$
  - w,b parameters of layer l, unit/neuron j
  - a is the input from previous layer l-1
  - Sigmoid activation function



## Inference : Making predictions (Forward propagation)
- Let's suppose we have one input layer, 3 hidden layers and one output layer
- $\vec X$ is the input layer : layer 0
- Layer 1 : Let's suppose there are 10 neurons in this layer taking $\vec X$ as input. Then the equations for z in this layer will be
  - $\vec W_1^{[1]}.\vec X + b_1^{[1]}$
  - ........
  - $\vec W_{10}^{[1]}.\vec X + b_{10}^{[1]}$
  - outputs $\vec a^{[1]}$
- Layer 2
  - number of neurons 4
  - input $\vec a^{[1]}$
  - equations
    - $\vec W_1^{[2]}.\vec a^{[1]} + b_1^{[2]}$ ........ $\vec W_4^{[2]}.\vec a^{[1]} + b_4^{[2]}$
  - outputs $\vec a^{[2]}$
- Layer 3
  - number of neurons 2
  - input $\vec a^{[2]}$
  - equations
    - $\vec W_1^{[3]}.\vec a^{[2]} + b_1^{[3]}$ and $\vec W_2^{[3]}.\vec a^{[2]} + b_2^{[3]}$
  - outputs $\vec a^{[3]}$
- Layer 4 - output layer
  - number of neurons 1
  - input $\vec a^{[3]}$
  - equations
    - $\vec W^{[4]}.\vec a^{[3]} + b^{[4]}$
  - outputs $a^{[4]}$

## Building a neural network in TensorFlow
- layer1 = Dense(units=3, activation="sigmoid")
- layer2 = Dense(units=1, activation="sigmoid")
- model = Sequential([layer1, layer2])
- x = np.array([[200.0, 17.0],[120.0, 5.0],[425.0, 20.0],[212.0, 18.0]])
- y = np.array([1,0,0,1])
- model.compile()
- model.fit(x,y)
- y_new = model.predict(x_new)


In [None]:
# Coffee roasting example using TensorFlow
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def load_coffee_data():
    """ Creates a coffee roasting data set.
        roasting duration: 12-15 minutes is best
        temperature range: 175-260C is best
    """
    rng = np.random.default_rng(2)
    X = rng.random(400).reshape(-1,2)
    X[:,1] = X[:,1] * 4 + 11.5          # 12-15 min is best
    X[:,0] = X[:,0] * (285-150) + 150  # 350-500 F (175-260 C) is best
    Y = np.zeros(len(X))

    i=0
    for t,d in X:
        y = -3/(260-175)*t + 21
        if (t > 175 and t < 260 and d > 12 and d < 15 and d<=y ):
            Y[i] = 1
        else:
            Y[i] = 0
        i += 1

    return (X, Y.reshape(-1,1))

X,Y = load_coffee_data()
print(X.shape, Y.shape)

tf.random.set_seed(1234)  # applied to achieve consistent results
model = Sequential(
    [
        tf.keras.Input(shape=(2,)),
        Dense(3, activation='sigmoid', name = 'layer1'),
        Dense(1, activation='sigmoid', name = 'layer2')
     ]
)

model.summary()
L1_num_params = 2 * 3 + 3   # W1 parameters  + b1 parameters
L2_num_params = 3 * 1 + 1   # W2 parameters  + b2 parameters
print("L1 params = ", L1_num_params, ", L2 params = ", L2_num_params  )

W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print(f"W1{W1.shape}:\n", W1, f"\nb1{b1.shape}:", b1)
print(f"W2{W2.shape}:\n", W2, f"\nb2{b2.shape}:", b2)

model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
)

model.fit(
    Xt,Yt,
    epochs=10,
)

W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print("W1:\n", W1, "\nb1:", b1)
print("W2:\n", W2, "\nb2:", b2)





## Neural network implementation in Python
#### Forward propagation in a single layer
- x = np.array([200, 17])
- $a_1^{[1]}=g(\vec W_1^{[1]}.\vec X+b_1^{[1]})$
- $a_2^{[1]}=g(\vec W_2^{[1]}.\vec X+b_2^{[1]})$
- $a_3^{[1]}=g(\vec W_3^{[1]}.\vec X+b_3^{[1]})$
- $W_1^{[2]}$ will be represented as W2_1 in the code
- $a_1^{[2]}=g(\vec W_1^{[2]}.\vec a^{[1]}+b_1^{[2]})$

In [None]:
import numpy as np

x = np.array([200, 17])
w1_1 = np.array([1,2])
b1_1 = np.array([-1])
z1_1 = np.dot(w1_1, x) + b1_1
a1_1 = sigmoid(z1_1)
print(a1_1)

w1_2 = np.array([-3,4])
b1_2 = np.array([1])
z1_2 = np.dot(w1_2, x) + b1_2
a1_2 = sigmoid(z1_2)
print(a1_2)

w1_3 = np.array([5,-6])
b1_3 = np.array([2])
z1_3 = np.dot(w1_3, x) + b1_3
a1_3 = sigmoid(z1_3)
print(a1_3)

a1 = np.array([a1_1, a1_2, a1_3])

w2_1 = np.array([-7, 8, 9])
b2_1 = np.array([3])
z2_1 = np.dot(w2_1, a1) + b2_1
a2_1 = sigmoid(z2_1)
print(a2_1)

## Neural network implementation in Python
#### General implementation of forward propagation


In [None]:
def dense(a_in,W,b):
  units = W.shape[1]
  a_out = np.zeros(units)
  for j in range(units):
    w = W[:,j]
    z = np.dot(w,a_in) + b[j]
    a_out[j] = g(z)
  return a_out

def g(z):
  return 1/(1+np.exp(-z))

def sequential(x):
  a1 = dense(x,W1,b1)
  a2 = dense(a1,W2,b2)
  a3 = dense(a2,W3,b3)
  a4 = dense(a3,W4,b4)
  f_x = a4
  return f_x

W = np.array([[1, -3, 5],[2, 4, -6]])
b = np.array([-1, 1, 2])
a_in = np.array([-2, 4])

## Speculations on Artificial General Intelligence AGI
- AI
  - ANI (artificial narrow intelligence) : smart speaker, self driving car, web search, AI in farming and factories etc
  - AGI (artificial general intelligence) : Do anything a human can do
- Experiments
  - Roe et al - 1992
  -

## Vectorization : To implement neural networks efficiently
- Instead of using normal numpy array, we have to use them as matrices to take advantage of matrix multiplication

#### Matrix Multiplication
- Matrix is a 2D array of numbers
- $z=\vec a.\vec w$ = $a^T\vec W$
- Vector matrix multiplication


In [None]:
# Matrix multiplication code
import numpy as np

A = np.array([[1,-1,0.1],[2,-2,0.2]])
AT = np.array([[1,2],[-1,-2],[0.1,0.2]])
AT = A.T # another way to get transpose of matrix
W = np.array([[3,5,7,9],[4,6,8,0]])
z = np.matmul(AT,W) # alternate way to call matmul is z = AT @ W


# Neural Network Training



In [None]:
# Neural Network Training in TensorFlow

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy

model = Sequential([
    Dense(units=25, activation='sigmoid'),
    Dense(units=15, activation='sigmoid'),
    Dense(units=1, activation='sigmoid')
])

model.compile(loss=BinaryCrossentropy())
model.fit(X,Y,epochs=100) # epochs is the number of steps the algorithms can run, similar to number of steps in gradient descent



### What's going on in the background
- In general the model training steps are
  - Define model. For example in Logistic regression specify how to compute output given input x and parameters w,b
  - $f_{\vec w,b}(\vec X) = g(z) = \frac {1}{1+e^{-z}} = \frac {1}{1+e^{-(\vec W.\vec X + b)}} $
  - Specify loss and cost functions
  - $L(f_{\vec w,b}(\vec x^i), y^i)$ = $-y^ilog(f_{\vec w,b}(\vec x^i))-(1-y^i)log(1 - f_{\vec w,b}(\vec x^i))$
  - Define cost function
  - $J_{\vec w, b}$ = $\frac{1}{m}\sum_{i=1}^m[L(f_{\vec w,b}(\vec x^i), y^i)]$
  - Train on data to minimize cost function (like using gradient descent)
    - repeat simultaneously,
    - $w_j = w_j - α \frac {\partial}{\partial w_j}J(\vec w,b)$
    - $b= b - α \frac {\partial}{\partial b}J(\vec w,b)$
- How is it done in Neural network using Tensorflow
  - Define model
    - model = Sequential([Dense(...),Dense(...),Dense(...)])
  - Compile and specify loss function
    - model.compile(loss=BinaryCrossentropy())
  - Minimize the costfunction
    - model.fit(X,Y,epochs=100)

### Activation Functions
- For binary : Sigmoid : $g(z) = \frac {1}{1+e^{-z}}$
- ReLU (Rectified Linear Unit) : $g(z) = max(0,z)$
- Linear activation function : $g(z) = z = \vec w.\vec x + b$

### Choosing activation functions
- Different activation functions can be used for different neurons on a layer
- For output layer :
  - for binary classification : Sigmoid
  - Regression (output is both positive and negative values) : Linear activation Function
  - Regression (output is only positive values) : ReLU activation function
- For hidden layers :
  - ReLU us most common choice
  - Sigmoid/Linear Regression are very rarely used

## Multiclass Classification
- Classification problems where output labels are more than 2
- Softmax regression algorithm is a generalization of logistic regression from binary classification to multiclass classification

#### Softmax regression
- Suppose for a problem being solved has 4 possible outputs (y=1,2,3,4)
- then we will be calculating $z_1, z_2, z_3, z_4$ each corresponding to output types as below
  - $z_1 = \vec w_1.\vec x+b_1$
  - $z_2 = \vec w_2.\vec x+b_2$
  - $z_3 = \vec w_3.\vec x+b_3$
  - $z_4 = \vec w_4.\vec x+b_4$
- then the outputs $a_1, a_2, a_3, a_4$ will be calculated as below
  - $a_1 = \frac {e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$
  - $a_2 = \frac {e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$
  - $a_3 = \frac {e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$
  - $a_4 = \frac {e^{z_4}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$

#### Generalizing the Softmax Regression
- Number of possible outputs = N
- calculate $z_j = \vec w_j.\vec x+b_j$ ; j = 1,2,......,N
- calculate $a_j = \frac {e^{z_j}}{\sum_{k=1}^Ne^{z_k}}=P(y=j|\vec x)$

#### Cost function for Softmax Regression
- Equations used
  - $a_1 = \frac {e^{z_1}}{e^{z_1}+e^{z_2}+....+e^{z_N}}$ .....
  - $a_N = \frac {e^{z_N}}{e^{z_1}+e^{z_2}+....+e^{z_N}}$
- loss($a_1,...,a_N,y$) = $-log\ a_1\ (if\ y=1)\ ; -log\ a_2\ (if y=2)\ ; .... ; -log\ a_N\ (if\ y=N)$
- that is loss = -log $a_j$ if y=j


### Neural Network with Softmax output
- Let's suppose we are building a neural network to recognize a given digit with 2 hidden layers and one output layer. Then the output layer will have 10 nodes each to indicate the probability of the number being the 10 numerical digits.
- The hidden layer will use relu and the output layer will use Softmax
- Specify the model
  - import tensorflow as tf
  - from tensorflow.keras import Sequential
  - from tensorflow.keras.layers import Dense
  - model = Sequential([Dense(units=25, activation='relu'),Dense(units=15, activation='relu'),Dense(units=10, activation='softmax')])
- Specify loss and cost
  - from tensorflow.keras.losses import SparseCategoricalCrossEntropy
  - model.compile(loss=SparseCategoricalCrossEntropy())
- Train on data to minimize cost
  - model.fit(X,Y,epochs=100)

#### Improved implementation of softmax
- Specify the model
  - import tensorflow as tf
  - from tensorflow.keras import Sequential
  - from tensorflow.keras.layers import Dense
  - model = Sequential([Dense(units=25, activation='relu'),Dense(units=15, activation='relu'),Dense(units=10, activation='softmax')])
- Specify loss and cost
  - from tensorflow.keras.losses import SparseCategoricalCrossEntropy
  - model.compile(loss=SparseCategoricalCrossEntropy(from_logits=True))
- Train on data to minimize cost
  - model.fit(X,Y,epochs=100)


### Back Propagation
- Calculating Derivatives