# Deep Learning

Systems that mimic human brain's ability to learn.

![Deep Learning Models](https://www.artiba.org/Content/Images/types-of-deep-learning-models.jpg)

![](https://media.geeksforgeeks.org/wp-content/uploads/20250703121717652998/1-.webp)

ANN is a computational model inspired by the structure and functional aspects of biological neural networks. It is the cornerstone of Deep Learning.

It consists of large number of simple, highly interconnected processing elements (neurons) working in unison to solve specific problems.

The term "Deep" refers to the number of layers through which the data is transformed. Modern networks have hundreds of layers. Each layer extracts progressively higher level features from the raw input.

* Shallow Layers: Might detect simple edges or colors
* Deep Layers: Might detect complex shapes, textures, or even specific objects.

![](https://media.geeksforgeeks.org/wp-content/uploads/20230410104038/Artificial-Neural-Networks.webp)

Biological Term   |   Artificial Equivalent
* Dendrite:           Input Layer / Incoming Weights
* Cell Body:          Summation and Activation
* Axon:               Output of the Neuron
* Synapse:            Weights (Strength of Connection)

**Layer:** A collection of neurons
* Input Layer: Receives raw data.
* Hidden Layers: Layers between input and output where the "learning" happens.
* Output Layer: Provides the final prediction

- Weights (w) : Parameters that determine the importance of an input signal. They allow the network to "choose" which input features are more important.
- Bias (b) : An additional parameter that allows the neuron to shift its activation threshold. It provides the flexibility to "fire" the neuron even when inputs are low (or stay silent even when they are high)

![](https://miro.medium.com/v2/resize:fit:1400/1*upfpVueoUuKPkyX3PR3KBg.png)

A neuron performs two main operations:
1. **Weighted Summation**: It takes all inputs, multiplies them by their respective weights, and adds a bias.
2. **Activation**: It passes that sum through a non-linear function.
Mathematically, for a set of inputs x1, x2, x3, ....xn:
z = (w1x1 + w2x2 + w3x3 + .... + wnxn) + b

Or in vector notation: z = W.X + b

Where,
* W is the vector of weights.
* X is the vector of inputs.
* b is the bias

In [1]:
import numpy as np
import matplotlib.pyplot as plt

def single_neuron_calculation(inputs, weights, bias):
    """Calculate the linear output (z) of a single neuron"""
    z = np.dot(inputs, weights) + bias
    return z

inputs = np.array([0.5, -0.2, 0.1])
weights = np.array([0.4, 0.7, -0.3])
bias = 0.1


linear_output = single_neuron_calculation(inputs, weights, bias)
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Bias: {bias}")

print(f"Linear Sum (z) = {linear_output:.4f}")

Inputs: [ 0.5 -0.2  0.1]
Weights: [ 0.4  0.7 -0.3]
Bias: 0.1
Linear Sum (z) = 0.1300


# Activation Functions (Sigmoid, ReLU, Tanh)

Without activation functions, a neural network is just a big linear regression model. Activation functions introduce non-linearity, allowing the network to learn complex patterns.

### 1. Sigmoid Activation Function

It maps any value to a range between 0 and 1.

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$


It is great for binary classification.

### 2. Tanh (Hyperbolic Tangent) Activation Function

Maps values to a range between -1 and 1.

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$


Zero-centered. It typically trains faster than Sigmoid because it pushes gradients in both directions.

### 3. ReLU (Rectified Linear Unit) Activation Function

The current "gold standard" for hidden layers.

$$f(z) = \max(0, z)$$

It is extremely efficient to calculate.

![](https://sebastianraschka.com/images/faq/activation-functions/activation-functions.png)

# Forward Propagation

Forward Propagation is the process where input data is fed forward through the network to generate an output. Each layer passes its output to the next layer as input.


**Steps:**
For a network with one hidden layer:
1. Input Layer (X): Raw features
2. Hidden Layer:
    * Calculate linear sum: ($Z^{[1]} = W^{[1]}X + b^{[1]}$)
    * Calculate activation: $A^{[1]} = g(Z^{[1]})$ (where $g$ is ReLU or Tanh)
3. Output Layer:
   * Calculate linear sum:  $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
   * Calculate activation: $\hat{y} = A^{[2]} = \sigma(Z^{[2]})$ (for classification)

**Vectorization**

In practice, we don't calculate one neuron at a time. We use Matrix Multiplication to calculate an entire layer for a whole batch of data at once. This is what makes GPUs so powerful for Deep Learning.

In [3]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def relu(z):
    return np.maximum(0, z)


# Initialize the data(X) - 3 features, 2 samples
X = np.array([[1.0, 2.0],
            [0.5, 1.1],
            [-0.2, 0.4]])


# Parameters for Hidden Layer (4 neurons)
W1 = np.random.randn(4, 3)*0.01 # 4 neurons, 3 features
b1 = np.zeros((4, 1))


# Parameters for Output Layer (1 neuron)
W2 = np.random.randn(1, 4) * 0.01 # 1 output, 4 features
b2 = np.zeros((1, 1))


# Forward Pass
# Layer 1
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)

# Layer 2
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)


print(f"Input X Shape: {X.shape}")
print(f"Hidden Layer Output (A1) shape: {A1.shape}")
print(f"Final Prediction Output (A2): \n {A2}")

Input X Shape: (3, 2)
Hidden Layer Output (A1) shape: (4, 2)
Final Prediction Output (A2): 
 [[0.50002665 0.50005183]]


### Loss Functions (Cost Function): MSE, Cross-Entropy

A Loss Function (also called Cost Function) measures how "wront" the model's predictions are.

#### 1. Mean Squared Error (MSE)
Primarily used for Regression problems.
$$J = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2$$

Squaring the difference ensures losses are always positive and penalizes larger errors more heavily.


#### 2. Binary Cross-Entropy (Log Loss):
Primarily used for Binary Classification.
$$J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})]$$

It uses logarithms to heavily penalize confident but wrong predictions. If y = 1 and the model predicts 0.01, the loss becomes very large.


#### 3. Categorical Cross-Entropy:

Used for multi-class classification problems with One-Hot encoded labels.



In [5]:
def compute_mse(y_true, y_pred):
    return np.mean(np.power(y_true - y_pred, 2))

def compute_cross_entropy(y_true, y_pred):
    m = y_true.shape[0]
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1-epsilon)
    loss = -1/(m) * np.sum(y_true * np.log(y_pred) + (1 - y_true)*np.log(1 - y_pred))
    return loss

y_actual = np.array([1, 0, 1])
y_predicted = np.array([0.9, 0.1, 0.2])
print(f"Cross-Entropy Loss: {compute_cross_entropy(y_actual, y_predicted):.4f}")

Cross-Entropy Loss: 0.6067


# Backpropagation

Backpropagation is the "engine" that allows neural networks to learn. It is the process of calculating the gradient of the loss function with respect to the weights of the network.


![](https://serokell.io/files/a0/a05ov1m.Backpropagation_in_NN_pic1.jpg)

![](https://media.geeksforgeeks.org/wp-content/uploads/20250701163824448467/Backpropagation-in-Neural-Network-1.webp)

![](https://miro.medium.com/1*W7ZPd1tvyi_cIdpoDdX3DA.png)

![](https://miro.medium.com/v2/resize:fit:1400/1*DBjWT-lIAUq8m7aZSGj_Ew.png)

New Weight = Old Weight minus (Learning rate * gradient)

* Stochastic Gradient Descent (SGD): Uses one sample at a time, compute loss and gradient for that sample, update. Very noisy but can help escape bad local minima; often needs many passes over the data.

* Batch Gradient Descent: Compute the loss and gradient on the entire dataset, then do one update. Very stable but slow for large datasets.

* Mini-Batch Gradient Descent: Uses a small batch (eg: 32, or 64 samples) per update. It is the balance between stability and speed. One pass over the whole dataset (using many mini-batches) is called one epoch.


**Learning Rate**

* Too Large: Updates are huge; the loss may bounce around or even increase (divergence)
* Too Small: Updates are tiny; training is very slow
* In practice, we often use a schedule (e.g. start with 0.01 and reduce over time) or adaptive optimizers that effectively adjust the step size per parameter.

### Optimizers: Adam, RMSprop, SGD


And optimizer is the rule we use to update the weights given the gradients. Plain gradient descent uses the same learning rate for every parameter. Adaptive optimizers (like Adam) adjust the effective step size per parameter, which often speeds up training and reduces the need to turn the learning rate.

* Momentum: Instead of using the gradient directly, we keep a running average (like a moving average) of past gradients. $v_t = \beta v_{t-1} + g_t$, then update $w \leftarrow w - \eta v_t$. Typical $\beta \approx 0.9$.


$g_t$ = gradient at step t


* RMSprop: RMSprop keeps running average of the squared gradients and uses it to scale the update. Parameters that usually have large gradients get a smaller effective cheat; Parameters with small gradients get a larger effective map. So the learning rate is adaptive per parameter.

**Adam (Adaptive Moment Estimation)** 

It combines the idea of momentum (smooth direction) and RMSprop (adaptive step size). It maintains two running averages: one for the squared gradient (second moment), and uses bias correction so that early steps aren't too small. 

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(momentum)} $$
$$ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(squared gradients)} $$
$$ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)} $$
$$ w \leftarrow w - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$


Typical defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=0.001$.

### Regularization Techniques (Dropout, L2 Regularization)


Overfitting means the model fits the training data very well but performs poorly on new data (it memorized the training set instead of learning general patterns).


Regularization is any technique that discourages overfitting and encourages the model to be simpler or more robust.

**Dropout**

During training only: at each forward pass we randomly set a fraction p of the neurons in a layer to zero (we "drop" them). The remaining neurons are scaled so the total activation level is roughly preserved (e.g. divide by the probability of keeping a neuron). During test time, we use all neurons and no dropout.

During training, we randomly turn off some neurons so that the network doesn't depend on any one of them.

**L2 Regularization** (Weight Decay)


We add an extra term to the loss that penalizes large weights.

![](https://miro.medium.com/v2/resize:fit:1400/1*ozLs-feHr73kJTfKL8figA.png)

Topic for next class: Implementing Neural Network from Scratch (Numpy)