# Neural Networks - Complete Deep Explanation (Tanglish)

---

## Introduction

Indha notebook la namma **Neural Networks** pathi complete ah, in-depth ah, formulas oda purinjikka porom. Rendu examples use pannuvom:

1. **Regression Problem** - House Price Prediction
2. **Classification Problem** - Binary Classification

Yellam step-by-step ah, formula oda, clear ah explain pannuvom!

---

# Part 1: REGRESSION EXAMPLE

## Problem Statement

**Dataset:**
- **Input (X):** Size (sq.ft) = [1000, 1500, 2000]
- **Output (Y):** Price (₹ lakhs) = [50, 70, 90]

**Goal:** Veedu size kuduthu, price predict pannanum

---

## Step 1: Input Layer - Data Preparation

### 1.1 Raw Data

Namma kitta irukura raw data:

```
Size (sq.ft): [1000, 1500, 2000]
Price (₹ lakhs): [50, 70, 90]
```

### 1.2 Input ah Vector ah Convert Panradhu

**Question:** Inputs la yepadi vector ah change panranga?

**Answer:** Neural network ku data matrix format la (2D array) kudukanum. Indha process ku **Feature Matrix** nu solluvom.

**Original Data:**
```
X = [1000, 1500, 2000]  → Idhu 1D array (list)
```

**Vector Format (Matrix):**
```
X = [[1000],
     [1500],
     [2000]]
```

Idhu oru **3×1 matrix** (3 samples, 1 feature).

**Mathematical Representation:**

$$
X = \begin{bmatrix}
x^{(1)} \\
x^{(2)} \\
x^{(3)}
\end{bmatrix} = \begin{bmatrix}
1000 \\
1500 \\
2000
\end{bmatrix}
$$

Inga:
- $x^{(1)}$ = First training example (1000 sq.ft)
- $x^{(2)}$ = Second training example (1500 sq.ft)
- $x^{(3)}$ = Third training example (2000 sq.ft)

### 1.3 Normalization (Optional but Important)

Large values irundha training slow ah irukkum. Adhunala normalize pannuvom:

**Min-Max Normalization Formula:**

$$
X_{normalized} = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

**Calculation:**

$$
X_{normalized} = \frac{[1000, 1500, 2000] - 1000}{2000 - 1000} = \frac{[0, 500, 1000]}{1000} = [0, 0.5, 1.0]
$$

**Final Input Matrix:**

$$
X = \begin{bmatrix}
0.0 \\
0.5 \\
1.0
\end{bmatrix}
$$

Similarly, output Y um normalize pannalam:

$$
Y = \begin{bmatrix}
50 \\
70 \\
90
\end{bmatrix} \rightarrow Y_{normalized} = \begin{bmatrix}
0.0 \\
0.5 \\
1.0
\end{bmatrix}
$$

---

## Step 2: Network Architecture Design

Namma oru simple neural network design pannuvom:

**Architecture:**
- **Input Layer:** 1 neuron (Size feature)
- **Hidden Layer 1:** 3 neurons
- **Output Layer:** 1 neuron (Price prediction)

**Visual Representation:**

```
Input Layer    Hidden Layer    Output Layer
    (1)            (3)              (1)
    
    [X] ---------> [H1] 
                   [H2] ---------> [Y]
                   [H3]
```

---

## Step 3: Weight Initialization

### 3.1 Weights Yepadi Initialize Panranga?

**Question:** Inputs la irunthu weights yepadi initialize panranga?

**Answer:** Weights ah **random ah** initialize pannuvom, but small values la. Idhu training ku help pannum.

### 3.2 Weight Matrices

**Between Input and Hidden Layer:**

Weight matrix $W^{[1]}$ size: **(number of hidden neurons) × (number of input features)**

$$
W^{[1]} = \begin{bmatrix}
w_{11}^{[1]} \\
w_{21}^{[1]} \\
w_{31}^{[1]}
\end{bmatrix} \text{ (3×1 matrix)}
$$

**Example Random Initialization:**

$$
W^{[1]} = \begin{bmatrix}
0.5 \\
-0.3 \\
0.8
\end{bmatrix}
$$

**Bias for Hidden Layer:**

$$
b^{[1]} = \begin{bmatrix}
0.1 \\
0.2 \\
-0.1
\end{bmatrix} \text{ (3×1 matrix)}
$$

**Between Hidden and Output Layer:**

Weight matrix $W^{[2]}$ size: **(number of output neurons) × (number of hidden neurons)**

$$
W^{[2]} = \begin{bmatrix}
w_{11}^{[2]} & w_{12}^{[2]} & w_{13}^{[2]}
\end{bmatrix} \text{ (1×3 matrix)}
$$

**Example:**

$$
W^{[2]} = \begin{bmatrix}
0.4 & 0.6 & -0.2
\end{bmatrix}
$$

**Bias for Output Layer:**

$$
b^{[2]} = [0.3] \text{ (scalar)}
$$

### 3.3 Why Random Initialization?

- Ella weights um same value la initialize pannina, ella neurons um same ah learn pannudum (symmetry problem)
- Random values kuduthu, each neuron different patterns learn pannum
- Small values use panradhu gradient explosion avoid panna help pannum

---

## Step 4: Forward Propagation

### 4.1 Input to Hidden Layer

**Question:** Weights and multiplication yepadi nadakuthu?

**Answer:** Matrix multiplication use panni, weighted sum calculate pannuvom.

**Formula:**

$$
Z^{[1]} = W^{[1]} \cdot X + b^{[1]}
$$

Inga:
- $Z^{[1]}$ = Pre-activation values (hidden layer ku)
- $W^{[1]}$ = Weights (3×1)
- $X$ = Input (1×1 for one sample)
- $b^{[1]}$ = Bias (3×1)

**First Training Example Calculation ($x^{(1)} = 0.0$):**

$$
Z^{[1]} = \begin{bmatrix}
0.5 \\
-0.3 \\
0.8
\end{bmatrix} \times 0.0 + \begin{bmatrix}
0.1 \\
0.2 \\
-0.1
\end{bmatrix} = \begin{bmatrix}
0.0 + 0.1 \\
0.0 + 0.2 \\
0.0 - 0.1
\end{bmatrix} = \begin{bmatrix}
0.1 \\
0.2 \\
-0.1
\end{bmatrix}
$$

**Detailed Breakdown:**

Each hidden neuron ku:
- $z_1^{[1]} = w_{11}^{[1]} \times x + b_1^{[1]} = 0.5 \times 0.0 + 0.1 = 0.1$
- $z_2^{[1]} = w_{21}^{[1]} \times x + b_2^{[1]} = -0.3 \times 0.0 + 0.2 = 0.2$
- $z_3^{[1]} = w_{31}^{[1]} \times x + b_3^{[1]} = 0.8 \times 0.0 - 0.1 = -0.1$

### 4.2 Activation Function (Hidden Layer)

**Question:** Activation nadakalana yena agum?

**Answer:** Activation function illaama, neural network oru simple linear regression dhan agidum. Non-linearity add panna dhan complex patterns learn panna mudiyum.

**Common Activation: ReLU (Rectified Linear Unit)**

$$
\text{ReLU}(z) = \max(0, z) = \begin{cases}
z & \text{if } z > 0 \\
0 & \text{if } z \leq 0
\end{cases}
$$

**Apply ReLU:**

$$
A^{[1]} = \text{ReLU}(Z^{[1]}) = \begin{bmatrix}
\max(0, 0.1) \\
\max(0, 0.2) \\
\max(0, -0.1)
\end{bmatrix} = \begin{bmatrix}
0.1 \\
0.2 \\
0.0
\end{bmatrix}
$$

**Why ReLU?**
- Simple and fast to compute
- Negative values ah 0 aakidum
- Positive values ah as-is pass pannum
- Gradient vanishing problem ah reduce pannum

### 4.3 Hidden to Output Layer

**Formula:**

$$
Z^{[2]} = W^{[2]} \cdot A^{[1]} + b^{[2]}
$$

**Calculation:**

$$
Z^{[2]} = \begin{bmatrix}
0.4 & 0.6 & -0.2
\end{bmatrix} \cdot \begin{bmatrix}
0.1 \\
0.2 \\
0.0
\end{bmatrix} + 0.3
$$

$$
Z^{[2]} = (0.4 \times 0.1) + (0.6 \times 0.2) + (-0.2 \times 0.0) + 0.3
$$

$$
Z^{[2]} = 0.04 + 0.12 + 0.0 + 0.3 = 0.46
$$

### 4.4 Output Activation (Regression)

Regression ku, output layer la **Linear Activation** (or no activation) use pannuvom:

$$
\hat{Y} = A^{[2]} = Z^{[2]} = 0.46
$$

Idhu namma **predicted price** (normalized form la).

---

## Step 5: Loss Calculation

### 5.1 Loss Function for Regression

**Mean Squared Error (MSE):**

$$
L = \frac{1}{2}(Y - \hat{Y})^2
$$

Inga:
- $Y$ = Actual value
- $\hat{Y}$ = Predicted value
- $\frac{1}{2}$ = Mathematical convenience ku (derivative easy aagum)

**For First Sample:**

$$
L = \frac{1}{2}(0.0 - 0.46)^2 = \frac{1}{2}(-0.46)^2 = \frac{1}{2}(0.2116) = 0.1058
$$

### 5.2 Total Loss (All Samples)

Ella samples kum loss calculate panni average edukanum:

$$
J = \frac{1}{m} \sum_{i=1}^{m} L^{(i)} = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(y^{(i)} - \hat{y}^{(i)})^2
$$

Inga $m$ = number of training examples (3 in our case)

---

## Step 6: Backpropagation

### 6.1 What is Backpropagation?

Loss ah minimize panna, weights ah adjust pannanum. Backpropagation use panni, each weight ku **gradient** (slope) calculate pannuvom.

**Chain Rule:**

Calculus la chain rule use panni, loss ah each weight respect ah differentiate pannuvom.

### 6.2 Output Layer Gradient

**Loss respect to output:**

$$
\frac{\partial L}{\partial \hat{Y}} = \frac{\partial}{\partial \hat{Y}} \left[ \frac{1}{2}(Y - \hat{Y})^2 \right] = -(Y - \hat{Y})
$$

**Calculation:**

$$
\frac{\partial L}{\partial \hat{Y}} = -(0.0 - 0.46) = 0.46
$$

**Output activation gradient (Linear):**

$$
\frac{\partial \hat{Y}}{\partial Z^{[2]}} = 1
$$

**Combined:**

$$
dZ^{[2]} = \frac{\partial L}{\partial Z^{[2]}} = \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial Z^{[2]}} = 0.46 \times 1 = 0.46
$$

### 6.3 Gradients for $W^{[2]}$ and $b^{[2]}$

**Weight gradient:**

$$
\frac{\partial L}{\partial W^{[2]}} = dZ^{[2]} \cdot (A^{[1]})^T
$$

$$
\frac{\partial L}{\partial W^{[2]}} = 0.46 \times \begin{bmatrix}
0.1 & 0.2 & 0.0
\end{bmatrix} = \begin{bmatrix}
0.046 & 0.092 & 0.0
\end{bmatrix}
$$

**Bias gradient:**

$$
\frac{\partial L}{\partial b^{[2]}} = dZ^{[2]} = 0.46
$$

### 6.4 Hidden Layer Gradient

**Propagate error backwards:**

$$
dA^{[1]} = (W^{[2]})^T \cdot dZ^{[2]}
$$

$$
dA^{[1]} = \begin{bmatrix}
0.4 \\
0.6 \\
-0.2
\end{bmatrix} \times 0.46 = \begin{bmatrix}
0.184 \\
0.276 \\
-0.092
\end{bmatrix}
$$

**ReLU derivative:**

$$
\text{ReLU}'(z) = \begin{cases}
1 & \text{if } z > 0 \\
0 & \text{if } z \leq 0
\end{cases}
$$

**Apply to our $Z^{[1]}$:**

$$
\text{ReLU}'(Z^{[1]}) = \begin{bmatrix}
1 \\
1 \\
0
\end{bmatrix} \text{ (because } Z^{[1]} = \begin{bmatrix}
0.1 \\
0.2 \\
-0.1
\end{bmatrix})
$$

**Element-wise multiplication:**

$$
dZ^{[1]} = dA^{[1]} \odot \text{ReLU}'(Z^{[1]}) = \begin{bmatrix}
0.184 \\
0.276 \\
-0.092
\end{bmatrix} \odot \begin{bmatrix}
1 \\
1 \\
0
\end{bmatrix} = \begin{bmatrix}
0.184 \\
0.276 \\
0.0
\end{bmatrix}
$$

### 6.5 Gradients for $W^{[1]}$ and $b^{[1]}$

**Weight gradient:**

$$
\frac{\partial L}{\partial W^{[1]}} = dZ^{[1]} \cdot X^T
$$

$$
\frac{\partial L}{\partial W^{[1]}} = \begin{bmatrix}
0.184 \\
0.276 \\
0.0
\end{bmatrix} \times [0.0] = \begin{bmatrix}
0.0 \\
0.0 \\
0.0
\end{bmatrix}
$$

**Bias gradient:**

$$
\frac{\partial L}{\partial b^{[1]}} = dZ^{[1]} = \begin{bmatrix}
0.184 \\
0.276 \\
0.0
\end{bmatrix}
$$

---

## Step 7: Optimization (Weight Update)

### 7.1 Gradient Descent

**Formula:**

$$
W = W - \alpha \frac{\partial L}{\partial W}
$$

$$
b = b - \alpha \frac{\partial L}{\partial b}
$$

Inga:
- $\alpha$ = Learning rate (e.g., 0.01)
- $\frac{\partial L}{\partial W}$ = Gradient

### 7.2 Update $W^{[2]}$ and $b^{[2]}$

**Assume $\alpha = 0.01$:**

$$
W^{[2]}_{new} = \begin{bmatrix}
0.4 & 0.6 & -0.2
\end{bmatrix} - 0.01 \times \begin{bmatrix}
0.046 & 0.092 & 0.0
\end{bmatrix}
$$

$$
W^{[2]}_{new} = \begin{bmatrix}
0.39954 & 0.59908 & -0.2
\end{bmatrix}
$$

$$
b^{[2]}_{new} = 0.3 - 0.01 \times 0.46 = 0.3 - 0.0046 = 0.2954
$$

### 7.3 Update $W^{[1]}$ and $b^{[1]}$

$$
W^{[1]}_{new} = \begin{bmatrix}
0.5 \\
-0.3 \\
0.8
\end{bmatrix} - 0.01 \times \begin{bmatrix}
0.0 \\
0.0 \\
0.0
\end{bmatrix} = \begin{bmatrix}
0.5 \\
-0.3 \\
0.8
\end{bmatrix}
$$

$$
b^{[1]}_{new} = \begin{bmatrix}
0.1 \\
0.2 \\
-0.1
\end{bmatrix} - 0.01 \times \begin{bmatrix}
0.184 \\
0.276 \\
0.0
\end{bmatrix} = \begin{bmatrix}
0.09816 \\
0.19724 \\
-0.1
\end{bmatrix}
$$

### 7.4 Other Optimizers

**1. Stochastic Gradient Descent (SGD):**
- Oru sample use panni update pannuvom
- Fast but noisy

**2. Mini-Batch Gradient Descent:**
- Small batch of samples use pannuvom
- Balance between speed and stability

**3. Adam Optimizer:**

Most popular optimizer. Momentum and adaptive learning rate combine pannudum.

**Formulas:**

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(Momentum)}
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(Adaptive learning rate)}
$$

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(Bias correction)}
$$

$$
W = W - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

Inga:
- $g_t$ = Current gradient
- $\beta_1$ = 0.9 (typical)
- $\beta_2$ = 0.999 (typical)
- $\epsilon$ = $10^{-8}$ (numerical stability ku)

**4. RMSprop:**

$$
v_t = \beta v_{t-1} + (1 - \beta) g_t^2
$$

$$
W = W - \frac{\alpha}{\sqrt{v_t} + \epsilon} g_t
$$

---

## Step 8: Training Loop

**Complete Training Process:**

```
For each epoch (1 to 1000):
    For each training sample:
        1. Forward Propagation
           - Calculate Z[1] = W[1] · X + b[1]
           - Calculate A[1] = ReLU(Z[1])
           - Calculate Z[2] = W[2] · A[1] + b[2]
           - Calculate Ŷ = Z[2]
        
        2. Calculate Loss
           - L = (1/2)(Y - Ŷ)²
        
        3. Backpropagation
           - Calculate gradients for all weights and biases
        
        4. Update Weights
           - W = W - α × gradient
           - b = b - α × gradient
    
    If loss < threshold:
        Stop training
```

**Convergence:**

Multiple epochs ku apram, loss gradually reduce aagi, model accurate predictions pannudum.

---

# Part 2: CLASSIFICATION EXAMPLE

## Problem Statement

**Dataset: Student Pass/Fail Prediction**

- **Input (X):** Study Hours = [1, 2, 3, 4, 5]
- **Output (Y):** Pass/Fail = [0, 0, 1, 1, 1]

Inga:
- 0 = Fail
- 1 = Pass

**Goal:** Study hours kuduthu, student pass aaguvana illa fail aaguvana predict pannanum

---

## Step 1: Input Preparation

**Raw Data:**

```
X = [1, 2, 3, 4, 5]
Y = [0, 0, 1, 1, 1]
```

**Vector Format:**

$$
X = \begin{bmatrix}
1 \\
2 \\
3 \\
4 \\
5
\end{bmatrix}, \quad Y = \begin{bmatrix}
0 \\
0 \\
1 \\
1 \\
1
\end{bmatrix}
$$

**Normalization:**

$$
X_{normalized} = \frac{X - 1}{5 - 1} = \frac{[0, 1, 2, 3, 4]}{4} = \begin{bmatrix}
0.0 \\
0.25 \\
0.5 \\
0.75 \\
1.0
\end{bmatrix}
$$

---

## Step 2: Network Architecture

**Design:**
- **Input Layer:** 1 neuron (Study hours)
- **Hidden Layer:** 2 neurons
- **Output Layer:** 1 neuron (Pass/Fail probability)

```
Input (1) → Hidden (2) → Output (1)
```

---

## Step 3: Weight Initialization

**Layer 1 (Input to Hidden):**

$$
W^{[1]} = \begin{bmatrix}
0.6 \\
-0.4
\end{bmatrix}, \quad b^{[1]} = \begin{bmatrix}
0.2 \\
-0.1
\end{bmatrix}
$$

**Layer 2 (Hidden to Output):**

$$
W^{[2]} = \begin{bmatrix}
0.5 & 0.7
\end{bmatrix}, \quad b^{[2]} = 0.1
$$

---

## Step 4: Forward Propagation

### 4.1 First Sample ($x^{(1)} = 0.0$, $y^{(1)} = 0$)

**Input to Hidden:**

$$
Z^{[1]} = W^{[1]} \cdot X + b^{[1]} = \begin{bmatrix}
0.6 \\
-0.4
\end{bmatrix} \times 0.0 + \begin{bmatrix}
0.2 \\
-0.1
\end{bmatrix} = \begin{bmatrix}
0.2 \\
-0.1
\end{bmatrix}
$$

**Activation (ReLU):**

$$
A^{[1]} = \text{ReLU}(Z^{[1]}) = \begin{bmatrix}
0.2 \\
0.0
\end{bmatrix}
$$

**Hidden to Output:**

$$
Z^{[2]} = W^{[2]} \cdot A^{[1]} + b^{[2]} = \begin{bmatrix}
0.5 & 0.7
\end{bmatrix} \cdot \begin{bmatrix}
0.2 \\
0.0
\end{bmatrix} + 0.1
$$

$$
Z^{[2]} = (0.5 \times 0.2) + (0.7 \times 0.0) + 0.1 = 0.1 + 0.0 + 0.1 = 0.2
$$

### 4.2 Output Activation (Sigmoid)

**Classification ku Sigmoid use pannuvom:**

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

**Why Sigmoid?**
- Output ah 0 to 1 range ku convert pannum
- Probability ah interpret panna mudiyum
- Binary classification ku perfect

**Calculate:**

$$
\hat{Y} = \sigma(Z^{[2]}) = \frac{1}{1 + e^{-0.2}} = \frac{1}{1 + 0.8187} = \frac{1}{1.8187} = 0.5498
$$

**Interpretation:**
- $\hat{Y} = 0.5498$ means 54.98% probability of passing
- Threshold 0.5 use pannina, 0.5498 > 0.5, so predict = 1 (Pass)
- But actual Y = 0 (Fail), so prediction wrong!

---

## Step 5: Loss Calculation

### 5.1 Binary Cross-Entropy Loss

Classification ku **Binary Cross-Entropy** use pannuvom:

$$
L = -[Y \log(\hat{Y}) + (1 - Y) \log(1 - \hat{Y})]
$$

**Why this formula?**
- Y = 1 (Pass) aana: $L = -\log(\hat{Y})$
  - $\hat{Y}$ close to 1 aana, loss small
  - $\hat{Y}$ close to 0 aana, loss large
- Y = 0 (Fail) aana: $L = -\log(1 - \hat{Y})$
  - $\hat{Y}$ close to 0 aana, loss small
  - $\hat{Y}$ close to 1 aana, loss large

### 5.2 Calculate Loss

**For first sample (Y = 0, $\hat{Y}$ = 0.5498):**

$$
L = -[0 \times \log(0.5498) + (1 - 0) \times \log(1 - 0.5498)]
$$

$$
L = -[0 + 1 \times \log(0.4502)]
$$

$$
L = -\log(0.4502) = -(-0.7985) = 0.7985
$$

**Total Loss (All Samples):**

$$
J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]
$$

---

## Step 6: Backpropagation

### 6.1 Output Layer Gradient

**Loss respect to output:**

Binary cross-entropy + sigmoid ku derivative simple ah irukkum:

$$
\frac{\partial L}{\partial Z^{[2]}} = \hat{Y} - Y
$$

**Derivation:**

$$
\frac{\partial L}{\partial \hat{Y}} = -\frac{Y}{\hat{Y}} + \frac{1-Y}{1-\hat{Y}}
$$

$$
\frac{\partial \hat{Y}}{\partial Z^{[2]}} = \sigma(Z^{[2]})(1 - \sigma(Z^{[2]})) = \hat{Y}(1 - \hat{Y})
$$

**Chain rule apply pannina:**

$$
\frac{\partial L}{\partial Z^{[2]}} = \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial Z^{[2]}} = \hat{Y} - Y
$$

**Calculate:**

$$
dZ^{[2]} = 0.5498 - 0 = 0.5498
$$

### 6.2 Gradients for $W^{[2]}$ and $b^{[2]}$

$$
\frac{\partial L}{\partial W^{[2]}} = dZ^{[2]} \cdot (A^{[1]})^T = 0.5498 \times \begin{bmatrix}
0.2 & 0.0
\end{bmatrix} = \begin{bmatrix}
0.1100 & 0.0
\end{bmatrix}
$$

$$
\frac{\partial L}{\partial b^{[2]}} = dZ^{[2]} = 0.5498
$$

### 6.3 Hidden Layer Gradient

$$
dA^{[1]} = (W^{[2]})^T \cdot dZ^{[2]} = \begin{bmatrix}
0.5 \\
0.7
\end{bmatrix} \times 0.5498 = \begin{bmatrix}
0.2749 \\
0.3849
\end{bmatrix}
$$

$$
dZ^{[1]} = dA^{[1]} \odot \text{ReLU}'(Z^{[1]}) = \begin{bmatrix}
0.2749 \\
0.3849
\end{bmatrix} \odot \begin{bmatrix}
1 \\
0
\end{bmatrix} = \begin{bmatrix}
0.2749 \\
0.0
\end{bmatrix}
$$

### 6.4 Gradients for $W^{[1]}$ and $b^{[1]}$

$$
\frac{\partial L}{\partial W^{[1]}} = dZ^{[1]} \cdot X^T = \begin{bmatrix}
0.2749 \\
0.0
\end{bmatrix} \times [0.0] = \begin{bmatrix}
0.0 \\
0.0
\end{bmatrix}
$$

$$
\frac{\partial L}{\partial b^{[1]}} = dZ^{[1]} = \begin{bmatrix}
0.2749 \\
0.0
\end{bmatrix}
$$

---

## Step 7: Weight Update

**Learning rate $\alpha = 0.1$:**

$$
W^{[2]}_{new} = \begin{bmatrix}
0.5 & 0.7
\end{bmatrix} - 0.1 \times \begin{bmatrix}
0.1100 & 0.0
\end{bmatrix} = \begin{bmatrix}
0.489 & 0.7
\end{bmatrix}
$$

$$
b^{[2]}_{new} = 0.1 - 0.1 \times 0.5498 = 0.04502
$$

$$
W^{[1]}_{new} = \begin{bmatrix}
0.6 \\
-0.4
\end{bmatrix} - 0.1 \times \begin{bmatrix}
0.0 \\
0.0
\end{bmatrix} = \begin{bmatrix}
0.6 \\
-0.4
\end{bmatrix}
$$

$$
b^{[1]}_{new} = \begin{bmatrix}
0.2 \\
-0.1
\end{bmatrix} - 0.1 \times \begin{bmatrix}
0.2749 \\
0.0
\end{bmatrix} = \begin{bmatrix}
0.1725 \\
-0.1
\end{bmatrix}
$$

---

## Step 8: Key Differences - Regression vs Classification

| Aspect | Regression | Classification |
|--------|-----------|----------------|
| **Output Activation** | Linear (no activation) | Sigmoid (binary) / Softmax (multi-class) |
| **Loss Function** | Mean Squared Error (MSE) | Binary Cross-Entropy / Categorical Cross-Entropy |
| **Output Range** | Any real number | 0 to 1 (probability) |
| **Prediction** | Continuous value | Class label (0 or 1) |
| **Example** | House price, temperature | Pass/Fail, Cat/Dog |
| **Evaluation Metric** | MAE, RMSE, R² | Accuracy, Precision, Recall, F1-score |

---

## Summary: Complete Neural Network Flow

### Chain Rule in Action

**Forward Propagation Chain:**

$$
X \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow{\text{ReLU}} A^{[1]} \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow{\text{Activation}} \hat{Y} \xrightarrow{\text{Loss}} L
$$

**Backpropagation Chain:**

$$
\frac{\partial L}{\partial W^{[2]}} = \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial Z^{[2]}} \times \frac{\partial Z^{[2]}{\partial W^{[2]}}
$$

$$
\frac{\partial L}{\partial W^{[1]}} = \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial Z^{[2]}} \times \frac{\partial Z^{[2]}}{\partial A^{[1]}} \times \frac{\partial A^{[1]}}{\partial Z^{[1]}} \times \frac{\partial Z^{[1]}}{\partial W^{[1]}}
$$

### Key Takeaways

1. **Input Vectorization:** Data ah matrix format ku convert pannanum
2. **Weight Initialization:** Random small values use pannanum
3. **Forward Propagation:** Layer by layer multiply panni activate pannanum
4. **Activation Functions:** Non-linearity add panna essential
5. **Loss Calculation:** Prediction vs actual compare pannanum
6. **Backpropagation:** Chain rule use panni gradients calculate pannanum
7. **Optimization:** Gradients use panni weights update pannanum
8. **Iteration:** Multiple epochs train panni model improve pannanum

### Mathematical Foundation

Neural networks la yellam **linear algebra** (matrices) and **calculus** (derivatives) base pannirukku. Chain rule dhan backpropagation oda heart!

---

