# Part A: Manual SGD 

This notebook demonstrates the mechanics of stochastic gradient descent by manually computing the forward pass, loss, gradient, and parameter updates for a simple linear regression model.

**Model:**
\[
ŷ = x · w
\]

The first three samples of the Swedish Auto Insurance dataset are used.
No preprocessing is applied to observe raw gradient behavior.


In [12]:
import pandas as pd

data = pd.read_csv("data/Swedish_Auto_Insurance_dataset.csv")
data = data.iloc[:3]
print(data, "\nX is Input Features and Y is Target Feature")

     X      Y
0  108  392.5
1   19   46.2
2   13   15.7 
X is Input Features and Y is Target Feature


## Hyperparameter Selection

- Initial weight \( w_0 = 0.5 \)
- Learning rate \( \alpha = 0.0001 \)

A small learning rate is chosen due to the relatively large input values, ensuring stable gradient updates.


## Manual SGD Computation

For each sample, the following steps are applied:

1. Forward: ŷ = x · w
2. Loss: L = (t - ŷ)²
3. Gradient: ∂L/∂w = 2x(ŷ - t)
4. Update: wnew = wold - α · (∂L/∂w)


## Hyperparameters

Initial weight (w₀) = 0.5  
Learning rate (α) = 0.0001  

No preprocessing is applied since the goal is to observe raw SGD behavior.


### Sample 1

### Given:**
- \( Input: x = 108 \)
- \( Tareget: t = 392.5 \) 
- \( w = 0.5 \)

### Forward Pass:
\
ŷ = x · w
\
ŷ = 108 * 0.5
\
ŷ = 54


### Loss:
\
L = (t - ŷ)²
\
L = (392.5 - 54)²
\
L = 114582.25


### Gradient:
\
∂L/∂w = 2x(ŷ - t)


**We have**
| Variable | Value | Description |
|----------|-------|-------------|
| x | 108 | Input |
| t | 392.5 | Target |
| ŷ | 54 | Prediction (x · w = 108 × 0.5) |
| L | 114582.25 | Loss = (t - ŷ)² |



**Chain Rule Setup**

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}$$



**Step 1: Calculate ∂L/∂ŷ** (Using Power Rule)

$$L = (t - \hat{y})^2$$

$$\frac{\partial L}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}}(t - \hat{y})^2$$

$$= 2(t - \hat{y}) \cdot \frac{\partial}{\partial \hat{y}}(t - \hat{y})$$

$$= 2(t - \hat{y}) \cdot \left[\frac{\partial t}{\partial \hat{y}} - \frac{\partial \hat{y}}{\partial \hat{y}}\right]$$

$$= 2(t - \hat{y}) \cdot [0 - 1]$$

$$= 2(t - \hat{y})(-1)$$

$$\boxed{\frac{\partial L}{\partial \hat{y}} = -2(t - \hat{y}) = 2(\hat{y} - t)}$$



**Step 2: Calculate ∂ŷ/∂w**

$$\hat{y} = x \cdot w$$

$$\frac{\partial \hat{y}}{\partial w} = \frac{\partial}{\partial w}(x \cdot w)$$

$$\boxed{\frac{\partial \hat{y}}{\partial w} = x}$$



**Step 3: Combine Using Chain Rule**

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}$$

$$= 2(\hat{y} - t) \cdot x$$

$$\boxed{\frac{\partial L}{\partial w} = 2x(\hat{y} - t)}$$


**Step 4: Numerical Calculation**

$$\frac{\partial L}{\partial w} = 2(108)(54 - 392.5)$$

$$= (216)(-338.5)$$

$$\boxed{\frac{\partial L}{\partial w} = -73116}$$


$$\frac{\partial L}{\partial w} = -73116$$


### Update: wnew = wold - α · (∂L/∂w)
\
wnew = 0.5 - 0.0001 · (-73116)
\
wnew = 7.3116
\
wnew = 0.5 + 7.3116 = 7.8116


**Explanation**
\
Gradient was negative (-73116) 
\
Loss decreases as weight increases 
\
Update added 7.3116 
\
Large step due to big gradient magnitude 
\
New weight increased 
\
From 0.5 → 7.8116 (makes sense: ŷ=54 was too small, need larger w to reach t=392.5) 

**Verification (Optional)**
New prediction would be:
$$\hat{y}_{new} = x \cdot w_{new} = 108 \times 7.8116 = 843.6528$$

This overshoots target (392.5), indicating learning rate might be too large, but mathematically the update is correct!

### Sample 2

### Given:
- \( Input: x = 19 \)
- \( Target: t = 46.2 \) 
- \( w = 7.8116 \) (carried from Sample 1)

### Forward Pass:
\
ŷ = x · w
\
ŷ = 19 * 7.8116
\
ŷ = 148.4204

### Loss:
\
L = (t - ŷ)²
\
L = (46.2 - 148.4204)²
\
L = (-102.2204)²
\
L = 10449.0082

### Gradient:
\
∂L/∂w = 2x(ŷ - t)

**We have**
| Variable | Value | Description |
|----------|-------|-------------|
| x | 19 | Input |
| t | 46.2 | Target |
| ŷ | 148.4204 | Prediction (x · w = 19 × 7.8116) |
| L | 10449.0082 | Loss = (t - ŷ)² |

**Chain Rule Setup**

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}$$

**Step 1: Calculate ∂L/∂ŷ** (Using Power Rule)

$$L = (t - \hat{y})^2$$

$$\frac{\partial L}{\partial \hat{y}} = 2(t - \hat{y}) \cdot (-1) = 2(\hat{y} - t)$$

$$\boxed{\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - t)}$$

**Step 2: Calculate ∂ŷ/∂w**

$$\hat{y} = x \cdot w$$

$$\boxed{\frac{\partial \hat{y}}{\partial w} = x}$$

**Step 3: Combine Using Chain Rule**

$$\frac{\partial L}{\partial w} = 2(\hat{y} - t) \cdot x$$

$$\boxed{\frac{\partial L}{\partial w} = 2x(\hat{y} - t)}$$

**Step 4: Numerical Calculation**

$$\frac{\partial L}{\partial w} = 2(19)(148.4204 - 46.2)$$

$$= (38)(102.2204)$$

$$\boxed{\frac{\partial L}{\partial w} = 3884.3752}$$

### Update: wnew = wold - α · (∂L/∂w)
\
wnew = 7.8116 - 0.0001 · (3884.3752)
\
wnew = 7.8116 - 0.3884
\
wnew = 7.4232

**Explanation**
\
Gradient was positive (3884.3752)
\
Loss decreases as weight decreases
\
Update subtracted 0.3884
\
From 7.8116 → 7.4232 (makes sense: ŷ=148.42 was too large, need smaller w to reach t=46.2)

---

### Sample 3

### Given:
- \( Input: x = 13 \)
- \( Target: t = 15.7 \) 
- \( w = 7.4232 \) (carried from Sample 2)

### Forward Pass:
\
ŷ = x · w
\
ŷ = 13 * 7.4232
\
ŷ = 96.5016

### Loss:
\
L = (t - ŷ)²
\
L = (15.7 - 96.5016)²
\
L = (-80.8016)²
\
L = 6528.8986

### Gradient:
\
∂L/∂w = 2x(ŷ - t)

**We have**
| Variable | Value | Description |
|----------|-------|-------------|
| x | 13 | Input |
| t | 15.7 | Target |
| ŷ | 96.5016 | Prediction (x · w = 13 × 7.4232) |
| L | 6528.8986 | Loss = (t - ŷ)² |

**Chain Rule Setup**

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}$$

**Step 1: Calculate ∂L/∂ŷ** (Using Power Rule)

$$L = (t - \hat{y})^2$$

$$\frac{\partial L}{\partial \hat{y}} = 2(t - \hat{y}) \cdot (-1) = 2(\hat{y} - t)$$

$$\boxed{\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - t)}$$

**Step 2: Calculate ∂ŷ/∂w**

$$\hat{y} = x \cdot w$$

$$\boxed{\frac{\partial \hat{y}}{\partial w} = x}$$

**Step 3: Combine Using Chain Rule**

$$\frac{\partial L}{\partial w} = 2(\hat{y} - t) \cdot x$$

$$\boxed{\frac{\partial L}{\partial w} = 2x(\hat{y} - t)}$$

**Step 4: Numerical Calculation**

$$\frac{\partial L}{\partial w} = 2(13)(96.5016 - 15.7)$$

$$= (26)(80.8016)$$

$$\boxed{\frac{\partial L}{\partial w} = 2100.8416}$$

### Update: wnew = wold - α · (∂L/∂w)
\
wnew = 7.4232 - 0.0001 · (2100.8416)
\
wnew = 7.4232 - 0.2101
\
wnew = 7.2131

**Explanation**
\
Gradient was positive (2100.8416)
\
Loss decreases as weight decreases
\
Update subtracted 0.2101
\
From 7.4232 → 7.2131 (makes sense: ŷ=96.50 was too large, need smaller w to reach t=15.7)

---

## Summary Table

| Sample | w (old) | x | ŷ | dL/dw | w (new) |
|--------|---------|---|---|-------|---------|
| 1 | 0.5 | 108 | 54 | -73116 | 7.8116 |
| 2 | 7.8116 | 19 | 148.4204 | 3884.3752 | 7.4232 |
| 3 | 7.4232 | 13 | 96.5016 | 2100.8416 | 7.2131 |