Build a Simple Neural Network from Scratch with PyTorch

<div style="text-align: center;">
    <img src=".\images\gradeint_descent.PNG" alt="Computational Graph" width="700">
</div>

## Gradeint Descent Review
- Lets assume we have three data points $(x_1,y_1)$, $(x_2,y_2)$ and $(x_3,y_3)$ in the $xy$ coordinate system.
- The goal is to find the relation between $x$ and $y$.
- Here we can see these points are at $45^o$, but if the points are more complex. how to find the solution.
- We can make a hypothesis to find the solution. So any ML agorithm starts with a hypothesis $h(x)=\theta_1*(x)$, the output is connected with input by multiplying it with some parameters or weights $\theta_1$ with input $x$.
- Than we have to define the loss function how good or bad is the paratmeter  $\theta_1$ is given the dataset that we have. Here we use Mean squared error (MSE), which is obtained by subtracting between the prediction $\theta_1(x)_{i}$ abd the actial value $y_i$ and square it.
$$J(\theta_1)=\frac{1}{2} \sum_{i=1}{3}(\theta_1(x)_{i} - y_i)^2$$
- Now the objective is to find the value of $\theta_1$ that leads the minimum vallue of cost fucntion $J(\theta_1)$

$$\arg\min_{\theta_1} (J(\theta_1))$$

- $\arg\min$ means that we are looking for minimum value of the cost fucntion but indeed we are looking at what input value $\theta_1$ leads to the minimum value of the function.
- If we plot the cost fucntion $J(\theta_1)$ that we have w.r.t. $\theta_1$ we can atleast see where the fucntion is minimum (visually), whcih is at 1. This is what we knew, because slope of $45^o$ line is 1. But what if the cost funtion is modre complex.
- To find the minimum we look the derivaive of function. Derivative is intsantaneous rate of change. If I want to find the minimum value of fucntion. We can find the derivative at the initilization random point. The gradeitn is basically a tangent to that point and see what is the value of derivative $+$ or $-$. Depnding on that we check should we go left or right.
- If the derviative is negative, this means that we ahve to go to right to get the smaller value of this fucntion
- Repeat this process for point 2. go towards right
- when the gradeint is approx zero we stop.

## General formula to update the values
$$\theta_1 \leftarrow \theta_1 - \alpha J'(\theta_1)$$
- where $\alpha$ is learnign rate, $\theta_1$ is the current value $J'(\theta_1)$ derivative at that point.
- negative sign because we want to minimize the function. As we are going towards right.
- Lets calcualte the derivative $J'(\theta_1) $ of objective function w.r.t. model parameters i.e. weights, not input or output using chain rule. Input and output are given and fixed so nothing to optimize
  $$ J'(\theta_1) =\frac{1}{2}\sum_{i=1}^{n} 2(\theta_1*x_i - y_i)x_i = \sum_{i=1}^{n} 2(\theta_1*x_i - y_i)x_i$$
- $\theta_1$ is the model parameters that we want to optimize. In nn these are weights and biases

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Given data
x = np.array([1, 2, 3])
y_true = np.array([1, 2, 3])

plt.figure(figsize=(8, 5))
plt.plot(x,y_true, 'ok')
plt.title('n=3 training data points')
plt.show()

# Define MSE function
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Generate predicted values for visualization
y_pred_range = np.linspace(-10, 10, 10)  # Range of predicted values
mse_values = [mse(y_true, np.full_like(y_true, yp)) for yp in y_pred_range]

# Plot MSE vs. predicted values
plt.figure(figsize=(8, 5))
plt.plot(y_pred_range, mse_values, label="MSE Curve", color="b")
plt.xlabel("Predicted Value")
plt.ylabel("Mean Squared Error (MSE)")
plt.title("MSE vs. Predicted Value")
plt.legend()
plt.grid()
plt.show()


# Gradient Descent to Minimize the Cost Function (MSE)

**Gradient Descent** is an optimization algorithm used to minimize a given cost function by iteratively adjusting parameters in the direction of the steepest descent (negative gradient). For **Mean Squared Error (MSE)**, the goal is to find the optimal \( \theta \) that minimizes:

$$
J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

where:

- $J(\theta)$ is the **cost function (MSE)**.
- $y_i $ is the **true value**.
- $ \hat{y}_i$ is the **predicted value**.
- $ n $ is the **number of data points**.

---

## Gradient Descent Update Rule

The update rule for gradient descent is:

$$
\theta = \theta - \alpha \frac{dJ}{d\theta}
$$

where:

- $\alpha $ (learning rate) controls the **step size**.
- $\frac{dJ}{d\theta} $ is the **gradient of the cost function**.

For **linear regression**, the gradient of MSE with respect to \( \theta \) is:

$$
\frac{dJ}{d\theta} = -\frac{2}{n} \sum (y_i - \hat{y}_i)
$$


# Gradient Descent Algorithm

Gradient Descent is an optimization algorithm used to minimize a given cost function by iteratively adjusting parameters in the direction of the steepest descent (negative gradient).

## **Algorithm Steps for Gradient Descent**
1. **Initialize Parameters:**
   - Choose an initial guess for **$\theta $ (weights/parameters)**.
   - Set the **learning rate** $\alpha $.
   - Define the number of iterations (or stopping criteria).

2. **Compute Predictions:**
   - Calculate the predicted values **$\hat{y}$** using the current parameters:
     $$
     \hat{y} = \theta x
     $$

3. **Compute the Cost Function:**
   - Evaluate the Mean Squared Error $(MSE)$ cost function:
     $$
     J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
     $$
   - Store the cost function values (optional, for visualization).

4. **Compute the Gradient:**
   - Calculate the derivative of the cost function with respect to **\( \theta \)**:
     $$
     \frac{dJ}{d\theta} = -\frac{2}{m} \sum (y_i - \hat{y}_i) x_i
     $$

5. **Update the Parameters:**
   - Update **\( \theta \)** using the gradient descent formula:
     $$
     \theta = \theta - \alpha \frac{dJ}{d\theta}
     $$

6. **Repeat Steps 2 to 5:**
   - Continue iterating until:
     - A **convergence criteria** is met (e.g., the change in cost function is minimal).
     - The **maximum number of iterations** is reached.

7. **Return the Optimal Parameters:**
   - The final value of **$ \theta $** is the one that minimizes the cost function.

---

## **Notes:**
- The **learning rate** $ \alpha $ should be chosen carefully:
  - Too large: The algorithm may **diverge**.
  - Too small: The algorithm may **converge slowly**.
- Gradient Descent can be **Batch Gradient Descent**, **Stochastic Gradient Descent (SGD)**, or **Mini-batch Gradient Descent** depending on the amount of data used per update.

---

This algorithm is widely used in **machine learning**, particularly in **linear regression**, **logistic regression**, and **neural networks**.




In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Given dataset
x = np.array([1, 2, 3])
y_true = np.array([1, 2, 3])

# Initialize parameters
theta = 0.0  # Initial guess
alpha = 0.1  # Learning rate
iterations = 100  # Number of gradient descent steps

# Store cost values for visualization
cost_history = []

# Gradient Descent Algorithm
for i in range(iterations):
    y_pred = theta * x  # Predicted value
    error = y_pred - y_true  # Error
    cost = np.mean(error ** 2)  # Compute MSE cost
    cost_history.append(cost)  # Store cost function value
    
    # Compute gradient (derivative of cost function)
    gradient = (2 / len(x)) * np.sum(error * x)
    
    # Update theta using gradient descent update rule
    theta -= alpha * gradient

# Final optimized theta
print("Optimized theta:", theta)

# Plot the cost function over iterations
plt.figure(figsize=(8,5))
plt.plot(range(iterations), cost_history, label="Cost Function (MSE)", color="b")
plt.xlabel("Iterations")
plt.ylabel("MSE Cost")
plt.title("Gradient Descent: Cost Function vs Iterations")
plt.legend()
plt.grid()
plt.show()


## Multi dimensional problem
If we have multivariate fucntion. In this case we have to use parital derivative instead of derivate. than find the Jacobian
$$\theta \leftarrow \theta - \alpha \nabla J(\theta)$$
- $\nabla J(\theta)$ is gradeint
$$
\nabla J(\theta) = 
\begin{bmatrix}
\frac{\partial J}{\partial \theta_1} \\
\frac{\partial J}{\partial \theta_2} \\
\end{bmatrix}
$$
### if we have a lot of samples 
Than we canot use full data at once. We divide the data into batches and update the parameter 
- divide the data into bathes $\rightarrow$ update the parameter $\rightarrow$ repeat (numper of epocs)
- we have to tune hyper parameter, the size of batch and number of epochs

---
# Creat a simple Neural Network 
---
## Step 0:  Generate synthetic Data
Lets assume that the output $y$ is linear combination of weights $w_1$ and $w_2$. Previously we call it as $\theta_1$ and $\theta_2$
$$ y = w_1 x_1 + w_2 x_2 + b $$
- This is a general framework for linear regression
- Using linear algebra, We can write this as inner product of weights and inputs (inputs are kind of features, or )
$$
y <
\begin{bmatrix}
w_1 \\
w_2 \\
\end{bmatrix},
\begin{bmatrix}
x_1 \\
x_2 \\
\end{bmatrix}
> + b =
\begin{bmatrix}
x_1 & x_2 \\
\end{bmatrix}
\begin{bmatrix}
w_1 \\
w_2 \\
\end{bmatrix} +b
$$

for a complete dataset we stack the feature vector $[x_1 x_2]$ whcih is called $X$, where each row corresponds to sample and each column coresponds to feature
$$
y =
\begin{bmatrix}
\vdots \\
X \\
\vdots \\
\end{bmatrix}
\begin{bmatrix}
w_1 \\
w_2 \\
\end{bmatrix} +b
$$

In [None]:
import torch
from torch.utils import data

def synthetic_data(w, b, num_examples):
    """ Generate y = wx + b + noise """
    X = torch.normal(0,1, (num_examples, len(w)))
    y = torch.matmul(X,w) + b # here it works because of broadcasting. b is added to all elements of the result of multiplication
    y += torch.normal(0, 0.01, y.shape)
    return X,y.reshape((-1,1))


true_w = torch.tensor([2, -3.4])
true_b = 4.2
# features are X and  labels are y
features, labels = synthetic_data(true_w, true_b, 1000)
print(features.shape, labels.shape) # for each sampel we ahve 1 output value y

## Step 1: Reading the dataset
We use `torch.utils` to construct batch size of data

In [None]:
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyToech data Iterator"""
    dataset = data.TensorDataset(*dat_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

batch_size = 10 # we have 1000 samples and 10 batches so there will be 100
# This gives us an iteratable object. Every time we get 10 rows of X and corresponding y
data_iter =  load_array((features, labels), batch_size)

## Step 2: Defining the model
<div style="text-align: center;">
    <img src=".\images\fcnn_simple.PNG" alt="Computational Graph" width="350">
</div>



In [None]:
# `nn` is abbrevaition for neural networks
from torch import nn

# the fully connected lauer is defined in the Linear class
net = nn.Sequential(nn.Linear(2,1)) # 2 neuron in input layer and 1 neuron in output layer
# NOW WE HAVE ACCESS TO WEIGHTS AND BIAS. To access first layer
print(net[0].weight, net[0].bias) # weight has two element and bias has one element. requires_grad is already set true, which we need for autograd

# We can also initialize w1, w2 and b
net[0].weight.data.normal_(0,0.01) # we are initiliising with normal distributed data with mean 0 and variance 0.01
net[0].bias.data.fill_(0)

# As we have only one layer if we try to get value net[1].weight, we will get error. As there is only one layer

## Step 3: Defining the cost fucntion (loss) and optimizer

- `net.parameters()` will take all the parameter i.e., weights and bias `net` is our neural network that we defined earlier



In [None]:
loss = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.03)

## Step 4: Training

In [None]:
num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        # reading only one mini batch, whcih is lioaded usign data loader
        l = loss(net(X), y)
        trainer.zero_grad() # empty the gradeint
        l.backward() # find new gradeint for the loss fucntion
        optimizer.step() # update formula for SGD
    l = loss(net(features), labels) # See the performance for the estimated data (features) with true data (lables)
    print(f'epoch {epoch +1 }, loss {l:f}')

# At any point we can access the weight and bias
w = net[0].weight.data
print('error in estimating w:', true_w - w.reshape(true.w.shape))
b = net[0].bias.data
print('error in estimating b:', true_b - b)