In this section, we will learn to build a simple neural network from scratch.
When I say "from scratch," it means we will not use PyTorch's `torch.nn` module (which is designed to easily construct neural networks).
Instead, we will manually build the layers and connections using basic matrix operations.

**Why not `torch.nn`?**
- The `torch.nn` module is the most common and recommended way to build neural networks in PyTorch. It simplifies the process by providing pre-built layers and functionality, allowing you to focus on designing and training the model.
- By using `torch.nn`, you can quickly create complex models without worrying about low-level details like manually implementing matrix multiplications or tracking gradients, which is why it’s the preferred approach in most real-world applications.

**Why build from scratch?**
- Understanding how neural networks work under the hood is essential for developing a deep understanding of how each layer, activation function, and weight update interacts.
- This exercise will give us a solid foundation to understand what’s really happening behind `torch.nn`, and it will also give us the ability to design advanced neural network architectures that may not be readily available in `torch.nn`.

If you’re focused on getting up to speed with PyTorch quickly, feel free to skip the beginning of this tutorial and jump straight to the section on building a neural network with torch.nn, starting after the line `import torch.nn as nn`. (And don’t forget to check the last part of this tutorial, *Convention of the Input/Output Tensor Shapes (IMPORTANT)*).

### **Nothing is really complicated**
A fully connected neural network (also known as a dense layer) follows the mathematical expression:

$$a_i^{(l)} = \sum_{j=1}^{n_{l-1}} w_{ij}^{(l)} a_j^{(l-1)} + b_i^{(l)}$$

where:
- $a_i^{(l)}$ represents the activation/output of the $i$-th neuron in layer $l$,
- $w_{ij}^{(l)}$ is the weight connecting the $j$-th neuron in layer $(l-1)$ to the $i$-th neuron in layer $l$,
- $b_i^{(l)}$ is the bias term for the $i$-th neuron in layer $l$,
- and $a_j^{(l-1)}$ is the activation/output of the $j$-th neuron in the previous layer $(l-1)$.

You can see that the operations involved are just matrix multiplication and addition. Specifically, multiplying the weight matrix $W^{(l)}$ by the output vector $a^{(l-1)}$, and then adding the bias vector $b^{(l)}$.

Note that the connections between layers here are linear. No activation function has been applied yet. (we will talk about activation functions later)

In [None]:
import torch
import numpy as np

# Creating dataset in PyTorch tensors
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)  # shape (4, 1)
Y = torch.tensor([[2, 3], [4, 6], [6, 9], [8, 12]], dtype=torch.float32)  # shape (4, 2)
# IMPORTANT: Why do we change the shape of tensors?
# --> go to the end of this tutorial: Convention of the Input/Output Tensor Shapes

"""
We shall use object-oriented programming (OOP) to construct neural networks using a class.
(Though using simple functions is also possible.)

In OOP, a class acts as a blueprint for creating objects,
allowing us to define attributes and methods that operate on these attributes.
(It’s a fundamental concept in Python programming.)
"""

class NN:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases manually
        self.mean = 0.
        self.std_dev = 0.1
        self.w1 = torch.normal(mean=self.mean, std=self.std_dev, size=(input_size, hidden_size), requires_grad=True)
        self.b1 = torch.normal(mean=self.mean, std=self.std_dev, size=(hidden_size,), requires_grad=True)
        self.w2 = torch.normal(mean=self.mean, std=self.std_dev, size=(hidden_size, hidden_size), requires_grad=True)
        self.b2 = torch.normal(mean=self.mean, std=self.std_dev, size=(hidden_size,), requires_grad=True)
        self.w_out = torch.normal(mean=self.mean, std=self.std_dev, size=(hidden_size, output_size), requires_grad=True)
        self.b_out = torch.normal(mean=self.mean, std=self.std_dev, size=(output_size,), requires_grad=True)

    # forward ver1
    def forward(self, x):
        # Input layer (input to hidden1)
        z1 = x @ self.w1 + self.b1  # matrix multiplication + bias
        # First hidden layer to second hidden layer (hidden1 to hidden2)
        z2 = z1 @ self.w2 + self.b2  # matrix multiplication + bias
        # Second layer to output (hidden2 to output)
        z_out = z2 @ self.w_out + self.b_out  # matrix multiplication + bias
        return z_out

    """
    Instead of "@" and "+", you can use `torch.matmul()` and `torch.add()` as well.
    They are the same. (see forward ver2 below)
    """
    # # forward ver2
    # def forward(self, x):
    #     # Input layer (input to hidden1)
    #     z1 = torch.add(torch.matmul(x, self.w1), self.b1)
    #     # First hidden layer to second hidden layer (hidden1 to hidden2)
    #     z2 = torch.add(torch.matmul(z1, self.w2), self.b2)
    #     # Second layer to output (hidden2 to output)
    #     z_out = torch.add(torch.matmul(z2, self.w_out), self.b_out)
    #     return z_out

    def parameters(self):
        # Collect all parameters for easy access (useful for optimization)
        return [self.w1, self.b1, self.w2, self.b2, self.w_out, self.b_out]

# Fix input and output size based on the shape of X and Y
model = NN(input_size=X.shape[1], hidden_size=3, output_size=Y.shape[1])
# (batch size, number of features) @ (number of features, layer1 dimension) @ (layer1 dimensiton, layer2 dimesion) @ (layer2 dimension, output size)
# (4, 1) @ (1, 3) @ (3, 3) @ (3, 2) --> so the output will be a tensor of size (4, 2) "4 set of output data (batch size) each with 2 output features"

# Checking the matrix multiplication's shape
print(X.shape, model.w1.shape, model.w2.shape, model.b1.shape, model.b2.shape)

# Mean squared error (MSE) loss function
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# Initial prediction
print(f'Prediction before training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

# Main training loop settings
learning_rate = 0.01
n_iterations = 200
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = model.forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Zero gradients:: Putting zero_grad before backward is also fine!
    optimizer.zero_grad()

    # Backward pass to compute gradients
    l.backward()

    # Optimizer step
    optimizer.step()

    # Print training info
    if epoch % 10 == 0:
        print(f'Epoch {epoch + 1}, loss = {l:.8f}')

# Prediction after training
print(f'Prediction after training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

torch.Size([4, 1]) torch.Size([1, 3]) torch.Size([3, 3]) torch.Size([3]) torch.Size([3])
Prediction before training: model([5]) = tensor([[ 0.0793, -0.0141]], grad_fn=<AddBackward0>)
Epoch 1, loss = 48.60699844
Epoch 11, loss = 35.05459213
Epoch 21, loss = 0.61255360
Epoch 31, loss = 0.49195898
Epoch 41, loss = 0.39488536
Epoch 51, loss = 0.31532276
Epoch 61, loss = 0.25029284
Epoch 71, loss = 0.19739500
Epoch 81, loss = 0.15463699
Epoch 91, loss = 0.12032990
Epoch 101, loss = 0.09302635
Epoch 111, loss = 0.07148015
Epoch 121, loss = 0.05462322
Epoch 131, loss = 0.04154541
Epoch 141, loss = 0.03148174
Epoch 151, loss = 0.02379571
Epoch 161, loss = 0.01796615
Epoch 171, loss = 0.01357216
Epoch 181, loss = 0.01027791
Epoch 191, loss = 0.00781928
Prediction after training: model([5]) = tensor([[ 9.8465, 14.9062]], grad_fn=<AddBackward0>)


In [None]:
# You can also print the model parameters simply with:
print(f'model parameters = {model.parameters()}')

model parameters = [tensor([[ 1.0811, -0.7628, -0.6782]], requires_grad=True), tensor([-0.1405,  0.3181,  0.1594], requires_grad=True), tensor([[-0.1280,  0.4199, -1.0095],
        [ 0.2909, -0.1669,  0.7717],
        [ 0.0541, -0.1634,  0.6986]], requires_grad=True), tensor([ 0.0021,  0.0882, -0.1247], requires_grad=True), tensor([[-0.0182, -0.3003],
        [ 0.1421,  0.4845],
        [-0.8439, -1.1604]], requires_grad=True), tensor([0.5553, 0.6318], requires_grad=True)]


In [None]:
# Or in a fancier way:
def print_model_parameters(model):
    for name, param in zip(['w1', 'b1', 'w2', 'b2'], model.parameters()):
        print(f'Parameter name: {name}')
        print(f'Value: \n{param.data}')
        print(f'Gradient: \n{param.grad}')
        print('---')

print_model_parameters(model) # note that the grad printed here are all none since

Parameter name: w1
Value: 
tensor([[ 1.0811, -0.7628, -0.6782]])
Gradient: 
tensor([[-0.0049,  0.0037,  0.0031]])
---
Parameter name: b1
Value: 
tensor([-0.1405,  0.3181,  0.1594])
Gradient: 
tensor([ 0.0668, -0.0500, -0.0439])
---
Parameter name: w2
Value: 
tensor([[-0.1280,  0.4199, -1.0095],
        [ 0.2909, -0.1669,  0.7717],
        [ 0.0541, -0.1634,  0.6986]])
Gradient: 
tensor([[ 0.0022, -0.0041,  0.0125],
        [-0.0031,  0.0065, -0.0216],
        [-0.0019,  0.0037, -0.0120]])
---
Parameter name: b2
Value: 
tensor([ 0.0021,  0.0882, -0.1247])
Gradient: 
tensor([-0.0074,  0.0163, -0.0585])
---


In [None]:
"""
moving away from scratch
--> Building Neural Networks with `torch.nn`

From here, we explore `torch.nn` for building neural networks
that may become your go-to approach
"""

import torch
import torch.nn as nn
import numpy as np

# Creating dataset in PyTorch tensors
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)  # shape (4, 1)
Y = torch.tensor([[2, 3], [4, 6], [6, 9], [8, 12]], dtype=torch.float32)  # shape (4, 2)

class NN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NN, self).__init__()
        # Define layers using nn.Linear (includes weights and biases automatically)
        self.fc1 = nn.Linear(input_size, hidden_size)  # First layer: input to hidden1
        self.fc2 = nn.Linear(hidden_size, hidden_size)  # Second layer: hidden1 to hidden2
        self.fc_out = nn.Linear(hidden_size, output_size) # output layer: hidden2 to output

    def forward(self, x):
        # Apply the layers in sequence (no activation function for simplicity)
        x = self.fc1(x)  # input -> First layer
        x = self.fc2(x)  # First layer -> Second layer
        x = self.fc_out(x) # Second layer -> output
        return x
    """
    Note:
    1.  __init__ Method:
	•	The __init__ method is the constructor of the class. It is used to define and initialize the layers of the neural network.
	•	In this case, we define three fully connected layers (nn.Linear), which include weights and biases automatically.
	•	The method receives input_size, hidden_size, and output_size to configure the dimensions of the layers.
	2.	super(NN, self).__init__():
	•	This line calls the __init__ method of the parent class nn.Module.
	•	It’s necessary because NN inherits from nn.Module, and calling super() ensures that all the internal features of nn.Module (like parameter registration, autograd, etc.) are correctly initialized.
	•	Without this line, the class wouldn’t function properly as a PyTorch model.
    """

# Fix input and output size based on the shape of X and Y
model = NN(input_size=X.shape[1], hidden_size=3, output_size=Y.shape[1])

# Mean squared error (MSE) loss function
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# Initial prediction
print(f'Prediction before training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

# Main training loop settings
learning_rate = 0.01
n_iterations = 200
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = model.forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Zero gradients:: Putting zero_grad before backward is also fine!
    optimizer.zero_grad()

    # Backward pass to compute gradients
    l.backward()

    # Optimizer step
    optimizer.step()

    # Print training info
    if epoch % 10 == 0:
        print(f'Epoch {epoch + 1}, loss = {l:.8f}')

# Prediction after training
print(f'Prediction after training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

Prediction before training: model([5]) = tensor([[-0.7878,  0.5648]], grad_fn=<AddmmBackward0>)
Epoch 1, loss = 48.46225739
Epoch 11, loss = 0.22184399
Epoch 21, loss = 0.00322931
Epoch 31, loss = 0.00308042
Epoch 41, loss = 0.00294041
Epoch 51, loss = 0.00280765
Epoch 61, loss = 0.00268153
Epoch 71, loss = 0.00256158
Epoch 81, loss = 0.00244738
Epoch 91, loss = 0.00233854
Epoch 101, loss = 0.00223475
Epoch 111, loss = 0.00213572
Epoch 121, loss = 0.00204117
Epoch 131, loss = 0.00195090
Epoch 141, loss = 0.00186468
Epoch 151, loss = 0.00178229
Epoch 161, loss = 0.00170357
Epoch 171, loss = 0.00162833
Epoch 181, loss = 0.00155642
Epoch 191, loss = 0.00148770
Prediction after training: model([5]) = tensor([[10.0701, 14.9532]], grad_fn=<AddmmBackward0>)


In [None]:
# With torch.nn.Module, our model now has a named_parameters() function
# which returns both the names and the values of the model's parameters.
def print_model_parameters(model):
    for name, param in model.named_parameters():
        print(f'Parameter name: {name}')
        print(f'Value: \n{param.data}')
        print(f'Gradient: \n{param.grad}')
        print('---')

print_model_parameters(model)

Parameter name: fc1.weight
Value: 
tensor([[ 1.1824],
        [-1.3786],
        [ 0.5873]])
Gradient: 
tensor([[ 1.0512e-04],
        [-5.6312e-06],
        [-2.1590e-04]])
---
Parameter name: fc1.bias
Value: 
tensor([-0.6605, -0.1920,  0.0790])
Gradient: 
tensor([ 0.0004,  0.0004, -0.0013])
---
Parameter name: fc2.weight
Value: 
tensor([[ 0.7895, -0.7183,  0.1394],
        [ 0.1031, -0.9238,  0.5506],
        [-0.1248, -0.2326, -0.0709]])
Gradient: 
tensor([[-0.0005, -0.0006,  0.0003],
        [ 0.0011,  0.0010, -0.0004],
        [-0.0016, -0.0016,  0.0007]])
---
Parameter name: fc2.bias
Value: 
tensor([-0.3280, -0.3290, -0.0565])
Gradient: 
tensor([ 0.0013, -0.0023,  0.0036])
---
Parameter name: fc_out.weight
Value: 
tensor([[ 0.5222,  0.5795, -0.0590],
        [ 0.8826,  0.6873,  0.1868]])
Gradient: 
tensor([[ 0.0065, -0.0026, -0.0017],
        [-0.0043,  0.0018,  0.0012]])
---
Parameter name: fc_out.bias
Value: 
tensor([0.3632, 0.7999])
Gradient: 
tensor([-0.0193,  0.0129])
---


In [None]:
"""
Additionally, we sometimes see models with simpler NN built using `nn.Sequential`:
"""

X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)  # shape (4, 1)
Y = torch.tensor([[2, 3], [4, 6], [6, 9], [8, 12]], dtype=torch.float32)  # shape (4, 2)

# Constructing the model directly with nn.Sequential
model = nn.Sequential(
    nn.Linear(1, 3),  # input_size = 1, hidden_size = 3
    nn.Linear(3, 3),  # hidden_size to hidden_size
    nn.Linear(3, 2)   # hidden_size to output_size = 2
)

# Mean squared error (MSE) loss function
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# Main training loop settings
learning_rate = 0.01
n_iterations = 200
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = model(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Optimizer
    optimizer.zero_grad()
    l.backward()
    optimizer.step()

    # Print training info
    if epoch % 10 == 0:
        print(f'Epoch {epoch + 1}, loss = {l:.8f}')

# Prediction after training
print(f'Prediction after training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

"""
While `nn.Sequential` allows us to construct simple neural networks in a neat and compact way,
it is limited in terms of flexibility. For more complex architectures, where custom behavior or
layer configurations are needed, it is generally better to use the class-based method.
Personally, I tend to avoid the `nn.Sequential` method, even though it can be quite convenient for simpler cases.
"""

Epoch 1, loss = 38.28511810
Epoch 11, loss = 0.15921214
Epoch 21, loss = 0.10976579
Epoch 31, loss = 0.07623108
Epoch 41, loss = 0.05298983
Epoch 51, loss = 0.03696627
Epoch 61, loss = 0.02596537
Epoch 71, loss = 0.01843089
Epoch 81, loss = 0.01327113
Epoch 91, loss = 0.00972800
Epoch 101, loss = 0.00728033
Epoch 111, loss = 0.00557324
Epoch 121, loss = 0.00436669
Epoch 131, loss = 0.00349947
Epoch 141, loss = 0.00286336
Epoch 151, loss = 0.00238617
Epoch 161, loss = 0.00201960
Epoch 171, loss = 0.00173119
Epoch 181, loss = 0.00149910
Epoch 191, loss = 0.00130849
Prediction after training: model([5]) = tensor([[10.0530, 14.9428]], grad_fn=<AddmmBackward0>)


'\nWhile `nn.Sequential` allows us to construct simple neural networks in a neat and compact way, \nit is limited in terms of flexibility. For more complex architectures, where custom behavior or \nlayer configurations are needed, it is generally better to use the class-based method. \nPersonally, I tend to avoid the `nn.Sequential` method, even though it can be quite convenient for simpler cases.\n'

# **Convention of the Input/Output Tensor Shapes (IMPORTANT)**

In machine learning, it’s essential to understand how data is typically formatted when passed into models as tensors. As we have demonstrated with explicit matrix multiplications (`matmul`), when a single data point passes through the network, we expect to see tensor operations with the following shapes:

$$(1, 1)  @  (1, 3)  @  (3, 3)  @  (3, 1),$$

where the hidden neurons align in columns between the matrix multiplications.

Thanks to the parallel computation capabilities of GPUs, we can process multiple data points simultaneously. This generalizes the former expression to:

$$(\text{Batch Size}, 1)  @  (1, 3)  @  (3, 3)  @  (3, 1),$$

without violating the mathematical rules of matrix operations. More generally, we can represent the tensor operations as:

$$
(\text{Batch Size}, \text{Input Dimension})  @  (\text{Input Dimension}, \text{Hidden}_1 \ \text{Dimension})  @  \cdots  @  (\text{Hidden}_\text{last} \ \text{Dimension}, \text{Output Dimension}),
$$

where the final output shape becomes $(\text{Batch Size}, \text{Output Dimension})$.

The input dimension is also commonly referred to as "features." Hence:

### **Input Tensor Shape**

$$
\text{Input Tensor Shape:} \quad (\text{Batch Size}, \text{ Number of Features})
$$

- **Batch Size (N)**: This represents the number of samples processed in parallel. Instead of feeding a single data point into the model, we usually process a batch of data points. This helps optimize training by making better use of modern hardware, such as GPUs.
- **Number of Features (M)**: Each sample consists of multiple features. For instance, in a dataset where each sample represents an image, the features could be pixel values. In tabular data, the features might represent different measurements for each sample.

Mathematically, this input tensor can be thought of as a matrix of shape $ N \times F $, where:

### **Visual Representation of the Input Tensor Shape**

$$
\begin{pmatrix}
\text{feature}_1^{(1)} & \text{feature}_2^{(1)} & \cdots & \text{feature}_n^{(1)} \\
\text{feature}_1^{(2)} & \text{feature}_2^{(2)} & \cdots & \text{feature}_n^{(2)} \\
\text{feature}_1^{(3)} & \text{feature}_2^{(3)} & \cdots & \text{feature}_n^{(3)} \\
\vdots                 & \vdots                 & \ddots & \vdots \\
\text{feature}_1^{(m)} & \text{feature}_2^{(m)} & \cdots & \text{feature}_n^{(m)}
\end{pmatrix}
\begin{array}{l}
\left. \begin{array}{c} \\ \\ \\ \\ \\ \end{array} \right\} \text{Features }N; \text{Batches }M
\end{array}
$$

Note that when dealing with data with higher dimensions (such as 32x32 2D-image data) we often flatten these images into a 1D vector with $32 \times 32 = 1024$ components. This can be easily achieved using `torch.view` to reshape the original data tensor.

Now, we are equipped with the foundational skills to code a simple neural network using PyTorch’s torch.nn class. Every advanced neural network, no matter how complex, begins with these fundamental building blocks provided by the torch.nn module (or sometimes even from scratch). As the reader may have observed, our current network is simply a series of linear transformations. In the next section, we will explore how to introduce non-linearity into the model by adding *activation functions*, which are crucial for capturing more complex patterns