Nueral network comprices of layers/modules that perform operations on data.

the [torch.nn](https://pytorch.org/docs/stable/nn.html) provides all the building blocks you need to build your own neural network

# Neural network to classify images in the FashionMNIST dataset.


In [2]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Get a device for training


we want to be able to train a model on an accelerator such as CUDA, MPS, MTIA or XPU

In [3]:
if torch.cuda.is_available():
  device = 'cuda'
else:
  device = 'cpu'

print(f"Using {device} device")

Using cuda device


In [6]:
!nvidia-smi

Thu Mar 13 11:36:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [9]:
!nvidia-smi -L #list available nvidia devices

GPU 0: Tesla T4 (UUID: GPU-8d249efe-d52b-788e-87b1-48041d3e09b1)


# Define the class

In [7]:
class NeuralNetwork(nn.Module): #sub class inherits from the nn.Module super class
  def __init__(self):
    super().__init__() #calls the nn.Module constructor
    self.flatten =nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
        nn.Linear(28*28, 512),
        nn.ReLU(),
        nn.Linear(512, 512),
        nn.ReLU(),
        nn.Linear(512,10)
    )

  def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

In [8]:
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


To use the model we pass in the input data. This executes the models `forward`

Calling the model on the input returns a 2-dimensional tensor with dim=0 corresponding to each output of 10 raw predictied values for each class, and dim=1 corresponding to the individual values of each output.

we get the prediction probabilities by passing it throught an instance of the `nn.softmax` module

In [14]:
X = torch.rand(1, 28, 28, device = device )
X

tensor([[[0.2673, 0.4616, 0.6439, 0.1135, 0.3129, 0.9808, 0.6728, 0.2944,
          0.1140, 0.1431, 0.6632, 0.6748, 0.4022, 0.7723, 0.6514, 0.9899,
          0.4485, 0.6280, 0.7989, 0.4605, 0.7856, 0.9924, 0.5583, 0.4990,
          0.9674, 0.5661, 0.1834, 0.3078],
         [0.4977, 0.4825, 0.4009, 0.9072, 0.1756, 0.7436, 0.8044, 0.3969,
          0.2072, 0.4688, 0.2921, 0.4764, 0.4385, 0.3407, 0.9944, 0.2457,
          0.1149, 0.3104, 0.5594, 0.1053, 0.7252, 0.1867, 0.1969, 0.6313,
          0.7503, 0.8884, 0.9234, 0.5925],
         [0.7620, 0.4384, 0.7022, 0.3244, 0.4776, 0.6039, 0.3639, 0.6401,
          0.1652, 0.2680, 0.9155, 0.5096, 0.5303, 0.9118, 0.2061, 0.0977,
          0.7700, 0.9031, 0.6211, 0.0383, 0.7915, 0.5787, 0.4181, 0.3859,
          0.2295, 0.0854, 0.2746, 0.0665],
         [0.1093, 0.3264, 0.5218, 0.6346, 0.2258, 0.3193, 0.7571, 0.8046,
          0.1708, 0.0653, 0.7936, 0.2982, 0.7171, 0.1653, 0.3651, 0.4892,
          0.3586, 0.0861, 0.1788, 0.7454, 0.5267, 0.4438,

In [16]:
logits = model(X)
logits

tensor([[ 0.0668,  0.0271, -0.0193, -0.0773, -0.0405, -0.0205,  0.0092, -0.0075,
          0.1733,  0.0580]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [17]:
pred_probab = nn.Softmax(dim=1)(logits)
pred_probab

tensor([[0.1049, 0.1008, 0.0962, 0.0908, 0.0942, 0.0961, 0.0990, 0.0974, 0.1167,
         0.1040]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

In [18]:
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([8], device='cuda:0')


# Model Layers

breaking it down -  Take a sample minibatch of 3 images of size 28 X 28 and see what happens to it as we pass it through the network


In [22]:
input_image = torch.rand(3,28,28)
print(input_image.size())

torch.Size([3, 28, 28])


## nn.Flatten

Initialize the `nn.Flatten` layer to convert each 2D 28x28 image into a contigous array of 784 pixel values (the minibatch dimension (at dim=0) is maintained

In [23]:
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

torch.Size([3, 784])


## nn.Linear

`nn.Linear` applies a linear transformation on the input using its stored weights and biases

In [25]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
layer1

Linear(in_features=784, out_features=20, bias=True)

In [28]:
hidden1 = layer1(flat_image)
hidden1

tensor([[ 0.2155, -0.4049, -0.8427,  0.1457, -0.0643, -0.4941, -0.4382, -0.1710,
         -0.1029, -0.2099, -0.1976, -0.2963,  0.4958,  0.0162, -0.7350, -0.5259,
          0.1827, -0.0915, -0.2880, -0.1483],
        [ 0.2453, -0.3701, -0.7320,  0.3557,  0.1142, -0.1105, -0.5984, -0.2136,
         -0.4932, -0.3136, -0.0386, -0.0660,  0.4938,  0.4970, -0.7241, -0.6247,
         -0.0164, -0.1783, -0.4323,  0.0151],
        [ 0.3137, -0.6424, -0.5849,  0.1244, -0.0347, -0.0650, -0.6022,  0.0118,
         -0.1354, -0.2929, -0.2084, -0.4902,  0.6500,  0.1685, -0.7105, -0.3890,
          0.2476, -0.2165, -0.6606,  0.0412]], grad_fn=<AddmmBackward0>)

In [29]:
print(hidden1.size())

torch.Size([3, 20])


## nn.ReLU

Non-linear activations are what create the complex mappings between the model's inputs and outputs. They are applied after linear transformations to introduce non linearlity, helping neural networks learn a wide variety phenomenal

In [30]:
print(f"Before ReLU: {hidden1} \n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[ 0.2155, -0.4049, -0.8427,  0.1457, -0.0643, -0.4941, -0.4382, -0.1710,
         -0.1029, -0.2099, -0.1976, -0.2963,  0.4958,  0.0162, -0.7350, -0.5259,
          0.1827, -0.0915, -0.2880, -0.1483],
        [ 0.2453, -0.3701, -0.7320,  0.3557,  0.1142, -0.1105, -0.5984, -0.2136,
         -0.4932, -0.3136, -0.0386, -0.0660,  0.4938,  0.4970, -0.7241, -0.6247,
         -0.0164, -0.1783, -0.4323,  0.0151],
        [ 0.3137, -0.6424, -0.5849,  0.1244, -0.0347, -0.0650, -0.6022,  0.0118,
         -0.1354, -0.2929, -0.2084, -0.4902,  0.6500,  0.1685, -0.7105, -0.3890,
          0.2476, -0.2165, -0.6606,  0.0412]], grad_fn=<AddmmBackward0>) 


After ReLU: tensor([[0.2155, 0.0000, 0.0000, 0.1457, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.4958, 0.0162, 0.0000, 0.0000, 0.1827, 0.0000,
         0.0000, 0.0000],
        [0.2453, 0.0000, 0.0000, 0.3557, 0.1142, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.4938, 0.4970, 0.0

ReLU makes all negatives zero and outputs the same numbers for positive values

## nn.Sequential

`nn.Sequential` is an ordered container of modules. The data is passed through all the modules in the same order as defined.

You can use sequential containers to put together a quick network like `seq_modules``


In [33]:
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(20,10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)
logits

tensor([[-0.0787, -0.0265,  0.0381,  0.2250, -0.2706, -0.0462,  0.1415,  0.0530,
          0.1138,  0.0839],
        [-0.1836,  0.1150,  0.1064,  0.1978, -0.1172, -0.1620,  0.2105,  0.0848,
          0.1421,  0.0098],
        [-0.1120,  0.0866,  0.0424,  0.2905, -0.2480,  0.0116,  0.1444,  0.1412,
          0.0400,  0.1570]], grad_fn=<AddmmBackward0>)

## nn.Softmax

The last linear layer of the netowrk returns logits - raw values in [-infty, infty] -  which are passed to the `nn.Softmax` module
The logits are scaled to value[0,1] representing the model's predicted probabilities for each class. `dim`paramter indicates the dimension along which the value must sum to 1

In [47]:
logits

tensor([[-0.0787, -0.0265,  0.0381,  0.2250, -0.2706, -0.0462,  0.1415,  0.0530,
          0.1138,  0.0839],
        [-0.1836,  0.1150,  0.1064,  0.1978, -0.1172, -0.1620,  0.2105,  0.0848,
          0.1421,  0.0098],
        [-0.1120,  0.0866,  0.0424,  0.2905, -0.2480,  0.0116,  0.1444,  0.1412,
          0.0400,  0.1570]], grad_fn=<AddmmBackward0>)

In [46]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
pred_probab

tensor([[0.0896, 0.0943, 0.1006, 0.1213, 0.0739, 0.0925, 0.1116, 0.1022, 0.1086,
         0.1054],
        [0.0792, 0.1067, 0.1058, 0.1160, 0.0846, 0.0809, 0.1174, 0.1036, 0.1097,
         0.0961],
        [0.0838, 0.1022, 0.0977, 0.1253, 0.0731, 0.0948, 0.1082, 0.1079, 0.0975,
         0.1096]], grad_fn=<SoftmaxBackward0>)

In [42]:
inputs = torch.rand(size=(4, 4), dtype=torch.float32)
inputs

tensor([[0.4821, 0.5069, 0.9276, 0.9228],
        [0.3100, 0.3381, 0.6353, 0.9888],
        [0.7175, 0.6312, 0.7221, 0.7960],
        [0.0813, 0.0582, 0.2255, 0.9559]])

In [45]:
# input tensor of dimensions B x C, B = number of batches (rows), C = number of classes(columns).
inputs = torch.rand(size=(4, 4), dtype=torch.float32)
soft_dim0 = torch.softmax(inputs, dim=0)
soft_dim1 = torch.softmax(inputs, dim=1)
print('**** INPUTS ****')
print(inputs)
print('**** SOFTMAX DIM=0 ****')
print(soft_dim0)
print('**** SOFTMAX DIM=1 ****')
print(soft_dim1)

**** INPUTS ****
tensor([[0.8793, 0.8018, 0.7319, 0.3394],
        [0.7831, 0.0488, 0.7002, 0.0189],
        [0.3044, 0.9235, 0.8594, 0.8932],
        [0.8315, 0.6727, 0.4078, 0.1659]])
**** SOFTMAX DIM=0 ****
tensor([[0.2920, 0.2874, 0.2612, 0.2322],
        [0.2652, 0.1354, 0.2531, 0.1685],
        [0.1644, 0.3246, 0.2967, 0.4040],
        [0.2784, 0.2526, 0.1889, 0.1952]])
**** SOFTMAX DIM=1 ****
tensor([[0.2966, 0.2745, 0.2560, 0.1729],
        [0.3489, 0.1674, 0.3212, 0.1625],
        [0.1562, 0.2902, 0.2721, 0.2815],
        [0.3309, 0.2823, 0.2166, 0.1701]])


for the softmax with dim=0, the sum of each column =1, while for dim=1, it is the sum of the rows that equals 1. Usually, you do not want to perform a softmax operation across the batch dimension.

## Model Parameters

many layers inside a neural network are `parameterized - have associated weights and biases that ate optimizzed during training`

subclassing `nn.Module` tracks all fields defined inside model object and makes all parameters accessible using your model's `parameters()` or `named_parameters()` methods

We iterate over each paramter, and print its size and a preview of its values

In [48]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure: NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[-0.0353, -0.0236,  0.0047,  ...,  0.0198, -0.0139, -0.0345],
        [ 0.0017,  0.0104, -0.0141,  ..., -0.0288, -0.0204,  0.0223]],
       device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([0.0134, 0.0138], device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[-0.0013, -0.0069, -0.0213,  ..., -0.0232,  0.0332,  0.0096],
        [-0.0141, -0.0440, -0.0296,  ..., -0.0038,  0.0418,  0.0141]],
       device='cuda:0', grad_fn=<Slic