# Neural Networks

In [1]:
import torch
import torch.nn as nn
import numpy as np

In [2]:
# initializes matrix W and vector b
classifier = nn.Linear(5, 10) # 5 inputs 10 outputs

In [3]:
model = nn.Linear(10,3) # 10 inputs 3 outputs
loss = nn.MSELoss()
## dummy input x
input_vector = torch.randn(10)
input_vector

tensor([ 0.7889, -0.8749, -1.7647, -0.4678,  0.1724, -0.7130, -0.1860, -1.1708,
        -0.7079,  1.1622])

In [4]:
## class number 3, denoted as a vector with the class index to 1
target = torch.tensor([0,0,1])
## y in math
# passing the input vector to the model object of the linear classifier
pred = model(input_vector)
output = loss(pred, target) # loss is also object for MSE loss
print("Prediction: " ,pred)
print("Output: " , output)

Prediction:  tensor([-0.7739,  0.4137,  0.3271], grad_fn=<ViewBackward0>)
Output:  tensor(0.4076, grad_fn=<MseLossBackward0>)


In [5]:
def train():

    model = nn.Linear(4,2) # 4 inputs 2 outputs

    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(10):
        # Converting inputs and labels to Variable
        inputs = torch.Tensor([0.8,0.4,0.4,0.2])
        labels = torch.Tensor([1,0])

        # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward
        optimizer.zero_grad()

        # get output from the model, given the inputs
        outputs = model(inputs)

        # get loss for the predicted output
        loss = criterion(outputs, labels)
        print(loss)

        # get gradients w.r.t to parameters
        loss.backward()

        # update parameters
        optimizer.step()

        print('epoch {}, loss {}'.format(epoch, loss.item()))

if __name__ == "__main__":
    train()

tensor(1.2210, grad_fn=<MseLossBackward0>)
epoch 0, loss 1.2210241556167603
tensor(0.7815, grad_fn=<MseLossBackward0>)
epoch 1, loss 0.7814555168151855
tensor(0.5001, grad_fn=<MseLossBackward0>)
epoch 2, loss 0.5001314878463745
tensor(0.3201, grad_fn=<MseLossBackward0>)
epoch 3, loss 0.3200841248035431
tensor(0.2049, grad_fn=<MseLossBackward0>)
epoch 4, loss 0.2048538327217102
tensor(0.1311, grad_fn=<MseLossBackward0>)
epoch 5, loss 0.13110646605491638
tensor(0.0839, grad_fn=<MseLossBackward0>)
epoch 6, loss 0.08390815556049347
tensor(0.0537, grad_fn=<MseLossBackward0>)
epoch 7, loss 0.053701214492321014
tensor(0.0344, grad_fn=<MseLossBackward0>)
epoch 8, loss 0.03436878323554993
tensor(0.0220, grad_fn=<MseLossBackward0>)
epoch 9, loss 0.02199600636959076


In [6]:
model = nn.Sequential(
        nn.Linear(3, 20), # 3 input features, 20 output features
        nn.ReLU(), # activation,=
        nn.Linear(20,2) # 2 output classes
)

print(model)

Sequential(
  (0): Linear(in_features=3, out_features=20, bias=True)
  (1): ReLU()
  (2): Linear(in_features=20, out_features=2, bias=True)
)


In [7]:
model = nn.Sequential(
        nn.Linear(4,5),
        nn.ReLU(),
        nn.Linear(4,2)
)

print(model)

Sequential(
  (0): Linear(in_features=4, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=4, out_features=2, bias=True)
)


## Universal approximation theorem:

According to the universal approximation theorem, given enough neurons and the correct set of weights, a multi-layer NN can approximate any function. Learning this function is increasingly hard, and we have no guarantee that our data are enough to do so.

Admittedly, that doesn’t mean we should only use NNs.

In fact, we will learn about other models and how we can make a NN more compact, wider, or deeper to learn very rich data representations.

Why is that even useful?

Because NNs hide another secret besides being very good function approximators. They are also very good feature extractors.

## Deep neural networks as feature extractors

Feature extraction can be seen as the transformation of the input data points from the input space to the feature space where classification is much easier.

Here is an intuitive and oversimplified example:

Imagine that each data point has 70 dimensions. Finding the correct 70-dimensional function to distinguish the data into two categories is very difficult and time consuming.

Instead, we transform our input to a three-dimensional space where a classifier can approximate the decision boundary more easily. If we transform the 3D decision boundary back to the 70-dimensional space, we will see that it corresponds to a 70-dimensional decision boundary.

The transformed space does not always need to be low-dimensional, but high-dimensional spaces do not guarantee better results either.

Think of the 70-dim example: if one of these input dimensions refers to the label, it would be enough to have 100% accuracy.

In any case, this is the main reason Deep Neural Networks (DNNs) exist: to transform the input data into a “better” space. Better because we can classify the data more easily after we transform them!

In fact, in most real-life applications, only the last one or two layers of a neural network performs the actual classification. The rest account for feature extraction and learning representations.



## Build a Neural Network With Pytorch
Implement a vanilla neural network from scratch using Pytorch.

In [8]:
X = torch.tensor([1,2,3,4,5]) # 1d array
Y = torch.tensor([[1,2], [3,4]]) # 2d array

In [9]:
def neuron(input):
    W = torch.tensor([0.5,0.5,0.5]) # 3 x 1
    b = torch.tensor([0.5]) # 1 x 1
    return torch.add(torch.matmul(W, input), b)

neuron(torch.tensor([5.0, 5.0, 5.0])) # 1 x 1

tensor([8.])

Note that the linear layer does not contain the activation function, so we have to explicitly declare them as well.

In [10]:
model = nn.Sequential(
        nn.Linear(2, 3),
        nn.Sigmoid(),
        nn.Linear(3, 2),
        nn.Sigmoid()
)

## Program your own neural network

Model class is inheriting (taking all the properties and methods) from PyTorch's base class, torch.nn.Module.

nn.Module is the fundamental building block for all neural network modules (layers, loss functions, and entire models) in PyTorch. It automatically tracks trainable parameters, manages data movement (like .cuda()), and handles the forward/backward pass logic.

In [11]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(2, 3)
        self.linear2 = nn.Linear(3, 2)

    def forward(self, X): # modifying the forward method of base class, linear layers will automatically us this
        h = torch.sigmoid(self.linear1(X)) # passes input to first hidden layer, multiplies it by weights and does activation (sigmoid) function
        o = torch.sigmoid(self.linear2(h)) # passes output of first hidden layer to second hidden layer, does activation
        return o

model = Model()
print('model: ', model)
print('w1: ', model.linear1.weight.data)
X = torch.randn((1,2)) # 1 row 2 cols
print('X: ', X)
Y = model(X)
print('Y: ', Y)

print("--- All Model Parameters ---")
# .named_parameters() returns (name, parameter) tuples
for name, param in model.named_parameters():
    if 'weight' in name or 'bias' in name:
        print(f"Layer: {name} | Shape: {param.shape}")
        print(f"Values:\n{param.data.numpy()}\n")

model:  Model(
  (linear1): Linear(in_features=2, out_features=3, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
)
w1:  tensor([[-0.1998, -0.0198],
        [-0.7021, -0.2804],
        [-0.4792, -0.3639]])
X:  tensor([[-0.5350,  0.3057]])
Y:  tensor([[0.5077, 0.6630]], grad_fn=<SigmoidBackward0>)
--- All Model Parameters ---
Layer: linear1.weight | Shape: torch.Size([3, 2])
Values:
[[-0.19976814 -0.01980531]
 [-0.70210934 -0.2804035 ]
 [-0.47918278 -0.36394453]]

Layer: linear1.bias | Shape: torch.Size([3])
Values:
[-0.00917335  0.2128045  -0.6331857 ]

Layer: linear2.weight | Shape: torch.Size([2, 3])
Values:
[[-0.31730545 -0.38738433 -0.28919014]
 [ 0.16225891  0.349739   -0.31868428]]

Layer: linear2.bias | Shape: torch.Size([2])
Values:
[0.54793125 0.49512607]



In [12]:
seed = 172
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

def fnn(input):
    model = nn.Sequential(
            nn.Linear(10, 128),
            nn.ReLU(),
            nn.Linear(128,64),
            nn.ReLU(),
            nn.Linear(64, 2)
    )

    return model(input)

input = torch.randn((1,10))
print(fnn(input))

tensor([[0.2195, 0.0326]], grad_fn=<AddmmBackward0>)


## Optimization
Explore the different variations of the gradient descent algorithm.

### Batch Gradient Descent
The equation and code presented above actually referred to batch gradient descent. In this variant, we calculate the gradient for the entire dataset on each training step before updating the weights.

```
for t in range(steps):
  dw = gradient(loss, data, w)
  w = w - learning_rate *dw
```

### Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) was introduced to address this exact issue. Instead of calculating the gradient over all training examples and updating the weights, SGD updates the weights for each training example 

```
for t in range(steps):
  for example in data:
    dw = gradient(loss, example, w)
    w = w - learning_rate *dw
```

As a result, SGD is much faster and more computationally efficient, but it has noise in the estimation of the gradient. Since it updates the weight frequently, it can lead to big oscillations and that makes the training process highly unstable.

We continuously walk in a zig-zag fashion down the landscape, which keeps overshooting and missing our minimum. However, we can easily get away from local minimums for the same reason, and keep searching for a better one.

### Mini-batch Stochastic Gradient Descent
Mini-batch SGD sits right in the middle of the two previous ideas and combines the best of both worlds. It randomly selects n training examples — the so-called mini-batch — from the whole dataset and computes the gradients only from them. It essentially tries to approximate Batch Gradient Descent by sampling only a subset of the data.

```
for t in range(steps):
  for mini_batch in get_batches(data, batch_size):
    dw = gradient(loss, mini_batch, w)
    w = w - learning_rate *dw
```

## Popular Optimization Algorithms
A saddle point is a point on the surface of the graph of a function where the slopes (derivatives) are all zero but which is not a local maximum of the function.

### Adding momentum

```
for t in range(steps):
    dw = gradient(loss, w)
    v = rho*v +dw   # velocity decayed by friction
    w = w - learning_rate *v
```

### Adaptive learning rate
perform smaller updates for frequent features and bigger ones for infrequent ones.

#### Adagrad
Adagrad keeps a running sum of the squares of the gradients in each dimension, and in each update, we scale the learning rate based on the sum. That way we achieve a different learning rate for each parameter (or an adaptive learning rate). Moreover, by using the root of the squared gradients, we only take into account the magnitude of the gradients and not the sign.

```
for t in range(steps):
    dw = gradient(loss, w)
    squared_gradients +=dw*dw
    w = w - learning_rate * dw/ (squared_gradients.sqrt() + e)
```

A big drawback of Adagrad is that as time goes by, the learning rate becomes smaller and smaller due to the monotonic increment of the running squared sum.

#### RMSprop
A solution to this problem is a modification of the above algorithm called RMSProp, which can be thought of as a “Leaky Adagrad”. In essence, we once again add the notion of friction by decaying the sum of the previous squared gradients.

As we did in momentum-based methods, we multiply our term (here the running squared sum) with a constant value (the decay rate). That way we hope that the algorithm will not slow down over the course of training as Adagrad does.

```
for t in range(steps):
    dw = gradient(loss, w)
    squared_gradients = decay_rate*squared_gradients + (1- decay_rate)* dw*dw
    w = w - learning_rate * (dw/(squared_gradients.sqrt() + e)
```

#### Adam - used most commonly for DL
Adam (Adaptive moment estimation) is arguably the most popular variation nowadays. It has been used extensively in both research and business applications. Its popularity is hidden in the fact that it combines the two best previous ideas, momentum and adaptive learning rate.

We now keep track of two running variables, velocity and the squared gradients average we described in RMSProp. They are also called first and second moments in the original paper.

```
for t in range(steps):
    dw = gradient(loss, w)
    moment1= delta1 *moment1  +(1-delta1)* dw
    moment2 = delta2*moment2 +(1-delta2)*dw*dw
    moment1_unbiased = moment1  /(1-delta1**t)
    moment2_unbiased = moment2  /(1-delta2**t)
    w = w - learning_rate*moment1_unbiased/ (moment2_unbiased.sqrt()+e)
```

## Activation Functions
Activation functions are applied element-wise and, as a result, are independent of the input shape.

- Sigmoid
- Tanh
- ReLU
- Leaky Relu
- Parametric Relu
- Softmax - We usually apply softmax in the last dimension of a multi-dimensional input. To do that in Pytorch, you can just set dim=-1.
     ```
  nn.Softmax(dim=-1)
  ```


In [13]:
def m_sigmoid(x):
    return 1 / (1 + torch.exp(-x))


def m_tanh(x):
    return (torch.exp(x) - torch.exp(-x)) / (torch.exp(x) + torch.exp(-x))

def m_relu(x):
    return torch.mul(x, (x > 0))


def m_softmax(x):
    return torch.exp(x) / torch.sum(torch.exp(x))

## Training in Pytorch