# Pytorch

In [2]:
import os 
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

## Neural Network

### Basics
Nested structure of modules that subclass `nn.Module`. 

The model itself subclasses `nn.Module` as do all its layers (e.g., densely connected layer).

Containers store layers. You then use the container like a function to pass input through all the layers. E.g., `nn.Sequential()` will store layers like `nn.Linear()` and `nn.ReLU()`.

### Logits

x/(1-x)

Also referred to as log-odds.

We say the output is the logits. We choose to *interpret* the output this way. We say that the matrix multiplication operations of a neural network produce the logits. Then something like the softmax converts these logit values to probability values by squeezing them to [0,1] from (-inf,inf). The squeezing is done by a function using e^-x, so that having the output in terms of log makes the calculation easier.

Reminder: odds the ratio of it happening to it not happening. E.g., p=0.75 means that odd are 3 to 1 of some event occuring. After 4 races, we will win 3 of them and lose 1 of them.

The logit function is the inverse of the sigmoid function. Logit function is log((x)/(1/x)). Sigmoid function is e^((x)/(1/x)). As expected, sigmoid(logit(p)) = p for some probability p.

### Sigmoid 

1/1+e^-x

Inverse function to log odds function. It is an S-shaped curve.

Sigmoid inputs 1 logit and outputs 1 probability.

Softmax inputs a vector of logits and outputs a vector of probabilities.

You use sigmoid for binary classification and softmax for multiclass classification.


### Softmax

e^xi/Sum_ij e^xij

The component probability for each component of the output vector is the odds of it divided by the sum of all the odds. This results in normalization. It is a probability distribution over the classes. ('Over' refers to the components in the vector that each value in the distribution is tied to.)

This works because the exponential function e^x times the logits produces the odds, which is [0, inf) and monotonically increasing.

### Miscellany
`nn.flatten()` is different than `torch.squeeze()`. Flatten removes dimensions. E.g., flattening a d=2 (28,28) dimension image to d=1 (784,). So from a 2d to 1d vector. In addition, you can specify where to start and end the flatten. Squeeze removes all dimensions of size 1.

`nn.Linear()` applies a linear transformation to the input vector. y=xA^T + b. E.g., take a (3, 784) dimension image that needs to be transformed to (3, 20) dimension output. The matrix would need to be (784, 20) dimension to accomodate the transformation. It is the transform of A, A^T, because xA^T = Ax, because we can do (3, 784)(784,20) and (20,784)(784,3).

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using {device} device.")

Using cpu device.


In [13]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )
    
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    

In [14]:
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


In [15]:
x = torch.rand(1, 28, 28, device=device)
print(x.size()) # note this is the same as print(x.shape)

torch.Size([1, 28, 28])


In [22]:
logits = model(x)
print(f"Logits are {logits}")
yhat = nn.Softmax(dim=1)(logits)
print()
print(f"Predicted probability is {yhat}")

Logits are tensor([[-0.0197, -0.0396,  0.0082, -0.0159,  0.0155,  0.0200,  0.0305, -0.0899,
         -0.0075,  0.0784]], grad_fn=<AddmmBackward0>)

Predicted probability is tensor([[0.0982, 0.0962, 0.1009, 0.0985, 0.1017, 0.1021, 0.1032, 0.0915, 0.0994,
         0.1083]], grad_fn=<SoftmaxBackward0>)
