# Build PyTorch CNN

__ML Pipeline__: Prepare data -> __build model__ -> train model -> analyze model's results

To build neural networks in PyTorch, we extend the `torch.nn.Module` PyTorch class. This means we need to utilize a little bit of object oriented programming (OOP) in Python.

## OOP Review

When we’re writing programs or building software, there are two key components, code and data. With object oriented programming, we orient our program design and structure around objects.

Objects are defined in code using classes. A class defines the object's specification or spec, which specifies what data and code each object of the class should have.

When we create an object of a class, we call the object an instance of the class, and all instances of a given class have two core components:
- Methods (code)
- Attributes (data)

The methods represent the code, while the attributes represent the data, and so the methods and attributes are defined by the class.

In a given program, many objects, a.k.a instances of a given class, can exist simultaneously, and all of the instances will have the same available attributes and the same available methods. They are uniform from this perspective.

The difference between objects of the same class is the values contained within the object for each attribute. Each object has its own attribute values. These values determine the internal state of the object. The code and data of each object is said to be encapsulated within the object.

In [1]:
class Lizard:
    def __init__(self, name):
        self.name = name

    def set_name(self, name):
        self.name = name

In [2]:
lizard = Lizard('Deer')
print(lizard.name)

Deer


In [3]:
lizard.set_name('DL')
print(lizard.name)

DL


## `torch.nn`

As we know, deep neural networks are built using multiple layers. This is what makes the network deep. Each layer in a neural network has two primary components:

* A transformation (code)
* A collection of weights (data)

Within the `nn` package, there is a class called `Module`, and it is the __base class__ for all of neural network modules which includes layers.

This means that all of the layers in PyTorch extend the `nn.Module` class and inherit all of PyTorch’s built-in functionality within the `nn.Module` class. 

In OOP this concept is known as __inheritance__.

In [23]:
import torch
import torch.nn as nn


### `forward()` method

When we pass a tensor to our network as input, the __tensor flows__ forward though each layer transformation until the tensor reaches the output layer. This process of a tensor flowing forward though the network is known as a __forward pass__.

Each layer has its own transformation (code) and the tensor passes forward through each layer. The composition of all the individual layer forward passes defines the overall forward pass transformation for the network. The goal of the overall transformation is to transform or map the input to the correct prediction output class, and during the training process, the layer weights (data) are updated in such a way that cause the mapping to adjust to make the output closer to the correct prediction. This is achieved efficiently by __backpropagation__.

What this all means is that, every PyTorch `nn.Module` has a `forward()` method, and so when we are building layers and networks, we must provide an implementation of the `forward()` method. The forward method is the actual transformation.

### `torch.nn.functional`

When we implement the `forward()` method of our `nn.Module` subclass, we will typically use functions from the `nn.functional` package. This package provides us with many neural network operations that we can use for building layers. In fact, many of the `nn.Module` layer classes use `nn.functional` functions to perform their operations.

The `nn.functional` package contains methods that __subclasses__ of `nn.Module` use for implementing their `forward()` functions. One reason for this is that during backpropagation, the network must perform a __symbolic differentiation__ of the operations involved in the layers to calculate the gradient of the loss with respect to the weights.

## Building a Neural Network in PyTorch

We now have enough information to provide an outline for building neural networks in PyTorch. The steps are as follows:

Short version:

- Extend the `nn.Module` base class.
- Define layers as class attributes.
- Implement the `forward()` method.


More detailed version:

- Create a neural network class that extends the `nn.Module` base class.
- In the class constructor, define the network’s layers as class attributes using pre-built layers from `torch.nn`.
- Use the network’s layer attributes as well as operations from the `nn.functional` API to define the network’s forward pass.

In [5]:
# a trivial neural network (zero layers)
class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init()
        self.layer = None

    def forward(self, t):
        t = self.layer(t)
        return t


Let’s replace this now with some real layers that come pre-built for us from PyTorch's `nn` library. We’re building a CNN, so the two types of layers we'll use are linear layers and convolutional layers.

In [6]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
        
        self.fc1 = nn.Linear(in_features=12*4*4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        return t

In [7]:
network = Network()
network

Network(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=192, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=60, bias=True)
  (out): Linear(in_features=60, out_features=10, bias=True)
)

We used the abbreviation `fc` in `fc1` and `fc2` because linear layers are also called fully connected layers. They also have a third name that we may hear sometimes called dense. So linear, dense, and fully connected are all ways to refer to the same type of layer. PyTorch uses the word linear, hence the `nn.Linear` class name.

We used the name `out` for the last linear layer because the last layer in the network is the output layer.

The above neural net has three hyperparameters that need to be manually specified:
* `kernel_size.`  size of each convolutional filter
* `out_channels` number of filters in the convolutional layer 
* `out_features` size of output tensor, i.e. the number of neurons in the dense layer

Having `out_features=10` on the final output layer is a data dependent hyperparameter, i.e. fixed due to the nature of the problem.

### CNN Layer Parameters

### Parameters vs Arguments

Well parameters are used in function definitions as place-holders while arguments are the actual values that are passed to the function. The parameters can be thought of as local variables that live inside a function.

In our network's case, the names are the parameters and the values that we have specified are the arguments.

### Two Types of Parameters

To better understand the argument values for these parameters, let's consider two categories or types of parameters that we used when constructing our layers.

- Hyperparameters
- Data dependent hyperparameters

A lot of terms in deep learning are used loosely, and the word parameter is one of them. Try not to let it through you off. The main thing to remember about any type of parameter is that the parameter is a place-holder that will eventually hold or have a value.

The goal of these particular categories is to help us remember how each parameter's value is decided.

When we construct a layer, we pass values for each parameter to the layer’s constructor. With our convolutional layers have three parameters and the linear layers have two parameters.

- Convolutional layers
    - in_channels
    - out_channels  - Sets the number of filters. One filter produces one output channel.
    - kernel_size   - Sets the filter size. The words kernel and filter are interchangeable.

- Linear layers
    - in_features
    - out_features - Sets the size of the output tensor.

#### Hyperparameters
In general, hyperparameters are parameters whose values are chosen manually and arbitrarily.

As neural network programmers, we choose hyperparameter values mainly based on trial and error and increasingly by utilizing values that have proven to work well in the past. For building our CNN layers, these are the parameters we choose manually.

- `kernel_size`
- `out_channels`
- `out_features`

This means we simply choose the values for these parameters. In neural network programming, this is pretty common, and we usually test and tune these parameters to find values that work best.

One pattern that shows up quite often is that we increase our out_channels as we add additional conv layers, and after we switch to linear layers we shrink our out_features as we filter down to our number of output classes.

#### Data Dependent Hyperparameters
Data dependent hyperparameters are parameters whose values are dependent on data. The first two data dependent hyperparameters that stick out are the `in_channels` of the first convolutional layer, and the `out_features` of the output layer.



| Layer 	| Param name   	| Param value 	| The param value is                                      	|
|-------	|--------------	|-------------	|---------------------------------------------------------	|
| conv1 	| in_channels  	| 1           	| the number of color channels in the input image.        	|
| conv1 	| kernel_size  	| 5           	| a hyperparameter.                                       	|
| conv1 	| out_channels 	| 6           	| a hyperparameter.                                       	|
| conv2 	| in_channels  	| 6           	| the number of out_channels in previous layer.           	|
| conv2 	| kernel_size  	| 5           	| a hyperparameter.                                       	|
| conv2 	| out_channels 	| 12          	| a hyperparameter (higher than previous conv layer).     	|
| fc1   	| in_features  	| 12\*4\*4      	| the length of the flattened output from previous layer. 	|
| fc1   	| out_features 	| 120         	| a hyperparameter.                                       	|
| fc2   	| in_features  	| 120         	| the number of out_features of previous layer.           	|
| fc2   	| out_features 	| 60          	| a hyperparameter (lower than previous linear layer).    	|
| out   	| in_features  	| 60          	| the number of out_channels in previous layer.           	|
| out   	| out_features 	| 10          	| the number of prediction classes.                       	|

## Learnable Parameters

Learnable parameters are parameters whose values are learned during the training process.

With learnable parameters, we typically start out with a set of arbitrary values, and these values then get updated in an iterative fashion as the network learns.



In [8]:
network = Network()

In [9]:
print(network)

Network(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=192, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=60, bias=True)
  (out): Linear(in_features=60, out_features=10, bias=True)
)


The `print()` function prints to the console a string representation of our network. With a sharp eye, we can notice that the printed output here is detailing our network’s architecture listing out our network’s layers, and showing the values that were passed to the layer constructors.

For this reason, in object oriented programming, we usually want to provide a string representation of our object inside our classes so that we get useful information when the object is printed. This string representation comes from Python’s default base class called object.

We can override Python’s default string representation using the `__repr__` function. This name is short for representation.

For the convolutional layers, the kernel_size argument is a Python tuple `(5,5)` even though we only passed the number `5` in the constructor.

This is because our filters actually have a height and width, and when we pass a single number, the code inside the layer’s constructor assumes that we want a square filter.

The **stride** is an additional parameter that we could have set, but we left it out. When the stride is not specified in the layer constructor the layer automatically sets it.

The **stride** tells the conv layer how far the filter should slide after each operation in the overall convolution. This tuple says to slide by one unit when moving to the right and also by one unit when moving down.

### Accessing the Network's layers

In [10]:
network.conv1

Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))

In [11]:
network.conv2

Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))

In [12]:
network.fc1

Linear(in_features=192, out_features=120, bias=True)

In [13]:
network.fc2

Linear(in_features=120, out_features=60, bias=True)

In [14]:
network.out

Linear(in_features=60, out_features=10, bias=True)

### Accessing the Layer Weights

In [15]:
network.conv1.weight

Parameter containing:
tensor([[[[-0.1229,  0.0926, -0.1417, -0.0607,  0.0912],
          [ 0.1341, -0.1312,  0.0854, -0.1496,  0.1899],
          [ 0.0196, -0.0671,  0.0507, -0.1408,  0.1325],
          [-0.1721,  0.1209,  0.0503,  0.1255, -0.1475],
          [-0.1521, -0.0169,  0.1762,  0.0859, -0.0057]]],


        [[[ 0.0626, -0.0752,  0.0668,  0.1028,  0.0333],
          [-0.1684,  0.0268, -0.1093,  0.0239,  0.1331],
          [-0.0099,  0.0126,  0.1425, -0.1425, -0.1933],
          [-0.1470, -0.0811, -0.1749, -0.0252,  0.0682],
          [-0.0291, -0.0987,  0.1878,  0.0022,  0.0681]]],


        [[[-0.1740,  0.1166,  0.1760, -0.1656, -0.0819],
          [-0.1560, -0.1639, -0.0061, -0.0060, -0.0338],
          [ 0.1315,  0.0891,  0.1836, -0.0483,  0.1421],
          [ 0.0047, -0.0603,  0.1282, -0.0495, -0.1666],
          [-0.1772,  0.1583, -0.1229, -0.1144,  0.1190]]],


        [[[ 0.0885, -0.0541, -0.1832,  0.0228,  0.1807],
          [-0.1984, -0.0408,  0.0381,  0.0763, -0.0202

Then, we access the weight tensor object that lives inside the conv layer object, so all of these objects are chained or linked together.

One thing to notice about the weight tensor output is that it says parameter containing at the top of the output. This is because this particular tensor is a special tensor because its values or scalar components are learnable parameters of our network.

This means that the values inside this tensor, the ones we see above, are actually learned as the network is trained. As we train, these weight values are updated in such a way that the loss function is minimized.


### Parameter Class
To keep track of all the weight tensors inside the network. PyTorch has a special class called Parameter. The Parameter class extends the tensor class, and so the weight tensor inside every layer is an instance of this Parameter class. This is why we see the Parameter containing text at the top of the string representation output.


## Weight Tensor Shape

For the convolutional layers, the weight values live inside the filters, and in code, the filters are actually the weight tensors themselves.

The convolution operation inside a layer is an operation between the input channels to the layer and the filter inside the layer. This means that what we really have is an operation between two tensors.

In [16]:
network.conv1.weight.shape

torch.Size([6, 1, 5, 5])

For the first conv layer, we have 1 color channel that should be convolved by 6 filters of size 5x5 to produce 6 output channels. This is how we interpret the values inside our layer constructor.

Inside our layer though, we don’t explicitly have 6 weight tensors for each of the 6 filters. We actually represent all 6 filters using a single weight tensor whose shape reflects or accounts for the 6 filters.

The shape of the weight tensor for the first convolutional layer shows us that we have a rank-4 weight tensor. The first axis has a length of 6, and this accounts for the 6 filters.

In [17]:
network.conv2.weight.shape

torch.Size([12, 6, 5, 5])

Think of this value of 6 here as giving each of the filters some depth. Instead of having a filter that convolves all of the channels iteratively, our filter has a depth that matches the number of channels.

The two main takeaways about these convolutional layers is that our filters are represented using a single tensor and that each filter inside the tensor also has a depth that accounts for the input channels that are being convolved.

- All filters are represented using a single tensor.
- Filters have depth that accounts for the input channels.

**Our tensors are rank-4 tensors.** 

The first axis represents the number of filters. The second axis represents the depth of each filter which corresponds to the number of input channels being convolved.

The last two axes represent the height and width of each filter. We can pull out any single filter by indexing into the weight tensor’s first axis.

__(Number of filters, Depth, Height, Width)__


In [21]:
network

Network(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=192, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=60, bias=True)
  (out): Linear(in_features=60, out_features=10, bias=True)
)

With linear layers or fully connected layers, we have flattened rank-1 tensors as input and as output. The way we transform the in_features to the out_features in a linear layer is by using a rank-2 tensor that is commonly called a weight matrix.

This is due to the fact that the weight tensor is of rank-2 with height and width axes.

In [18]:
network.fc1.weight.shape

torch.Size([120, 192])

In [19]:
network.fc2.weight.shape

torch.Size([60, 120])

In [20]:
network.out.weight.shape

torch.Size([10, 60])

### A general example of Matrix Multiplication

Here, we have the `in_features` and the `weight_matrix` as tensors, and we’re using the tensor method called matmul() to perform the operation. The name `matmul()` as we now know is short for matrix multiplication.

In general, the weight matrix defines a linear function that maps a 1-dimensional tensor with four elements to a 1-dimensional tensor that has three elements. **We can think of this function as a mapping from 4-dimensional Euclidean space to 3-dimensional Euclidean space.**

In [24]:
in_feature = torch.tensor([1, 2, 3, 4], dtype=torch.float32)

In [25]:
weight_matrix = torch.tensor([
    [1, 2, 3, 4],
    [2, 3 ,4, 5],
    [3, 4, 5, 6]
], dtype=torch.float32)

In [26]:
weight_matrix.matmul(in_feature)

tensor([30., 40., 50.])

### Accessing the Network Parameters

In [31]:
for param in network.parameters():
    print(param.shape)

torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([12, 6, 5, 5])
torch.Size([12])
torch.Size([120, 192])
torch.Size([120])
torch.Size([60, 120])
torch.Size([60])
torch.Size([10, 60])
torch.Size([10])


In [29]:
for name, param in network.named_parameters():
    print(name, '\t\t', param.shape)

conv1.weight 		 torch.Size([6, 1, 5, 5])
conv1.bias 		 torch.Size([6])
conv2.weight 		 torch.Size([12, 6, 5, 5])
conv2.bias 		 torch.Size([12])
fc1.weight 		 torch.Size([120, 192])
fc1.bias 		 torch.Size([120])
fc2.weight 		 torch.Size([60, 120])
fc2.bias 		 torch.Size([60])
out.weight 		 torch.Size([10, 60])
out.bias 		 torch.Size([10])


### PyTorch Linear Layer

In [32]:
fc = nn.Linear(in_features=4, out_features=3)

#### Callable Python Objects

PyTorch creates a weight matrix and initializes it with random values. This means that the linear functions from the two examples are different, so we are using different function to produce these outputs.

In [33]:
fc(in_feature)

tensor([ 0.2853, -0.1881,  1.3774], grad_fn=<AddBackward0>)

Let's explicitly set the weight matrix of the linear layer to be the same as the one we used in our other example.

PyTorch module weights need to be parameters. This is why we wrap the weight matrix tensor inside a parameter class instance. Let's see now how this layer transforms the input using the new weight matrix. 

In [34]:
fc.weight = nn.Parameter(weight_matrix)

In [35]:
fc(in_feature)

tensor([29.8384, 39.5811, 50.1765], grad_fn=<AddBackward0>)

This time we are much closer to the 30, 40, and 50 values. However, we're exact. 

Why is this? 

We'll, this is not exact because the linear layer is adding a bias tensor to the output. Watch what happens when we turn the bias off. We do this by passing a False flag to the constructor.

In [36]:
fc = nn.Linear(in_features=4, out_features=3, bias=False)
fc.weight = nn.Parameter(weight_matrix)

In [37]:
fc(in_feature)

tensor([30., 40., 50.], grad_fn=<SqueezeBackward3>)

## Callable Layers and Neural Networks

We pointed out before how it was kind of strange that we called the layer object instance as if it were a function.

What makes this possible is that PyTorch module classes implement another special Python function called `__call__()`. If a class implements the `__call__()` method, the special call method will be invoked anytime the object instance is called.

This fact is an important PyTorch concept because of the way the `__call__()` method interacts with the `forward()` method for our layers and networks.

Instead of calling the `forward()` method directly, we call the object instance. After the object instance is called, the `__call__()` method is invoked under the hood, and the `__call__()` in turn invokes the `forward()` method. This applies to all PyTorch neural network modules, namely, networks and layers.

In [39]:
fc = nn.Linear(in_features=4, out_features=3)

t = torch.tensor([1, 2, 3, 4], dtype=torch.float32)

output = fc(t)

print(output)

tensor([ 0.9839, -0.7427,  2.0162], grad_fn=<AddBackward0>)


The extra code that PyTorch runs inside the `__call__()` method is why we never invoke the `forward()` method directly. If we did, the additional PyTorch code would not be executed. As a result, any time we want to invoke our `forward()` method, we call the object instance. This applies to both layers, and networks because they are both PyTorch neural network modules.

## Forward Propagation
## Implementing the `forward()` method

The `forward()` method is the actual network transformation. The forward method is the mapping that maps an input tensor to a prediction output tensor. Let's see how this is done.

In [40]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        # (1) input layer
        t = t

        # (2) hidden convolution layer
        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (3) hiden convolution layer
        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        
        return t

As we can see here, our input tensor is transformed as we move through the convolutional layers. 
- The first convolutional layer has a convolutional operation, followed by a relu activation operation whose output is then passed to a max pooling operation with kernel_size=2 and stride=2.
- The output tensor t of the first convolutional layer is then passed to the next convolutional layer, which is identical except for the fact that we call self.conv2() instead of self.conv1().

Each of these layers is comprised of a collection of weights (data) and a collection operations (code). 
The weights are encapsulated inside the `nn.Conv2d()` class instance. 

The `relu()` and the `max_pool2d()` calls are just pure operations. Neither of these have weights, and this is why we call them directly from the `nn.functional` API.

Sometimes we may see pooling operations referred to as ***pooling*** layers. Sometimes we may even hear activation operations called ***activation*** layers.

The 4 * 4 is actually the height and width of each of the 12 output channels.

We started with a 1 x 28 x 28 input tensor. This gives a single color channel, 28 x 28 image, and by the time our tensor arrives at the first linear layer, the dimensions have changed.

The height and width dimensions have been reduced from 28 x 28 to 4 x 4 by the convolution and pooling operations.

Before we pass our input to the first hidden linear layer, we must `reshape()` or flatten our tensor. This will be the case any time we are passing output from a convolutional layer as input to a linear layer.

Since the fourth layer is the first linear layer, we will include our reshaping operation as a part of the fourth layer.

In [48]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision
import torchvision.transforms as transforms

torch.set_printoptions(linewidth=120)

In [49]:
train_set = torchvision.datasets.FashionMNIST(
    root = './data/FashionMNIST',
    train=True,
    download=True,
    transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

0it [00:00, ?it/s]Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/train-images-idx3-ubyte.gz
 99%|█████████▊| 26050560/26421880 [00:05<00:00, 7044682.82it/s]Extracting ./data/FashionMNIST/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw

0it [00:00, ?it/s][ADownloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/train-labels-idx1-ubyte.gz

  0%|          | 0/29515 [00:00<?, ?it/s][A
 56%|█████▌    | 16384/29515 [00:00<00:00, 137840.62it/s][A

0it [00:00, ?it/s][A[AExtracting ./data/FashionMNIST/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?i

In [50]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        # (1) input layer
        t = t

        # (2) hidden convolution layer
        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (3) hiden convolution layer
        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        
        # (4) hidden linear layer
        t = t.reshape(-1, 12 * 4 * 4)
        t = self.fc1(t)
        t = F.relu(t)

        # (5) hidden linear layer
        t = self.fc2(t)
        t = F.relu(t)

        # (6) output layer
        t = self.out(t)
        #t = F.softmax(t, dim=1)

        return t

32768it [00:12, 2587.19it/s]                             
4423680it [00:12, 361524.47it/s]                              
8192it [00:10, 797.93it/s]              
26427392it [00:20, 7044682.82it/s]                              

Inside the network we usually use `relu()` as our non-linear activation function, but for the output layer, whenever we have a single category that we are trying to predict, we use `softmax()`. 

**The softmax function returns a positive probability for each of the prediction classes, and the probabilities sum to 1.**

However, in our case, we won't use `softmax()` because the loss function that we'll use, `F.cross_entropy()`, implicitly performs the `softmax()` operation on its input, so we'll just return the result of the last linear transformation.

In [52]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fedc63f1510>

In [53]:
network = Network()

In [54]:
sample = next(iter(train_set))

In [55]:
image, label = sample
image.shape

torch.Size([1, 28, 28])

### Create a Batch

The image tensor’s shape indicates that we have a single channel image that is 28 in height and 28 in width. Cool, this is what we expect.

Now, there's a second step we must preform before simply passing this tensor to our network. When we pass a tensor to our network, the network is expecting a batch, so even if we want to pass a single image, we still need a batch.

This is no problem. We can create a batch that contains a single image. All of this will be packaged into a single four dimensional tensor that reflects the following dimensions.

This requirement of the network arises from the fact that the `forward()` method's in the `nn.Conv2d` convolutional layer classes expect their tensors to have 4 dimensions. This is pretty standard as most neural network implementations deal with batches of input samples rather than single samples.

In [57]:
image.unsqueeze(0).shape

torch.Size([1, 1, 28, 28])

In [58]:
pred = network(image.unsqueeze(0))

In [59]:
pred.shape

torch.Size([1, 10])

In [60]:
pred

tensor([[-0.0120,  0.1101,  0.0547, -0.0904, -0.0431,  0.0482, -0.1171,  0.0278,  0.1302, -0.0394]])

The shape of the prediction tensor is `1 x 10`. This tells us that the first axis has a length of one while the second axis has a length of ten. The interpretation of this is that we have one image in our batch and ten prediction classes.

In [61]:
label

9

In [63]:
pred.argmax(dim=1)

tensor([8])

For each input in the batch, and for each prediction class, we have a prediction value. If we wanted these values to be probabilities, we could just the `softmax()` function from the `nn.functional` package.

In [64]:
F.softmax(pred, dim=1)

tensor([[0.0978, 0.1105, 0.1046, 0.0905, 0.0948, 0.1039, 0.0881, 0.1018, 0.1128, 0.0952]])

There are a couple of important things we need to point out about these results. Most of the probabilities came in close to `10%`, and this makes sense because our network is guessing and we have ten prediction classes coming from a `balanced dataset`.

Another implication of the randomly generated weights is that each time we create a new instance of our network, the weights within the network will be different. This means that the predictions we get will be different if we create different networks. Keep this in mind. Your predictions will be different from what we see here.

## Sending data in batches

In [65]:
print(torch.__version__)
print(torchvision.__version__)

1.3.1
0.4.2


In [66]:
data_loader = torch.utils.data.DataLoader(
    train_set, batch_size=10
)

In [67]:
batch = next(iter(data_loader))

In [68]:
images, labels = batch

In [69]:
images.shape

torch.Size([10, 1, 28, 28])

Last time, when we pulled a single image from our training set, we had to `unsqueeze()` the tensor to add another dimension that would effectively transform the singleton image into a batch with a size of one. Now that we are working with the data loader, we are dealing with batches by default, so there is no further processing needed.

In [70]:
labels.shape

torch.Size([10])

In [84]:
preds = network(images)

In [85]:
preds.shape

torch.Size([10, 10])

The prediction tensor has a shape of `10 by 10`, which gives us two axes that each have a length of ten. This reflects the fact that we have ten images and for each of these ten images we have ten prediction classes.
`(batch size, number of prediction classes)`

The elements of the first dimension are arrays of length ten. Each of these array elements contain the ten predictions for each category for the corresponding image.

The elements of the second dimension are numbers. Each number is the assigned value of the specific output class. The output classes are encoded by the indexes, so each index represents a specific output class.

In [86]:
preds

tensor([[-0.0120,  0.1101,  0.0547, -0.0904, -0.0431,  0.0482, -0.1171,  0.0278,  0.1302, -0.0394],
        [-0.0106,  0.1116,  0.0468, -0.0868, -0.0432,  0.0466, -0.1274,  0.0286,  0.1241, -0.0416],
        [-0.0067,  0.0991,  0.0382, -0.0908, -0.0383,  0.0312, -0.1149,  0.0303,  0.1255, -0.0385],
        [-0.0062,  0.1018,  0.0399, -0.0906, -0.0415,  0.0368, -0.1201,  0.0279,  0.1239, -0.0399],
        [-0.0101,  0.1084,  0.0427, -0.0927, -0.0460,  0.0459, -0.1160,  0.0320,  0.1255, -0.0405],
        [-0.0124,  0.1095,  0.0453, -0.0918, -0.0395,  0.0401, -0.1166,  0.0311,  0.1245, -0.0332],
        [-0.0176,  0.1034,  0.0407, -0.0965, -0.0361,  0.0357, -0.1184,  0.0256,  0.1228, -0.0343],
        [-0.0146,  0.1137,  0.0504, -0.0924, -0.0431,  0.0473, -0.1106,  0.0313,  0.1301, -0.0368],
        [-0.0083,  0.0982,  0.0378, -0.0899, -0.0363,  0.0374, -0.1206,  0.0273,  0.1224, -0.0440],
        [-0.0136,  0.1003,  0.0474, -0.0896, -0.0385,  0.0374, -0.1182,  0.0278,  0.1238, -0.0362]])

The result from the `argmax()` function is a tensor of ten prediction categories. 

Each number is the index where the highest value occurred. We have ten numbers because there were ten images. Once we have this tensor of indices of highest values, we can compare it against the label tensor.

In [93]:
preds.argmax(dim=1)

tensor([8, 8, 8, 8, 8, 8, 8, 8, 8, 8])

In [94]:
labels

tensor([9, 0, 0, 3, 0, 2, 7, 2, 5, 5])

In [95]:
preds.argmax(dim=1).eq(labels)

tensor([False, False, False, False, False, False, False, False, False, False])

Finally, if we call the `sum()` function on this result, we can reduce the output into a single number of correct predictions inside this scalar valued tensor.

In [96]:
def get_num_correct(preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()

In [97]:
get_num_correct(preds, labels)

0