# Chapter 3: Convolutional Neural Networks

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from PIL import Image

In [2]:
from PIL import Image, ImageFile
def check_image(path):
    try:
        im = Image.open(path)
        return True
    except:
        return False

In [3]:
train_data_path = "data/train/"

transforms_ = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                    std=[0.229, 0.224, 0.225] )
    ])

train_data = torchvision.datasets.ImageFolder(root=train_data_path,transform=transforms_, is_valid_file=check_image)

val_data_path = "data/val/"
val_data = torchvision.datasets.ImageFolder(root=val_data_path,
                                            transform=transforms_, is_valid_file=check_image)
test_data_path = "data/test/"
test_data = torchvision.datasets.ImageFolder(root=test_data_path,
                                             transform=transforms_, is_valid_file=check_image)

batch_size=64
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_data_loader  = torch.utils.data.DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_data_loader  = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [4]:
if torch.cuda.is_available():
    device = torch.device("cuda") 
else:
    device = torch.device("cpu")

In [5]:
class CNNNet(nn.Module):

    def __init__(self, num_classes=2):
        super(CNNNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

In [6]:
cnnnet = CNNNet()

## Convolutions

The first thing to notice is the use of **nn.Sequential(). This allows us to create a chain of layers. When we use one of these chains in forward(), the input goes through each element of the array of layers in succession. You can use this to break your model into more logical arrangements. In this network, we have two chains: the features block and the classifier.**

**The Conv2d layer is a 2D convolution. If we have a grayscale image, it consists of an array, x pixels wide and y pixels high, with each entry having a value that indicates whether it’s black or white or somewhere in between (we assume an 8-bit image, so each value can vary from 0 to 255).**

**Next we introduce something called a filter, or convolutional kernel. This is another matrix, most likely smaller, which we will drag across our image.**

Let’s go back to how we’re invoking the Conv2d layer and see some of the other options that we can set:
nn.Conv2d(in_channels,out_channels, kernel_size, stride, padding).

- The in_channels is the number of input channels we’ll be receiving in the layer. At the beginning of the network, we’re taking in the RGB image as input, so the number of input channels is three. out_channels is, unsurprisingly, the number of output channels, which corresponds to the number of filters in our conv layer. 
- Next is kernel_size, which describes the height and width of our filter.1 This can be a single scalar specifying a square (e.g., in the first conv layer, we’re setting up an 11 × 11 filter), or you can use a tuple (such as (3,5) for a 3 × 5 filter).
- The next two parameters seem harmless enough, but they can have big effects on the downstream layers of your network, and even what that particular layer ends up looking at. stride indicates how many steps across the input we move when we adjust the filter to a new position. In our example, we end up with a stride of 2, which has the effect of making a feature map that is half the size of the input. But we could have also moved with a stride of 1, which would give us a feature map output of 4 × 4, the same size of the input. We can also pass in a tuple (a,b) that would allow us to move a across and b down on each step.
- Padding is adding zeros to all the input, with a padding of one we increase the shape by two in every dimension and so on. If you don’t set padding, any edge cases that PyTorch encounters in the last columns of the input are simply thrown away. It’s up to you to set padding appropriately. Just as with stride and kernel_size, you can also pass in a tuple for height × weight padding instead of a single number that pads the same in both directions.

## Pooling

In conjunction with the convolution layers, you will often see pooling layers. These layers reduce the resolution of the network from the previous input layer, which gives us fewer parameters in lower layers. This compression results in faster computation for a start, and it helps prevent overfitting in the network.
In our model, we’re using MaxPool2d with a kernel size of 3 and a stride of 2.

There’s a padding option to MaxPool that creates a border of zero values around the tensor in case the stride goes outside the tensor window.

As you can imagine, you can pool with other functions aside from taking the maximum value from a kernel. A popular alternative is to take the average of the tensor values, which allows all of the tensor data to contribute to the pool instead of just one value in the max case (and if you think about an image, you can imagine that you might want to consider the nearest neighbors of a pixel). 

Also, PyTorch provides AdaptiveMaxPool and AdaptiveAvgPool layers, which work independently of the incoming input tensor’s dimensions (we have an AdaptiveAvgPool in our model, for example). I recommend using these in model architectures that you construct over the standard MaxPool or AvgPool layers, because they allow you to create architectures that can work with different input dimensions; this is handy when working with disparate datasets.
We have one more new component to talk about, one that is incredibly simple yet important for training.

## Dropout

One recurring issue with neural networks is their tendency to overfit to training data, and a large amount of ongoing work is done in the deep learning world to identify approaches that allow networks to learn and generalize to nontraining data without simply learning how to just respond to the training inputs. 

The Dropout layer is a devilishly simple way of doing this that has the benefit of being easy to understand and effective: what if we just don’t train a random bunch of nodes within the network during a training cycle? Because they won’t be updated, they won’t have the chance to overfit to the input data, and because it’s random, each training cycle will ignore a different selection of the input, which should help generalization even further.

By default, the Dropout layers in our example CNN network are initialized with 0.5, meaning that 50% of the input tensor is randomly zeroed out. If you want to change that to 20%, add the p parameter to the initialization call: Dropout(p=0.2).

Dropout should take place only during training. If it was happening during inference time, you’d lose a chunk of your network’s reasoning power, which is not what we want! Thankfully, PyTorch’s implementation of Dropout works out which mode you’re running in and passes all the data through the Dropout layer at inference time.

In [7]:
from tqdm import tqdm

In [8]:
def train(model, optimizer, loss_fn, train_loader, val_loader, epochs=20, device="cpu"):
    for epoch in tqdm(range(epochs)):
        training_loss = 0.0
        valid_loss = 0.0
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            inputs, targets = batch
            inputs = inputs.to(device)
            targets = targets.to(device)
            output = model(inputs)
            loss = loss_fn(output, targets)
            loss.backward()
            optimizer.step()
            training_loss += loss.data.item() * inputs.size(0)
        training_loss /= len(train_loader.dataset)
        
        model.eval()
        num_correct = 0 
        num_examples = 0
        for batch in val_loader:
            inputs, targets = batch
            inputs = inputs.to(device)
            output = model(inputs)
            targets = targets.to(device)
            loss = loss_fn(output,targets) 
            valid_loss += loss.data.item() * inputs.size(0)
            correct = torch.eq(torch.max(F.softmax(output), dim=1)[1], targets).view(-1)
            num_correct += torch.sum(correct).item()
            num_examples += correct.shape[0]
        valid_loss /= len(val_loader.dataset)

        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}, accuracy = {:.2f}'.format(epoch, training_loss,
        valid_loss, num_correct / num_examples))

In [9]:
cnnnet.to(device)
optimizer = optim.Adam(cnnnet.parameters(), lr=0.001)

In [10]:
train(cnnnet, optimizer,torch.nn.CrossEntropyLoss(), train_data_loader,val_data_loader, epochs=10, device=device)

 10%|█         | 1/10 [00:05<00:51,  5.74s/it]

Epoch: 0, Training Loss: 1.38, Validation Loss: 0.64, accuracy = 0.78


 20%|██        | 2/10 [00:11<00:45,  5.71s/it]

Epoch: 1, Training Loss: 0.70, Validation Loss: 0.69, accuracy = 0.59


 30%|███       | 3/10 [00:17<00:39,  5.69s/it]

Epoch: 2, Training Loss: 0.68, Validation Loss: 0.61, accuracy = 0.65


 40%|████      | 4/10 [00:22<00:34,  5.68s/it]

Epoch: 3, Training Loss: 0.59, Validation Loss: 0.46, accuracy = 0.82


 50%|█████     | 5/10 [00:28<00:28,  5.67s/it]

Epoch: 4, Training Loss: 0.55, Validation Loss: 0.51, accuracy = 0.76


 60%|██████    | 6/10 [00:33<00:22,  5.66s/it]

Epoch: 5, Training Loss: 0.50, Validation Loss: 0.40, accuracy = 0.82


 70%|███████   | 7/10 [00:39<00:16,  5.66s/it]

Epoch: 6, Training Loss: 0.48, Validation Loss: 0.51, accuracy = 0.76


 80%|████████  | 8/10 [00:45<00:11,  5.65s/it]

Epoch: 7, Training Loss: 0.51, Validation Loss: 0.54, accuracy = 0.79


 90%|█████████ | 9/10 [00:50<00:05,  5.65s/it]

Epoch: 8, Training Loss: 0.51, Validation Loss: 0.49, accuracy = 0.77


100%|██████████| 10/10 [00:56<00:00,  5.65s/it]

Epoch: 9, Training Loss: 0.48, Validation Loss: 0.35, accuracy = 0.81





## Downloading a pretrained network

There are two ways of downloading pre-trained image models with PyTorch. Firstly, you can use the torchvision.models library, or you can use PyTorch Hub. The latter is preferred as of 2019, as this is a one-stop shop for all models and the new standard for distributing models with PyTorch.

In [11]:
import torchvision.models as models
alexnet = models.alexnet(num_classes=1000, pretrained=True)

Downloading: "https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth" to /home/anton/.cache/torch/checkpoints/alexnet-owt-4df8aa71.pth


HBox(children=(FloatProgress(value=0.0, max=244418560.0), HTML(value='')))




In [12]:
resnet50 = torch.hub.load('pytorch/vision', 'resnet50')

Downloading: "https://github.com/pytorch/vision/archive/master.zip" to /home/anton/.cache/torch/hub/master.zip


In [13]:
print(alexnet)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

In [14]:
print(resnet50)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

In [15]:
torch.hub.list('pytorch/vision')

Using cache found in /home/anton/.cache/torch/hub/pytorch_vision_master


['alexnet',
 'deeplabv3_resnet101',
 'densenet121',
 'densenet161',
 'densenet169',
 'densenet201',
 'fcn_resnet101',
 'googlenet',
 'inception_v3',
 'mobilenet_v2',
 'resnet101',
 'resnet152',
 'resnet18',
 'resnet34',
 'resnet50',
 'resnext101_32x8d',
 'resnext50_32x4d',
 'shufflenet_v2_x0_5',
 'shufflenet_v2_x1_0',
 'squeezenet1_0',
 'squeezenet1_1',
 'vgg11',
 'vgg11_bn',
 'vgg13',
 'vgg13_bn',
 'vgg16',
 'vgg16_bn',
 'vgg19',
 'vgg19_bn',
 'wide_resnet101_2',
 'wide_resnet50_2']

## BatchNorm

BatchNorm, short for batch normalization, is a simple layer that has one task in life: using two learned parameters (meaning that it will be trained along with the rest of the network) to try to ensure that each minibatch that goes through the network has a mean centered around zero with a variance of 1. 

You might ask why we need to do this when we’ve already normalized our input by using the transform chain in Chapter 2. For smaller networks, BatchNorm is indeed less useful, but as they get larger, the effect of any layer on another, say 20 layers down, can be vast because of repeated multiplication, and you may end up with either vanishing or exploding gradients, both of which are fatal to the training process. 

The BatchNorm layers make sure that even if you use a model such as ResNet-152, the multiplications inside your network don’t get out of hand.

You might be wondering: if we have BatchNorm in our network, why are we normalizing the input at all in the training loop’s transformation chain? After all, shouldn’t BatchNorm do the work for us? And the answer here is yes, you could do that! But it’ll take longer for the network to learn how to get the inputs under control, as they’ll have to discover the initial transform themselves, which will make training longer.