# Transfer Learning

In this notebook, you'll learn how to use pre-trained networks to solved challenging problems in computer vision. Specifically, you'll use networks trained on [ImageNet](http://www.image-net.org/) [available from torchvision](http://pytorch.org/docs/0.3.0/torchvision/models.html). 

ImageNet is a massive dataset with over 1 million labeled images in 1000 categories. It's used to train deep neural networks using an architecture called convolutional layers. I'm not going to get into the details of convolutional networks here, but if you want to learn more about them, please [watch this](https://www.youtube.com/watch?v=2-Ol7ZB0MmU).

Once trained, these models work astonishingly well as feature detectors for images they weren't trained on. Using a pre-trained network on images not in the training set is called transfer learning. Here we'll use transfer learning to train a network that can classify our cat and dog photos with near perfect accuracy.

With `torchvision.models` you can download these pre-trained networks and use them in your applications. We'll include `models` in our imports now.

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models

Most of the pretrained models require the input to be 224x224 images. Also, we'll need to match the normalization used when the models were trained. Each color channel was normalized separately, the means are `[0.485, 0.456, 0.406]` and the standard deviations are `[0.229, 0.224, 0.225]`.

In [2]:
data_dir = '../data/Cat_Dog_data'

# TODO: Define transforms for the training data and testing data
train_transforms = transforms.Compose([transforms.RandomRotation(30),
                                       transforms.RandomResizedCrop(224),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

test_transforms = transforms.Compose([transforms.Resize(256),
                                      transforms.CenterCrop(224),
                                      transforms.ToTensor(),
                                      transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                      ])

# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)

trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=64)

We can load in a model such as [DenseNet](http://pytorch.org/docs/0.3.0/torchvision/models.html#id5). Let's print out the model architecture so we can see what's going on.

In [3]:
model = models.densenet121(pretrained=True)
model

Downloading: "https://download.pytorch.org/models/densenet121-a639ec97.pth" to C:\Users\DELL/.cache\torch\hub\checkpoints\densenet121-a639ec97.pth
100.0%


DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace=True)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace=True)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace=True)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu

This model is built out of two main parts, the features and the classifier. The features part is a stack of convolutional layers and overall works as a feature detector that can be fed into a classifier. The classifier part is a single fully-connected layer `(classifier): Linear(in_features=1024, out_features=1000)`. This layer was trained on the ImageNet dataset, so it won't work for our specific problem. That means we need to replace the classifier, but the features will work perfectly on their own. In general, I think about pre-trained networks as amazingly good feature detectors that can be used as the input for simple feed-forward classifiers.

In [4]:
# Freeze parameters so we don't backprop through them
for param in model.parameters():
    param.requires_grad = False

from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
                          ('fc1', nn.Linear(1024, 500)),
                          ('relu', nn.ReLU()),
                          ('fc2', nn.Linear(500, 2)),
                          ('output', nn.LogSoftmax(dim=1))
                          ]))
    
model.classifier = classifier

With our model built, we need to train the classifier. However, now we're using a **really deep** neural network. If you try to train this on a CPU like normal, it will take a long, long time. Instead, we're going to use the GPU to do the calculations. The linear algebra computations are done in parallel on the GPU leading to 100x increased training speeds. It's also possible to train on multiple GPUs, further decreasing training time.

PyTorch, along with pretty much every other deep learning framework, uses [CUDA](https://developer.nvidia.com/cuda-zone) to efficiently compute the forward and backwards passes on the GPU. In PyTorch, you move your model parameters and other tensors to the GPU memory using `model.to('cuda')`. You can move them back from the GPU with `model.to('cpu')` which you'll commonly do when you need to operate on the network output outside of PyTorch. As a demonstration of the increased speed, I'll compare how long it takes to perform a forward and backward pass with and without a GPU.

In [5]:
import time

In [6]:
for device in ['cpu', 'cuda']:

    criterion = nn.NLLLoss()
    # Only train the classifier parameters, feature parameters are frozen
    optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

    model.to(device)

    for ii, (inputs, labels) in enumerate(trainloader):

        # Move input and label tensors to the GPU
        inputs, labels = inputs.to(device), labels.to(device)

        start = time.time()

        outputs = model.forward(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if ii==3:
            break
        
    print(f"Device = {device}; Time per batch: {(time.time() - start)/3:.3f} seconds")

Device = cpu; Time per batch: 3.009 seconds
Device = cuda; Time per batch: 0.014 seconds


You can write device agnostic code which will automatically use CUDA if it's enabled like so:
```python
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
```

From here, I'll let you finish training the model. The process is the same as before except now your model is much more powerful. You should get better than 95% accuracy easily.

>**Exercise:** Train a pretrained models to classify the cat and dog images. Continue with the DenseNet model, or try ResNet, it's also a good model to try out first. Make sure you are only training the classifier and the parameters for the features part are frozen.

In [3]:
from collections import OrderedDict

In [4]:
## TODO: Use a pretrained model to classify the cat and dog images
model2 = models.efficientnet_b1(weights=True)



In [5]:
model2

EfficientNet(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): SiLU(inplace=True)
    )
    (1): Sequential(
      (0): MBConv(
        (block): Sequential(
          (0): Conv2dNormActivation(
            (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
            (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): SiLU(inplace=True)
          )
          (1): SqueezeExcitation(
            (avgpool): AdaptiveAvgPool2d(output_size=1)
            (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
            (fc2): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1))
            (activation): SiLU(inplace=True)
            (scale_activation): Sigmoid()
          )
          (2): Conv2dNormActivat

In [6]:
for param in model2.parameters():
    param.requires_grad = False
classifier2 = nn.Sequential(OrderedDict([
    ('fc1',nn.Linear(1280,512)),
    ('relu',nn.ReLU()),
    ('drop1',nn.Dropout(0.2)),
    ('fc2', nn.Linear(512,2)),
    ('output', nn.LogSoftmax(dim=1))
]))
model2.classifier = classifier2

In [7]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [8]:
model2.to(device)

EfficientNet(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): SiLU(inplace=True)
    )
    (1): Sequential(
      (0): MBConv(
        (block): Sequential(
          (0): Conv2dNormActivation(
            (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
            (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): SiLU(inplace=True)
          )
          (1): SqueezeExcitation(
            (avgpool): AdaptiveAvgPool2d(output_size=1)
            (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
            (fc2): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1))
            (activation): SiLU(inplace=True)
            (scale_activation): Sigmoid()
          )
          (2): Conv2dNormActivat

In [9]:
epochs = 10
print_step = 10
criterion = nn.NLLLoss()
optimizer = optim.Adam(model2.classifier.parameters())

In [11]:
train_losses, test_losses = [], []
for epoch in range(epochs):
    running_train_loss = 0
    step = 0
    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        logps = model2(images)
        loss = criterion(logps,labels)
        loss.backward()
        optimizer.step()
        running_train_loss+=loss.item()*len(labels)
        if step%print_step==0:
            print("Epoch:{} , Step:{}, Train Batch Loss:{}".format(epoch+1,step,loss.item()))
        step+=1
    else:
        running_test_loss = 0
        running_correct_count = 0
        model2.eval()
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            logps = model2(images)
            loss = criterion(logps,labels)
            running_test_loss+=loss.item()*len(labels)
            #accuracy
            probs = torch.exp(logps)
            _, top_class = probs.topk(1,dim=1)
            equals = labels.view(*top_class.shape) == top_class
            running_correct_count += equals.sum().item()
            
        model2.train()
        train_loss = running_train_loss/len(trainloader.dataset)
        test_loss = running_test_loss/len(testloader.dataset)
        test_accuracy = running_correct_count/len(testloader.dataset)
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        print("Epoch:{} , Train Loss:{}, Test Loss:{}, Accuracy:{}".format(epoch+1,
                                                                           train_loss,
                                                                          test_loss,
                                                                           test_accuracy
                                                                          ))

Epoch:1 , Step:0, Train Batch Loss:0.25042232871055603
Epoch:1 , Step:10, Train Batch Loss:0.3762143850326538
Epoch:1 , Step:20, Train Batch Loss:0.20429368317127228
Epoch:1 , Step:30, Train Batch Loss:0.11261643469333649
Epoch:1 , Step:40, Train Batch Loss:0.10823437571525574
Epoch:1 , Step:50, Train Batch Loss:0.15615692734718323
Epoch:1 , Step:60, Train Batch Loss:0.16711755096912384
Epoch:1 , Step:70, Train Batch Loss:0.22673919796943665
Epoch:1 , Step:80, Train Batch Loss:0.18839575350284576
Epoch:1 , Step:90, Train Batch Loss:0.18614991009235382
Epoch:1 , Step:100, Train Batch Loss:0.09257227182388306
Epoch:1 , Step:110, Train Batch Loss:0.09283874183893204
Epoch:1 , Step:120, Train Batch Loss:0.14351201057434082
Epoch:1 , Step:130, Train Batch Loss:0.357151061296463
Epoch:1 , Step:140, Train Batch Loss:0.09300102293491364
Epoch:1 , Step:150, Train Batch Loss:0.1502377688884735
Epoch:1 , Step:160, Train Batch Loss:0.19466927647590637
Epoch:1 , Step:170, Train Batch Loss:0.1532840

Epoch:4 , Step:330, Train Batch Loss:0.09401580691337585
Epoch:4 , Step:340, Train Batch Loss:0.07305363565683365
Epoch:4 , Step:350, Train Batch Loss:0.2363852709531784
Epoch:4 , Train Loss:0.1704002358648512, Test Loss:0.052227259242534636, Accuracy:0.98
Epoch:5 , Step:0, Train Batch Loss:0.06946796923875809
Epoch:5 , Step:10, Train Batch Loss:0.1104893609881401
Epoch:5 , Step:20, Train Batch Loss:0.09916281700134277
Epoch:5 , Step:30, Train Batch Loss:0.08039059489965439
Epoch:5 , Step:40, Train Batch Loss:0.13674400746822357
Epoch:5 , Step:50, Train Batch Loss:0.12024646997451782
Epoch:5 , Step:60, Train Batch Loss:0.10880748927593231
Epoch:5 , Step:70, Train Batch Loss:0.17124789953231812
Epoch:5 , Step:80, Train Batch Loss:0.21624135971069336
Epoch:5 , Step:90, Train Batch Loss:0.20102494955062866
Epoch:5 , Step:100, Train Batch Loss:0.16919393837451935
Epoch:5 , Step:110, Train Batch Loss:0.2578524351119995
Epoch:5 , Step:120, Train Batch Loss:0.265569269657135
Epoch:5 , Step:13

Epoch:8 , Step:290, Train Batch Loss:0.18097160756587982
Epoch:8 , Step:300, Train Batch Loss:0.07937771826982498
Epoch:8 , Step:310, Train Batch Loss:0.11814530193805695
Epoch:8 , Step:320, Train Batch Loss:0.13885675370693207
Epoch:8 , Step:330, Train Batch Loss:0.12722061574459076
Epoch:8 , Step:340, Train Batch Loss:0.19723428785800934
Epoch:8 , Step:350, Train Batch Loss:0.1172638013958931
Epoch:8 , Train Loss:0.15702118519147237, Test Loss:0.049533698412775995, Accuracy:0.9796
Epoch:9 , Step:0, Train Batch Loss:0.07520144432783127
Epoch:9 , Step:10, Train Batch Loss:0.1713179498910904
Epoch:9 , Step:20, Train Batch Loss:0.21017952263355255
Epoch:9 , Step:30, Train Batch Loss:0.12900175154209137
Epoch:9 , Step:40, Train Batch Loss:0.22599975764751434
Epoch:9 , Step:50, Train Batch Loss:0.17794130742549896
Epoch:9 , Step:60, Train Batch Loss:0.15463413298130035
Epoch:9 , Step:70, Train Batch Loss:0.1597745269536972
Epoch:9 , Step:80, Train Batch Loss:0.07766569405794144
Epoch:9 , S