# Transfer Learning and PreTrained Models

Before we get into building a ton of models, I want to first talk about one of the coolest aspects of Deep Learning: **Transfer Learning**. As we have seen in the [Intro to PyTorch](https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20Basics/Intro%20to%20PyTorch), a Deep Learning model is really just a structured bag of parameters. In the Linear Regression example, we had only two parameters, a weight $W_1$ and a Bias $W_0$. The only models we can create by multiplying and adding numbers together is a linear model. The Dense MNIST model we then made after was a deeper model with more parameters and connections, but we were still only multiplying and adding numbers together, therefore we can only represent a linear relationship. To fix this we then added in some ReLU (Rectified Linear Unit) activation functions that would allow us to introduce non-linearity to our model! All these weights that are randomly initialized when we define the model are then optimized and tuned so we have the least error between our predictions and true value. 

If we took our model and saved the parameters that define it, this would be known as a **pre-trained** model. Transfer learning is then the idea that, if we already have a model that can predict handwritten digits very well (0 through 9), and we have a new dataset of the lowercase alphabet (a through z), do we really need to train a new model from scratch? Or can we just grab our pre-trained model and its already optimized weights and fine-tune it to the new dataset?

Intuitively you can think about it this way. Lets say we show a kid some pictures of dogs and cats (like the dataset we are working on). After we show them some images of these dogs and cats, they will be able to classify them pretty well all on their own! Now we want to give them a new task, classify between tigers and wolves. Obviously tigers are not the same as cats and wolves are not the same as dogs, but they do share a bunch of similarities right? So the kid would then use their previous knowledge (The **pre-trained** model) and expand it to be able to do this new task.

## AlexNet and Convolutions

We will be skipping ahead a bit here. In the next lesson, PyTorch for Vision, we will be covering in detail the ideas for convolutions and why they are much better that Dense Linear layers for Images. For now, pretend the model we will use is some black box that takes in an image, and outputs some probabilities of belonging to different classes. The specific model we will look at is AlexNet, which was probably the first real evidence of the power of deep learning in 2012. It was trained on the ImageNet task (predict across 1000 classes of images) and beat all other methods by a significant margin. We will be loading a PyTorch defined version of this model, but don't worry, we will implement this entire model from scratch in the next lesson! I am just trying to offer some intuition for Deep Learning before we get into the weeds!

## More Intuition about Deep Learning 

As we have seen, the reason it is called "Deep" Learning is because the model physically has depth and many layers of computation. Because of this we actually get some interesting properties!

![Image](../../src/visuals/conv_feature_extracts.jpeg)

[credit](https://anhvnn.wordpress.com/2018/02/01/deep-learning-computer-vision-and-convolutional-neural-networks/)

This is a common image you will see when exploring Deep Learning. You can see that earlier features that are learned are focused on simple things to find in an image, like edges, lines, etc... Regardless of the Image problem at hand, we always have to detect these features, so the underlying weights in a pre-trained model that encode this can easily be used. As we move forward through the model, the extracted features in the image become much more specific to the dataset, and in this case there is some type of face detection. It is really cool though that convolutions (Again a black box that does some type of image processing) can extract abstract features at such a high level!

Now lets say we want to use this pre-trained model and now detect Dogs instead. Well on its own, this model would do poorly as it is optimized to find human faces! But also the initial layers that do low level image features are probably fine to leave alone. Therefore, the typical strategy is to keep the earlier parts of the model static (don't allow gradient updates) and then fine-tune the later layers.

There are some benefits for this:

- We can pre-train a model on a giant dataset, and then fine-tune it to a similar problem that we have limited data for
- We can control and avoid problems like overfitting better

In [12]:
### Imports ###
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.models import AlexNet, AlexNet_Weights
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets import ImageFolder
from tqdm import tqdm
import numpy as np

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import warnings
warnings.filterwarnings("ignore")

### Recap From Before

We will be skipping the details of the setup for the DataLoader and everything here. If you have any confusion, go back to my [PyTorch DataLoader](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20DataLoaders) tutorial where I go in depth about how this works!

In [2]:
### Build Cats vs Dogs Dataset ###
PATH_TO_DATA = "./data/cats_vs_dogs/"

### DEFINE TRANSFORMATIONS ###
normalizer = transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) ### IMAGENET MEAN/STD ###
train_transforms = transforms.Compose([
                                        transforms.Resize((224,224)),
                                        transforms.RandomHorizontalFlip(),
                                        transforms.ToTensor(),
                                        normalizer
                                      ])


dataset = ImageFolder(PATH_TO_DATA, transform=train_transforms)

train_samples, test_samples = int(0.9 * len(dataset)), len(dataset) - int(0.9 * len(dataset))
train_dataset, val_dataset = torch.utils.data.random_split(dataset, lengths=[train_samples, test_samples])

In [3]:
dataset

Dataset ImageFolder
    Number of datapoints: 37500
    Root location: ./data/cats_vs_dogs/
    StandardTransform
Transform: Compose(
               Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=True)
               RandomHorizontalFlip(p=0.5)
               ToTensor()
               Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
           )

### Lets Now Load the Model!!
We will now load the AlexNet model from PyTorch and then poke around a bit to see how these models are typically defined! Below you will see a ton of thigns you haven't seen before but its ok! We will go into detail again in the next tutorial. For now lets just take a rought look at whats going on:

In [4]:
model = AlexNet()
print(model)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

All PyTorch models are saved like this, where every layer has a name and we can access them individually. For example, if we want to look at the 1st linear layer in the classifier we can do it this way:

```
model.classifier[1]
```

In [5]:
model.classifier[1]

Linear(in_features=9216, out_features=4096, bias=True)

### What is the Output of the Model?

Notice the very last linear layer looks like this:

```
Linear(in_features=4096, out_features=1000, bias=True)
```

We can clearly see here that the model is outputing 1000 values because the ImageNet task requires us to predict across 1000 classes. As we start thinking about how we want to use this model, if we have a binary classification problem (Cats vs Dogs), then we will need to change this last linear layer so its output is only 1 value (like we had in logistic regression. We will then apply a softmax where if we get a value greater than 50%

**Lets Update the Model!**
We know that we can access the last layer of the model by doing *model.classifier[6]* and we want to change this to a linear layer. We saw in the [Intro to PyTorch](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/Intro%20to%20PyTorch) that we can define a linear layer as:

```
Linear(in_features, out_features)
```

but we have to make sure that the input features of the new layer matches the output features of the previous Linear layer. We can see in the 4th Linear layer that the output is 4096 so we will ensure to keep that as our input!

In [6]:
model = AlexNet()
model.classifier[6] = nn.Linear(4096, 2)
model

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

Now we can clearly see that our model is outputting 2 features rather than 1000! This then gives us everything we need, so lets pass a test tesor through it to make sure its all functional, and then we can do a bit more exploration. As we said before, the common shape for Image data is:

```
[Batch Size x Channels x Image Height x Image Width]
```

We will just make a dummy tensor that matches the CatsVsDogs dataset we made that will have the shape [16 x 3 224 x 224]

In [7]:
rand_data = torch.rand(16,3,224,224)
model_output = model(rand_data)
model_output.shape

torch.Size([16, 2])

We can see that we passed in 16 images and the output is 2 classes per image, so the model is fully functional!!

### Checking out the Model Parameters 
The attribute **.named_parameters()** that a model has allows us to iterate through all the names and parameters of the model! You will notice that the parameters are also stored as Tensors and these are the number that are updated through optimization! We can add up the number of parameters in the model and see that after we reduced the final output dimension to 2, we have roughly 57 Million parameters! That may sound like a lot but most Deep Learning models today have Billions of parameters.

In [8]:
total_parameters = 0
for name, params in model.named_parameters():
    num_params = int(torch.prod(torch.tensor(params.shape)))
    print(name,":", params.shape, "Num Parameters:", num_params)
    total_parameters += num_params
    
print("------------------------")
print("Total Parameters in Model", total_parameters)
    

features.0.weight : torch.Size([64, 3, 11, 11]) Num Parameters: 23232
features.0.bias : torch.Size([64]) Num Parameters: 64
features.3.weight : torch.Size([192, 64, 5, 5]) Num Parameters: 307200
features.3.bias : torch.Size([192]) Num Parameters: 192
features.6.weight : torch.Size([384, 192, 3, 3]) Num Parameters: 663552
features.6.bias : torch.Size([384]) Num Parameters: 384
features.8.weight : torch.Size([256, 384, 3, 3]) Num Parameters: 884736
features.8.bias : torch.Size([256]) Num Parameters: 256
features.10.weight : torch.Size([256, 256, 3, 3]) Num Parameters: 589824
features.10.bias : torch.Size([256]) Num Parameters: 256
classifier.1.weight : torch.Size([4096, 9216]) Num Parameters: 37748736
classifier.1.bias : torch.Size([4096]) Num Parameters: 4096
classifier.4.weight : torch.Size([4096, 4096]) Num Parameters: 16777216
classifier.4.bias : torch.Size([4096]) Num Parameters: 4096
classifier.6.weight : torch.Size([2, 4096]) Num Parameters: 8192
classifier.6.bias : torch.Size([2]

### Lets Train This Model From Scratch!
Our model is currently randomly initialized, so we will be training the model from scratch just to see how it does!

In [9]:
### SELECT DEVICE ###
# GPU device configuration
if torch.cuda.is_available():
  DEVICE = torch.device('cuda')
  print('Using GPU')
elif torch.backends.mps.is_available():
  DEVICE = torch.device('mps')
  print('Using MPS')
else:
  DEVICE = torch.device('cpu')
  print('Using CPU')
  
print(f"Training on Device {DEVICE}")

Using MPS
Training on Device mps


In [10]:
### LOAD IN and Modify AlexNet Model ###
model = AlexNet()
model.classifier[6] = nn.Linear(4096, 2)
model = model.to(DEVICE)

### MODEL TRAINING INPUTS ###
epochs = 5
optimizer = optim.Adam(params=model.parameters(), lr=0.0001)
loss_fn = nn.CrossEntropyLoss()
batch_size = 128

### BUILD DATALOADERS ###
trainloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
valloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

def train(model, device, epochs, optimizer, loss_fn, batch_size, trainloader, valloader):
    log_training = {"epoch": [],
                    "training_loss": [],
                    "training_acc": [],
                    "validation_loss": [],
                    "validation_acc": []}

    for epoch in range(1, epochs + 1):
        print(f"Starting Epoch {epoch}")
        training_losses, training_accuracies = [], []
        validation_losses, validation_accuracies = [], []

        for image, label in tqdm(trainloader):
            image, label = image.to(DEVICE), label.to(DEVICE)
            optimizer.zero_grad()
            out = model.forward(image)
        
            ### CALCULATE LOSS ##
            loss = loss_fn(out, label)
            training_losses.append(loss.item())

            ### CALCULATE ACCURACY ###
            predictions = torch.argmax(out, axis=1)
            accuracy = (predictions == label).sum() / len(predictions)
            training_accuracies.append(accuracy.item())

            loss.backward()
            optimizer.step()

        for image, label in tqdm(valloader):
            image, label = image.to(DEVICE), label.to(DEVICE)
            with torch.no_grad():
                out = model.forward(image)

                ### CALCULATE LOSS ##
                loss = loss_fn(out, label)
                validation_losses.append(loss.item())

                ### CALCULATE ACCURACY ###
                predictions = torch.argmax(out, axis=1)
                accuracy = (predictions == label).sum() / len(predictions)
                validation_accuracies.append(accuracy.item())

        training_loss_mean, training_acc_mean = np.mean(training_losses), np.mean(training_accuracies)
        valid_loss_mean, valid_acc_mean = np.mean(validation_losses), np.mean(validation_accuracies)

        log_training["epoch"].append(epoch)
        log_training["training_loss"].append(training_loss_mean)
        log_training["training_acc"].append(training_acc_mean)
        log_training["validation_loss"].append(valid_loss_mean)
        log_training["validation_acc"].append(valid_acc_mean)

        print("Training Loss:", training_loss_mean) 
        print("Training Acc:", training_acc_mean)
        print("Validation Loss:", valid_loss_mean)
        print("Validation Acc:", valid_acc_mean)
        
    return log_training, model


random_init_logs, model = train(model = model,
                                device = DEVICE,
                                epochs = epochs,
                                optimizer = optimizer,
                                loss_fn = loss_fn,
                                batch_size = batch_size,
                                trainloader = trainloader,
                                valloader = valloader)



Starting Epoch 1


100%|██████████| 264/264 [01:46<00:00,  2.47it/s]
100%|██████████| 30/30 [00:27<00:00,  1.07it/s]


Training Loss: 0.6399982165206562
Training Acc: 0.6662915951826356
Validation Loss: 0.6373635490735372
Validation Acc: 0.6662417769432067
Starting Epoch 2


100%|██████████| 264/264 [01:42<00:00,  2.58it/s]
100%|██████████| 30/30 [00:27<00:00,  1.08it/s]


Training Loss: 0.637497083255739
Training Acc: 0.666394137523391
Validation Loss: 0.637366896867752
Validation Acc: 0.6662417769432067
Starting Epoch 3


100%|██████████| 264/264 [01:42<00:00,  2.58it/s]
100%|██████████| 30/30 [00:27<00:00,  1.08it/s]


Training Loss: 0.6375585728974054
Training Acc: 0.6662640668677561
Validation Loss: 0.6369383871555329
Validation Acc: 0.6662417769432067
Starting Epoch 4


100%|██████████| 264/264 [01:42<00:00,  2.58it/s]
100%|██████████| 30/30 [00:27<00:00,  1.08it/s]


Training Loss: 0.637465353039178
Training Acc: 0.6662640668677561
Validation Loss: 0.6371716062227885
Validation Acc: 0.6662417769432067
Starting Epoch 5


100%|██████████| 264/264 [01:42<00:00,  2.57it/s]
100%|██████████| 30/30 [00:27<00:00,  1.08it/s]

Training Loss: 0.6377316464980444
Training Acc: 0.6663363284685395
Validation Loss: 0.6382428566614787
Validation Acc: 0.6662417769432067





## Lets Now Load our PreTrained Weights
We will still be training the entire model end to end, but it will now start with the pretrained weights from ImageNet.

**Note**: The pretrained model by default has a final linear layer that outputs to 1000 classes! When we swap this last portion out with a new Linear layer that outputs to 2 classes, only that layer will be randomly initialized. The remaining calculations (previous linear layers, convolutions, etc...) are all still using the pretrained valued.

In [13]:
model = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)
model.classifier[6] = nn.Linear(4096, 2)
model = model.to(DEVICE)

Downloading: "https://github.com/pytorch/vision/zipball/v0.10.0" to /Users/kmm/.cache/torch/hub/v0.10.0.zip
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /Users/kmm/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|██████████| 233M/233M [00:06<00:00, 40.4MB/s] 


In [14]:
### MODEL TRAINING INPUTS ###
epochs = 2
optimizer = optim.Adam(params=model.parameters(), lr=0.0001)
loss_fn = nn.CrossEntropyLoss()
batch_size = 128

### BUILD DATALOADERS ###
trainloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
valloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

random_init_logs, model = train(model=model,
                                device=DEVICE,
                                epochs=epochs,
                                optimizer=optimizer,
                                loss_fn=loss_fn,
                                batch_size=batch_size,
                                trainloader=trainloader,
                                valloader=valloader)


Starting Epoch 1


100%|██████████| 264/264 [01:43<00:00,  2.56it/s]
100%|██████████| 30/30 [00:27<00:00,  1.07it/s]


Training Loss: 0.6457205771496801
Training Acc: 0.6625491378433777
Validation Loss: 0.6480644782384236
Validation Acc: 0.6662417769432067
Starting Epoch 2


100%|██████████| 264/264 [01:42<00:00,  2.57it/s]
100%|██████████| 30/30 [00:27<00:00,  1.08it/s]

Training Loss: 0.6370814320715991
Training Acc: 0.6662647550304731
Validation Loss: 0.6378946522871654
Validation Acc: 0.6662417769432067





Notice in a less number of epochs, we have beat the preformance of the "From Scratch" model greatly. Obviously if we kept training the randomized model more, we would have reached a similar performance but I am trying to show the benefits if you had lower resources for compute. 

## Load PreTrained Weights but Only Train the Final Classifier Layer
Like we had mentioned previously, the entire model starts with the pretrained weights, but when we swap the last linear layer with one that outputs to 2 classes, that layer will become randomly initialized. A common technique is then to freeze the remaining layers and only train this classifier head. To do this lets look at a flag that many tensors have in PyTorch!

**Note**: All Tensors (atleast by default) in your PyTorch model will have this. You will see that I am only showing the first layer that includes "bias" in the name. This is only for visualization as the bias tensors are pretty small compared to the massive weight tensors that belong to the convolutions. The idea is the same regardless though as the bias is also a learnable parameter!

In [11]:
for name, param in model.named_parameters():
    if "bias" in name:
        print(name)
        print(param)
        break

features.0.bias
Parameter containing:
tensor([-0.9706, -2.8080, -0.0382, -0.0790, -0.1152,  0.0244, -0.0754, -1.4167,
         1.6432, -0.1007, -0.0164, -0.1283, -0.0677, -0.0344, -0.0743, -1.2976,
        -0.0519,  0.0111, -0.1036, -1.1884, -0.1380, -0.0497, -0.0793, -0.0420,
        -0.0970, -0.0704, -1.9365, -0.0869, -0.1393, -0.1974, -0.1294, -2.0085,
        -0.0485, -0.0630, -0.0360, -0.3865, -2.7826,  0.6600, -0.1665, -2.1298,
         0.0531, -0.0287, -0.1711, -0.0606, -0.4209, -1.9391, -1.2091,  0.0143,
        -0.1081, -0.0254, -0.1512, -1.8519, -0.0936, -0.0186, -0.0702, -0.0576,
        -0.0627, -0.0736, -1.2676, -0.1170, -0.0437, -0.3276,  0.0489, -0.0151],
       device='cuda:0', requires_grad=True)


Notice the flag at the end of the Tensor: 
```
required_grad = True
```

This indicates that when PyTorch is optimizing the model, this tensor allows for gradient updates. Therefore, if we want to turn this off for this layer, we just need to turn that flag to False. We will want to repeat this process of turning off gradient updates for all layers except for our last one!

In [12]:
for name, param in model.named_parameters():
    print(name)

features.0.weight
features.0.bias
features.3.weight
features.3.bias
features.6.weight
features.6.bias
features.8.weight
features.8.bias
features.10.weight
features.10.bias
classifier.1.weight
classifier.1.bias
classifier.4.weight
classifier.4.bias
classifier.6.weight
classifier.6.bias


As we can see, the name of our last classifier includes "classifier.6", so we will use that to turn the gradients off on everything else!

In [13]:
model = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)
model.classifier[6] = nn.Linear(4096, 2)

# Check the name of all the parameters
for name, param in model.named_parameters():
    if "classifier.6" not in name:
        param.requires_grad_(False) # Inplace turn of gradient updates

        
for name, param in model.named_parameters():
    if "bias" in name:
        print(name)
        print(param)
        break

Using cache found in /home/priyam/.cache/torch/hub/pytorch_vision_v0.10.0


features.0.bias
Parameter containing:
tensor([-0.9705, -2.8070, -0.0371, -0.0795, -0.1159,  0.0252, -0.0752, -1.4181,
         1.6454, -0.0990, -0.0161, -0.1282, -0.0658, -0.0345, -0.0743, -1.2977,
        -0.0505,  0.0121, -0.1013, -1.1887, -0.1380, -0.0492, -0.0789, -0.0405,
        -0.0958, -0.0705, -1.9374, -0.0850, -0.1388, -0.1968, -0.1279, -2.0095,
        -0.0476, -0.0604, -0.0351, -0.3843, -2.7823,  0.6605, -0.1655, -2.1293,
         0.0543, -0.0274, -0.1703, -0.0593, -0.4215, -1.9394, -1.2094,  0.0153,
        -0.1081, -0.0248, -0.1503, -1.8516, -0.0928, -0.0177, -0.0700, -0.0582,
        -0.0630, -0.0721, -1.2678, -0.1176, -0.0441, -0.3259,  0.0507, -0.0146])


Notice now that the Requires Grad flag is now gone! We should be good to go now to retrain the model one more time. 

In [14]:
model = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)
model.classifier[6] = nn.Linear(4096, 2)

# Check the name of all the parameters
for name, param in model.named_parameters():
    if "classifier.6" not in name:
        param.requires_grad_(False) # Inplace turn of gradient updates

model = model.to(DEVICE)

### MODEL TRAINING INPUTS ###
epochs = 2
optimizer = optim.Adam(params=model.parameters(), lr=0.0001)
loss_fn = nn.CrossEntropyLoss()
batch_size = 128

### BUILD DATALOADERS ###
trainloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
valloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4)


random_init_logs, model = train(model=model,
                                device=DEVICE,
                                epochs=epochs,
                                optimizer=optimizer,
                                loss_fn=loss_fn,
                                batch_size=batch_size,
                                trainloader=trainloader,
                                valloader=valloader)



Using cache found in /home/priyam/.cache/torch/hub/pytorch_vision_v0.10.0


Starting Epoch 1


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 176/176 [00:15<00:00, 11.44it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 10.02it/s]


Training Loss: 0.20389853561805052
Training Acc: 0.9185770021920855
Validation Loss: 0.14128499366343023
Validation Acc: 0.9455771148204803
Starting Epoch 2


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 176/176 [00:15<00:00, 11.50it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 10.11it/s]

Training Loss: 0.12450791882689703
Training Acc: 0.9513974203304811
Validation Loss: 0.12310790345072746
Validation Acc: 0.9471144139766693





## Lets Roundup All the Ideas Now!
Pretrained models allow us to "Transfer" knowledge from a previous task to a new one. The main benefits to this are as follows:
- Less data needed to train model and allows us to work on niche tasks even with these expressive datasets
- Less concern for Overfitting. Typically, if you train a large model on a small dataset, it will quickly memorize the data. Instead, we can train just a small part of the model and keep the rest as a feature extractor. 
- Less compute! Deep Learning takes long enough already to train. Atleast we have less parameters to do gradient updating on, so it saves us quite a bit of compute. 


Pretrained models are the current future for deep learning. It has become the gold standard to take large language, vision, speech, and any other modality and transfer it to new tasks. Even more interesting is, the models have become so powerful, they can often transfer to completely unrelated tasks (Using NLP based models for Protein Analysis). This technique also greatly democratizes Deep Learning to many because if the models are easier to train and dont require entire data centers like the Pretraining stage did, then we can all enjoy solving the problems we are interested in!