In [1]:
%matplotlib inline
import torch,os,torchvision
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, models, transforms
from PIL import Image
from sklearn.model_selection import StratifiedShuffleSplit
torch.__version__

'1.0.0'

# 4.1 Fine tuning model fine tuning
In the previous introduction of convolutional neural networks, I said that PyTorch has trained some classic network models for us, so what are these pre-trained models used for? In fact, it is used for our fine-tuning.

## 4.1.1 What is fine-tuning

What should I do if I don’t have much training data for a certain task?
It doesn't matter, we first find a model trained by others of the same kind, take the ready-made trained model of others, replace it with our own data, adjust the parameters, and train again. This is fine-tune.
The classic network models provided in PyTorch are all official Imagenet datasets and trained data. If our data training data is not enough, these data can be used as a basic model.

### Why fine-tune
1. For the case where the data set itself is very small (thousands of pictures), it is unrealistic to train a large neural network with tens of millions of parameters from scratch, because the larger the model, the greater the amount of data required, and the overfitting can not avoid. At this time, if you want to use the super feature extraction capabilities of a large neural network, you can only rely on fine-tuning the trained model.
2. Training costs can be reduced: If you use the method of exporting feature vectors for migration learning, the later training costs are very low, and the CPU is completely stress-free, and it can be done without a deep learning machine.
3. The model trained by the predecessors with a lot of energy will be more powerful than the model you built from scratch in a high probability. There is no need to repeat the wheel.


### Transfer Learning Transfer Learning
Some people always associate transfer learning with neural network training. These two concepts are unrelated at first.
Transfer learning is a branch of machine learning. The reason why transfer learning is so closely related to neural networks is that the development of image recognition is too fast and the effect is too good, so almost all transfer learning is in the direction of image recognition, so everyone The transfer learning I have seen is basically based on neural network-related computer vision, and this article will also use this as an example

The original intention of migration learning is to save the time of manually labeling samples, so that the model can migrate from an existing labeled data field to an unlabeled data field to train a model suitable for this field. It is too expensive to directly learn the target domain from scratch , We therefore turn to use existing relevant knowledge to assist in learning new knowledge as soon as possible

A simple example can illustrate the problem well. What will we learn when we learn programming? Grammar, language-specific API, process processing, object-oriented, design patterns, etc.

The syntax and API are unique to each language, but object-oriented and design patterns are common. We learn JAVA, and then learn C#, or Python. Object-oriented and design patterns don’t have to learn because the principles are Similarly, even when learning C#, you can learn a lot less grammar. This is the concept of transfer learning, which abstracts the unified concept and only learns different content.

Transfer learning can be divided into sample-based transfer, feature-based transfer, model-based transfer, and relationship-based transfer according to the learning method, which will not be introduced in detail here.

### Relationship between the two
In fact, there is no strict distinction between "Transfer Learning" and "Fine-tune". The meanings can be interchanged, but the latter seems to be more commonly used to describe the later fine-tuning of transfer learning.
My personal understanding is that fine-tuning should be part of transfer learning. Fine tuning can only be said to be a trick.

## 4.1.2 How to fine-tune
The method of fine-tuning is different for different fields. For example, in the field of speech recognition, the first few layers are generally fine-tuned, and the following layers are fine-tuned for image recognition problems. I can only talk about this reason here.

For pictures, the first few layers of our CNN learned are low-level features, such as points, lines, and surfaces. These low-level features can be abstracted out of any picture, so we use him as a general purpose For the data, just fine-tune the high-level features combined by these low-level features. For example, whether these points, lines, and surfaces make up a circle, an ellipse, or a square, the meaning of these representatives is what we need to train later.

For voice, each word has the same meaning, but the pronunciation or spelling of the word is different. For example, apple, apple, apfel (German), they all mean the same thing, but the pronunciation and the word It is not the same, but the meaning of it is the same, that is, the high-level features are the same, so we only need to fine-tune the low-level features.

The following only introduces the fine-tuning of the computer vision direction, taken from [cs231](http://cs231n.github.io/transfer-learning/)

 -ConvNet as fixed feature extractor.:
In fact, there are two approaches:
1. Use the features obtained by the fc layer before the last fc layer to learn a linear classifier (such as SVM)
2. Retrain the last fc layer


 -Fine-tuning the ConvNet

Fix the parameters of the first few layers, and only fine-tuning the last few layers,

There are some fine-tuning tips for the above two schemes. For example, first calculate the feature vector of the convolutional layer of the pre-training model for all training and test data, and then abandon the pre-training model, and only train your own customized simple version full connection The internet.
One of the advantages of this method is to save computing resources, each iteration will not run all the data, but just run a simple full connection


 -Pretrained models

This is actually the same as the second one, but it is more extreme. Use the entire pre-trained model as initialization, and then fine-tuning the entire network instead of some layers. However, the amount of calculation for this is very large, which is only equivalent to Do an initialization.




## 4.1.3 Notes

1. The new data set is similar to the original data set, so you can directly fine-tune a final FC layer or reassign a new classifier
2. The new data set is relatively small and the original data set is quite different, so you can start training from the middle of the model, and only fine-tuning the last few layers
3. The new data set is relatively small and the original data set is quite different. If the above method still does not work, it is best to retrain, and only use the pre-trained model as the initial data for a new model
4. The size of the new data set must be the same as the original data set. For example, the size of the image input in CNN must be the same to avoid an error.
5. If the size of the data set is different, you can add a convolution or pool layer before the last fc layer to make the final output consistent with the fc layer, but this will cause a significant decrease in accuracy, so it is not recommended to do this
6. Different learning rates can be set for different layers. In general, it is recommended that the learning rate of the layer initialized with the original data used is less than (generally can be set less than 10 times) the initial learning rate, so as to ensure that the The initialized data will not be distorted too fast, and the new layer using the initialized learning rate can converge quickly.

## 4.1.3 Fine-tuning examples
Here we use the officially trained resnet50 to participate in the [dog breed](https://www.kaggle.com/c/dog-breed-identification) dog breed identification on kaggle to make a simple fine-tuning example.

First of all, we need to download the official data decompression, as long as the data directory structure is maintained, specify the location of the directory here, and look at the content

In [2]:
DATA_ROOT ='data'
all_labels_df = pd.read_csv(os.path.join(DATA_ROOT,'labels.csv'))
all_labels_df.head()

Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever


Get the category of the dog and number according to the category

Two dictionaries are defined here, corresponding to name and id respectively, which is convenient for later processing

In [3]:
breeds = all_labels_df.breed.unique()
breed2idx = dict((breed,idx) for idx,breed in enumerate(breeds))
idx2breed = dict((idx,breed) for idx,breed in enumerate(breeds))
len(breeds)

120

Add to list

In [4]:
all_labels_df['label_idx'] = [breed2idx[b] for b in all_labels_df.breed]
all_labels_df.head()

Unnamed: 0,id,breed,label_idx
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull,0
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo,1
2,001cdf01b096e06d78e9e5112d419397,pekinese,2
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick,3
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever,4


Since our data set is not in the officially designated format, we define a data set ourselves

In [5]:
class DogDataset(Dataset):
    def __init__(self, labels_df, img_path, transform=None):
        self.labels_df = labels_df
        self.img_path = img_path
        self.transform = transform

    def __len__(self):
        return self.labels_df.shape[0]

    def __getitem__(self, idx):
        image_name = os.path.join(self.img_path, self.labels_df.id[idx]) +'.jpg'
        img = Image.open(image_name)
        label = self.labels_df.label_idx[idx]

        if self.transform:
            img = self.transform(img)
        return img, label

Define some hyperparameters

In [6]:
IMG_SIZE = 224 # The input of resnet50 is 224, so the picture needs to be uniform in size
BATCH_SIZE= 256 #This batch size needs to occupy 4.6-5g of video memory, if it is not enough, you can change the batch, if the memory exceeds 10G, it can be changed to 512
IMG_MEAN = [0.485, 0.456, 0.406]
IMG_STD = [0.229, 0.224, 0.225]
CUDA=torch.cuda.is_available()
DEVICE = torch.device("cuda" if CUDA else "cpu")

Define image transformation rules for training and validation data

In [7]:
train_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.RandomResizedCrop(IMG_SIZE),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(30),
    transforms.ToTensor(),
    transforms.Normalize(IMG_MEAN, IMG_STD)
])

val_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.CenterCrop(IMG_SIZE),
    transforms.ToTensor(),
    transforms.Normalize(IMG_MEAN, IMG_STD)
])

We only split 10% of the data here as training time

In [8]:
dataset_names = ['train','valid']
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
train_split_idx, val_split_idx = next(iter(stratified_split.split(all_labels_df.id, all_labels_df.breed)))
train_df = all_labels_df.iloc[train_split_idx].reset_index()
val_df = all_labels_df.iloc[val_split_idx].reset_index()
print(len(train_df))
print(len(val_df))

9199
1023


Use the official dataloader to load data

In [9]:
image_transforms = {'train':train_transforms,'valid':val_transforms}

train_dataset = DogDataset(train_df, os.path.join(DATA_ROOT,'train'), transform=image_transforms['train'])
val_dataset = DogDataset(val_df, os.path.join(DATA_ROOT,'train'), transform=image_transforms['valid'])
image_dataset = {'train':train_dataset,'valid':val_dataset}

image_dataloader = {x:DataLoader(image_dataset[x],batch_size=BATCH_SIZE,shuffle=True,num_workers=0) for x in dataset_names}
dataset_sizes = {x:len(image_dataset[x]) for x in dataset_names}

Start to configure the network. Since ImageNet recognizes 1000 objects, our dog's classification is only 120 in total, so we need to fine-tune the last fully connected layer of the model and change the output from 1000 to 120

In [10]:
model_ft = models.resnet50(pretrained=True) # The official pre-training model is automatically downloaded here, and
# Freeze all parameter layers
for param in model_ft.parameters():
    param.requires_grad = False
# Print the information of the fully connected layer here
print(model_ft.fc)
num_fc_ftr = model_ft.fc.in_features #Get the input of the fc layer
model_ft.fc = nn.Linear(num_fc_ftr, len(breeds)) # Define a new FC layer
model_ft=model_ft.to(DEVICE)# put in the device
print(model_ft) # Finally print the new model

Linear(in_features=2048, out_features=1000, bias=True)
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (downsample): Sequential(
        (0): Co

)


Set training parameters

In [11]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam([
    {'params':model_ft.fc.parameters()}
], lr=0.001)#Specify the learning rate of the newly added fc layer

Define the training function

In [12]:
def train(model,device, train_loader, epoch):
    model.train()
    for batch_idx, data in enumerate(train_loader):
        x,y = data
        x=x.to(device)
        y=y.to(device)
        optimizer.zero_grad()
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
    print ('Train Epoch: {}\t Loss: {:.6f}'.format(epoch,loss.item()))

Define test function

In [13]:
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for i,data in enumerate(test_loader):
            x,y = data
            x=x.to(device)
            y=y.to(device)
            optimizer.zero_grad()
            y_hat = model(x)
            test_loss += criterion(y_hat, y).item() # sum up batch loss
            pred = y_hat.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(y.view_as(pred)).sum().item()
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(val_dataset),
        100. * correct / len(val_dataset)))

Train 9 times to see the effect

In [14]:
for epoch in range(1, 10):
    %time train(model=model_ft,device=DEVICE, train_loader=image_dataloader["train"],epoch=epoch)
    test(model=model_ft, device=DEVICE, test_loader=image_dataloader["valid"])

Train Epoch: 1	 Loss: 2.775527
Wall time: 1min 13s

Test set: Average loss: 0.0079, Accuracy: 700/1023 (68%)

Train Epoch: 2	 Loss: 1.965775
Wall time: 56.5 s

Test set: Average loss: 0.0047, Accuracy: 779/1023 (76%)

Train Epoch: 3	 Loss: 1.798122
Wall time: 56.4 s

Test set: Average loss: 0.0037, Accuracy: 790/1023 (77%)

Train Epoch: 4	 Loss: 1.596331
Wall time: 57.1 s

Test set: Average loss: 0.0031, Accuracy: 814/1023 (80%)

Train Epoch: 5	 Loss: 1.502677
Wall time: 56.3 s

Test set: Average loss: 0.0029, Accuracy: 822/1023 (80%)

Train Epoch: 6	 Loss: 1.430908
Wall time: 56.4 s

Test set: Average loss: 0.0028, Accuracy: 815/1023 (80%)

Train Epoch: 7	 Loss: 1.466642
Wall time: 56.4 s

Test set: Average loss: 0.0028, Accuracy: 824/1023 (81%)

Train Epoch: 8	 Loss: 1.368286
Wall time: 56.9 s

Test set: Average loss: 0.0025, Accuracy: 840/1023 (82%)

Train Epoch: 9	 Loss: 1.348546
Wall time: 56.9 s

Test set: Average loss: 0.0027, Accuracy: 814/1023 (80%)



We saw that 80% accuracy was achieved after only 9 trainings, and the effect is still ok.

But each training requires a picture to be calculated in all networks, and the result of the calculation is the same every time, which wastes a lot of computing resources.
Below we will save these calculation results without back-propagation or without updating the network weight parameter layer, so that we can directly input these results into the FC layer or build a new network layer with these results when we use them later. The calculation time is saved, and if only the fully connected layer is trained, the CPU can complete it.
## 4.1.4 Vector export of fixed layer
[PyTorch Forum](https://discuss.pytorch.org/t/can-i-get-the-middle-layers-output-if-i-use-the-sequential-module/7070) said it can be used Manually implement the forward parameter in the model by yourself, which seems to be very simple, but it is very troublesome to handle it, so it is not recommended to use it.

Here we are going to use PyTorch's more advanced API, hook to deal with, we must first define a hook function

In [15]:
in_list= [] # Store all output here
def hook(module, input, output):
    #input is a tuple representing each input item in order, we only have one item here, so get it directly
    #All the parameter information can be printed using this
    #for val in input:
    # print("input val:",val)
    for i in range(input[0].size(0)):
        in_list.append(input[0][i].cpu().numpy())

Register the hook function in the corresponding layer to ensure that the function can work normally. Here we directly hook the pool layer in front of the fully connected layer to obtain the input data of the pool layer, which will obtain more features

In [16]:
model_ft.avgpool.register_forward_hook(hook)

<torch.utils.hooks.RemovableHandle at 0x24812a5e978>

Start to get the output, here we do not need backpropagation, so we can directly use no_grad nesting

In [17]:
%%time
with torch.no_grad():
    for batch_idx, data in enumerate(image_dataloader["train"]):
        x,y = data
        x=x.to(DEVICE)
        y=y.to(DEVICE)
        y_hat = model_ft(x)

Wall time: 1min 23s


In [18]:
features=np.array(in_list)
np.save("features",features)

In this way, we only need to read out this array during retraining, and then we can directly use this array and input it into the linear or the sigmod layer we mentioned earlier.

Here we have obtained more features before the pool layer, and these features can be classified using more advanced classifiers, such as SVM and tree classifiers.

The above is a fine-tuning introduction for the direction of computer vision. For the NLP direction, fastai founder Jeremy released ULMFiT this year, which can be used as a good reference.
Please see these two links for details:

[fast.ai official blog](https://nlp.fast.ai/), [original paper: Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)