# Homework 7 - Network Compression (Knowledge Distillation)

> Author: Arvin Liu (b05902127@ntu.edu.tw)

## **goal**

 ----- strong baseline -----   0.84100

----- simple baseline -----   0.83682

In [None]:
# Download dataset
!gdown --id '1GzukFVznTp_RG7b2ury7hr9TwA-MyMYj' --output food-11.zip
# Unzip the files
!unzip food-11.zip

In [2]:
import random
import numpy as np
import torch

# 固定隨機種子
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    np.random.seed(seed)  # Numpy module.
    random.seed(seed)  # Python random module.
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

same_seeds(0)

# Readme


HW7的任務是模型壓縮 - Neural Network Compression。

Compression有很多種門派，在這裡我們會介紹上課出現過的其中四種，分別是:

* 知識蒸餾 Knowledge Distillation
* 網路剪枝 Network Pruning
* 用少量參數來做CNN Architecture Design
* 參數量化 Weight Quantization

在這個notebook中我們會介紹Knowledge Distillation，
而我們有提供已經學習好的大model方便大家做Knowledge Distillation。
而我們使用的小model是"Architecture Design"過的model。

* Architecute Design在同目錄中的hw7_Architecture_Design.ipynb。
* 下載pretrained大model(47.2M): https://drive.google.com/file/d/1B8ljdrxYXJsZv2vmTequdPOofp3VF3NN/view?usp=sharing
  * 請使用torchvision提供的ResNet18，把num_classes改成11後load進去即可。(後面有範例。)

In [3]:
import torch
import os
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.models as models

# Load進我們的Model架構(在hw7_Architecture_Design.ipynb內) TA_Student_Net
!gdown --id '1lJS0ApIyi7eZ2b3GMyGxjPShI8jXM2UC' --output "hw7_Architecture_Design.ipynb"
%run "hw7_Architecture_Design.ipynb"

Downloading...
From: https://drive.google.com/uc?id=1lJS0ApIyi7eZ2b3GMyGxjPShI8jXM2UC
To: /content/hw7_Architecture_Design.ipynb
  0% 0.00/8.78k [00:00<?, ?B/s]100% 8.78k/8.78k [00:00<00:00, 14.8MB/s]


Knowledge Distillation
===

<img src="https://i.imgur.com/H2aF7Rv.png=100x" width="500px">

簡單上來說就是讓已經做得很好的大model們去告訴小model"如何"學習。
而我們如何做到這件事情呢? 就是利用大model預測的logits給小model當作標準就可以了。

## 為甚麼這會work?
* 例如當data不是很乾淨的時候，對一般的model來說他是個noise，只會干擾學習。透過去學習其他大model預測的logits會比較好。
* label和label之間可能有關連，這可以引導小model去學習。例如數字8可能就和6,9,0有關係。
* 弱化已經學習不錯的target(?)，避免讓其gradient干擾其他還沒學好的task。


## 要怎麼實作?
* $Loss = \alpha T^2 \times KL(\frac{\text{Teacher's Logits}}{T} || \frac{\text{Student's Logits}}{T}) + (1-\alpha)(\text{原本的Loss})$


* 以下code為甚麼要對student使用log_softmax: https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
* reference: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

In [4]:
def loss_fn_kd(outputs, labels, teacher_outputs, T=20, alpha=0.5):
    # 一般的Cross Entropy
    hard_loss = F.cross_entropy(outputs, labels) * (1. - alpha)
    # 讓logits的log_softmax對目標機率(teacher的logits/T後softmax)做KL Divergence。
    soft_loss = nn.KLDivLoss(reduction='batchmean')(F.log_softmax(outputs/T, dim=1),
                             F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T)
    return hard_loss + soft_loss

# Data Processing

我們的Dataset使用的是跟Hw3 - CNN同樣的Dataset，因此這個區塊的Augmentation / Read Image大家參考或直接抄就好。

如果有不會的話可以回去看Hw3的colab。

需要注意的是如果要自己寫的話，Augment的方法最好使用我們的方法，避免輸入有差異導致Teacher Net預測不好。

In [5]:
import re
import torch
from glob import glob
from PIL import Image
import torchvision.transforms as transforms

class MyDataset(torch.utils.data.Dataset):

    def __init__(self, folderName, transform=None):
        self.transform = transform
        self.data = []
        self.label = []

        for img_path in sorted(glob(folderName + '/*.jpg')):
            try:
                # Get classIdx by parsing image path
                class_idx = int(re.findall(re.compile(r'\d+'), img_path)[1])
            except:
                # if inference mode (there's no answer), class_idx default 0
                class_idx = 0

            image = Image.open(img_path)
            # Get File Descriptor
            image_fp = image.fp
            image.load()
            # Close File Descriptor (or it'll reach OPEN_MAX)
            image_fp.close()

            self.data.append(image)
            self.label.append(class_idx)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        image = self.data[idx]
        if self.transform:
            image = self.transform(image)
        return image, self.label[idx]


trainTransform = transforms.Compose([
    transforms.RandomCrop(256, pad_if_needed=True, padding_mode='symmetric'),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
])
testTransform = transforms.Compose([
    transforms.CenterCrop(256),
    transforms.ToTensor(),
])

def get_dataloader(mode='training', batch_size=32):

    assert mode in ['training', 'testing', 'validation']

    dataset = MyDataset(
        f'./food-11/{mode}', #原本的
        # f'./{mode}', #之前發現 zip 沒 folder
        transform=trainTransform if mode == 'training' else testTransform)

    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=(mode == 'training'))

    return dataloader


# Pre-processing

我們已經提供TeacherNet的state_dict，其架構是torchvision提供的ResNet18。

至於StudentNet的架構則在hw7_Architecture_Design.ipynb中。

這裡我們使用的Optimizer為AdamW，沒有為甚麼，就純粹我想用。

In [6]:
# get dataloader
train_dataloader = get_dataloader('training', batch_size=32)
print('finish train_dataloader')

valid_dataloader = get_dataloader('validation', batch_size=32)
print('finish valid_dataloader')

finish train_dataloader
finish valid_dataloader


In [7]:
# # 使用 pre-train teacher_resnet18.bin
# !gdown --id '1B8ljdrxYXJsZv2vmTequdPOofp3VF3NN' --output teacher_resnet18.bin

# # 使用 pre-train teacher_resnet18.bin
# !gdown --id '1xiaRepwfa1XOwwKswzbHfXaDmBEVpZVh' --output teacher_resnet18.bin #preatrain

# 使用 teacher_resnet18_from_scratch.bin
!gdown --id '1VEoKts_clMcJYKnuUdxcNX1BtPcPTmsc' --output teacher_resnet18.bin #from scratch

teacher_net = models.resnet18(pretrained=False, num_classes=11).cuda() #把 teacher net load
student_net = StudentNet(base=16).cuda()

teacher_net.load_state_dict(torch.load(f'./teacher_resnet18.bin'))
optimizer = optim.AdamW(student_net.parameters(), lr=1e-3)

Downloading...
From: https://drive.google.com/uc?id=1VEoKts_clMcJYKnuUdxcNX1BtPcPTmsc
To: /content/teacher_resnet18.bin
44.8MB [00:00, 96.3MB/s]


# Start Training

* 剩下的步驟與你在做Hw3 - CNN的時候一樣。

## 小提醒

* torch.no_grad是指接下來的運算或該tensor不需要算gradient。
* model.eval()與model.train()差在於Batchnorm要不要紀錄，以及要不要做Dropout。



In [8]:
# # 拿之前的再 train
# !gdown --id '110YMEwwIyLNJNcaQ_uV_0daewsNz3Wiy' --output student_custom_small.bin
!gdown --id '10c_s5vehB1B0zyQfT-tnBqxmETB6bJjY' --output student_custom_small.bin

student_net = StudentNet(base=16).cuda()
student_net.load_state_dict(torch.load('student_custom_small.bin'))

Downloading...
From: https://drive.google.com/uc?id=10c_s5vehB1B0zyQfT-tnBqxmETB6bJjY
To: /content/student_custom_small.bin
  0% 0.00/1.05M [00:00<?, ?B/s]100% 1.05M/1.05M [00:00<00:00, 72.3MB/s]


<All keys matched successfully>

In [None]:
# 先給他跑 170 次
# 已經 40 + 50 + 60
# 如果 lr too small 繼續加

optimizer = optim.AdamW(student_net.parameters(), lr=1e-3)
import time
def run_epoch(dataloader, update=True, alpha=0.5):
    total_num, total_hit, total_loss = 0, 0, 0
    for now_step, batch_data in enumerate(dataloader):
        # 清空 optimizer
        optimizer.zero_grad()
        # 處理 input
        inputs, hard_labels = batch_data
        inputs = inputs.cuda()
        hard_labels = torch.LongTensor(hard_labels).cuda()
        # 因為Teacher沒有要backprop，所以我們使用torch.no_grad
        # 告訴torch不要暫存中間值(去做backprop)以浪費記憶體空間。
        with torch.no_grad():
            soft_labels = teacher_net(inputs)

        if update:
            logits = student_net(inputs)
            # 使用我們之前所寫的融合soft label&hard label的loss。
            # T=20是原始論文的參數設定。
            loss = loss_fn_kd(logits, hard_labels, soft_labels, 20, alpha)
            loss.backward()
            optimizer.step()    
        else:
            # 只是算validation acc的話，就開no_grad節省空間。
            with torch.no_grad():
                logits = student_net(inputs)
                loss = loss_fn_kd(logits, hard_labels, soft_labels, 20, alpha)
            
        total_hit += torch.sum(torch.argmax(logits, dim=1) == hard_labels).item()
        total_num += len(inputs)

        total_loss += loss.item() * len(inputs)
    return total_loss / total_num, total_hit / total_num


# TeacherNet永遠都是Eval mode.
teacher_net.eval()
now_best_acc = 0
for epoch in range(170):
    train_start_time = time.time()
    student_net.train()
    train_loss, train_acc = run_epoch(train_dataloader, update=True)
    student_net.eval()
    valid_loss, valid_acc = run_epoch(valid_dataloader, update=False)

    # 存下最好的model。
    if valid_acc > now_best_acc:
        now_best_acc = valid_acc
        torch.save(student_net.state_dict(), 'student_model.bin')
        print('save model')
    print('epoch {:>3d}: train loss: {:6.4f}, acc {:6.4f} valid loss: {:6.4f}, acc {:6.4f}'.format(
        epoch, train_loss, train_acc, valid_loss, valid_acc))
    print('epoch cost time =', time.time() - train_start_time)
    print('')

In [None]:
# # 拿之前的再 train
!gdown --id '110YMEwwIyLNJNcaQ_uV_0daewsNz3Wiy' --output student_custom_small.bin

student_net = StudentNet(base=16).cuda()
student_net.load_state_dict(torch.load('student_custom_small.bin'))

In [None]:
# 重跑 30 次

optimizer = optim.AdamW(student_net.parameters(), lr=1e-4)

import time
def run_epoch(dataloader, update=True, alpha=0.5):
    total_num, total_hit, total_loss = 0, 0, 0
    for now_step, batch_data in enumerate(dataloader):
        # 清空 optimizer
        optimizer.zero_grad()
        # 處理 input
        inputs, hard_labels = batch_data
        inputs = inputs.cuda()
        hard_labels = torch.LongTensor(hard_labels).cuda()
        # 因為Teacher沒有要backprop，所以我們使用torch.no_grad
        # 告訴torch不要暫存中間值(去做backprop)以浪費記憶體空間。
        with torch.no_grad():
            soft_labels = teacher_net(inputs)

        if update:
            logits = student_net(inputs)
            # 使用我們之前所寫的融合soft label&hard label的loss。
            # T=20是原始論文的參數設定。
            loss = loss_fn_kd(logits, hard_labels, soft_labels, 20, alpha)
            loss.backward()
            optimizer.step()    
        else:
            # 只是算validation acc的話，就開no_grad節省空間。
            with torch.no_grad():
                logits = student_net(inputs)
                loss = loss_fn_kd(logits, hard_labels, soft_labels, 20, alpha)
            
        total_hit += torch.sum(torch.argmax(logits, dim=1) == hard_labels).item()
        total_num += len(inputs)

        total_loss += loss.item() * len(inputs)
    return total_loss / total_num, total_hit / total_num


# TeacherNet永遠都是Eval mode.
teacher_net.eval()
now_best_acc = 0
for epoch in range(30):
    train_start_time = time.time()
    student_net.train()
    train_loss, train_acc = run_epoch(train_dataloader, update=True)
    student_net.eval()
    valid_loss, valid_acc = run_epoch(valid_dataloader, update=False)

    # 存下最好的model。
    if valid_acc > now_best_acc:
        now_best_acc = valid_acc
        torch.save(student_net.state_dict(), 'student_model.bin')
        print('save model')
    print('epoch {:>3d}: train loss: {:6.4f}, acc {:6.4f} valid loss: {:6.4f}, acc {:6.4f}'.format(
        epoch, train_loss, train_acc, valid_loss, valid_acc))
    print('epoch cost time =', time.time() - train_start_time)
    print('')

看 StudentNet 參數量


In [None]:
from torchsummary import summary
summary(student_net, input_size=(3, 128, 128))

In [11]:
def get_parameter_number(net):
    total_num = sum(p.numel() for p in net.parameters())
    trainable_num = sum(p.numel() for p in net.parameters() if p.requires_grad)
    return {'Total': total_num, 'Trainable': trainable_num}

In [12]:
get_parameter_number(student_net)

{'Total': 256779, 'Trainable': 256779}

看 teacher_net 參數量

In [None]:
from torchsummary import summary
summary(teacher_net, input_size=(3, 128, 128))

# Inference

同Hw3，請參考該作業:)。


## testing

In [None]:
# 讀 train 好的檔
# !gdown --id '10usrlxc7KhTbwRTzG7IAmaFbsVdWqlQ3' --output student_custom_small.bin #predict_0.8402332361516035_student_model.csv
!gdown --id '11MtUk-wHWrV004j9Li-bsQFfWB2Kkl3j' --output student_custom_small.bin #predict_0.8131195335276968student_model.csv

student_net = StudentNet(base=16).cuda()
student_net.load_state_dict(torch.load('student_custom_small.bin'))

In [None]:
test_dataloader = get_dataloader('testing', batch_size=32)
print('finish test_dataloader')

In [None]:
import numpy as np
student_net.eval()
prediction = []

for now_step, batch_data in enumerate(test_dataloader):
    # 清空 optimizer
    # optimizer.zero_grad()
    # 處理 input
    inputs, hard_labels = batch_data
    inputs = inputs.cuda()

    with torch.no_grad():
        logits = student_net(inputs)
        test_label = np.argmax(logits.cpu().data.numpy(), axis=1)
        for y in test_label:
            prediction.append(y)  

In [None]:
# 丟到 hw7
from google.colab import files

#將結果寫入 csv 檔
with open("hw7_predict_0.8131195335276968student_model.csv", 'w') as f:
    f.write('Id,label\n')
    for i, y in  enumerate(prediction):
        f.write('{},{}\n'.format(i, y))
#存到本機端
files.download("hw7_predict_0.8131195335276968student_model.csv")

In [None]:
# 丟到 hw3
# 可以使用 hw3 kaggle predict
from google.colab import files

#將結果寫入 csv 檔
with open("hw3_predict_0.8131195335276968student_model.csv", 'w') as f:
    f.write('Id,Category\n')
    for i, y in  enumerate(prediction):
        f.write('{},{}\n'.format(i, y))
#存到本機端
files.download("hw3_predict_0.8131195335276968student_model.csv")

In [None]:
# Kaggle Score Record

# 1. predict_TA.csv
#   acc = 0.82964

# 2. predict_0.8440233236151603_student_model.csv
#   acc = 0.04064
#   https://drive.google.com/open?id=10uOiw6Hsn0dYQe9TnNpxGaJVp4V6QVbt

# 3. predict_0.8402332361516035_student_model.csv
#   acc = 0.86072
#   https://drive.google.com/open?id=10usrlxc7KhTbwRTzG7IAmaFbsVdWqlQ3

# 4. predict_0.8402332361516035_student_model_8bytes.csv
#   acc = 0.85475
#   https://drive.google.com/open?id=10usrlxc7KhTbwRTzG7IAmaFbsVdWqlQ3

# 5. predict_0.8131195335276968student_model.csv
#   acc = 0.83024

# Q&A

有任何問題Network Compression的問題可以寄信到b05902127@ntu.edu.tw / ntu-ml-2020spring-ta@googlegroups.com。

時間允許的話我會更新在這裡。