# AIOK Model Adapter Distiller Customized DEMO
Model Adapter is a convenient framework can be used to reduce training and inference time, or data labeling cost by efficiently utilizing public advanced models and those datasets from many domains. It mainly contains three components served for different cases: Finetuner, Distiller, and Domain Adapter. 

This demo mainly introduces the usage of Distiller. Take image classification as an example, it shows how to integrate distiller  from ResNet50 to ResNet18 on CIFAR100 dataset. This demo shows how to integrate distiller into a general training pipeline, you can find build-in simplied demo at [here](./Model_Adapter_Distiller_buildin_ResNet18_CIFAR100.ipynb).

# Content

* [Model Adapter Distiller Overview](#Model-Adapter-Distller-Overview)
* [Environment Setup](#Environment-Setup)
* [Training with Distiller](#Training-with-Distiller)
    * [Prepare Data](#Prepare-Data)
    * [Create Transferrable Model](#Create-Transferrable-Model)
    * [Train and Evaluate](#Train-and-Evaluate)

# Model Adapter Distiller Overview
Distiller is based on knowledge distillation technology, it can transfer knowledge from a heavy model (teacher) to a light one (student) with different structure. Teacher is a large model pretrained on specific dataset, which contains sufficient knowledge for this task, while the student model has much smaller structure. Distiller trains the student not only on the dataset, but also with the help of teacher’s knowledge. With distiller, we can take use of the knowledge from the existing pretrained large models but use much less training time. It can also significantly improve the converge speed and predicting accuracy of a small model, which is very helpful for inference.

<img src="../imgs/distiller.png" width="60%">
<center>Model Adapter Distiller Structure</center>

# Environment Setup

1. prepare code
    ``` bash
    git clone https://github.com/intel/e2eAIOK.git
    cd e2eAIOK
    git submodule update --init –recursive
    ```
2. build docker image
   ```
   cd Dockerfile-ubuntu18.04 && docker build -t e2eaiok-pytorch112 . -f DockerfilePytorch112 && cd .. && yes 
   ```
3. run docker
   ``` bash
   docker run -it --name model_adapter --shm-size=10g --privileged --network host \
   -v ${dataset_path}:/home/vmagent/app/data  \
   -v `pwd`:/home/vmagent/app/e2eaiok \
   -w /home/vmagent/app/e2eaiok e2eaiok-pytorch112 /bin/bash 
   ```
4. Run in conda and set up e2eAIOK
   ```bash
   conda activate pytorch-1.12.0
   python setup.py sdist && pip install dist/e2eAIOK-*.*.*.tar.gz
   ```
5. Start the jupyter notebook and tensorboard service
   ``` bash
   nohup jupyter notebook --notebook-dir=/home/vmagent/app/e2eaiok --ip=${hostname} --port=8899 --allow-root &
   nohup tensorboard --logdir /home/vmagent/app/data/tensorboard --host=${hostname} --port=6006 & 
   ```
   Now you can visit demso in `http://${hostname}:8899/`, and see tensorboad log in ` http://${hostname}:6006`.

# Training with Distiller

Import lib

In [1]:
import torch
from torchvision import transforms,datasets
from torch.utils.data import DataLoader
import torch.optim as optim
from timm.utils import accuracy
import timm
import transformers
import datetime

## Prepare Data

### Define Data Preprocessor

In [2]:
CIFAR100_TRAIN_MEAN = (0.5070751592371323, 0.48654887331495095, 0.4409178433670343) # mean for 3 channels
CIFAR100_TRAIN_STD = (0.2673342858792401, 0.2564384629170883, 0.27615047132568404)  # std for 3 channels

train_transform = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.RandomHorizontalFlip(),
  transforms.Resize(112),  # pretrained model is trained on large imgage size, scale 32x32 to 112x112
  transforms.ToTensor(),
  transforms.Normalize(CIFAR100_TRAIN_MEAN, CIFAR100_TRAIN_STD)
])

test_transform = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.Resize(112),  # pretrained model is trained on large imgage size, scale 32x32 to 112x112
  transforms.ToTensor(),
  transforms.Normalize(CIFAR100_TRAIN_MEAN, CIFAR100_TRAIN_STD)
])

### Prepare dataset and dataloader

In [3]:
data_folder='/home/vmagent/app/data/dataset/cifar' # dataset location
train_set = datasets.CIFAR100(root=data_folder, train=True, download=True, transform=train_transform)
test_set = datasets.CIFAR100(root=data_folder, train=False, download=True, transform=test_transform)

train_loader = DataLoader(dataset=train_set, batch_size=128, shuffle=True, num_workers=1, drop_last=False)
validate_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=True, num_workers=1, drop_last=False)

Files already downloaded and verified
Files already downloaded and verified


## Create Transferrable Model

### Create Backbone model

In [4]:
backbone = timm.create_model('resnet18', pretrained=False, num_classes=100)

### Create teacher model
To use distiller, we need to prepare teacher model to guide the training. Directly download teacher model Resnet50 pretrained on CIFAR100 from [here](), and put it at `${dataset}/model/demo/baseline/cifar100_res50PretrainI21k/cifar100_res50_pretrain_imagenet21k.pth`.

In [5]:
pretrain_model = "/home/vmagent/app/data/model/demo/baseline/cifar100_res50PretrainI21k/cifar100_res50_pretrain_imagenet21k.pth"
teacher_model = timm.create_model('resnet50', pretrained=False, num_classes=100)
teacher_model.load_state_dict(torch.load(pretrain_model), strict=True)

<All keys matched successfully>

### Define Distiller
Here we define a distiller using KD algorithm, and it take a teacher model as input.

In [6]:
from e2eAIOK.ModelAdapter.engine_core.distiller import KD
distiller= KD(teacher_model)

### Make Model transferrable with distiller

In [7]:
from e2eAIOK.ModelAdapter.engine_core.transferrable_model import *
loss_fn = torch.nn.CrossEntropyLoss()
model = make_transferrable_with_knowledge_distillation(backbone,loss_fn,distiller)

## Train and Evaluate

### Create optimizer and scheduler

In [8]:
################# create optimizer #################
init_lr = 0.01
weight_decay = 0.0001
momentum = 0.9
optimizer = optim.SGD(model.parameters(),lr=init_lr, weight_decay=weight_decay,momentum=momentum)
################# create scheduler #################
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1)

### Create Trainer

In [9]:
max_epoch = 1 # max 1 epoch
print_interval = 10 

In [12]:
class Trainer:
    def __init__(self, model, optimizer, scheduler):
        self._model = model
        self._optimizer = optimizer
        self._scheduler = scheduler
        
    def train(self, train_dataloader, valid_dataloader, max_epoch):
        ''' 
        :param train_dataloader: train dataloader
        :param valid_dataloader: validation dataloader
        :param max_epoch: steps per epoch
        '''
        for epoch in range(0, max_epoch):
            ################## train #####################
            model.train()  # set training flag
            for (cur_step,(data, label)) in enumerate(train_dataloader):
                optimizer.zero_grad()
                output = model(data)
                loss_value = model.loss(output, label) # transferrable model has loss attribute
                loss_value.backward() 
                if cur_step%print_interval == 0:
                    batch_acc = accuracy(output.backbone_output,label)[0] # use output.backbone_output instead of output
                    dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # date time
                    print("[{}] epoch {} step {} : training backbone loss {:.4f}, distiller loss {:4f}, training batch acc {:.4f}".format(
                      dt, epoch, cur_step, loss_value.backbone_loss.item(), loss_value.distiller_loss.item(), batch_acc.item())) # use loss_value.backbone_loss
                self._optimizer.step()
            self._scheduler.step()
            ################## evaluate ######################
            self.evaluate(model, valid_dataloader, epoch)
            
    def evaluate(self, model, valid_dataloader, epoch):
        with torch.no_grad():
            model.eval()  
            backbone = model.backbone # use backbone in evaluation
            loss_cum = 0.0
            sample_num = 0
            acc_cum = 0.0
            for (cur_step,(data, label)) in enumerate(valid_dataloader):
                output = backbone(data)
                batch_size = data.size(0)
                sample_num += batch_size
                loss_cum += loss_fn(output, label).item() * batch_size
                acc_cum += accuracy(output, label)[0].item() * batch_size
            dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # date time
            loss_value = loss_cum/sample_num
            acc_value = acc_cum/sample_num

            print("[{}] epoch {} : evaluation loss {:.4f}, evaluation acc {:.4f}".format(
                dt, epoch, loss_value, acc_value))

### Start train

In [13]:
%%time
trainer = Trainer(model, optimizer, scheduler)
trainer.train(train_loader,validate_loader,max_epoch)

[2023-02-07 09:32:04] epoch 0 step 0 : training backbone loss 4.2446, distiller loss 7.003967, training batch acc 7.0312
[2023-02-07 09:32:21] epoch 0 step 10 : training backbone loss 4.1627, distiller loss 6.882109, training batch acc 8.5938
[2023-02-07 09:32:36] epoch 0 step 20 : training backbone loss 3.9140, distiller loss 6.864616, training batch acc 9.3750
[2023-02-07 09:32:51] epoch 0 step 30 : training backbone loss 3.8867, distiller loss 6.332502, training batch acc 11.7188
[2023-02-07 09:33:07] epoch 0 step 40 : training backbone loss 3.9411, distiller loss 7.078649, training batch acc 10.1562
[2023-02-07 09:33:22] epoch 0 step 50 : training backbone loss 4.2199, distiller loss 6.615680, training batch acc 8.5938
[2023-02-07 09:33:37] epoch 0 step 60 : training backbone loss 4.3427, distiller loss 6.883377, training batch acc 5.4688
[2023-02-07 09:33:52] epoch 0 step 70 : training backbone loss 4.1152, distiller loss 6.332665, training batch acc 10.1562
[2023-02-07 09:34:06] 