# Model Adapter Distiller Customized DEMO
Model Adapter is a convenient framework can be used to reduce training and inference time, or data labeling cost by efficiently utilizing public advanced models and those datasets from many domains. It mainly contains three components served for different cases: Finetuner, Distiller, and Domain Adapter. 

This demo mainly introduces the usage of Distiller. Take image classification as an example, it shows how to integrate distiller from VIT to ResNet18 on CIFAR100 dataset. This demo shows how to integrate distiller into a general training pipeline, you can find build-in simplied demo at [here](./Model_Adapter_Distiller_buildin_ResNet18_CIFAR100.ipynb).

# Content

* [Model Adapter Distiller Overview](#Model-Adapter-Distller-Overview)
* [Environment Setup](#Environment-Setup)
* [Training with Distiller](#Training-with-Distiller)
    * [Prepare Data](#Prepare-Data)
    * [Create Transferrable Model](#Create-Transferrable-Model)
    * [Train and Evaluate](#Train-and-Evaluate)

# Model Adapter Distiller Overview
Distiller is based on knowledge distillation technology, it can transfer knowledge from a heavy model (teacher) to a light one (student) with different structure. Teacher is a large model pretrained on specific dataset, which contains sufficient knowledge for this task, while the student model has much smaller structure. Distiller trains the student not only on the dataset, but also with the help of teacher’s knowledge. With distiller, we can take use of the knowledge from the existing pretrained large models but use much less training time. It can also significantly improve the converge speed and predicting accuracy of a small model, which is very helpful for inference.

<img src="../imgs/distiller.png" width="60%">
<center>Model Adapter Distiller Structure</center>

# Environment Setup

1. prepare code
    ``` bash
    git clone https://github.com/intel/e2eAIOK.git
    cd e2eAIOK
    git submodule update --init –recursive
    ```
2. build docker image
   proxy
   ``` bash
   python3 scripts/start_e2eaiok_docker.py -b pytorch112 --dataset_path ${dataset_path} -w ${host0} ${host1} ${host2} ${host3} --proxy  "http://addr:ip"
   ```
3. run docker
   ``` bash
   sshpass -p docker ssh ${host0} -p 12347
   ```
4. Run in conda and set up e2eAIOK
   ```bash
   conda activate pytorch-1.12.0
   python setup.py sdist && pip install dist/e2eAIOK-*.*.*.tar.gz
   ```
5. Start the jupyter notebook and tensorboard service
   ``` bash
   nohup jupyter notebook --notebook-dir=/home/vmagent/app/e2eaiok --ip=${hostname} --port=8899 --allow-root &
   nohup tensorboard --logdir /home/vmagent/app/data/tensorboard --host=${hostname} --port=6006 & 
   ```
   Now you can visit demso in `http://${hostname}:8899/`, and see tensorboad log in ` http://${hostname}:6006`.

# Training with Distiller

Import lib

In [1]:
import torch
from torchvision import transforms,datasets
from torch.utils.data import DataLoader
import torch.optim as optim
from timm.utils import accuracy
import timm
import transformers
import datetime

## Prepare Data

### Define transformer and dataset

In [2]:
CIFAR_TRAIN_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR_TRAIN_STD = (0.2023, 0.1994, 0.2010)

train_transform = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.RandomHorizontalFlip(),
  transforms.Resize(112),  # pretrained model is trained on large imgage size, scale 32x32 to 112x112
  transforms.ToTensor(),
  transforms.Normalize(CIFAR_TRAIN_MEAN, CIFAR_TRAIN_STD)
])

test_transform = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.Resize(112),  # pretrained model is trained on large imgage size, scale 32x32 to 112x112
  transforms.ToTensor(),
  transforms.Normalize(CIFAR_TRAIN_MEAN, CIFAR_TRAIN_STD)
])

In [3]:
data_folder='/home/vmagent/app/data/dataset/cifar' # dataset location
train_set = datasets.CIFAR100(root=data_folder, train=True, download=True, transform=train_transform)
test_set = datasets.CIFAR100(root=data_folder, train=False, download=True, transform=test_transform)

Files already downloaded and verified
Files already downloaded and verified


### Prepare dataloader

In [4]:
train_loader = DataLoader(dataset=train_set, batch_size=128, shuffle=True, num_workers=1, drop_last=False)
validate_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=True, num_workers=1, drop_last=False)

## Create Transferrable Model

### Create Backbone model

In [5]:
backbone = timm.create_model('resnet18', pretrained=False, num_classes=100)

(optional & recommend) Optimized weight initilization, can enhance initial learning.

In [6]:
from e2eAIOK.common.trainer.model.model_utils.model_utils import initWeights
backbone.apply(initWeights)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (drop_block): Identity()
      (act1): ReLU(inplace=True)
      (aa): Identity()
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (act2): ReLU(inplace=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, m

### Create teacher model
To use distiller, we need to prepare teacher model to guide the training. Directly download teacher model Resnet50 pretrained on CIFAR100 from [here](), and put it at `${dataset}/model/demo/baseline/cifar100_res50PretrainI21k/cifar100_res50_pretrain_imagenet21k.pth`.

In [7]:
pretrain_model = "/home/vmagent/app/data/model/demo/baseline/cifar100_res50PretrainI21k/cifar100_res50_pretrain_imagenet21k.pth"
teacher_model = timm.create_model('resnet50', pretrained=False, num_classes=100)
teacher_model.load_state_dict(torch.load(pretrain_model), strict=True)

<All keys matched successfully>

### Define Distiller
Here we define a distiller using KD algorithm, and it take a teacher model as input.

In [8]:
from e2eAIOK.ModelAdapter.engine_core.distiller import KD
distiller= KD(teacher_model)

### Make backbone model transferrable with distiller

In [9]:
from e2eAIOK.ModelAdapter.engine_core.transferrable_model import *
loss_fn = torch.nn.CrossEntropyLoss()
model = make_transferrable_with_knowledge_distillation(backbone,loss_fn,distiller)

## Train and Evaluate

### Create optimizer and scheduler

In [10]:
################# create optimizer #################
init_lr = 0.01
weight_decay = 0.0001
momentum = 0.9
optimizer = optim.SGD(model.parameters(),lr=init_lr, weight_decay=weight_decay,momentum=momentum)
################# create scheduler #################
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1)

### Create Trainer

In [11]:
max_epoch = 1 # max 1 epoch
print_interval = 10 

In [12]:
class Trainer:
    def __init__(self, model, optimizer, scheduler):
        self._model = model
        self._optimizer = optimizer
        self._scheduler = scheduler
        
    def train(self, train_dataloader, valid_dataloader, max_epoch):
        ''' 
        :param train_dataloader: train dataloader
        :param valid_dataloader: validation dataloader
        :param max_epoch: steps per epoch
        '''
        for epoch in range(0, max_epoch):
            ################## train #####################
            self._model.train()  # set training flag
            for (cur_step,(data, label)) in enumerate(train_dataloader):
                self._optimizer.zero_grad()
                output = self._model(data)
                loss_value = self._model.loss(output, label) # transferrable model has loss attribute
                loss_value.backward() 
                if cur_step%print_interval == 0:
                    batch_acc = accuracy(output.backbone_output,label)[0] # use output.backbone_output instead of output
                    dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # date time
                    print("[{}] epoch {} step {} : total loss {:4f}, training backbone loss {:.4f}, distiller loss {:4f}, training batch acc {:.4f}".format(
                      dt, epoch, cur_step, loss_value.total_loss.item(),loss_value.backbone_loss.item(), loss_value.distiller_loss.item(), batch_acc.item())) 
                self._optimizer.step()
            self._scheduler.step()
            ################## evaluate ######################
            self.evaluate(valid_dataloader)
            
    def evaluate(self, valid_dataloader):
        with torch.no_grad():
            self._model.eval()  
            backbone = self._model.backbone # use backbone in evaluation
            loss_cum = 0.0
            sample_num = 0
            acc_cum = 0.0
            total_step = len(valid_dataloader)
            for (cur_step,(data, label)) in enumerate(valid_dataloader):
                output = backbone(data)
                batch_size = data.size(0)
                sample_num += batch_size
                loss_cum += loss_fn(output, label).item() * batch_size
                acc_cum += accuracy(output, label)[0].item() * batch_size
                if cur_step%print_interval == 0:
                    print(f"step {cur_step}/{total_step}")
            dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # date time
            loss_value = loss_cum/sample_num
            acc_value = acc_cum/sample_num

            print("[{}] evaluation loss {:.4f}, evaluation acc {:.4f}".format(
                dt, loss_value, acc_value))

### Start train

In [13]:
%%time
trainer = Trainer(model, optimizer, scheduler)
trainer.train(train_loader,validate_loader,max_epoch)

[2023-02-09 04:55:18] epoch 0 step 0 : total loss 8.668498, training backbone loss 5.5269, distiller loss 9.017567, training batch acc 0.0000
[2023-02-09 04:55:33] epoch 0 step 10 : total loss 7.362644, training backbone loss 4.8556, distiller loss 7.641201, training batch acc 1.5625
[2023-02-09 04:55:48] epoch 0 step 20 : total loss 7.232350, training backbone loss 4.5274, distiller loss 7.532903, training batch acc 1.5625
[2023-02-09 04:56:02] epoch 0 step 30 : total loss 7.261838, training backbone loss 4.1673, distiller loss 7.605680, training batch acc 8.5938
[2023-02-09 04:56:17] epoch 0 step 40 : total loss 6.535180, training backbone loss 4.0882, distiller loss 6.807068, training batch acc 5.4688
[2023-02-09 04:56:32] epoch 0 step 50 : total loss 6.611567, training backbone loss 4.0500, distiller loss 6.896182, training batch acc 7.8125
[2023-02-09 04:56:47] epoch 0 step 60 : total loss 6.338083, training backbone loss 4.2036, distiller loss 6.575247, training batch acc 6.2500
