# # HW - A Guided Tutorial on Proxies Estimating Performance of Vision Transformers
**Author:** 
**Version:** 
**Requirements:**
- Python 3 (tested on v3.7.16)
- numpy==1.21.5
- torch==1.13.1
- torchvision==0.14.1
- timm==0.4.12
- opencv-python==4.9.0.80
- scipy==1.7.3
- scikit-image==0.19.2
- pyyaml==5.4.1
- easydict==1.13
- matplotlib
- ipykernel

### 0. Prelim: Import packages

In [1]:
import random
import numpy as np
import time
import torch
import torch.backends.cudnn as cudnn
from timm.utils.model import unwrap_model
from lib.datasets import build_dataset
from lib import utils
import json
from scipy import stats
import json
from torch import nn
import matplotlib.pyplot as plt

### 1. Prelim: Hyperparameters and configuration

In [2]:
batch_size=8
api_Auto_FM_Benchmark='./AutoFM_CVPR2022_API_5_7M.json'
input_size=224
data_path='./dataset/imagenet'              
seed=0
num_workers=10

In [3]:
# The code is cpu-friendly, but running the notebook with GPU(s) will drastically speed up the computation.

# Define device for torch
print("CUDA is available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CUDA is available: False


### 2. Prelim: Dataset and Dataloader

This section loads data into PyTorch Dataloaders.

In [4]:
# Fixed the seed for reproducibility

torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
cudnn.benchmark = True  # enables automatic selection of the most efficient algorithms for deep learning operations

##load data
dataset_val, nb_classes = build_dataset(True, data_path,input_size, "train")  # load data of one batch
sampler_val = torch.utils.data.SequentialSampler(dataset_val)
    
data_loader_val = torch.utils.data.DataLoader(
    dataset_val, batch_size=batch_size,
    sampler=sampler_val, num_workers=num_workers,
    pin_memory=True, drop_last=False
)

  cpuset_checked))


### 3. Use AutoFormer API to introduce candidate architectures
This section loads the candidate architectures via an external API [1] and use it to explore Vision Transformers from an architectural perspective. 50 architectures are introduced. For each architecture, the API will provide:

1) the architectural configuration;

2) the test accuracy (for image classification) on ImageNet-1k validation set.

In [5]:
### load candidate architectures
from model.get_Vision_Transformer_Arch import model_VIT
file_api = open(api_Auto_FM_Benchmark)
arch_candidate_set =json.load(file_api)

CUDA is available: False


### 4. Develop functions to compute Proxies for Architecture Performance Estimation

This section focuses on the two proxies that estimate the performance of candidate Vision Transformer architectures

**Question 2**: Write your proxy

The goal is to design a proxy (function) that can predict the performance of some given networks without training the network parameters [2]. We will consider two simple proxies:

**SNIP**: This proxy measures the each weight multiplied by its gradient [2,3] at initialization.

**Gradient Norm (grad_norm)**: This proxy measures the norm of the gradients by layer [2] at initialization. A lower gradient norm may signify training difficulties, though a high norm could indicate the problem of exploding gradients.


In [6]:
def get_layer_metric_array_grad_norm(net, metric):
    metric_array = []

    for layer in net.modules():
        if isinstance(layer, nn.Conv2d) or isinstance(layer, nn.Linear):
            metric_array.append(metric(layer))
    
    return sum(metric_array).item()

def get_grad_norm_scores(net, inputs, targets, loss_fn):
    for param in net.parameters():
        param.requires_grad = True

    net.zero_grad()
    outputs = net.forward(inputs)
    
    loss = loss_fn(outputs, targets)
    loss.backward()

    grad_norm_arr = get_layer_metric_array_grad_norm(net, lambda l: l.weight.grad.norm() if l.weight.grad is not None else torch.zeros_like(l.weight))
    
    return grad_norm_arr

In [7]:
def get_layer_metric_array_synflow(net, metric):
    metric_array = []

    for layer in net.modules():
        if isinstance(layer, nn.Conv2d) or isinstance(layer, nn.Linear):
            metric_array.append(metric(layer))
    
    return metric_array

def compute_synflow_per_weight(net, inputs, targets, loss_fn):

    device = inputs.device

    #convert params to their abs. Keep sign for converting it back.
    @torch.no_grad()
    def linearize(net):
        signs = {}
        for name, param in net.state_dict().items():
            signs[name] = torch.sign(param)
            param.abs_()
        return signs

    #convert to orig values
    @torch.no_grad()
    def nonlinearize(net, signs):
        for name, param in net.state_dict().items():
            if 'weight_mask' not in name:
                param.mul_(signs[name])

    # keep signs of all params
    signs = linearize(net)
    
    # Compute gradients with input of 1s 
    net.zero_grad()
    net.float()
    outputs = net.forward(inputs)
    
    loss = loss_fn(outputs, targets)
    loss.backward()
    
    # select the gradients that we want to use for search/prune
    def synflow(layer):
        if layer.weight.grad is not None:
            return torch.abs(layer.weight * layer.weight.grad)
        else:
            return torch.zeros_like(layer.weight)
    ## computed synflow score for each layer
    grads_abs = get_layer_metric_array_synflow(net, synflow)

    ## sum synflow score for all layer
    def sum_arr(arr):
        sum = 0.
        for i in range(len(arr)):
            print(arr[i].shape)
            print(arr[i].sum())
            print(arr[i][0].sum())
            sum += torch.sum(arr[i])
        return sum.item()

    grads_abs = sum_arr(grads_abs)
    # apply signs of all params
    nonlinearize(net, signs)

    return grads_abs

### 5. Computing Proxies for Estimating Performance of Candidate Architectures
This section computes the proxy scores for the candidate Vision Transformer architectures from the API.

**Leveraging Vision Transformer architectures candidates information**: We will utilize the information retrieved from the API for each Vision Transformer architecture:

- **Architecture configuration**: This allows us to create the corresponding PyTorch model representing the specific Vision Transformer architectures candidates.
- **Test accuracy**: While the API might provide test accuracy, we won't rely on it directly for ranking here.

**Proxy Computation**: We will employ the proxy function: **grad_norm** and **SynFlow** (defined earlier) to compute a score for each Vision Transformer architectures candidate. This score estimates the potential performance of the corresponding Vision Transformer architecture without actually training it.

In [8]:
### get data sample to compute the proxies
x, y = next(iter(data_loader_val))
x = x.to(device)
y = y.to(device)

### initial loss function to cumpute gradients
lossfunc = nn.CrossEntropyLoss().cuda()

proxy = 'synflow' ##change proxy

#get net and then compute  zero-cost-proxy
proxy_scores = []
accs = []
st_time = time.time()
# fifty architectures
for arch,accuracy in arch_candidate_set.items():
    if int(arch) == 50:
        break
    e_time = time.time()
    ### initital net from API
    net_setting = accuracy['net_setting']
    net_arch = unwrap_model(model_VIT)
    net_arch.set_sample_config(config=net_setting)
    net_arch.to(device)

    ##compute proxy
    if proxy == 'grad_norm':
        res = get_grad_norm_scores(net_arch, x, y,lossfunc)

    elif proxy =='synflow':
        res = compute_synflow_per_weight(net_arch,x, y, lossfunc)  # align the input of the two proxies

    ## store result zerocost score for compute correlation with accuracy
    del net_arch
    print('Architectures: ',arch)
    print('Test-Accuracy: ', accuracy['test-accuracy'])
    proxy_scores.append(res)
    accs.append(accuracy['test-accuracy'])
    print('Zerocost proxy score: ',res)
    edl_time = time.time()
    print('Computation Proxy Time: ',edl_time-e_time)   
    print('---------------------------------------------')
end_time = time.time()
print('total time: ',end_time-st_time)     

torch.Size([256, 3, 16, 16])
tensor(19.1641, grad_fn=<SumBackward0>)
torch.Size([768, 256])
tensor(0.7682, grad_fn=<SumBackward0>)
torch.Size([256, 256])
tensor(1.4427, grad_fn=<SumBackward0>)
torch.Size([1024, 256])
tensor(1.7797, grad_fn=<SumBackward0>)
torch.Size([256, 1024])
tensor(1.7506, grad_fn=<SumBackward0>)
torch.Size([768, 256])
tensor(0.9732, grad_fn=<SumBackward0>)
torch.Size([256, 256])
tensor(1.7208, grad_fn=<SumBackward0>)
torch.Size([1024, 256])
tensor(1.2248, grad_fn=<SumBackward0>)
torch.Size([256, 1024])
tensor(1.2349, grad_fn=<SumBackward0>)
torch.Size([768, 256])
tensor(0.6773, grad_fn=<SumBackward0>)
torch.Size([256, 256])
tensor(1.4500, grad_fn=<SumBackward0>)
torch.Size([1024, 256])
tensor(1.3707, grad_fn=<SumBackward0>)
torch.Size([256, 1024])
tensor(1.4180, grad_fn=<SumBackward0>)
torch.Size([768, 256])
tensor(0.6771, grad_fn=<SumBackward0>)
torch.Size([256, 256])
tensor(1.1591, grad_fn=<SumBackward0>)
torch.Size([1024, 256])
tensor(1.3248, grad_fn=<SumBackwa

KeyboardInterrupt: 

### 6. Evaluating Proxy Effectiveness
This section focuses on evaluating the effectiveness of the proxy in predicting actual performance.

**Correlation Analysis**: We will calculate Kendall correlation between the proxy scores and the test accuracies.

In [None]:
###compute correlation with accuracy
kendalltau = stats.kendalltau(proxy_scores, accs)
print('*'*50)
print('Kendalltau:', kendalltau)

### 7. Visualizing the Correlation Distribution
This section visualizes the correlation between the proxy scores and the test accuracies.

In [None]:
plt.scatter(x=proxy_scores, y=accs)
plt.ylabel("Accuracy")
plt.xlabel("Proxy Scores")

### 8. Identifying Top Architectures based on Proxies
This section focuses on identifying the top-performing architectures based on the computed proxy scores.

**Ranking by Proxy Scores*: We will sort the ViT architectures in descending order based on their proxy scores. This ranking prioritizes architectures with higher predicted performance.

**Top Architectures**: We can then identify the top-ranked architectures (e.g., top 1 or top 5) as potential candidates for further exploration

In [None]:
from model.get_subnet_Vision_Transformer_Arch import get_subnet_arch
best_index_architectures = np.argmax(proxy_scores) #get the index of top-1 architecture based on the proxy score

# load the top-1 architecture and check the layer-wise details
net_arch = arch_candidate_set[str(best_index_architectures)]['net_setting']
model_best_by_proxys = get_subnet_arch(net_arch)


print(model_best_by_proxys)

### Reference

[1] Chen, Minghao, Houwen Peng, Jianlong Fu, and Haibin Ling. "Autoformer: Searching transformers for visual recognition." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12270-12280. 2021.

[2] Li, Guihong, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, and Radu Marculescu. "Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities." arXiv preprint arXiv:2307.01998 (2023).

[3] Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip HS Torr. "Snip: Single-shot network pruning based on connection sensitivity." arXiv preprint arXiv:1810.02340 (2018).