# 寻找CPU-bound模型
统计不同DNN模型的计算强度，以及其在指定device上的计算特性，寻找CPU-bound的DNN模型来测试FSS的加速效果

In [2]:
import os 
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import torch
import sys
sys.path.append("/workspace/packages/autoqnn")
import autoqnn

以AlexNet和ResNet18为例，测试这两个模型在RTX2080ti上的计算性能

In [3]:
import torchvision
alexnet = torchvision.models.alexnet()
resnet18 = torchvision.models.resnet18()

In [4]:
alexnet_flops,_,alexnet_mem = autoqnn.utils.get_flops_params_mems(alexnet,(2,3,224,224),32,32,"alexnet")
resnet_flops,_,resnet_mem = autoqnn.utils.get_flops_params_mems(resnet18,(2,3,224,224),32,32,"resnet18")
# !pip install thop

Flops, Params and Mems of alexnet is [0.71GFLOPs,61.10M, 237.31M]
Flops, Params and Mems of resnet18 is [1.82GFLOPs,11.69M, 75.23M]


In [None]:
计算两个模型的计算强度（flops/mem）

In [5]:
alexnet_compintensity = alexnet_flops/alexnet_mem
resnet_compintensity = resnet_flops/resnet_mem
print(alexnet_compintensity,resnet_compintensity)

2.872081575209959 23.060971033535996


获取计算设备的计算强度上限

In [6]:
gpus={
# device_name:[FLOPS, Bandwidth, Power]
#      "QS-855+":[1.032*10**12,34.1*2**30,10],
#      "QS-888+":[1.72*10**12,51.2*2**30,10],
    "1080ti":[10.616*10**12,484*2**30,250],
     "2080ti":[11.75*10**12,616*2**30,250],
     "3090":[29.28*10**12,936.2*2**30,350],
#      "A6000":[31.29*10**12,768*2**30,300]
     }
def get_device_roofline(devices,max_I=30):
    devices_Is={k:v[0]/v[1] for k,v in devices.items()}
    devices_lines={}
    for k in devices.keys():
        I = devices_Is[k]
        f = devices[k][0]/10**12
        b = devices[k][1]
        line1=[[0,I],[0,f]]
        line2=[[I,max_I],[f,f]]
        devices_lines[k] = [line1,line2]
    return devices_Is, devices_lines
gpus_Is, gpus_lines = get_device_roofline(gpus)

In [12]:
devices={# device_name:[FLOPS, Bandwidth, Power]
     "QS-855+":[1.032*10**12,34.1*2**30,10],
     "QS-888+":[1.72*10**12,51.2*2**30,10],
     "1080ti":[10.616*10**12,484*2**30,250],
     "2080ti":[11.75*10**12,616*2**30,250],
     "3090":[29.28*10**12,936.2*2**30,350],
     "A6000":[31.29*10**12,768*2**30,300],
     "Xeon E5-2678 v3":[1.9/2*10**12,68*2**30,120],
     "Apple A14 Bionic":[1.536*10**12,34.1*2**30,10],
     "Kirin 9000":[2.332*10**12,44*2**30,10],
     }

def get_attainable_FLOPS(model,device_key,w_bit=32,a_bit=32,model_name="model"):
    '''
    The candidate devices includes: 
    ['QS-855+', 'QS-888+', '1080ti', '2080ti', '3090', 'A6000', 'Xeon E5-2678 v3',
    'Apple A14 Bionic', 'Kirin 9000']
    '''
    # get model computing intensity
    flops,_,mems=autoqnn.utils.get_flops_params_mems(model,(2,3,224,224),32,32,model_name)
    intensity = flops/mems
    # get device computing intensity
    flops,bandwidth,_ = devices.get(device_key)
    device_intensity = flops/bandwidth
    # get attainable performance
    if intensity>=device_intensity:
        attainable_flops = flops
        print("%s is Compute-bound model"%(model_name))
    else:
        # model_flops/model_mems * device_bandwidth
        attainable_flops = intensity*bandwidth
        print("%s is IO-bound model"%(model_name))
    return attainable_flops

In [16]:
alexnet_af = get_attainable_FLOPS(alexnet,gpus["2080ti"],model_name="alexnet")
print("Attainable FLOPS is %.4f GFLOPS"%(alexnet_af/1000**3))

Flops, Params and Mems of alexnet is [0.71GFLOPs,61.10M, 237.31M]
alexnet is IO-bound model
Attainable FLOPS is 1899.6665 GFLOPS


In [17]:
resnet_af = get_attainable_FLOPS(resnet18,gpus["2080ti"],model_name="resnet18")
print("Attainable FLOPS is %.4f GFLOPS"%(resnet_af/1000**3))

Flops, Params and Mems of resnet18 is [1.82GFLOPs,11.69M, 75.23M]
resnet18 is Compute-bound model
Attainable FLOPS is 11750.0000 GFLOPS
