## Lab 2: Optimizing Inference for Intel Hybrid Platforms

### Prerequisites: 
- Python3
- onnx
- onnxruntime
- numpy
- cv2 

### Lab Objectives: 
- Improve performance with intra/inter thread count configurations
- Reduce CPU utilization by setting spin loops to false
- Improving performance with INT8 optimizations
- Further improving performance of the model with ONNX Runtime optimization levels

### Let's start with a baseline of our performance with ResNet50

Import dependencies

In [None]:
import cv2
import numpy as np
import onnxruntime as rt
import onnx

Load the ResNet50 model

Ensure the model created from Lab 01 is located at the model_path

In [None]:
model_path = "../../models/resnet50.onnx"
model = onnx.load(model_path)
print("Model loaded!")

>Create the runtime session and use *CPUExecutionProvider* to run on CPU with MLAS (Microsoft Linear Algebra Subroutine).

In [None]:
print("Available providers")
print(rt.get_available_providers())

In [None]:
sess = rt.InferenceSession(model_path, providers=['CPUExecutionProvider'])     #Use CPU execution provider

Read a sample image to classify the object.

>Find the sample image under resources/ and read the image in.

In [None]:
# Read the image
img = cv2.imread("../../resources/lab03_image.jpg")
cv2.imshow("lab03_img", img)
cv2.waitKey(0) 
cv2.destroyWindow('lab03_img')

# Preprocess the image for ResNet50. 
def preprocess(img):
    img = img / 255.
    img = cv2.resize(img, (256, 256))
    h, w = img.shape[0], img.shape[1]
    y0 = (h - 224) // 2
    x0 = (w - 224) // 2
    img = img[y0 : y0+224, x0 : x0+224, :]
    img = (img - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
    img = np.transpose(img, axes=[2, 0, 1])
    img = img.astype(np.float32)
    img = np.expand_dims(img, axis=0)
    return img

img = preprocess(img)

Run inference with the sample image

In [None]:
input_name = sess.get_inputs()[0].name 
output_name = sess.get_outputs()[0].name 

prediction = sess.run(None, {input_name: img})[0]
prediction = np.squeeze(prediction)
top = np.argsort(prediction)[::-1]

# Read in human-readable class labels 
with open("../../resources/labels.txt", 'r') as f:
    labels = [l.rstrip() for l in f]

index = top[0]
print(index)
print("Predicted class:{0}  Probability: {1}".format(labels[index], prediction[index]))

Let's get an idea of our model's baseline performance by running the prediction loop 100x

In [None]:
import time 

start = time.time()
for i in range(100):
    prediction = sess.run(None, {input_name: img})[0]
end = time.time()
print("Total execution time of 100 inference sessions: {:.3f} seconds".format(end-start))

#### Now that we have our baseline performance with default options with ONNXRuntime, we will explore some of the optimizations that improve performance on hybrid systems. 

Let's improve performance with intra/inter thread count controls.

>For this lab, let's experiment with changing the value for intra op num threads to see how this impacts the performance. 

For additional help, please refer to: Thread Tuning - Low Precision Tuning section for an example

In [None]:
# Create session options to change different knobs for CPU (MLAS)
'''
    TODO: Create SessionOptions() object
    TODO: Change the number of threads
'''


'''
    TODO: Select the execution mode type, use ORT_SEQUENTIAL
    TODO: Use GraphOptimizationLevel.ORT_ENABLE_BASIC
    TODO: Create the session
'''

In [None]:
start = time.time()
for i in range(100):
    prediction = sess.run(None, {input_name: img})[0]
end = time.time()
print("Total execution time of 100 inference sessions: {:.3f} seconds".format(end-start))

Next, we'll take advantage of ONNX Runtime's graph optimizations by setting it to all. \
>Change the Graph Optimization level to Enable All (ORT_ENABLE_ALL) \
>Hint: Try typing rt. and see what auto-completion brings

In [None]:
sess_options.graph_optimization_level = '''TODO: Change to ORT_ENABLE_ALL configuration'''

sess = rt.InferenceSession(model_path, providers=["CPUExecutionProvider"], sess_options=sess_options)

start = time.time()
for i in range(100):
    prediction = sess.run(None, {input_name: img})[0]
end = time.time()
print("Total execution time of 100 inference sessions: {:.3f} seconds".format(end-start))

Finally, we'll put everything together alongside the Int8 model we've created in Lab01_02

>Apply all of the optimizations learned for the inference session

In [None]:
model_path = "../../models/resnet50_int8.onnx"
model = onnx.load(model_path)

del sess_options

#Use all of the optimization techniques learned here
'''
    TODO: Apply all of the optimization techniques learned by 
    1. creating a SessionOptions() object
    2. Enabling all graph optimizations
    3. Using a performant number of threads
    4. Using CPU as the provider with the Int8 Quantized model
'''

input_name = sess.get_inputs()[0].name 
output_name = sess.get_outputs()[0].name 

start = time.time()
for i in range(100):
    prediction = sess.run(None, {input_name: img})[0]
end = time.time()
print("Total execution time of 100 inference sessions: {:.3f} seconds".format(end-start))