# Intel OpenVino (Open Visual Inference and Neural Network Optimization)


https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html

The OpenVINO™ toolkit enables you to optimize a deep learning model from almost any framework and deploy it with best-in-class performance on a range of Intel® processors and other hardware platforms.

OpenVINO™ toolkit is an open source toolkit that accelerates AI inference with lower latency and higher throughput while maintaining accuracy, reducing model footprint, and optimizing hardware use. It streamlines AI development and integration of deep learning in domains like computer vision, large language models (LLM), and generative AI.

# Install


```python
python -m venv openvino_env

openvino_env\Scripts\activate

python -m pip install --upgrade pip

pip install openvino-genai==2024.2.0

pip install "openvino>=2023.1.0" transformers "torch>=2.1" tqdm --extra-index-url https://download.pytorch.org/whl/cpu
```
# OpenVINO IR format
https://docs.openvino.ai/2024/documentation/openvino-ir-format.html

OpenVINO supports the following model formats:

- PyTorch
- TensorFlow
- TensorFlow Lite
- ONNX
- PaddlePaddle
- OpenVINO IR
#  Neural Network Compression Framework (NNCF)

Neural Network Compression Framework (NNCF) provides a suite of post-training and training-time algorithms for optimizing inference of neural networks in OpenVINO™ with a minimal accuracy drop.

NNCF is designed to work with models from PyTorch, TensorFlow, ONNX and OpenVINO™

![title](nncf.png)

https://github.com/openvinotoolkit/nncf

https://docs.openvino.ai/2024/openvino-workflow/model-optimization.html

- Post-training Quantization is designed to optimize the inference of deep learning models by applying the post-training 8-bit integer quantization that does not require model retraining or fine-tuning.

- Training-time Optimization , a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow 2.x. It supports methods like Quantization-aware Training, Structured and Unstructured Pruning, etc.

- Weight Compression, an easy-to-use method for Large Language Models footprint reduction and inference acceleration.

![title](nncf2.png)

In [1]:
import warnings
from pathlib import Path
import time
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import openvino as ov

In [35]:
ov.__version__

'2024.2.0-15519-5c0f38f83f6-releases/2024/2'

# Model
https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

In [2]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=checkpoint)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [3]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [4]:
import torch

ir_xml_name = checkpoint + ".xml"
MODEL_DIR = "model/"
ir_xml_path = Path(MODEL_DIR) / ir_xml_name

MAX_SEQ_LENGTH = 128
input_info = [
    (ov.PartialShape([1, -1]), ov.Type.i64),
    (ov.PartialShape([1, -1]), ov.Type.i64),
]
default_input = torch.ones(1, MAX_SEQ_LENGTH, dtype=torch.int64)
inputs = {
    "input_ids": default_input,
    "attention_mask": default_input,
}

ov_model = ov.convert_model(model, input=input_info, example_input=inputs)
ov.save_model(ov_model, ir_xml_path)

  mask, torch.tensor(torch.finfo(scores.dtype).min)


In [7]:
core = ov.Core()

In [41]:
core.available_devices

['CPU', 'GPU']

In [37]:
device = "GPU"
core.get_property(device, "FULL_DEVICE_NAME")

'NVIDIA GeForce GTX 1650 (dGPU)'

In [38]:
print(f"{device} SUPPORTED_PROPERTIES:\n")
supported_properties = core.get_property(device, "SUPPORTED_PROPERTIES")
indent = len(max(supported_properties, key=len))

for property_key in supported_properties:
    if property_key not in ('SUPPORTED_METRICS', 'SUPPORTED_CONFIG_KEYS', 'SUPPORTED_PROPERTIES'):
        try:
            property_val = core.get_property(device, property_key)
        except TypeError:
            property_val = 'UNSUPPORTED TYPE'
        print(f"{property_key:<{indent}}: {property_val}")

GPU SUPPORTED_PROPERTIES:

AVAILABLE_DEVICES               : ['0']
RANGE_FOR_ASYNC_INFER_REQUESTS  : (1, 2, 1)
RANGE_FOR_STREAMS               : (1, 2)
OPTIMAL_BATCH_SIZE              : 1
MAX_BATCH_SIZE                  : 1
DEVICE_ARCHITECTURE             : GPU: vendor=0x10de arch=v7.5.0
FULL_DEVICE_NAME                : NVIDIA GeForce GTX 1650 (dGPU)
DEVICE_UUID                     : 14d4b062a125d17e3637a99793a8a24b
DEVICE_LUID                     : 18da000000000000
DEVICE_TYPE                     : Type.DISCRETE
DEVICE_GOPS                     : {<Type: 'float16'>: 0.0, <Type: 'float32'>: 0.0, <Type: 'int8_t'>: 0.0, <Type: 'uint8_t'>: 0.0}
OPTIMIZATION_CAPABILITIES       : ['FP32', 'BIN', 'INT8', 'EXPORT_IMPORT']
GPU_DEVICE_TOTAL_MEM_SIZE       : 4294639616
GPU_UARCH_VERSION               : 7.5.0
GPU_EXECUTION_UNITS_COUNT       : 14
GPU_MEMORY_STATISTICS           : {}
PERF_COUNT                      : False
MODEL_PRIORITY                  : Priority.MEDIUM
GPU_HOST_TASK_PRIORITY    

In [39]:
device = "CPU"
core.get_property(device, "FULL_DEVICE_NAME")

'Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz'

In [40]:
print(f"{device} SUPPORTED_PROPERTIES:\n")
supported_properties = core.get_property(device, "SUPPORTED_PROPERTIES")
indent = len(max(supported_properties, key=len))

for property_key in supported_properties:
    if property_key not in ('SUPPORTED_METRICS', 'SUPPORTED_CONFIG_KEYS', 'SUPPORTED_PROPERTIES'):
        try:
            property_val = core.get_property(device, property_key)
        except TypeError:
            property_val = 'UNSUPPORTED TYPE'
        print(f"{property_key:<{indent}}: {property_val}")

CPU SUPPORTED_PROPERTIES:

AVAILABLE_DEVICES                    : ['']
RANGE_FOR_ASYNC_INFER_REQUESTS       : (1, 1, 1)
RANGE_FOR_STREAMS                    : (1, 16)
EXECUTION_DEVICES                    : ['CPU']
FULL_DEVICE_NAME                     : Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
OPTIMIZATION_CAPABILITIES            : ['FP32', 'INT8', 'BIN', 'EXPORT_IMPORT']
DEVICE_TYPE                          : Type.INTEGRATED
DEVICE_ARCHITECTURE                  : intel64
NUM_STREAMS                          : 1
INFERENCE_NUM_THREADS                : 0
PERF_COUNT                           : False
INFERENCE_PRECISION_HINT             : <Type: 'float32'>
PERFORMANCE_HINT                     : PerformanceMode.LATENCY
EXECUTION_MODE_HINT                  : ExecutionMode.PERFORMANCE
PERFORMANCE_HINT_NUM_REQUESTS        : 0
ENABLE_CPU_PINNING                   : True
SCHEDULING_CORE_TYPE                 : SchedulingCoreType.ANY_CORE
MODEL_DISTRIBUTION_POLICY            : set()
ENABLE_HYPER_T

In [42]:
import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="AUTO",
    description="Device:",
    disabled=False,
)

device

Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

In [46]:
warnings.filterwarnings("ignore")
compiled_model = core.compile_model(ov_model, device.value)
infer_request = compiled_model.create_infer_request()

In [43]:
def softmax(x):
    """
    Defining a softmax function to extract
    the prediction from the output of the IR format
    Parameters: Logits array
    Returns: Probabilities
    """

    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [44]:
def infer(input_text):
    """
    Creating a generic inference function
    to read the input and infer the result
    into 2 classes: Positive or Negative.
    Parameters: Text to be processed
    Returns: Label: Positive or Negative.
    """

    input_text = tokenizer(
        input_text,
        truncation=True,
        return_tensors="np",
    )
    inputs = dict(input_text)
    label = {0: "NEGATIVE", 1: "POSITIVE"}
    result = infer_request.infer(inputs=inputs)
    for i in result.values():
        probability = np.argmax(softmax(i))
    return label[probability]

In [47]:
input_text = "I had a wonderful day"
start_time = time.perf_counter()
result = infer(input_text)
end_time = time.perf_counter()
total_time = end_time - start_time
print("Label: ", result)
print("Total Time: ", "%.2f" % total_time, " seconds")

Label:  POSITIVE
Total Time:  0.17  seconds


In [29]:
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)

open("notebook_utils.py", "w").write(r.text)
from notebook_utils import download_file

# Download the text from the openvino_notebooks storage
# vocab_file_path = download_file(
#     "https://storage.openvinotoolkit.org/repositories/openvino_notebooks/data/data/text/food_reviews.txt",
#     directory="data",
# )

'data\food_reviews.txt' already exists.


In [48]:
start_time = time.perf_counter()
with vocab_file_path.open(mode="r") as f:
    input_text = f.readlines()
    for lines in input_text:
        print("User Input: ", lines)
        result = infer(lines)
        print("Label: ", result, "\n")
end_time = time.perf_counter()
total_time = end_time - start_time
print("Total Time: ", "%.2f" % total_time, " seconds")

User Input:  The food was horrible.

Label:  NEGATIVE 

User Input:  We went because the restaurant had good reviews

Label:  POSITIVE 

User Input:  The trip turn into a nigthmare

Label:  NEGATIVE 

User Input:  I dont really know what to say. I dont have a stand about that
Label:  NEGATIVE 

Total Time:  0.61  seconds


In [49]:
import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="AUTO",
    description="Device:",
    disabled=False,
)

device

Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

In [50]:
warnings.filterwarnings("ignore")
compiled_model = core.compile_model(ov_model, device.value)
infer_request = compiled_model.create_infer_request()

In [51]:
input_text = "I had a wonderful day"
start_time = time.perf_counter()
result = infer(input_text)
end_time = time.perf_counter()
total_time = end_time - start_time
print("Label: ", result)
print("Total Time: ", "%.2f" % total_time, " seconds")

Label:  POSITIVE
Total Time:  0.18  seconds


In [52]:
start_time = time.perf_counter()
with vocab_file_path.open(mode="r") as f:
    input_text = f.readlines()
    for lines in input_text:
        print("User Input: ", lines)
        result = infer(lines)
        print("Label: ", result, "\n")
end_time = time.perf_counter()
total_time = end_time - start_time
print("Total Time: ", "%.2f" % total_time, " seconds")

User Input:  The food was horrible.

Label:  NEGATIVE 

User Input:  We went because the restaurant had good reviews

Label:  POSITIVE 

User Input:  The trip turn into a nigthmare

Label:  NEGATIVE 

User Input:  I dont really know what to say. I dont have a stand about that
Label:  NEGATIVE 

Total Time:  0.46  seconds
