# Alberta Text Classification with OpenVINO

This notebook shows a text classification with OpenVINO. We use [Alberta](https://huggingface.co/textattack/albert-base-v2-MRPC?text=I+like+you.+I+love+you) model is available from HuggingFace. OpenVINO is a powerful framework that can help optimize and accelerate machine learning models, making them more efficient and faster. By using OpenVINO, we can improve the performance of our text classification model and make it more accessible for deployment on various platforms.

The Alberta model is a pre-trained language model that has been fine-tuned on the Microsoft Research Paraphrase Corpus (MRPC) dataset. It has been trained to classify whether the input sentence is paraphrased or not. We will be using this model to classify text.

## The model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("textattack/albert-base-v2-MRPC")
pt_model = AutoModelForSequenceClassification.from_pretrained("textattack/albert-base-v2-MRPC")

## Convert Alberta to OpenVINO IR

![conversion_pipeline](https://user-images.githubusercontent.com/29454499/211261803-784d4791-15cb-4aea-8795-0969dfbb8291.png)

For starting work with Alberta model using OpenVINO, model should be converted to OpenVINO Intermediate Represenation (IR) format. HuggingFace provided Alberta model is a PyTorch model, which is supported in OpenVINO via conversion to ONNX. We will use HuggingFace transformers library capabilities to export model to ONNX. `transformers.onnx.export` accepts preprocessing function for input samples (tokenizer in this case), instance of model, ONNX export configuration, ONNX opset version for export and output path. More information about transformers export to ONNX can be found in HuggingFace [documentation](https://huggingface.co/docs/transformers/serialization).

While ONNX models are directly supported by OpenVINO runtime, it can be useful to convert them to IR format to take advantage of OpenVINO optimization tools and features.
The `mo.convert_model` python function can be used for converting model with [OpenVINO Model Optimizer](https://docs.openvino.ai/latest/openvino_docs_MO_DG_Python_API.html). The function returns instance of OpenVINO Model class, which is ready to use in Python interface. However, it can also be serialized to OpenVINO IR format for future execution using `openvino.runtime.serialize`. In this case, `compress_to_fp16` parameter is enabled for compression model weights to `FP16` precision and also specified dynamic input shapes with possible shape range (from one token to maximum length defined in the processing function) for optimization of memory consumption.

In [None]:
from pathlib import Path
from openvino.runtime import serialize
from openvino.tools import mo
from transformers.onnx import export, FeaturesManager

# define path for saving onnx model
onnx_path = Path("model/alberta.onnx")
onnx_path.parent.mkdir(exist_ok=True)

# define path for saving openvino model
model_path = onnx_path.with_suffix(".xml")

# get model onnx config function for output feature format casual-lm
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(pt_model, feature='sequence-classification')

# fill onnx config based on pytorch model config
onnx_config = model_onnx_config(pt_model.config)

# convert model to onnx
onnx_inputs, onnx_outputs = export(tokenizer, pt_model, onnx_config, onnx_config.default_onnx_opset, onnx_path)

# convert model to openvino
ov_model = mo.convert_model(onnx_path, compress_to_fp16=True)

# serialize openvino model
serialize(ov_model, str(model_path))

## Load the model

We can start by building an OpenVINO Core object. Then, read the network architecture and model weights from the .xml and .bin files, respectively. Finally, we compile the model for the desired device. Since we use the dynamic shapes feature, which is only available on CPU, we must use CPU for the device. Dynamic shapes support on GPU is coming soon.

Since the text recognition model has a dynamic input shape, you cannot directly switch device to GPU for inference on integrated or discrete Intel GPUs. In order to run inference on iGPU or dGPU with this model, you will need to resize the inputs to this model to use a fixed size. Then, try running the inference on GPU device.

In [None]:
from openvino.runtime import Core

# initialize openvino core
core = Core()

# read the model and corresponding weights from file
model = core.read_model(model_path)

# compile the model for CPU devices
compiled_model = core.compile_model(model=model, device_name="CPU")

Input keys are the names of the input nodes and output keys contain names of the output nodes of the network.

## Inferencing model

NLP models often take a list of tokens as a standard input. A token is a single word mapped to an integer. To provide the proper input, we use a vocabulary file to handle the mapping. So first let us load the vocabulary file.

In [None]:
import torch

def inference(sentence, inf_model):
    """
    Run inference on the pre-trained ALBERT model for sentence classification.
    
    Args:
        sentence (str): Input sentence to classify.
        
    Returns:
        predicted_label (int): Predicted label for the input sentence.
    """
    
    # encode sentence with tokenizer
    inputs = tokenizer(sentence, return_tensors="pt")
    inputs = dict(inputs)
    del inputs['token_type_ids']
    
    # run inference on model
    output_key = inf_model.output(0)
    outputs = inf_model(inputs)[output_key]
    
    # get predicted label
    predicted_label = torch.argmax(outputs.logits).item()
    
    # return predicted label
    return predicted_label

In [None]:
# classify an example sentence
sentence = "This is an example sentence."
predicted_label = inference(sentence, compiled_model)
print(predicted_label)

## Optimize model using NNCF Post-training Quantization API

[NNCF](https://github.com/openvinotoolkit/nncf) provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop.
We will use 8-bit quantization in post-training mode (without the fine-tuning pipeline) to optimize YOLOv7.

> **Note**: NNCF Post-training Quantization is available as a preview feature in OpenVINO 2022.3 release. Fully functional support will be provided in the next releases.

The optimization process contains the following steps:

1. Create a Dataset for quantization.
2. Run `nncf.quantize` for getting an optimized model.
3. Serialize an OpenVINO IR model, using the `openvino.runtime.serialize` function.

In [None]:
from torch.utils.data import DataLoader, TensorDataset
import nncf 
import numpy as np

# prepare sample data
sentence = "This is an example sentence."
inputs = tokenizer(sentence, return_tensors="pt")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# create a TensorDataset with the input IDs and attention mask
dataset = TensorDataset(input_ids, attention_mask)

# create a DataLoader to load the sample data
batch_size = 1
dataloader = DataLoader(dataset, batch_size=batch_size)

quantization_dataset = nncf.Dataset(dataloader)

quantized_model = nncf.quantize(pt_model, quantization_dataset, preset=nncf.QuantizationPreset.MIXED)

serialize(quantized_model, 'model/quantized_alberta.xml')

The `nncf.quantize` function provides interface for model quantization. It requires instance of OpenVINO Model and quantization dataset. 

## Validate Quantized model inference

In [None]:
int8_compiled_model = core.compile_model(quantized_model)
predicted_label = inference(sentence, int8_compiled_model)
print(predicted_label)

## Compare Performance of the Original and Quantized Models
Finally, use the OpenVINO [Benchmark Tool](https://docs.openvino.ai/latest/openvino_inference_engine_tools_benchmark_tool_README.html) to measure the inference performance of the `FP32` and `INT8` models.

> **NOTE**: For more accurate performance, it is recommended to run `benchmark_app` in a terminal/command prompt after closing other applications. Run `benchmark_app -m model.xml -d CPU` to benchmark async inference on CPU for one minute. Change `CPU` to `GPU` to benchmark on GPU. Run `benchmark_app --help` to see an overview of all command-line options.

In [None]:
# Inference FP32 model (OpenVINO IR)
!benchmark_app -m model/alberta.xml -d CPU -api async

In [None]:
# Inference INT8 model (OpenVINO IR)
!benchmark_app -m model/quantized_alberta.xml -d CPU -api async