# Model Export and Optimization Pipeline

This notebook exports and quantises a pre-trained sentiment analysis model through the following steps:

1. Loading a PyTorch model and tokeniser
2. Converting to ONNX Runtime format for improved inference performance
3. Applying quantization for model optimization
4. Comparing inference speed and outputs between:
   - Original PyTorch model
   - ONNX Runtime model 
   - Quantized model

The notebook includes a Gradio interface for interactive testing of all three model variants.

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import gradio as gr
import pandas as pd
from time import perf_counter

In [None]:
model_path = "models/best/"

pytorch_model = AutoModelForSequenceClassification.from_pretrained(model_path, local_files_only=True)

tokeniser = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
tokeniser.save_pretrained("models/tokeniser/")

In [None]:
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_path,
    export=True,
    provider="CPUExecutionProvider"
)
ort_model.save_pretrained("models/ort/")

In [None]:
quantization_config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantized_model_path = "models/quantized_model/"
quantizer = ORTQuantizer.from_pretrained(ort_model)

quantizer.quantize(save_dir=quantized_model_path, quantization_config=quantization_config)

quantized_model = ORTModelForSequenceClassification.from_pretrained(quantized_model_path, local_files_only=True)

In [None]:
def run_model(text_in: str, model: any) -> tuple:
    tokenised_text = tokeniser(
        text_in,
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors="pt"
    )

    start_time = perf_counter()
    with torch.no_grad():
        out = model(**tokenised_text)
        execution_time = round(perf_counter() - start_time, 5)
        out = round(out.logits.squeeze().item(), 5)

    return out, execution_time

In [None]:
def compare_models(text_in: str) -> pd.DataFrame:
    data = [
        ["PyTorch", *run_model(text_in, pytorch_model)],
        ["ONNX", *run_model(text_in, ort_model)],
        ["AutoQuantization", *run_model(text_in, quantized_model)]
    ]

    return pd.DataFrame(data, columns=["Model", "Output", "Time"])

The ONNX model achieves nearly identical performance to the quantised version, with only a minimal average difference of ~0.03 in predictions. Given this negligible accuracy impact, the ONNX model is the optimal choice as it provides the best balance between performance and prediction quality.


In [50]:
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            text = gr.Textbox(placeholder="Paste a headline here.")
            run_button = gr.Button("Run")
        output_table = gr.DataFrame()

    run_button.click(fn=compare_models, inputs=text, outputs=output_table)

demo.launch()

* Running on local URL:  http://127.0.0.1:7872
* To create a public link, set `share=True` in `launch()`.


