<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/mlflow/openvino/MLFLOW_OpenVino_Phi3_Vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mlflow, Phi3-Vision and OpenVINO

# Phi-3-Vision
The Phi-3-Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. More details about model can be found in model blog post, technical report, Phi-3-cookbook

- https://huggingface.co/microsoft/Phi-3.5-vision-instruct
- https://github.com/microsoft/Phi-3CookBook


# OpenVino
OpenVINO is an open-source toolkit for optimizing and deploying deep learning models from cloud to edge. It accelerates deep learning inference across various use cases, such as generative AI, video, audio, and language with models from popular frameworks like PyTorch, TensorFlow, ONNX, and more. Convert and optimize models, and deploy across a mix of Intel® hardware and environments, on-premises and on-device, in the browser or in the cloud.

- https://docs.openvino.ai/2024/_static/download/OpenVINO_Quick_Start_Guide.pdf
- https://docs.openvino.ai/2024/index.html

# NNCF
Neural Network Compression Framework (NNCF) provides a suite of post-training and training-time algorithms for optimizing inference of neural networks in OpenVINO™ with a minimal accuracy drop.

NNCF is designed to work with models from PyTorch, TorchFX, TensorFlow, ONNX and OpenVINO™.

https://github.com/openvinotoolkit/nncf

# MLFLOW

MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible.



In [None]:
%pip install -q "torch>=2.1" "torchvision" "transformers>=4.40" "protobuf>=3.20" "gradio>=4.26" "Pillow" "accelerate" "tqdm"  --extra-index-url https://download.pytorch.org/whl/cpu
%pip install  -q "openvino>=2024.2.0" "nncf>=2.11.0" mlflow ov_helpers

In [None]:
import requests
from pathlib import Path

if not Path("ov_phi3_vision_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/phi-3-vision/ov_phi3_vision_helper.py")
    open("ov_phi3_vision_helper.py", "w").write(r.text)


if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/phi-3-vision/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

if not Path("notebook_utils.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
    open("notebook_utils.py", "w").write(r.text)

In [None]:
import ipywidgets as widgets

# Select model
model_ids = [
    "microsoft/Phi-3.5-vision-instruct",
    "microsoft/Phi-3-vision-128k-instruct",
]

model_dropdown = widgets.Dropdown(
    options=model_ids,
    value=model_ids[0],
    description="Model:",
    disabled=False,
)

model_dropdown

# Convert and Optimize model
Phi-3-vision is PyTorch model. OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). OpenVINO model conversion API should be used for these purposes. ov.convert_model function accepts original PyTorch model instance and example input for tracing and returns ov.Model representing this model in OpenVINO framework. Converted model can be used for saving on disk using ov.save_model function or directly loading on device using core.complie_model.

The script ov_phi3_vision_helper.py contains helper function for model conversion, please check its content if you interested in conversion details.

Phi-3-vision is autoregressive transformer generative model, it means that each next model step depends from model output from previous step. The generation approach is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. In other words, model predicts the next token in the loop guided by previously generated tokens until the stop-condition will be not reached (generated sequence of maximum length or end of string token obtained). The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this blog. https://huggingface.co/blog/how-to-generate

The entry point for the generation process for models from the Hugging Face Transformers library is the `generate` method. You can find more information about its parameters and configuration in the documentation.(https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) To preserve flexibility in the selection decoding methodology, we will convert only model inference for one step.

The inference flow has difference on first step and for the next. On the first step, model accept preprocessed input instruction and image, that transformed to the unified embedding space using input_embedding and image_encoder models, after that language model, LLM-based part of model, runs on input embeddings to predict probability of next generated tokens. On the next step, language_model accepts only next token id selected based on sampling strategy and processed by input_embedding model and cached attention key and values. Since the output side is auto-regressive, an output token hidden state remains the same once computed for every further generation step. Therefore, recomputing it every time you want to generate a new token seems wasteful. With the cache, the model saves the hidden state once it has been computed. The model only computes the one for the most recently generated output token at each time step, re-using the saved ones for hidden tokens. This reduces the generation complexity from
 to O(n3) to O(n2)  for a transformer model.

 More details about how it works can be found in this article.(https://scale.com/blog/pytorch-improvements#Text%20Translation) For improving support images of various resolution, input image separated on patches and processed by image feature extractor and image projector that are part of image encoder.

To sum up above, model consists of 4 parts:

- Image feature extractor and Image projector for encoding input images into embedding space.
- Input Embedding for conversion input text tokens into embedding space
- Language Model for generation answer based on input embeddings provided by Image Encoder and Input Embedding models.

# Compress model weights to 4-bit
For reducing memory consumption, weights compression optimization can be applied using NNCF.
Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:

- enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;

- improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.

Neural Network Compression Framework (NNCF - https://github.com/openvinotoolkit/nncf) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.

The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.

nncf.compress_weights function can be used for performing weights compression. The function accepts an OpenVINO model and other compression parameters. Compared to INT8 compression, INT4 compression improves performance even more, but introduces a minor drop in prediction quality.

In [None]:
from ov_helpers.ov_phi3_vision_helper import convert_phi3_model

In [None]:
from pathlib import Path
import nncf


model_id = model_dropdown.value
out_dir = Path("/content/drive/MyDrive/models") / Path(model_id).name / "INT4"
out_dir

In [None]:
compression_configuration = {
    "mode": nncf.CompressWeightsMode.INT4_SYM,
    "group_size": 64,
    "ratio": 0.6,
}
convert_phi3_model(model_id, out_dir, compression_configuration)

In [None]:
from ov_helpers.notebook_utils import device_widget

device = device_widget(default="AUTO", exclude=["NPU"])

device

In [None]:
from ov_helpers.ov_phi3_vision_helper import OvPhi3Vision

In [None]:
model = OvPhi3Vision(out_dir, device.value)

In [None]:
import requests
from PIL import Image

url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
image = Image.open(requests.get(url, stream=True).raw)

print("Question:\n What is unusual on this picture?")
image

In [None]:
from transformers import AutoProcessor, TextStreamer

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is unusual on this picture?"},
]

processor = AutoProcessor.from_pretrained(out_dir, trust_remote_code=True)

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(prompt, [image], return_tensors="pt")

generation_args = {"temperature": .8, "max_new_tokens": 50, "do_sample": True, "streamer": TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)}

print("Answer:")
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

In [None]:
processor.decode(generate_ids[0],skip_prompt=True, skip_special_tokens=True)

In [None]:
import transformers
from mlflow.models import infer_signature
from mlflow.transformers import generate_signature_output
import locale
import mlflow
import json
import datetime
import pandas as pd
import numpy as np
import os
import pprint
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
from google.colab import userdata
MLFLOW_TRACKING_URI="databricks"
# Specify the workspace hostname and token
DATABRICKS_HOST="https://adb-2467347032368999.19.azuredatabricks.net/"
DATABRICKS_TOKEN=userdata.get('DATABRCKS_TTOKEN')

In [None]:

if "MLFLOW_TRACKING_URI" not in os.environ:
    os.environ["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_URI
if "DATABRICKS_HOST" not in os.environ:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
if "DATABRICKS_TOKEN" not in os.environ:
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

In [None]:
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

In [None]:
mlflow.set_experiment("/Users/pepe@kk.com/openvino-phi3-vision")

In [None]:

mlflow.end_run()

In [None]:
import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel
import pprint

In [None]:

class OpenVino_Phi3_Vision(PythonModel):
  def load_context(self, context):
        """
        This method initializes the tokenizer and language model
        using the specified model snapshot directory.
        """
        self.model = OvPhi3Vision(context.artifacts["snapshot"], "AUTO")
        self.processor = AutoProcessor.from_pretrained(context.artifacts["snapshot"], trust_remote_code=True)

  def open_image(self, image_path):
    if "http" in image_path:
      url = image_path
      image = Image.open(requests.get(url, stream=True).raw)
    else:
      image = Image.open(image_path)
    return image

  def predict(self, context, model_input, params=None):
        """
        This method generates prediction for the given input.
        """
        prompt = model_input["prompt"][0]
        image = self.open_image(model_input["image_path"][0])
         # Retrieve or use default values for temperature and max_tokens
        temperature = params.get("temperature", 0.7) if params else 0.7
        max_tokens = params.get("max_tokens", 128) if params else 128
        messages = [
          {"role": "user", "content": f"<|image_1|>\n{prompt}"},
      ]

        prompt = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        inputs = processor(prompt, [image], return_tensors="pt")

        generation_args = {"temperature": temperature, "max_new_tokens": max_tokens, "do_sample": True, "streamer": TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)}

        generate_ids = self.model.generate(**inputs, eos_token_id=self.processor.tokenizer.eos_token_id, **generation_args)

        result =self.processor.decode(generate_ids[0],skip_prompt=True, skip_special_tokens=True)
        return {"candidates": [result]}

In [None]:
import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema

# Define input and output schema
input_schema = Schema(
    [
        ColSpec(DataType.string, "prompt"),
        ColSpec(DataType.string, "image_path"),
    ]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])

parameters = ParamSchema(
    [

        ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
        ParamSpec("temperature", DataType.float, np.float32(.8), None),

    ]
)

signature = ModelSignature(inputs=input_schema, outputs=output_schema,  params=parameters)


# Define input example
input_example = pd.DataFrame({"prompt": ["Question:\n What is unusual on this picture?"],
                              "image_path":["https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"]})

In [None]:

import datetime
now = datetime.datetime.now()
now.strftime("%Y-%m-%d_%H:%M:%S")

In [None]:
import torch
import transformers
import torchvision
torch_version = torch.__version__.split("+")[0]

In [None]:
out_dir

In [None]:
# Start an MLflow run context and log the PHi3 model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
now = datetime.datetime.now()

description= """Phi3 Vision model converted to OpenVino.
"""
with mlflow.start_run(run_name=f"Phi3_vision_openvino_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:
    model_info = mlflow.pyfunc.log_model(
        "phi3-vision-openvino",
        python_model=OpenVino_Phi3_Vision(),
        # NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our PHi3() class.
        artifacts={"snapshot": '/content/drive/MyDrive/models/Phi-3.5-vision-instruct/INT4'},

        pip_requirements=[
            f"torch=={torch_version}",
            f"transformers=={transformers.__version__}",
            f"torchvision=={torchvision.__version__}",
             "protobuf>=3.20",
              "Pillow" ,
            "accelerate" ,
            "tqdm" ,
            "openvino>=2024.2.0",
            "nncf>=2.11.0",
            "ov_helpers"

        ],
        input_example=input_example,
        signature=signature,
    )

In [None]:

run.to_dictionary()

In [None]:


model_info.model_uri

In [None]:
loaded_model = mlflow.pyfunc.load_model("runs:/74b3ecdebd4d4cb2bd88059f48d44f4c/phi3-vision-openvino") #runs:/74b3ecdebd4d4cb2bd88059f48d44f4c/phi3-vision-openvino

In [None]:
loaded_model

In [None]:
prompt = "Question:\n Describe this picture"
image_path ="/content/drive/MyDrive/data/beach.jpg"

In [None]:
from PIL import Image
import requests
from io import BytesIO

In [None]:
image = Image.open(image_path)
image

In [None]:
time1=  datetime.datetime.now()
response = loaded_model.predict(pd.DataFrame(
    {"prompt": [prompt], "image_path" :[image_path]}), params={ "max_tokens": 256, "temperature": .5, "topk" :.7}
)
time2=  datetime.datetime.now()
print(time2-time1)

In [None]:
pprint.pprint(response['candidates'][0])

In [None]:
import gc

In [None]:
del loaded_model
del model
gc.collect()

In [None]:

gc.collect()

# Load Model from Model registry
after registring model we can point to model registry using model name and version

In [None]:
from mlflow import MlflowClient
import mlflow.pyfunc
client = MlflowClient()

In [None]:
model_name = "Phi35-vision-openvino"
model_version = 1

model_loaded = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{model_version}")

In [None]:
model_loaded

In [None]:
time1=  datetime.datetime.now()
response = model_loaded.predict(pd.DataFrame(
    {"prompt": [prompt], "image_path" :[image_path]}), params={ "max_tokens": 256, "temperature": .5, "topk" :.7}
)
time2=  datetime.datetime.now()
print(time2-time1)

In [None]:
pprint.pprint(response['candidates'][0])