<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/microsoft_Phi3_Vision_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Microsoft Phi3 Vision

phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens,
whose overall performance, as measured by both academic benchmarks and internal testing, rivals
that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38
on MT-bench), despite being small enough to be deployed on a phone.
The innovation lies entirely in
our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered
publicly available web data and synthetic data. The model is also further aligned for robustness,
safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B
models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more
capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).
Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with
strong reasoning capabilities for image and text prompts


https://github.com/microsoft/Phi-3CookBook

https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

https://onnxruntime.ai/docs/genai/tutorials/phi3-v.html

https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md


In [None]:

%pip install transformers -U --quiet
%pip install datasets -U --quiet
%pip install torch -U --quiet

%pip install -U flash-attn --no-build-isolation --quiet

In [None]:
import warnings
import datetime
import pprint
# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=Warning)

In [None]:
from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16

In [None]:
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype="auto").cuda()



In [None]:
import matplotlib.pyplot as plt
import argparse
import imutils
import cv2

In [None]:
def plt_imshow(title, image):
  # convert the image frame BGR to RGB color space and display it\n
  try:
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  except:
    image = image.convert('RGB')
  # show the image
  plt.imshow(image)
  plt.title(title)
  plt.grid(False)
  plt.show()

In [None]:
user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"

In [None]:
prompt = f"{user_prompt}<|image_1|>\nCould you please introduce this stock to me and explain the plot?{prompt_suffix}{assistant_prompt}"


url = "https://g.foolcdn.com/editorial/images/767633/nvidiadatacenterrevenuefy2017tofy2024.png"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")


In [None]:
plt_imshow("nvidia", image)

In [None]:
time1 = datetime.datetime.now()
generate_ids = model.generate(**inputs,
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids,
                                  skip_special_tokens=True,
                                  clean_up_tokenization_spaces=False)[0]


time2 = datetime.datetime.now()
print(f"Time taken: {time2 - time1}")

In [None]:
pprint.pprint(response)


In [None]:
prompt = f"{user_prompt}<|image_1|>\nHelp me get the title and author information of this book?{prompt_suffix}{assistant_prompt}"

url = "https://marketplace.canva.com/EAFPHUaBrFc/1/0/1003w/canva-black-and-white-modern-alone-story-book-cover-QHBKwQnsgzs.jpg"

image = Image.open(requests.get(url, stream=True).raw)

time1 = datetime.datetime.now()
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs,
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids,
                                  skip_special_tokens=False,
                                  clean_up_tokenization_spaces=False)[0]


time2 = datetime.datetime.now()
print(f"Time taken: {time2 - time1}")

In [None]:
plt_imshow("ocr 1", image)

In [None]:
response

In [None]:
path = "/content/drive/MyDrive/data (1)/docs/image1.jpeg"


In [None]:
image = Image.open(path).convert('RGB')

In [None]:
plt_imshow("ocr 2", image)

In [None]:
prompt = f"{user_prompt}<|image_1|>\Can you extract literally the text of the following image?{prompt_suffix}{assistant_prompt}"


time1 = datetime.datetime.now()
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs,
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids,
                                  skip_special_tokens=False,
                                  clean_up_tokenization_spaces=False)[0]

time2 = datetime.datetime.now()
print(f"Time taken: {time2 - time1}")

In [None]:
pprint.pprint(response)

In [None]:
prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is difference in this two images?{prompt_suffix}{assistant_prompt}"

print(f">>> Prompt\n{prompt}")

url = "https://hinhnen.ibongda.net/upload/wallpaper/doi-bong/2012/11/22/arsenal-wallpaper-free.jpg"

image_1 = Image.open(requests.get(url, stream=True).raw)

url = "https://assets-webp.khelnow.com/d7293de2fa93b29528da214253f1d8d0/news/uploads/2021/07/Arsenal-1024x576.jpg.webp"

image_2 = Image.open(requests.get(url, stream=True).raw)

images = [image_1, image_2]

time1 = datetime.datetime.now()
inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs,
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

time2 = datetime.datetime.now()
print(f"Time taken: {time2 - time1}")

In [None]:
plt_imshow("image 1", image_1), plt_imshow("image 2", image_2)

In [None]:
pprint.pprint(response)