# Visualizing VLM Tokens

Visual Language Models (VLM's) use a pretrained LLM for their core smarts, but take images as inputs.  There are many variations on how to do this, but nowadays they've settled into a standard pattern which is fairly straightforward, depicted below.  Images are first prepared and sent through a neural network that is pre-trained for image analysis - typically a ViT like CLIP.  ViTs break the image down into patches that are maybe 14x14 pixels, and each patch gets converted to its own vector at the output.  The VLM than translates these image vectors into the same embedding space as word tokens, and sends them into the LLM for analysis.

In [None]:
!git clone https://github.com/pifanpi/visualizing-vlm-tokens
%cd /content/visualizing-vlm-tokens
!git pull origin main
!./install-requirements.sh

In [None]:
import imgtokens
ipwt = imgtokens.ImagePatchWordTokenizer()

In [None]:
# Pick an Image URL

# path through grass
img_url="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

# psyduck
img_url = "https://images.tcdn.com.br/img/img_prod/460977/estatua_colecionavel_psyduck_pokemon_15cm_anime_manga_mega_saldao_mkp_127551_1_64f34b612a53196c9b84efe947d33d43.jpg"

# hand picking tomato
img_url = "https://forestry.com/wp/wp-content/uploads/2024/02/2-219.webp"

In [None]:
import requests
from PIL import Image
from io import BytesIO

print(f"Fetching img from {img_url}")
response = requests.get(img_url)
img = Image.open(BytesIO(response.content))
img = ipwt._standardize_img(img)
img

In [None]:
words = ipwt.process_img(img, num_words=8)

In [None]:
ipwt.draw_with_plotly(img, words, size=1500, iframe=False)