# Deepseek_vl2 tiny model

## Notebook Overview

<span style="font-size:16px">This notebook demonstrates multimodal image understanding using the DeepSeek-VL model, a vision-language model capable of generating textual descriptions and insights from images. The workflow combines PyTorch, transformers, and DeepSeek-VL utilities to process images and produce inferences.</span>

<b><span style="font-size:20px">Clone the official DeepSeek-VL GitHub repository, move into its directory, and install it in editable mode so we can use it directly in our notebook.</span><b>

In [12]:
# %cd C:\Users\User\Desktop\OpenCV University\vlm-bench\Models\Deepseek

C:\Users\User\Desktop\OpenCV University\vlm-bench\Models\Deepseek


In [None]:
# !git clone https://github.com/deepseek-ai/DeepSeek-VL.git
# %cd DeepSeek-VL
# !pip install -e .

In [6]:
# !pip install -U "triton-windows<3.5"

ERROR: Could not find a version that satisfies the requirement triton (from versions: none)
ERROR: No matching distribution found for triton


## Importing libraries and modules
<span style="font-size:16px">requests & BytesIO: To fetch images directly from GitHub URLs and load them into memory without saving locally.</span>

In [1]:
# import torch
import time

import requests
from io import BytesIO

# For opening and processing images
from PIL import Image

# Import mattplotlib to plot image outputs
import matplotlib.pyplot as plt

 # Show plots inline
%matplotlib inline

# Set default colormap to grayscale
plt.rcParams['image.cmap'] = 'gray'

In [2]:
import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

Python version is above 3.10, patching the collections module.


##  Load DeepSeek-VL2 tiny Model
<span style="font-size:16px">
<b>VLChatProcessor</b>: The processor handles preprocessing of both text and image inputs so the model can understand multimodal data.<br>    
Tokenizer: Used to convert text into token IDs suitable for the model.<br> 
MultiModalityCausalLM: This is the actual deep learning model capable of handling both image and text inputs for inference.</span>

In [3]:
# Current Version
import transformers
transformers.__version__

'4.38.2'

In [4]:
# !pip install transformers==4.38.2

In [5]:
# specify the path to the model
model_path = "deepseek-ai/deepseek-vl2-tiny"
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Add pad token = ['<｜▁pad▁｜>'] to the tokenizer
<｜▁pad▁｜>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Add grounding-related tokens = ['<|ref|>', '<|/ref|>', '<|det|>', '<|/det|>', '<|grounding|>'] to the tokenizer with input_ids
<|ref|>:128816
<|/ref|>:128817
<|det|>:128818
<|/det|>:128819
<|grounding|>:128820
Add chat tokens = ['<|User|>', '<|Assistant|>'] to the tokenizer with input_ids
<|User|>:128821
<|Assistant|>:128822



##   Inference Function
<span style="font-size:16px">
Steps inside the function: <br>
1. <b>Load image from GitHub</b> → Downloads the image and converts it to RGB format.<br>
2. <b>Prepare conversation format</b> → Adds the user’s prompt along with the image placeholder, following the model’s input format.<br>
3. <b>Preprocess inputs with `VLChatProcessor`</b> → Converts both text and image into the proper token/embedding format for the model. <br>
4. <b>Generate embeddings </b>→ Uses the multimodal model to create embeddings for the input. <br>
5. <b>Run inference</b> → Calls the language model to generate a response based on both the text prompt and the image.<br>
6. <b>Decode response</b> → Converts the generated tokens back into human-readable text.<br>
7. <b>Output</b> → Prints the image filename, inference time, and the model’s answer.<br>
<br>

This function run_inference allows us to run DeepSeek-VL multimodal inference on images hosted on GitHub without plotting them.<br>
<b>force_batchify=True</b> ensures that even a single image + prompt is converted into a “batch” format, making it compatible with the model and avoiding shape/dimension errors.<br>
<b>Output:</b> The model’s response as a string, along with the image filename and inference time.
</span>  

## In this case, we’ll ask the model to explain an architecture diagram

<span style="font-style:16px">We can now test our inference function by passing:
1. A GitHub image URL
2. A text prompt asking the model to analyze the image </span>

In [6]:
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>Explain the image in 100 words<|/ref|>.",
        "images": ["icons.png"],
    },
    {"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)

prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<|User|>: <image>
<|ref|>Explain the image in 100 words<|/ref|>.

<|Assistant|>: The image displays a list of popular apps categorized into two sections: "Popular apps" and "Must-have free apps." Each app has an icon, name, category, and price. The "Popular apps" section includes applications like PDF X: PDF Editor & PDF Reader, Screen recorder - Screen record & Screen capture, Sketchbook Pro, Movie Maker Video Editor, DTS Sound Unbound, HEVC Video Extensions, GST Doctor ITC Matching Software Pro, Movie Maker - Video Editor PRO, PDF Reader Pro - PDF Editor & Converter, and Doc Scan PDF Scanner. The "Must-have free apps" section features Instagram, Lively Wallpaper, VLC, Netflix, Adobe Acrobat Reader DC, ChatGPT, Telegram Desktop, Canva, Visual Studio Code, and Microsoft Teams.


In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/esp32-devkitC-v4-pinout.png",
    "Explain each pin in this ESP32 board."
)


## Extract Text from Image

In [None]:
run_inference("https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image.png",
"Extract all the text from this image as plain text.")


###  Analyze Table  

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/Depict-Data-Studio_Transforming-a-Table_GIF.gif",
    "Explain what transformation is happening in this data table."
)


### Diagram 
<span style="font-siz:16px">We are using the `run_inference` function to get a detailed step-by-step explanation of the water cycle from the given image</span>

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/pngtree-natural-phenomena-of-water-cycle-seawater-waves-png-image_6940374.png",
    "Explain the stages of this process step by step."
)


### Counting number of person

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image1.jpg",
    "How many people are present in this image? Return only the number."
)


### Identifying colors

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image2.jpg",
    "List the colors of the main objects visible in this image. Answer as a JSON array like: [\"red\", \"blue\", \"green\"]."
)


### Left and Right object detection

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image3.jpg",
    "Describe what is on the left side and what is on the right side of this image."
)


###  Image captioning

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image4.jpg",
    "Identify the objects in this image and count them."
)


### Scene description

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/image5.jpg",
    "Describe the details in this street image."
)


### Formula extraction

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/molecularformula.png",
    (
        "The image shows a table of chemical compounds and their formulas. "
        "Extract the compounds and formulas, but represent formulas in LaTeX math format. "
        "Output should be valid JSON only. Example: "
        "{\"Glucose\": \"$C_{6}H_{12}O_{6}$\", \"Butane\": \"$C_{4}H_{10}$\"}."
    )
)


### Chart

In [None]:
run_inference(
    "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/Piechartexample11.png",
    "Explain the insights from this chart."
)


In [None]:
!wget "https://www.shutterstock.com/image-photo/broken-screen-laptop-smashed-3d-600nw-2257814995.jpg" -O laptop.jpg
# Load and display using PIL
from PIL import Image
from IPython.display import display

img = Image.open("laptop.jpg")
display(img)
prompt = (
    "Carefully examine this image. "
    "Identify any physical damage, broken connectors, burns, missing parts, or anything abnormal. "
    "Describe the issue clearly and in detail."
)

run_inference(
    image_url="https://www.shutterstock.com/image-photo/broken-screen-laptop-smashed-3d-600nw-2257814995.jpg",
    prompt=prompt
)

In [None]:
# Download the image
!wget "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/battery-charger-9137425.jpg" -O battery.jpg

# Load and display using PIL
from PIL import Image
from IPython.display import display

img = Image.open("battery.jpg")
display(img)


In [None]:
prompt = (
    "Carefully examine this image. "
    "Identify any physical damage, broken connectors, burns, missing parts, or anything abnormal. "
    "Describe the issue clearly and in detail."
)

run_inference(
    image_url="https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/battery-charger-9137425.jpg",
    prompt=prompt
)


In [None]:
image_url = "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/istockphoto-490106120-612x612.jpg"
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
display(img)
run_inference(
    image_url=image_url,
    prompt=(
        "Carefully examine this electronic component image. "
        "Identify any physical damage, broken connectors, burns, missing parts, or anything abnormal. "
        "Describe the issue clearly and in detail."
    )
)


In [None]:
image_url = "https://raw.githubusercontent.com/Shilpaknnarayan/Images/main/Burnt-Plug-blog.jpg"
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
display(img)
run_inference(
    image_url=image_url,
    prompt=(
        "Carefully examine this burnt plug image. "
        "Identify any physical damage, broken connectors, burns, missing parts, or anything abnormal. "
        "Describe the issue clearly and in detail."
    )
)
