# Exploring Llama 3.2-Vision (locally) with Ollama

Code authored by: Shaw Talebi

[Blog link](https://towardsdatascience.com/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3)
<br>
[Video link](https://youtu.be/Ot2c5MKN_-w)

In [3]:
%pip install ollama

Defaulting to user installation because normal site-packages is not writeable
Collecting ollama
  Using cached ollama-0.5.4-py3-none-any.whl (13 kB)
Collecting httpx>=0.27
  Using cached httpx-0.28.1-py3-none-any.whl (73 kB)
Collecting pydantic>=2.9
  Using cached pydantic-2.11.9-py3-none-any.whl (444 kB)
Collecting idna
  Downloading idna-3.10-py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 2.9 MB/s eta 0:00:011
[?25hCollecting anyio
  Using cached anyio-4.10.0-py3-none-any.whl (107 kB)
Collecting httpcore==1.*
  Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)
Collecting certifi
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
[K     |████████████████████████████████| 161 kB 4.5 MB/s eta 0:00:01
[?25hCollecting h11>=0.16
  Using cached h11-0.16.0-py3-none-any.whl (37 kB)
Collecting typing-inspection>=0.4.0
  Using cached typing_inspection-0.4.1-py3-none-any.whl (14 kB)
Collecting pydantic-core==2.33.2
  Downloading pydantic_core-2.33.2-cp39-cp

### imports

In [4]:
import ollama

### select model

In [23]:
# model = 'llama3.2-vision' # 7b
model = 'gemma3:4b' # 4b
# model = 'moondream:1.8b' # 1.7b

### pull model

In [17]:
ollama.pull(model)

ProgressResponse(status='success', completed=None, total=None, digest=None)

#### Basic Usage

In [18]:
response = ollama.chat(
    model=model,
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['images/shaw-sitting.jpeg']
    }]
)

print(response['message']['content'])


The image features a man sitting on a yellow stool or ottoman, wearing a black shirt and tan pants. He has his hands clasped together while smiling at the camera. The room appears to be a living space with various furniture items such as chairs, a couch, and a TV mounted on the wall in the background.

There are also several potted plants placed around the room, adding greenery and life to the space. A chair can be seen near one of the potted plants, while another is located closer to the foreground. The man's position on the yellow stool or ottoman suggests that he might be in a casual setting where seating options are provided for guests or family members.

Additionally, there is a keyboard placed nearby, possibly indicating that this living space could also serve as a workspace or an area for creative pursuits such as music production or writing.


#### Image captioning - streaming

In [20]:
stream = ollama.chat(
    model=model,
    messages=[{
        'role': 'user',
        'content': 'Can you write a caption for this image?',
        'images': ['images/shaw-sitting.jpeg']
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

urns, plants, and a television are arranged on the wooden floor of a modern living room.


#### Explaining memes

In [21]:
stream = ollama.chat(
    model=model,
    messages=[{
        'role': 'user',
        'content': 'Can you explain this meme to me?',
        'images': ['images/ai-meme.jpeg']
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)


The image features a cartoon character, likely Spongebob Squarepants, sitting on a bench in a sandy area. The character is holding a hammer and appears to be working on something or fixing an object nearby. Various objects are scattered around the scene, including a blue wrench, a yellow toolbox, and several other tools. A text overlay at the bottom of the image reads "Trying to build with AI today...."

#### OCR

In [24]:
stream = ollama.chat(
    model=model,
    messages=[{
        'role': 'user',
        'content': 'Can you transcribe the text from this screenshot in a markdown format?',
        'images': ['images/5-ai-projects.jpeg']
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Here's the text from the screenshot formatted in markdown:

**5 AI Projects You Can Build This Weekend (with Python)**

1.  Resume Optimization (Beginner)
    *   Idea: build a tool that adapts your resume for a specific job description

2.  YouTube Lecture Summarizer (Beginner)
    *   Idea: build a tool that takes YouTube video link and summarizes it

3.  Automatically Organizing PDFs (Intermediate)
    *   Idea: build a tool to analyze the contents of each PDF and organize them into folders based on topics.

4.  Multimodal Search (Intermediate)
    *   Idea: Use multimodal embeddings to represent user queries, text knowledge, and images in single space

5.  Desktop QA (Advanced)
    *   Idea: Connect a multimodal knowledge base to a multimodal model like Llama-3.2-11B-Vision.