In [None]:
%pip install --quiet --upgrade openai pillow pydantic pandas tqdm

In [None]:
import urllib.request
import zipfile

url = "https://raw.githubusercontent.com/jsoma/dataharvest25-ai-images-video/main/sky.jpg"
urllib.request.urlretrieve(url, "sky.jpg")

url = "https://raw.githubusercontent.com/jsoma/dataharvest25-ai-images-video/main/city.png"
urllib.request.urlretrieve(url, "city.png")

url = "https://raw.githubusercontent.com/jsoma/dataharvest25-ai-images-video/main/cars.zip"
urllib.request.urlretrieve(url, "cars.zip")

with zipfile.ZipFile('cars.zip', 'r') as zip_ref:
    zip_ref.extractall()

We're going to start our journey by talking **directly to large language models**. They don't involve any heavy lifting or fancy installations on our part, they're remarkably useful across different types of problems, and you don't need many technical skills to get a "good enough" solution.

## Our LLM options

When we talk about LLMs, we think of them as chatbots that are great at text. And that's usually true! If we want one that processes other types of content – images, video or audio – we are looking for a **multimodal model**. Multi-modal just means it takes different *modes* of input beyond text.

Model availabilities and capabilities change over time, as does pricing (the most important part!). Down below we'll play around with different providers *and* different models, but you can also browse the documentation to see what models might work with your use case.

### OpenAI's GPT

- https://platform.openai.com/docs/models/compare
- https://platform.openai.com/docs/pricing
- https://platform.openai.com/docs/guides/images-vision#analyze-images

### Google Gemini

- https://ai.google.dev/gemini-api/docs/models
- https://aistudio.google.com/prompts/new_chat
- https://ai.google.dev/gemini-api/docs/image-understanding
- https://ai.google.dev/gemini-api/docs/video-understanding

### Anthropic's Claude

- https://docs.anthropic.com/en/docs/about-claude/models/overview#model-comparison-table

### Deepseek

I love it but it's just *so slow* when you're using it through the API.

# Using the LLMs directly

We'll start with talking directly to the LLMs.

## Basic requests

While each LLM provider has their own tools and libraries to work with the LLM, they almost all support an "OpenAI compatible endpoint." Since ChatGPT got very popular very quickly, this is an attempt by Anthropic and Google to ease the transition to using their tooling.

This means means you can re-use a lot of your code across different providers easily.

Find more about the library here: https://github.com/openai/openai-python

In [None]:
# Import the openai library
from openai import OpenAI

In order to use these services we need **API keys!**

While using the AI chatbots through the website is almost always free, using them through *Python* costs money. API keys are how they track your usage. In this case, just use mine!

In [None]:
# API KEYS GO HERE, ASK SOMA!!

### Talking to ChatGPT

Here is how you talk to ChatGPT. You let it know the model you want, a series of messages, and then print out the (very awkward) `completion.choices[0].message.content`. There's actually a more recent version of the API buuuut it doesn't work with other providers so we're ignoring it for now.

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        { "role": "user", "content": "What color is the sky?", },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

### Talking to Claude

Claude, the AI model run by Anthropic, is known for having the *best personality*.

In [None]:
client = OpenAI(base_url='https://api.anthropic.com/v1/', api_key=CLAUDE_API_KEY)

completion = client.chat.completions.create(
    model="claude-3-5-haiku-20241022",
    messages=[
        { "role": "user", "content": "What color is the sky?", },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

### Talking to Google Gemini

Everyone ignores Google Gemini, even though it works *great*. It can be a little wordy sometimes but it's very very capable (as you'll see later on).

To experiment with it, you should *not* use the normal chat interface, but instead [AI Studio](https://aistudio.google.com/app/u/0/prompts/new_chat?pli=1).

In [None]:
client = OpenAI(
    base_url='https://generativelanguage.googleapis.com/v1beta/openai/',
    api_key=GEMINI_API_KEY
)

completion = client.chat.completions.create(
    model="gemini-2.0-flash-lite",
    messages=[
        { "role": "user", "content": "What color is the sky?", },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

## Making image requests

So far we've been making basic text requests. Since this session is all about images and video we're now going to upgrade to working with images!

In [None]:
from openai import OpenAI
import base64
from IPython.display import Image

filename = 'sky.jpg'

Image(filename) 

The only complicated part of using images with an LLM is **converting them to base64 encoding,** a representation of the image using printable characters.

In [None]:
with open(filename, "rb") as image_file:
    b64_image = base64.b64encode(image_file.read()).decode("utf-8")

b64_image[:1000]

Sending the image to the conversation is a *little* different than it was last time, but not too crazy.

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What color is the sky in this image?"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

## Comparing models

The reason why it's good to use the OpenAI-compatible endpoints for all of the models is that it makes it very very very easy to compare their outputs. When you're asking it to write a story you might not care too much, but once you move into data and image analysis you very quickly learn that some models are better than others.

In [None]:
filename = 'city.png'

with open(filename, "rb") as image_file:
    b64_image = base64.b64encode(image_file.read()).decode("utf-8")

Image(filename)

Let's analyze this image using OpenAI's GPT-4.1-nano. It's a nice cheap image-capable model.

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "List the stores in this photo"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

How does Google Gemini compare?

In [None]:
client = OpenAI(base_url='https://generativelanguage.googleapis.com/v1beta/openai/', api_key=GEMINI_API_KEY)

completion = client.chat.completions.create(
    model="gemini-2.0-flash-lite",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "List the stores in this photo"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

Above we're using `gemini-2.0-flash-lie`, which is a lightweight Gemini model. Let's try again with a more powerful one, `gemini-2.5-flash-preview-05-20`.

In [None]:
client = OpenAI(base_url='https://generativelanguage.googleapis.com/v1beta/openai/', api_key=GEMINI_API_KEY)

completion = client.chat.completions.create(
    model="gemini-2.5-flash-preview-05-20",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "List the stores in this photo"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0
)

print(completion.choices[0].message.content)

## Structured output

You might ask an LLM politely to return data in a certain format, but it's always free to ignore you! [Pydantic](https://docs.pydantic.dev/latest/) allows you to demand more structure. You create a model of what the output should look like and the LLM follows it!

> I've been writing Pydantic for a while now, but I usually use an LLM to write the model for me, especially for more complicate ones. It's a lot of boilerplate.

In [None]:
from typing import Literal, List
from pydantic import BaseModel, Field

# Just ask "write me a Pydantic model for XXXX"
class ImageDescription(BaseModel):
    city_guess: str = Field("Best guess of the location in the photograph")
    cars_visible: int = Field("Number of visible cars") 
    season: Literal['spring', 'summer', 'fall', 'winter']

Three things change when using structured outputs:

1. You add `response_format=` along with your requested structure
2. You use `client.beta.chat.completions.parse` to make the request
3. The result comes from `completion.choices[0].message.parsed`

It's easy to accidentally cut and paste the normal LLM code, add **response_format=**, and end up with an error

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.beta.chat.completions.parse(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this cityscape"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0,
    response_format=ImageDescription,
)

result = completion.choices[0].message.parsed
result

Now let's try with **Google Gemini**

In [None]:
client = OpenAI(
    base_url='https://generativelanguage.googleapis.com/v1beta/openai/',
    api_key='AIzaSyCAhb7WnDOfboZN2Bz2TOFpb_VOCtLX5xA'
)

completion = client.beta.chat.completions.parse(
    # gemini-2.5-flash-preview-05-20
    # gemini-2.0-flash-lite
    model="gemini-2.5-flash-preview-05-20",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this cityscape"},
                {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
            ],
        },
    ],
    temperature=0,
    response_format=ImageDescription,
)

result = completion.choices[0].message.parsed
result

## Analyzing many images

While analyzing *one* image is fine and all, usually you want to analyze a whole lot of them! Looking at one image isn't hard, but looking at one thousand makes you an *investigative journalist*.

To do this we'll use `glob`, my favorite-named Python tool, to find all of the jpg images in the "cars" folder.

In [None]:
import glob

filenames = glob.glob("cars/*.jpg")
filenames

In [None]:
Image(filenames[0])

Now let's see what predictions the LLM makes for the country of origin (and some other things) for each image.

In [None]:
from typing import Literal, List
from pydantic import BaseModel, Field

# Just ask "write me a Pydantic model for XXXX"
class ImageDescription(BaseModel):
    country_guess: str = Field("Best guess of the country in the photograph")
    car_make: str
    car_model: str
    license_plate_number: str
    vehicle_category: Literal['car', 'truck', 'suv', 'other']

In [None]:
from tqdm import tqdm

client = OpenAI(
    base_url='https://generativelanguage.googleapis.com/v1beta/openai/',
    api_key='AIzaSyCAhb7WnDOfboZN2Bz2TOFpb_VOCtLX5xA'
)

def ask_llm(filename):
    with open(filename, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode("utf-8")

    completion = client.beta.chat.completions.parse(
        model="gemini-2.5-flash-preview-05-20",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this cityscape"},
                    {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
                ],
            },
        ],
        temperature=0,
        response_format=ImageDescription,
    )

    result = completion.choices[0].message.parsed
    return result

results = []
for filename in tqdm(filenames):
    result = ask_llm(filename)
    print(f"{filename} is {result}")
    results.append(result)

I love to have all of my data live in pandas dataframes, but Pydantic models won't go in nicely! You need to "dump" the model to make it work.

In [None]:
results[0].model_dump()

In [None]:
import pandas as pd

# Build into dataframe
data = [result.model_dump() for result in results]
df = pd.DataFrame(data)

# Add new columns
df['filename'] = filenames
df['preview'] = df['filename'].apply(lambda filename: f'<img src="{filename}" width="100"/>')

df.head()

My favorite trick is that `preview` column. If you turn on HTML rendering you can use it to look at a little version of your images!

In [None]:
from IPython.display import HTML

HTML(df.to_html(escape=False))