## Text input

https://platform.openai.com/docs/models

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
from langchain.agents import create_agent
from langchain_nvidia_ai_endpoints import ChatNVIDIA

model = ChatNVIDIA(model="meta/llama-3.2-11b-vision-instruct")

agent = create_agent(
    model=model,
    system_prompt="You are a science fiction writer, create a capital city at the users request.",
)

In [3]:
from langchain.messages import HumanMessage

question = HumanMessage(content=[
    {"type": "text", "text": "What is the capital of The Moon?"}
])

response = agent.invoke(
    {"messages": [question]}
)

print(response['messages'][-1].content)

What a fascinating request! As a science fiction writer, I'd be delighted to create a capital city on the Moon.

Welcome to Luminaria, the shining capital of the Moon!

Located in the vast, cratered expanse of the Moon's surface, Luminaria is a marvel of engineering and innovation. This futuristic city is nestled within the rim of a massive, ancient crater, its walls and structures glowing with a soft, ethereal light that seems almost otherworldly.

**Geography and Climate:**
Luminaria is situated in the lunar equatorial region, where the climate is relatively stable and temperate. The city's unique architecture allows it to harness and redirect the Moon's solar energy, providing a sustainable and reliable source of power. The atmosphere is thin, but the city's advanced life support systems maintain a comfortable pressure and oxygen level, making it habitable for humans and other species.

**Districts and Landmarks:**

1. **The Spire of Luminaria**: A towering, crystalline structure th

## Image input

In [4]:
tool_models = [model for model in ChatNVIDIA.get_available_models() if model.model_type == 'vlm']
for elem in tool_models:
    print(elem)

id='nvidia/nvclip' model_type='vlm' client='ChatNVIDIA' endpoint=None aliases=None supports_tools=False supports_structured_output=False supports_thinking=False thinking_prefix=None no_thinking_prefix=None thinking_param_enable=None thinking_param_disable=None base_model=None
id='meta/llama-3.2-11b-vision-instruct' model_type='vlm' client='ChatNVIDIA' endpoint='https://ai.api.nvidia.com/v1/gr/meta/llama-3.2-11b-vision-instruct/chat/completions' aliases=None supports_tools=False supports_structured_output=False supports_thinking=False thinking_prefix=None no_thinking_prefix=None thinking_param_enable=None thinking_param_disable=None base_model=None
id='google/gemma-3n-e4b-it' model_type='vlm' client='ChatNVIDIA' endpoint=None aliases=None supports_tools=False supports_structured_output=False supports_thinking=False thinking_prefix=None no_thinking_prefix=None thinking_param_enable=None thinking_param_disable=None base_model=None
id='mistralai/mistral-small-3.1-24b-instruct-2503' model_t

In [5]:
from ipywidgets import FileUpload
from IPython.display import display

uploader = FileUpload(accept='.png', multiple=False)
display(uploader)

FileUpload(value=(), accept='.png', description='Upload')

In [6]:
print(uploader.value)

({'name': 'Screenshot 2025-10-07 212432.png', 'type': 'image/png', 'size': 246811, 'content': <memory at 0x000001697F289D80>, 'last_modified': datetime.datetime(2025, 10, 8, 1, 24, 32, 850000, tzinfo=datetime.timezone.utc)},)


In [8]:
import base64
from PIL import Image
import io

# Get the first (and only) uploaded file dict
uploaded_file = uploader.value[0]

# This is a memoryview
content_mv = uploaded_file["content"]

# Convert memoryview -> bytes
img_bytes = bytes(content_mv)  # or content_mv.tobytes()

# compress the image
with Image.open(io.BytesIO(img_bytes)) as img:
    img = img.convert("RGB")
    img.thumbnail((1024, 1024))  # optional resize for smaller payload
    buf = io.BytesIO()
    img.save(buf, format="JPEG", quality=70, optimize=True)
    img_bytes = buf.getvalue()

# Now base64 encode
img_b64 = base64.b64encode(img_bytes).decode("utf-8")

In [None]:
image_data_url = f"data:image/png; base64,{img_b64}"

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this image."},
    {"type": "image_url", "image_url": {"url": image_data_url}}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)

The image depicts a cartoon-style illustration of a yellow caravan with two people inside, set against a blue sky with white clouds. The caravan is positioned on a hill, with a tree to its left and a house in the background. The title "CARAVAN" is prominently displayed in large white letters, with "SAND Witch" written in smaller text underneath.

**Key Features:**

* **Caravan:** Yellow in color, with a large tire on the back and a smaller one on the front.
* **People:** Two individuals are visible inside the caravan, one sitting on top and the other standing at the side.
* **Background:** A blue sky with white clouds, a tree to the left of the caravan, and a house in the distance.
* **Title:** "CARAVAN" in large white letters, with "SAND Witch" written in smaller text underneath.
* **Border:** A gray border surrounds the image, with a red box containing Chinese characters in the top-left corner.

**Overall Impression:**

The image appears to be a promotional graphic for a science fict

## Audio input

In [11]:
# 1. Fetch all available models from the NVIDIA endpoint
all_models = ChatNVIDIA.get_available_models()

# 2. Filter for likely audio models based on ID and type conventions
audio_models = [
    model.id for model in all_models 
    if "asr" in model.model_type.lower() 
    or "audio" in model.model_type.lower() 
    or "parakeet" in model.id.lower()
]

print("Models likely to support audio inputs:")
for model_id in audio_models:
    print(f"- {model_id}")

Models likely to support audio inputs:


In [None]:
import sounddevice as sd
from scipy.io.wavfile import write
import base64
import io
import time
from tqdm import tqdm

# Recording settings
duration = 5  # seconds
sample_rate = 44100

print("Recording...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
# Progress bar for the duration
for _ in tqdm(range(duration * 10)):   # update 10Ã— per second
    time.sleep(0.1)
sd.wait()
print("Done.")

# Write WAV to an in-memory buffer
buf = io.BytesIO()
write(buf, sample_rate, audio)
wav_bytes = buf.getvalue()

aud_b64 = base64.b64encode(wav_bytes).decode("utf-8")

In [None]:
agent = create_agent(
    model='gpt-4o-audio-preview',
)

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this audio file"},
    {"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)