In [1]:
#only one traversal to clean the response, and it also converts the text to lowercase
import re

def clean_response(text):
    return re.sub(r"^\s+|\s+$|\*\*|[\r\n\t]+|\s+", " ", text).lower()


In [2]:
from ollama import  chat
def generate_caption(system_prompt,path):
   
    response = chat(
    model='gemma3',
    messages=[
    {
    "role": "system",
    "content": system_prompt
    },

    {
    'role': 'user',
    'content': 'What is in this image? .',
    'images': [path],
    }
    ],
    )
    print(clean_response(response.message.content))


In [3]:
sys1='''
You are an image analysis assistant.

Describe the image using the following structure:
Objects: list the important objects.
Actions: what is happening.
Scene: environment or location.
Details: any additional useful visual information.

Be concise but informative. Only describe what is visible.

'''

In [4]:
img1=r'C:\Users\fareh\coding\video-rag\car.jpg'
img2=r'C:\Users\fareh\coding\video-rag\image.png'
img3=r'C:\Users\fareh\coding\video-rag\sample-2.png'

In [5]:
generate_caption(sys1,img1)

here’s a description of the image:  objects:  * vintage light blue/turquoise volkswagen beetle car * wooden door * concrete pavement  actions:  * the car is parked on the pavement.  scene:  * urban street setting with a yellow wall and a wooden door.  details:  * the door has a decorative upper section with arched openings. * the pavement has a stone or brick pattern.


In [6]:
generate_caption(sys1,img2)

here's a description of the image:  objects:  * man with dark hair and beard * mathematical integral symbol ∫ * text “x-1”  actions:  * the man is gesturing with his hands.  scene:  * black background.  details:  * the man appears to be in a video recording.


In [7]:
generate_caption(sys1,img3)

here's a description of the image based on the provided structure:  objects:  * nodes (b11, b12, b13, b21, b22, b23, w1, w12, w13, w22, w23, w32, w33) * arrows * text boxes with numerical values (7-2, 72, 69, 81, 8-1, 92, 75, 76, 0)  actions:  * connections between nodes represent a neural network's forward propagation process. * arrows indicate the flow of data and calculations.  scene:  * a screenshot of a document titled "forward propagation" depicting a neural network diagram.  details:  * the diagram shows a multi-layer neural network with connections between nodes. * the text boxes contain numerical values likely related to the activation functions or weights of the network. * the text “# of trainable parameters” is visible in the corner of the screenshot.


In [8]:
sys2='''
You are a visual reasoning assistant.
When given an image, first describe the important objects,
then reason step by step, and finally give a concise answer.
Do not hallucinate details that are not visible.
 '''

In [9]:
generate_caption(sys2,img3)

here's a description of the image and a step-by-step reasoning to answer your question:  description of objects:  the image shows a diagram illustrating the process of forward propagation in a neural network. it includes: *  neural network layers:  two layers are clearly depicted: "layer #1" and the subsequent layer. *  weights and biases:  various weights (w₁, w₂, w₃, w’₂, w’₃) and biases (b₁, b₂, b₃) are represented with numbers. *  activation functions:  sigmoid activation functions (represented by “σ”) are indicated at several points. *  arrows:  arrows show the flow of data and calculations between layers. *  output:  the final output of the network is shown as an equation: y = 0 + *  calculations:  some calculations are shown as equations (e.g., 6+2=8, 3/1).  reasoning:  1.  forward propagation:  the diagram depicts the forward pass of a neural network. this means the input data is fed through the network, and calculations are performed at each layer. 2.  layer 1:  the first laye

In [10]:
generate_caption(sys2,img1)

here’s a description of the image and my reasoning:  objects:  *  car:  a light blue and white vintage volkswagen beetle is parked on a sidewalk. *  wall:  the car is parked in front of a yellow wall. *  doors:  there are two dark wooden doors set into the wall.  reasoning:  1. the most prominent object is a classic volkswagen beetle car. 2. the car is situated on a paved sidewalk. 3. behind the car is a wall painted yellow. 4. two doors are built into the yellow wall.  concise answer:  a light blue volkswagen beetle is parked in front of a yellow wall with two wooden doors.


In [11]:
generate_caption(sys2,img2)

here's a description of the image:  objects:  * a man is the primary subject. he is pale, and appears to be expressing concern or frustration, with his mouth open and hands raised. * the man is against a solid black background. * written on the screen in white, is the integral symbol ∫ followed by the expression x-1.  reasoning:  the image captures a moment from a video. the video title suggests it's an educational video on calculus or integration. the symbol written on the screen, ∫x-1, indicates that the video is likely explaining how to calculate the integral of the function x-1.  concise answer:  the image shows a man against a black background with the integral of x-1 written on the screen.


without any system prompt

In [12]:
generate_caption("",img3)

okay, let's break down what's shown in the image. it appears to be a diagram illustrating the forward propagation process in a neural network. here's a breakdown of the key elements:  overall structure:  *  two layers:  the diagram depicts a neural network with two layers. *  connections:  arrows indicate the flow of data (signals) between neurons.  layer #1:  *  input:  the first layer receives input features *x11, x12, and x13*. *  neurons:  this layer contains three neurons, each with a bias term. the bias terms are labeled *b11, b12, and b13*. *  weights:  each connection from a neuron in layer 1 to a neuron in layer 2 has a weight. the weights are labeled *w11, w12, w13, w21, w22, w23, w31, w32, and w33*. these are the parameters to be learned during training.  layer #2:  *  neurons:  this layer receives inputs from layer 1. *  activation:  the diagram shows the output of each neuron in layer 2.  mathematical operations:  the diagram also shows the following equations: * `6 + 2 = 

In [13]:
generate_caption(" ",img2)  #bekar response

here's what's in the image: *  a man:  he's a young man with a somewhat distressed expression, gesturing with his hands. *  integral symbol:  the main focus is the integral symbol (∫) drawn on a black background. *  function:  underneath the integral symbol is the function x⁻¹. *  black background:  the background is a solid black color. this image is a screenshot from a video titled “this integral breaks math”.


In [14]:
sys3='''
You are an assistant that produces dense captions for images.
Describe:
- all visible objects
- interactions between objects
- activities
- environment
- notable visual details
Avoid speculation. Provide factual descriptions only.
'''

In [15]:
generate_caption(sys3,img1)

here is a dense description of the image:  objects:  *  vehicle:  a light teal-colored volkswagen beetle is positioned on the left side of the image. the car has a white stripe running along the top half of its body. it features round headlights, a prominent curved roof, and steel wheels with silver hubcaps. *  wall:  a tall, light yellow wall occupies the background. the wall is constructed of stone or concrete blocks. *  doors:  two wooden doors are visible in the upper right corner of the image. the doors are a dark brown color and have a rectangular shape with evenly spaced, dark wooden vertical slats. each door has a central, dark wooden handle. *  sidewalk:  a gray concrete sidewalk runs along the bottom of the frame, extending behind the car. the sidewalk is composed of square, evenly spaced tiles.  interactions:  * the car is parked directly adjacent to the sidewalk. * the doors are positioned against the wall.  activities:  * there is no discernible activity taking place in th

In [16]:
generate_caption(sys3,img2)

here is a dense caption describing the image:  overall description:  the image depicts a man in a dark blue shirt, illuminated by a black background. he is in a video recording.  objects:  *  man:  a young man with fair skin, short brown hair, and a visible beard is the primary subject. he is wearing a dark blue button-down shirt with a dark blue and grey pattern. his mouth is open and he appears to be expressing emotion. he is holding his hands up with his palms facing upward. *  mathematical expression:  a white mathematical expression is displayed on the screen. it is the integral symbol ∫ followed by the function x-1 dx.  interactions:  * the man is visually interacting with the mathematical expression by raising his hands towards it.  activity:  * the image captures a moment from a video recording, likely related to mathematics or educational content.  environment:  * the background is black, suggesting a studio or recording environment.  notable visual details:  * the man's facia

In [17]:
generate_caption(sys3,img3)

here is a dense caption describing the visible elements in the image:  overall description:  the image displays a diagram illustrating the concept of forward propagation in a neural network. it shows a layered network with multiple nodes and connections, alongside mathematical expressions representing calculations within the network.  specific elements:  *  neural network diagram:  the image depicts a multi-layered neural network. it consists of three layers, labeled "layer #1," "layer #2," and "layer #3." *  nodes:  each layer contains several nodes, represented by circles, labeled with numerical values like “7-2”, “8-1”, “72”, “92”, “69”, “81”, “75”, and “0.” *  connections:  arrows represent connections between nodes, indicating the flow of data. the connections vary in thickness, and some are indicated with equations. *  equations:  several mathematical equations are present.  * “6+2=8” is written next to the node labeled "7-2".  * “2x/5=1” is present next to the node labeled “8-1”