In [35]:
#only one traversal to clean the response, and it also converts the text to lowercase
import re

def clean_response(text):
    return re.sub(r"^\s+|\s+$|\*\*|[\r\n\t]+|\s+", " ", text).lower()


In [36]:
from ollama import  chat
def generate_caption(system_prompt,path):
   
    response = chat(
    model='gemma3',
    messages=[
    {
    "role": "system",
    "content": system_prompt
    },

    {
    'role': 'user',
    'content': 'What is in this image? .',
    'images': [path],
    }
    ],
    )
    print(clean_response(response.message.content))


In [37]:
sys1='''
You are an image analysis assistant.

Describe the image using the following structure:
Objects: list the important objects.
Actions: what is happening.
Scene: environment or location.
Details: any additional useful visual information.

Be concise but informative. Only describe what is visible.

'''

In [38]:
img1=r'C:\Users\fareh\coding\video-rag\car.jpg'
img2=r'C:\Users\fareh\coding\video-rag\image.png'
img3=r'C:\Users\fareh\coding\video-rag\sample-2.png'

In [52]:
generate_caption(sys1,img1)

here's an analysis of the image:  objects:  * vintage volkswagen beetle car * wooden door * pavement/sidewalk * wall  actions:  * the car is parked on the pavement.  scene:  * urban street setting.  details:  * the car is a light turquoise color with white stripes. * the door is dark brown and has a simple design. * the wall is a faded yellow color.


In [40]:
generate_caption(sys1,img2)

here's a description of the image:  objects:  man, integral symbol (∫), equation x-1  actions:  man is gesturing with his hands.  scene:  dark background with a black screen.  details:  the man appears to be distressed, with his mouth open and hands raised. the integral symbol and the equation x-1 are displayed in white.


In [41]:
generate_caption(sys1,img3)

here’s a description of the image:  objects:  * neural network diagram (multiple layers) * nodes/neurons (labeled b11, b12, b13, b21, b22, b23, etc.) * weights (labeled w1, w12, w13, w21, w22, w23, etc.) * arrows indicating connections  actions:  * the diagram illustrates a forward propagation process in a neural network.  scene:  * a computer screen displaying a diagram of a neural network.  details:  * the diagram shows a multi-layered neural network. * equations are present, including 6+2=8 and 2x1=3. * the text “# of trainable parameters” is visible in the top right corner. * the diagram is labeled “forward propagation”.


In [42]:
sys2='''
You are a visual reasoning assistant.
When given an image, first describe the important objects,
then reason step by step, and finally give a concise answer.
Do not hallucinate details that are not visible.
 '''

In [43]:
generate_caption(sys2,img3)

here's a breakdown of the image:  objects:  * the image displays a diagram illustrating the concept of forward propagation in a neural network. * it shows a multi-layered neural network with connections between layers. * the diagram includes values and equations associated with each layer. * it has labels such as “layer #1”, “b11”, “b12”, “b13”, “w1”, “w2”, “w3”, “x11”, “x12”, “x13”.  reasoning:  1.  network structure:  the diagram represents a neural network with three layers. the first layer (layer #1) receives inputs x11, x12, and x13. 2.  linear transformation:  each layer performs a linear transformation. for example, layer #1 has weights w1, w2, and w3. the outputs of these linear transformations (b11, b12, b13) are then summed. 3.  activation function:  an activation function (σ) is applied to the summed output. the equation σ = y = z*e^2 + 1 indicates that z is the sum of the linear transformation outputs, and e^2 is a constant (2 in this case). 4.  propagation:  the activation

In [44]:
generate_caption(sys2,img1)

here's a breakdown of what’s in the image:  description of objects:  *  car:  a light teal-colored volkswagen beetle is the most prominent object. it’s parked on the street. *  wall:  behind the car is a yellow-colored wall. *  doors:  there are two wooden doors visible on the wall. they have arched tops. *  sidewalk:  the car is parked on a gray sidewalk.  reasoning:  1. the primary focus of the image is the classic volkswagen beetle. 2. the wall provides the backdrop for the car. 3. the doors are a secondary element, providing detail to the background.  concise answer:  a light teal volkswagen beetle is parked in front of a yellow wall with wooden doors.


In [45]:
generate_caption(sys2,img2)

here's a breakdown of the image: *  objects:  the image shows a man with fair skin and brown hair, who is wearing a gray shirt. he's expressing a worried expression and gesturing with his hands. the most prominent feature is the integral symbol (∫) along with the expression x⁻¹ written above it. *  reasoning:  the image is a screenshot from a video titled "this integral breaks math.” the man is explaining or demonstrating the solution to a definite integral. *  concise answer:  the image shows the integral symbol and the expression x⁻¹ which is a part of a math video.


without any system prompt

In [46]:
generate_caption("",img3)

okay, let's break down what's in the image. this appears to be a visualization of a forward pass in a neural network, specifically illustrating backpropagation.  overall structure:  *  neural network diagram:  the image shows a network with two layers labeled "layer #1" and a final output layer. *  forward pass:  the arrows indicate the flow of data through the network, starting with the input features. *  backpropagation:  the network calculates the loss (represented by "26") and then propagates this loss backward through the layers to adjust the weights.  breakdown of the numbers and calculations:  *  inputs:   * `x11`, `x12`, `x13`: these represent the input features to layer #1. *  layer #1 calculations:   * `b11`, `b12`, `b13`: these are the biases for the neurons in layer #1.  * `w11`, `w12`, `w13`: these are the weights connecting the inputs to the neurons in layer #1.  * `2+2 = 4`: this indicates a basic addition operation.  * `3 + 1 = 4`: another addition operation. *  layer 2

In [47]:
generate_caption(" ",img2)  #bekar response

the image shows a video frame with a man pointing at the integral symbol ∫ with the function x-1 written underneath it. it appears to be a math tutorial.


In [48]:
sys3='''
You are an assistant that produces dense captions for images.
Describe:
- all visible objects
- interactions between objects
- activities
- environment
- notable visual details
Avoid speculation. Provide factual descriptions only.
'''

In [49]:
generate_caption(sys3,img1)

here's a dense description of the image, focusing on factual details:  objects:  *  vehicle:  a light teal-colored volkswagen beetle is positioned to the left of the image, facing to the right. the vehicle has a white stripe running along its roofline and a light gray wheel set. the body of the car displays some surface wear and minor imperfections. *  wall:  a solid, yellow-colored wall dominates the background. the wall appears to be constructed of stucco or a similar material with a slightly rough texture. *  doors:  two wooden doors are visible in the right side of the image. the doors have a dark brown finish and a simple, rectangular design with evenly spaced vertical grooves. they are closed. *  sidewalk:  a gray concrete sidewalk extends horizontally across the lower portion of the image. the sidewalk is marked with square tiles and appears slightly worn.  interactions:  * the car is parked on the sidewalk, adjacent to the wall and doors.  activity/environment:  * the scene app

In [50]:
generate_caption(sys3,img2)

here's a detailed description of the image:  objects:  *  man:  a young man with fair skin, brown hair, and a visible frown is the primary subject. he is wearing a dark blue shirt with a visible collar. his mouth is open in an expression of distress or frustration. both of his hands are raised upwards. *  mathematical expression:  a white mathematical expression is drawn on the screen. it shows the integral symbol (∫) followed by the expression "x-1" beneath it. *  background:  the background is a solid black color.  interactions:  * the man is raising both of his hands upwards.  activities:  * the man is expressing a reaction, possibly in response to the mathematical expression.  environment:  * the image is a screenshot from a digital display.  notable visual details:  * the man's facial expression (frown) is prominent. * the white math expression contrasts sharply with the black background.


In [51]:
generate_caption(sys3,img3)

here's a dense description of the image:  overall scene:  the image depicts a diagram illustrating the process of forward propagation in a neural network, specifically focusing on a multi-layer perceptron (mlp). the diagram is presented within a digital document with the title “forward propagation” and a date of “03 march 2022”.  diagram components:  *  network structure:  the diagram shows a layered neural network with three layers labeled “layer #1”, “layer #2” and “layer #3”. *  nodes (neurons):  each layer contains multiple nodes, represented as circles. the nodes are labeled with variables such as x11, x12, x13, x21, x22, x23, and b11, b12, b13, b21, b22, b23, b31, b32, and b33. *  connections (weights):  lines connect the nodes, representing weighted connections. these connections are labeled with variables such as w11, w12, w13, w21, w22, w23, w31, w32, and w33. *  mathematical operations:  the diagram includes several equations and calculations within the nodes.  * in the first