# VLM Agent Evaluation: Guided Run

This notebook serves as the master controller for running the different evaluation pipelines defined in the `scripts/` directory. We will compare three different approaches to Visual Question Answering (VQA):

1.  **Zero-Shot:** Directly asking the VLM the question.
2.  **Classic Agent:** Enhancing the VLM's prompt with information extracted via OpenCV.
3.  **DL Agent:** Using the VLM itself to perform Chain-of-Thought (CoT) reasoning.

<br/>

---

### ðŸ”§ 1. Setup

First, let's install all the required dependencies from the `requirements.txt` file.

In [None]:
!pip install -r ../requirements.txt

### ðŸ“¥ 2. Data Check

Before we run the evaluations, let's make sure our data is loaded correctly. The following command will run the `evaluate_agents.py` script in `show_sample` mode, which loads the dataset and displays a sample image, question, and answer.

> ðŸ’¡ **Note:** Make sure you have placed `vqa_dataset.csv` in the `data/` folder and all your `.png` images in the `data/images/` folder.

In [None]:
!python ../scripts/evaluate_agents.py --mode show_sample --sample_index 0

### ðŸš€ 3. Run Evaluation: Zero-Shot (Baseline)

This is our baseline. The script will loop through all 100 samples, ask the VLM the question directly, and use the LLM-as-Judge to score the answer. This will give us the performance of the raw model.

All logs will be saved to `logs/evaluation.log`.

In [None]:
!python ../scripts/evaluate_agents.py --mode zero_shot

### ðŸ¤– 4. Run Evaluation: Classic Agent (CV-Enhanced)

Now, we run the "Classic" agent. For each image, this script first runs an OpenCV pipeline (`src/agent_pipelines/classic_agent.py`) to detect all objects, their colors, shapes, and coordinates. This structured data is then pre-pended to the VLM's prompt to provide it with explicit context.

We expect this to improve performance, as the VLM no longer has to *find* the objects; it only has to *reason* about the data provided.

In [None]:
!python ../scripts/evaluate_agents.py --mode classic

### ðŸ§  5. Run Evaluation: DL Agent (Chain-of-Thought)

Finally, we run the "DL" agent. This agent uses the VLM's own reasoning capabilities in a 3-step pipeline (`src/agent_pipelines/dl_agent.py`):

1.  **Decompose:** Ask the VLM to create a *plan* to answer the question.
2.  **Extract:** Ask the VLM to *describe* the image in detail.
3.  **Synthesize:** Ask the VLM to *answer* the original question using only the plan and description.

This Chain-of-Thought (CoT) approach forces the model to "think step-by-step," which often leads to more accurate and robust answers for complex reasoning tasks.

In [None]:
!python ../scripts/evaluate_agents.py --mode dl

### ðŸ“ˆ 6. Final Analysis

You can now compare the accuracy scores printed at the end of each run. For a full breakdown and analysis, please see the **'How It Works'** and **'Results & Analysis'** sections in the `README.md` file.