## Dynamic Variables

* `{frame_range}`: Current frame range being analyzed
* `{total_frames}`: Total frames in video
* `{segment_analyses}`: All previous chunk analyses concatenated

## Implementation Notes

* Explicit step-by-step structure
* Separates observation (Steps 1-5) from interpretation (Step 6)
* Synthesis prompt aggregates multiple chunk analyses

---

## Implementation Details

### Frame Processing Pipeline

1. Extract all frames from video (no sampling, interval=1)
2. Convert frames to PNG format
3. Base64 encode each frame
4. Group into chunks of 10 frames
5. Send chunks to GPT-4o with appropriate prompt
6. Aggregate responses across chunks

### API Request Structure
```python
{
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "<prompt_text>"},
                {"type": "image_url", "image_url": {
                    "url": "data:image/png;base64,<base64_frame_1>",
                    "detail": "high"
                }},
                # ... up to 10 frames per request
            ]
        }
    ],
    "max_tokens": 4096,
    "temperature": 0.1  # (or variable for Self-Consistency)
}
```

### Zero-Shot Clarification

**All techniques were implemented in a zero-shot manner.** No few-shot examples were provided in any prompt. This ensures:

* Fair comparison across all techniques
* Generalizability without example bias
* Testing of model's inherent capabilities

### ReAct Clarification

Our ReAct implementation **does not use external tools**. It structures the model's internal reasoning process through the Thought-Action-Observation-Decision framework, adapted specifically for video frame analysis without tool invocation.

# Appendix A: Complete Prompt Templates

## Experimental Configuration (All Techniques)

- **Model:** GPT-4o (via OpenAI API, model identifier: `gpt-4o`)
- **Temperature:** 0.1 (except Self-Consistency: 0.3, 0.5, 0.7, 0.9, 1.0)
- **Max Tokens:** 4096
- **Chunk Size:** 10 frames per API request
- **Image Format:** Base64-encoded PNG, detail level: "high"
- **Frame Processing:** All frames extracted (frame_interval = 1)

---

## A.1 Zero-Shot Prompting

### Prompt
```
What's happening in these frames? Describe the scene and any notable actions or events.
```

### Implementation Notes

- No system message provided
- No formatting instructions
- No examples or guidance beyond the basic query
- Applied to each 10-frame chunk independently

---

## A.2 Sequential Prompting

### Approach

Five progressive prompts where each explicitly builds on the previous response.

### Prompt Sequence

**Prompt 1:**
```
What's happening in these frames? Describe the scene at a high level.
```

**Prompt 2:**
```
Based on what you observed in your previous response, who are the main people in the scene and what are they doing?
```

**Prompt 3:**
```
Looking at the actions you described, do you observe any potential criminal activities? If so, describe them in detail.
```

**Prompt 4:**
```
Based on your crime analysis, what objects or items are involved in the incident?
```

**Prompt 5:**
```
Considering all your observations, create a chronological timeline of events shown in these frames.
```

### Implementation Notes

- Each prompt includes the full conversation history (all previous Q&A pairs)
- Responses build cumulatively
- Applied to each 10-frame chunk before moving to next chunk

---

## A.3 Least-to-Most Prompting

### Approach

Eight prompts progressing from simple observations to complex analysis.

### Prompt Sequence

**Prompt 1:**
```
List all visible objects in these frames. Just identify what you can see - furniture, tools, vehicles, etc.
```

**Prompt 2:**
```
Identify the people visible in these frames. How many are there? Describe each person's basic appearance (clothing, general features).
```

**Prompt 3:**
```
Describe the location and setting shown in these frames. What kind of place is this? Describe the spatial layout.
```

**Prompt 4:**
```
What actions are the people performing in these frames? List simple actions you can observe.
```

**Prompt 5:**
```
How are the people interacting with each other and with objects in the scene? Describe specific interactions.
```

**Prompt 6:**
```
Do you notice any unusual, concerning, or potentially suspicious behaviors in these frames? If so, what specifically seems unusual?
```

**Prompt 7:**
```
Based on your previous observations, analyze whether any potential criminal activities might be occurring. What specific elements suggest criminal behavior?
```

**Prompt 8:**
```
Using all your previous observations, construct a detailed chronological timeline of events shown in these frames. Include who did what, when, and potential motives.
```

### Implementation Notes

- All previous responses included in context for each subsequent prompt
- Designed to decompose complex analysis into simpler subtasks
- Each step builds foundation for the next level of complexity

---

## A.4 ReAct (Reasoning and Acting)

### Prompt Template
```
Analyze these frames using the ReAct approach (Reasoning and Acting). For each important element you observe:

1. Thought: Reason about what you're seeing and what it might mean
2. Action: Describe what specific aspect you'll focus on analyzing next
3. Observation: Make detailed observations about that aspect
4. Decision: Draw a conclusion based on your observations

Specifically, follow this cycle for:
- People and their appearances
- Actions and behaviors
- Objects and items
- Spatial relationships
- Temporal sequence of events
- Potential criminal activity

After going through these cycles, provide your final analysis of what crime appears to be occurring, who is involved, and what evidence supports this conclusion.

You are now analyzing frames {frame_range} of {total_frames}.
```

### Dynamic Variables

- `{frame_range}`: e.g., "1-10", "11-20", "21-30"
- `{total_frames}`: Total number of frames in video

### Implementation Notes

- ReAct implemented as **structured reasoning framework** (not tool use)
- No external tools called
- Model structures its own reasoning through the Thought-Action-Observation-Decision cycle
- Applied to each 10-frame chunk

---

## A.5 True Iterative Refinement

### Core Prompt (repeated iteratively)
```
Analyze these video frames for criminal activity. What crime is occurring, who is involved, what evidence supports your conclusion, and how confident are you in this assessment?
```

### Refinement Process

- Same prompt repeated up to 8 iterations
- Each iteration receives the previous response as additional context
- **Convergence Detection:**
  - Calculates Jaccard similarity between consecutive responses
  - Stops when similarity â‰¥ 0.7 (70% threshold)
- **Alternative Stop Conditions:**
  - Confidence score plateaus
  - Maximum 8 iterations reached

### Implementation Notes

- Self-refinement approach: model improves its own analysis
- No prompt modification between iterations
- Context window grows with each iteration (includes all previous responses)

---

## A.6 Self-Consistency

### Comprehensive Analysis Prompt (used 5 times independently)
```
Analyze these video frames thoroughly and provide a comprehensive assessment. Please address the following aspects:

1. SCENE DESCRIPTION: Describe the environment, setting, and overall context.

2. PEOPLE IDENTIFICATION: Identify all individuals present, their apparent ages, genders, and relationships.

3. ACTIONS AND BEHAVIORS: Detail what each person is doing and their specific behaviors.

4. INTERACTIONS: Analyze how the people are interacting with each other.

5. BODY LANGUAGE: Examine non-verbal cues, postures, and emotional states.

6. OBJECTS AND ITEMS: Identify and describe any significant objects or items visible in the frames.

7. SPATIAL RELATIONSHIPS: Describe how people and objects are positioned relative to each other.

8. TEMPORAL SEQUENCE: Construct a timeline of events shown across the frames.

9. ANOMALIES OR CONCERNING ELEMENTS: Note anything unusual, suspicious, or concerning.

10. POTENTIAL CRIMINAL ACTIVITY: Based on all the above, assess whether any criminal activity appears to be occurring and explain your reasoning.

Provide a detailed, objective analysis based on what you can actually observe in the frames.
```

### Process

1. Generate 5 independent analyses using **different temperature settings**:
   - Run 1: temperature = 0.3
   - Run 2: temperature = 0.5
   - Run 3: temperature = 0.7
   - Run 4: temperature = 0.9
   - Run 5: temperature = 1.0

2. **Synthesis Prompt (aggregates the 5 analyses):**
```
You have been provided with 5 independent analyses of the same video frames, each examining the scene from the same perspective but potentially arriving at different conclusions.

Your task: Synthesize these analyses into a single, coherent assessment. Specifically:

1. CONSISTENCY ANALYSIS: Identify aspects where all or most analyses agree
2. DIVERGENCE POINTS: Note where analyses disagree and explain possible reasons
3. CONFIDENCE LEVELS: Based on agreement/disagreement, assess confidence in conclusions
4. EVIDENCE STRENGTH: Evaluate which claims have strongest support across analyses
5. FINAL ASSESSMENT: Provide your best judgment based on the collective evidence

The 5 independent analyses are:

[ANALYSIS 1]
{analysis_1}

[ANALYSIS 2]
{analysis_2}

[ANALYSIS 3]
{analysis_3}

[ANALYSIS 4]
{analysis_4}

[ANALYSIS 5]
{analysis_5}

Synthesize these into a coherent, evidence-based conclusion.
```

### Implementation Notes

- Variable temperature creates diversity in responses
- Same prompt, different sampling = multiple perspectives
- Synthesis identifies consensus and disagreements

---

## A.7 Meta-Prompting

### Meta-Prompt Template (generates specialized analysis prompts)
```
You are an expert at designing effective prompts for analyzing surveillance footage and crime scene videos.

I want you to create a specialized prompt for analyzing a sequence of frames from a surveillance video. The prompt should help analyze what's happening in the frames without making assumptions or jumping to conclusions. The focus should be on detailed observation of:

1. People in the scene
2. Actions and behaviors
3. Objects and their usage
4. Spatial relationships
5. Sequence of events

Create a detailed, structured prompt (300-500 words) that will guide the analysis of video frames {frame_range} of {total_frames}. The prompt should:
- Guide step-by-step observation
- Focus on objective description before interpretation
- Encourage attention to details that might be easily missed
- Help organize observations into a coherent narrative

Do not include any example responses in your prompt. The prompt should only contain instructions for analyzing the frames.
```

### Two-Stage Process

1. **Stage 1:** LLM generates a specialized prompt using the meta-prompt above
2. **Stage 2:** Generated prompt is used to analyze the video frames

### Dynamic Variables

- `{frame_range}`: e.g., "1-10", "11-20"
- `{total_frames}`: Total frames in video

### Implementation Notes

- Adaptive approach: prompt customized for each video chunk
- Model acts as its own prompt engineer
- Generated prompts vary based on context

---

## A.8 Chain of Thought (CoT)

### CoT Analysis Prompt
```
Analyze these video frames using a chain of thought reasoning process. Think step by step as you examine what's happening:

Step 1: First, carefully observe and list what you can actually see in the frames. Note people, objects, settings, and actions without interpretation.

Step 2: Identify the key actors in the scene. Describe each person's appearance and what they are doing. Track individuals across multiple frames.

Step 3: Describe the sequence of events chronologically. What happens first, next, and after that?

Step 4: Note any important objects or items in the scene and how they're being used.

Step 5: Consider the context and setting. Where is this taking place? What kind of environment is shown?

Step 6: Based on all the above observations, describe what appears to be happening in these frames.

Make sure to clearly show your thinking process for each step. These are frames {frame_range} of {total_frames}.
```

### CoT Synthesis Prompt (for aggregating chunks)
```
You are going to synthesize multiple analyses of different segments of the same video into a coherent understanding of the entire sequence. Use chain of thought reasoning to connect all segments into a complete narrative.

Think through the following steps:

Step 1: Review each segment analysis and extract the key information about people, objects, and actions from each one.

Step 2: Create a timeline by arranging events across all segments in chronological order.

Step 3: Identify which people appear across multiple segments and track their actions throughout.

Step 4: Note how objects or items are used or moved across the entire sequence.

Step 5: Consider the overall context and how different segments relate to each other.

Step 6: Based on all the above reasoning, construct a comprehensive description of what happens throughout the entire video.

Show your thinking at each step as you build your understanding of the complete video sequence.

Here are the segment analyses:

{segment_analyses}

Synthesize these into a complete narrative showing your chain of thought at each step.
```

### Dynamic Variables

- `{frame_range}`: Current frame range being analyzed
- `{total_frames}`: Total frames in video
- `{segment_analyses}`: All previous chunk analyses concatenated

### Implementation Notes

- Explicit step-by-step structure
- Separates observation (Steps 1-5) from interpretation (Step 6)
- Synthesis prompt aggregates multiple chunk analyses

---

## Implementation Details

### Frame Processing Pipeline

1. Extract all frames from video (no sampling, interval=1)
2. Convert frames to PNG format
3. Base64 encode each frame
4. Group into chunks of 10 frames
5. Send chunks to GPT-4o with appropriate prompt
6. Aggregate responses across chunks

### API Request Structure
```python
{
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "<prompt_text>"},
                {"type": "image_url", "image_url": {
                    "url": "data:image/png;base64,<base64_frame_1>",
                    "detail": "high"
                }},
                # ... up to 10 frames per request
            ]
        }
    ],
    "max_tokens": 4096,
    "temperature": 0.1  # (or variable for Self-Consistency)
}
```

### Zero-Shot Clarification

**All techniques were implemented in a zero-shot manner.** No few-shot examples were provided in any prompt. This ensures:

- Fair comparison across all techniques
- Generalizability without example bias
- Testing of model's inherent capabilities

### ReAct Clarification

Our ReAct implementation **does not use external tools**. It structures the model's internal reasoning process through the Thought-Action-Observation-Decision framework, adapted specifically for video frame analysis without tool invocation.

---

**Note:** This provides complete reproducibility. All prompts are exact verbatim text from our implementation. Researchers can replicate our experiments by using these prompts with GPT-4o under the specified configuration.