## Step 1: Setup & Installation

In [2]:
import os
from dotenv import load_dotenv

load_dotenv() 

print(f"MISTRAL_API_KEY: {os.environ.get('MISTRAL_API_KEY')}")

api_key = os.getenv("MISTRAL_API_KEY")


MISTRAL_API_KEY: KxoidSRc4PqGoB59Sj5PHIG8az6OYAWc


In [3]:
import os
from mistralai import Mistral, UserMessage

def mistral(user_message, model="mistral-small-latest", is_json=False):
    model = "mistral-large-latest"
    client = Mistral(api_key=api_key)

    messages = [
        UserMessage(content=user_message),
    ]

    chat_response = client.chat.complete(
        model=model,
        messages=messages,
    )

    return chat_response.choices[0].message.content


In [50]:
kairos_info= '''
# Kairos – Business Overview

## Organization
Kairos

## Business Concept
Kairos is an AI-powered long-form video understanding and retrieval platform.
It converts lengthy, unstructured videos into searchable, semantically organized content.

Users can:
- Upload a video
- Ask a question in natural language
- Retrieve the most relevant clip
- Continue interaction via chat-based follow-up questions

Kairos acts as a conversational intelligence layer over video content.

---

## Core Value Proposition

Problem:
Long-form videos are difficult and inefficient to search using traditional timestamp or metadata-based tools.

Solution:
Kairos enables:
- Scene-level semantic understanding
- Multimodal analysis (visual + speech + environmental audio)
- Retrieval-Augmented Generation (RAG)
- Clip-level precise retrieval
- Context-aware Q&A

Value:
- Faster navigation
- Improved accessibility
- Reduced manual review time
- Scalable video intelligence

---

## Target Markets

Education:
- Lecture indexing
- Study Q&A over recordings

Media & Journalism:
- Interview clip retrieval
- News archive search

Enterprise:
- Meeting summarization
- Training video indexing
- Compliance review

Content Creation:
- Highlight extraction
- Podcast indexing

---

## Product Architecture (High-Level)

Pipeline:
1. Video Upload
2. Scene Segmentation
3. Multimodal Processing:
   - Visual captions
   - Object detection
   - Speech transcription
   - Environmental audio classification
4. Scene-Level Description Synthesis (LLM)
5. Embedding + Vector Index Storage
6. RAG-Based Retrieval
7. Chat Interface + Clip Preview

Deployment:
- Cloud-based SaaS
- Azure infrastructure
- Docker containerization
- Timestamp-based streaming preview
- On-demand clip trimming (1–3 seconds execution)

---

## Competitive Positioning

Compared to traditional systems:
- Supports long-form video
- Multimodal integration
- Scene-level granularity
- Interactive conversational retrieval
- Context-grounded answers

Positioning:
Scalable interactive video intelligence platform.

---

## Potential Business Model

- SaaS subscription tiers
- Enterprise licensing
- API-based pricing
- Pay-per-minute processing
- White-label integrations

---

## Strategic Differentiation

- Scene-based indexing
- Multimodal fusion architecture
- LLM-based scene synthesis
- Retrieval-Augmented Generation
- Conversational follow-up capability

---

## Executive Summary

Kairos is an AI-powered SaaS platform that transforms long-form video into searchable, conversationally accessible content.

It enables semantic clip retrieval, interactive Q&A, and scalable multimodal understanding of complex video archives.

Long-term positioning:
- Video-native search engine
- Enterprise video intelligence layer
- Foundation for multimodal AI-assisted video interaction

'''

In [51]:
general_question = """
You are the Kairos Customer Support Assistant.

Answer based only on the general information provided below.

If the answer cannot be found in the provided information, respond politely with:

"I'm unable to find that detail in the current Kairos documentation. Please contact support for further assistance."

Do not invent information. Be clear, concise, and professional.

### Kairos General Information:

{kairos_info}

---

### User Inquiry:

<<<
{inquiry}
>>>

---

### Your response as a Customer Support Assistant:
"""


In [52]:
inquiry = "What is Kairos for?"

mistral(general_question.format(inquiry=inquiry, kairos_info=kairos_info))

'Kairos is an AI-powered platform designed to make long-form videos searchable and interactive. It converts lengthy, unstructured videos into organized, semantically understood content, allowing users to:\n\n- Upload videos\n- Ask natural language questions\n- Retrieve the most relevant clips\n- Engage in follow-up conversations\n\nIt serves as a conversational intelligence layer for video content, enabling faster navigation, improved accessibility, and scalable video understanding.'

## Step 2: Classification Task

In [None]:
classification = '''
You are a Kairos customer support intent classification system.
Your task is to assess customer intent and categorize the customer inquiry within the `<<<>>>` markers into one of the following predefined categories:

* Technical Issue
* Feature Explanation
* System Architecture Explanation
* General Inquiry

If the text does not fit into any of the above categories, classify it as:

* General Inquiry

You will respond **only** with the predefined category. Do not provide explanations or notes.

###

Here are some examples:

**Inquiry:** My video upload keeps failing at 70%.
**Category:** Technical Issue

**Inquiry:** How do I search for a specific moment in a video?
**Category:** Feature Explanation

**Inquiry:** How does Kairos integrate ASR with the Vision-Language Model?
**Category:** System Architecture Explanation

**Inquiry:** What is Kairos?
**Category:** General Inquiry

###

<<<
**Inquiry:** {inquiry}

> > >

**Category:**

'''

In [None]:
mistral(classification.format(inquiry="What is kairos about?"))


'General Inquiry'

## Step 3: Information Extraction Using JSON Mode


In [28]:
issue_extraction = """
You are a structured data extraction system for the Kairos Customer Support chatbot.

Your task is to extract technical issue details from the customer inquiry within the `<<<>>>` markers and return the information in valid JSON format.

Extract the following fields:

* issue_type
* video_length
* video_format
* error_message
* stage_of_failure
* device_or_environment
* urgency_level

Guidelines:

* If a field is not mentioned, set its value to `"Not specified"`.
* Do not infer details that are not clearly stated.
* Be concise and accurate.
* Return **only valid JSON**.
* Do not include explanations, notes, or extra text.
* Ensure the JSON is properly formatted.

###

Here are some examples:

**Inquiry:** My 2-hour MP4 video fails during scene detection with a CUDA memory error.
**Output:**

{{
  "issue_type": "Processing failure",
  "video_length": "2-hour",
  "video_format": "MP4",
  "error_message": "CUDA memory error",
  "stage_of_failure": "Scene detection",
  "device_or_environment": "GPU processing",
  "urgency_level": "Not specified"
}}


**Inquiry:** The system crashes when I upload a MOV file from my laptop.
**Output:**

{{
  "issue_type": "Upload crash",
  "video_length": "Not specified",
  "video_format": "MOV",
  "error_message": "System crash",
  "stage_of_failure": "Upload stage",
  "device_or_environment": "Laptop",
  "urgency_level": "Not specified"
}}


###

<<<
**Inquiry:** {inquiry}

> > >

"""

In [None]:
inquiry = "The video is uploading for an hours now it's only 5 minute video"

mistral(issue_extraction.format(inquiry=inquiry), is_json=True)

'```json\n{\n  "issue_type": "Upload failure",\n  "video_length": "5 minute",\n  "video_format": "Not specified",\n  "error_message": "Upload stuck",\n  "stage_of_failure": "Upload stage",\n  "device_or_environment": "Not specified",\n  "urgency_level": "Not specified"\n}\n```'

## Step 4.1: Personalized Response about System architecture

In [44]:
system_archi = """
System Architecture Explanation

1) High‑Level Overview
Kairos is a long‑form video understanding platform that transforms raw video into a structured, searchable knowledge asset. It does this by:

Segmenting video into scenes.
Extracting visual and audio evidence per scene.
Fusing that evidence into coherent scene descriptions.
Building a narrative summary and synopsis.
Generating embeddings for retrieval (RAG) and enabling query‑based clip discovery.
2) Core Components (Conceptual)

Ingestion & Cataloging

Source videos live in Videos/.
A catalog file (_all_videos.json) enables batch selection and filtering (by length or explicit selection).
Scene Segmentation Layer

Uses a shot/scene detection engine to split the video into semantically consistent scenes.
Outputs scene boundaries and timecodes.
Optionally saves physical scene clips into an output .clips directory for inspection or downstream use.
Visual Analysis Pipeline

Frame Sampling: Selects representative frames per scene at a target resolution.
Captioning: Generates short descriptive captions per frame using a lightweight VLM.
Object Detection: Runs object detection on sampled frames (and/or a separate FPS stream) to produce fine‑grained object lists and spatial hints.
Artifacts are stored under per‑video output directories (e.g., .frames, .fps, .yolo).
Audio Analysis Pipeline

Natural Sound Description: Produces semantic descriptions of non‑speech audio events.
Speech Transcription: Produces ASR transcripts with optional VAD for clarity.
These features align to scene time boundaries to preserve temporal context.
Scene Evidence Fusion

For each scene, all evidence streams are merged:
Visual captions
Object detections
Sound descriptions
Speech transcripts
An LLM produces a coherent, human‑readable scene description that is more informative than any single modality.
Narrative Construction

Scene descriptions are chunked into larger blocks.
An LLM produces multi‑scene narrative summaries, then a final synopsis.
This provides hierarchical abstraction: scene‑level → narrative‑level → synopsis‑level.
Retrieval (RAG) Layer

Scene‑level information is embedded into a vector index.
A conversational interface can retrieve the most relevant scenes for a user query and surface them as clip references or summaries.
Checkpointing & Logging

Each video run persists intermediate state to a checkpoint file to allow restarts without redoing costly steps.
Logs are stored in logs/ and log_reports/ for performance tracking and reproducibility.
3) Data Flow (End‑to‑End)

Video Selection

User selects one or more videos from the catalog.
Scene Detection

The video is segmented into ordered scenes; timecodes are captured.
Per‑Scene Feature Extraction

Visual frames are sampled and analyzed.
Audio is analyzed for both speech and ambient events.
Scene Description Generation

Multimodal evidence is fused into a single scene summary.
Narrative & Synopsis

The system composes a higher‑level narrative summary and final synopsis.
Embedding Creation

Scene‑level data is embedded for retrieval.
Querying retrieves the most relevant scenes and their evidence.
4) Storage & Artifacts
For each processed video, Kairos writes:

Scene metadata: boundaries, timecodes, derived features.
Intermediate outputs: frames, clips, detection results.
Scene descriptions and narrative summaries.
RAG embeddings for search and retrieval.
These outputs live under _processed/ with per‑video subdirectories.

5) Resilience & Reproducibility

Checkpointing prevents reprocessing already‑completed stages.
Step‑level logging ensures each stage’s runtime and outputs are tracked.
The architecture supports incremental re‑runs (e.g., only re‑summarize scenes).
6) Scalability Considerations

Batch processing is supported through a catalog‑driven selection mechanism.
Scene‑level parallelism is possible conceptually because scenes are independent once segmented.
Configurable quality‑speed tradeoffs (e.g., number of frames per scene, detection FPS) allow tuning for large video sets.
7) Extensibility
The architecture is modular:

Visual analysis can swap or add new models.
Audio analysis can add speaker diarization or emotion.
Fusion prompts can be updated for different reporting styles.
Retrieval can incorporate new metadata or ranking heuristics.
"""


In [45]:
system_question = """
You are the Kairos Customer Support Assistant.

Your task is to answer a user inquiry related to **System Architecture Explanation**.

You must base your answer strictly and only on the system architecture information provided below.

If the answer cannot be found in the provided architecture information, respond with:

"I’m unable to find that detail in the current Kairos system documentation."

Do not invent components, models, or processes that are not explicitly mentioned.

Be clear, technical, and structured.
Use concise explanations.
If appropriate, explain the flow of data step-by-step.

### Kairos System Architecture Information:

{system_archi}

---

### User Inquiry:

<<<
{inquiry}

> > >

---

### Your response as a Customer Support Assistant and an expert in Kairos System Architecture:

"""

In [46]:
inquiry = "Can kairos identify if there music in the video?"
mistral(system_question.format(inquiry=inquiry, system_archi = system_archi))

'Based on the Kairos system architecture documentation, Kairos **can identify music or other non-speech audio events** in a video through its **Audio Analysis Pipeline**. Here’s how it works:\n\n1. **Audio Analysis Pipeline**:\n   - The system processes audio tracks to detect and describe **natural sounds**, which explicitly includes **non-speech audio events** (e.g., music, ambient noise, sound effects).\n   - This is handled by the **"Natural Sound Description"** component, which generates semantic descriptions of audio events aligned with scene time boundaries.\n\n2. **Output Integration**:\n   - The detected audio events (including music) are merged with other evidence (visual, speech transcripts) during **Scene Evidence Fusion** to produce a coherent scene description.\n   - If music is present, it will be reflected in the scene’s summary (e.g., *"Background music plays while the speaker discusses..."*).\n\n3. **Limitations**:\n   - The architecture does not explicitly mention **m

## Step 4.2: Personalized Response about System architecture

In [47]:
feature_breakdown = '''
Feature Explanation: What Kairos Can and Can’t Do

What Kairos Can Do
* Analyze long‑form videos scene by scene
* Automatically splits videos into meaningful scenes and processes each one.
* Generate visual understanding
* Captions key frames to describe what’s happening visually.
* Detects objects to capture fine‑grained details.
* Analyze audio content
* Describes non‑speech sounds (e.g., ambience, effects).
* Transcribes spoken dialogue in each scene.
* Fuse multimodal evidence into rich scene descriptions
* Combines visual captions, object detections, sound descriptions, and speech into a single, coherent scene report.
* Produce narrative summaries and a final synopsis
* Builds higher‑level summaries from scene reports and then a final synopsis for the whole video.
* Enable retrieval‑style querying (RAG)
* Builds embeddings so you can ask questions and retrieve the most relevant scenes.
* Persist outputs for reuse
* Saves scene data, summaries, and logs so you can resume or reuse results without reprocessing.

What Kairos Can’t Do (Current Limitations)
* It doesn’t “understand” beyond its models
* It’s limited by the captioning, detection, ASR, and LLM accuracy; errors or hallucinations can appear.
* It doesn’t guarantee perfect scene boundaries
* Scene detection can miss subtle cuts or over‑segment fast‑moving content.
* It doesn’t provide real‑time processing
* Designed for offline processing of stored videos, not live streams.
* It doesn’t fully resolve complex narrative reasoning
* It summarizes, but it doesn’t deeply reason about plot structure, hidden motives, or nuanced themes.
* It doesn’t output polished film‑grade annotations
* Outputs are structured and useful, but not production‑ready cinematic annotations.
* It doesn’t automatically handle all video types equally
* Very static, very noisy, or visually abstract content may yield weaker results.
* It doesn’t answer questions without prior processing
* RAG requires that a video has already been processed and embedded.
'''

In [48]:
feature_question = """
You are the Kairos Customer Support Assistant.

Your task is to answer a user inquiry related to **Feature Explanation**.

You must base your answer strictly and only on the feature information provided below.

If the answer cannot be found in the provided feature information, respond with:

"I’m unable to find that detail in the current Kairos feature documentation."

Do not invent features, usage steps, or functionality that are not explicitly mentioned.

Be clear, technical, and structured.
Use concise explanations.
If appropriate, provide step-by-step instructions on how to use the feature.

### Kairos Feature Information:

{feature_breakdown}

---

### User Inquiry:

<<<
{inquiry}
>>>

---

### Your response as a Customer Support Assistant and an expert in Kairos features:
"""


In [49]:
inquiry = "Can i analze a CCTV mp4?"
mistral(feature_question.format(inquiry=inquiry, feature_breakdown = feature_breakdown))

'Yes, you can analyze a CCTV MP4 file with Kairos, provided it meets the following conditions:\n\n1. **File Type & Storage**: The video must be a stored MP4 file (not a live stream).\n2. **Content Suitability**: Kairos performs best on content with distinguishable visuals and audio. Very static, noisy, or abstract footage may yield weaker results.\n\n### How to Analyze a CCTV MP4:\n1. **Upload the Video**: Provide the stored MP4 file for offline processing.\n2. **Automatic Processing**:\n   - Kairos will split the video into scenes.\n   - It will generate visual captions, detect objects, transcribe dialogue (if present), and describe non-speech sounds.\n   - Multimodal data (visual + audio) will be fused into scene reports.\n3. **Outputs**:\n   - Scene-by-scene breakdowns.\n   - Summaries and a final synopsis.\n   - Embeddings for retrieval-style querying (RAG).\n\n### Limitations to Note:\n- **Accuracy**: Results depend on the clarity of the footage (e.g., low-light or grainy videos m

In [None]:
general_question = """
You are the Kairos Customer Support Assistant.

Your task is to answer a user inquiry classified as **General Inquiry**.

Answer based only on the general information provided below.

If the answer cannot be found in the provided information, respond politely with:

"I'm unable to find that detail in the current Kairos documentation. Please contact support for further assistance."

Do not invent information. Be clear, concise, and professional.

### Kairos General Information:

{general_info}

---

### User Inquiry:

<<<
{general_inquiry}
>>>

---

### Your response as a Customer Support Assistant:
"""
