**Created by:** Tarun Reddi (also known as Teen Different)  
**LinkedIn:** [Tarun Reddi](https://www.linkedin.com/in/tarunreddi/)  
**Portfolio:** [Tarun's Portfolio](https://redditarun.github.io/)  
**Medium Blogs:** [Teen Different on Medium](https://medium.com/@teendifferent)  
**Explore: Enhancing Omni for the Visually Impaired:** [Read the article](https://medium.com/predict/how-i-achieved-4ms-depth-object-detection-and-what-i-built-with-it-246849007223)  

<center>
  <h2 style="color: blue;">YouTube Video: <a href="https://www.youtube.com/watch?v=8_s5sz2eVb4" target="_blank" style="color: blue;">https://www.youtube.com/watch?v=8_s5sz2eVb4</a></h2>
</center>

# **Introducing Omni**

Omni isn’t just another tool, it's a groundbreaking leap toward Artificial General Intelligence (AGI) crafted *just for you*. Think of it as a personal assistant, mentor, and companion rolled into one, evolving alongside you every step of the way. Designed from the ground up, Omni isn’t just built to work with you it’s built to know you, grow with you, and make your life smoother, smarter, and more delightful.

---

### **What Can Omni Do for You?**

![dummy_image.jpg](attachment:0e735fec-5255-4084-8e4b-1797dcb4d028.jpg)

Omni isn't bound by a single purpose—its versatility is endless. Whether you need:

- A **personal assistant** to handle tasks,
- **Blind assistance** for navigating your environment,
- A **tutor** to guide your learning journey,
- A **memory tool** to archive your experiences,
- A **navigator**, **cooking partner**, or even a **wellbeing tracker**,
Omni is there, adapting to your needs and bringing intelligence to every moment of your life.

---

## **How Does Omni Make This Magic Happen?**

Omni embodies a vision: a world where everyone has their own personal AI assistant, just as essential as a smartphone or car.

Here’s how it works:

- **Three Modes**, tailored to different tasks.
- **Five Models**, working together in harmony.
- A powerful **database** that remembers your world, storing what matters.

And that's just the beginning—Omni is designed to grow, expand, and adapt to the future.

---

### **Omni Anatomy - Inspired by Human**

![summy1.jpg](attachment:279364f2-e215-4a82-ab98-62ff55a1d161.jpg)

Omni’s architecture takes cues from the way we function as humans:

- **Eyes** for object detection, depth estimation, and text recognition.
- **Ears** for understanding speech, processing conversations like a natural listener.
- **Mouth and Hands** for creating and conveying—reasoning, speaking, and generating content.

Omni isn’t just one model trying to do it all. Humans don’t work that way, right? Instead, it’s like a team a collection of specialized models working together in harmony, excelling at their respective tasks and delivering the best results possible.

In the future, Geminis's multimodal capabilities will enable Omni to create images and videos bringing AGI closer to true human-like intelligence.

## **A Closer Look at Omni’s Modes**

![image.png](attachment:0606ac9d-5957-4ae3-a9fd-dfcb65048dcd.png)



### **1. Rewind Mode**

Ever wished for a memory you could rely on? Rewind Mode answers questions from your past.

- Missed the details of last week’s meeting? It’s got you covered.
- Forgot your friend’s favorite food? Omni remembers for you.

### **2. Focus Mode**

Perfect for when you need precision and clarity.

- **Focus - Idle:** Contextual information gets stored in your vector database. Watch a lecture or attend a meeting, and Omni captures all the key points for future reference.
- **Focus - Working:** Need on-the-spot insights? Omni guides you through the task at hand without storing unnecessary data. Whether summarizing a document or guiding a blind user to their couch, it’s all about the now.

**Attention to Detail:**

- Omni processes objects, their distances, and positions. This opens new possibilities, like guiding the visually impaired.
- It avoids redundancy by storing only unique, relevant data. For instance, while reading a book, Omni saves the context just once, not repeatedly.

### **3. Timeless Mode**

A blend of Rewind and Focus, Timeless Mode bridges past and present.

- Solve a problem by combining current context with how you approached it before.
- Discover if you’ve cooked a recipe with specific ingredients in the past.

**Attention to Detail:**
How are we bridging the gap between the current context and past information? For instance, if you asked, "Can you tell me if I learned this concept?" you wouldn’t be able to use this query directly to retrieve data from a vector database. What we’re doing instead is leveraging Gemini to enhance the process. Here’s how it works: Gemini takes the current context (e.g., information about math problems) and combines it with your query about whether you've learned these concepts in the past. Gemini then assesses the available information, identifies any gaps, and generates a Retrieval-Augmented Generation (RAG) query to gather the missing details needed to answer your question. This query might look something like, "Provide the necessary context about permutations, combinations, math, and word problems." This RAG-generated query is then used to retrieve the relevant context for a more informed response.

---

## Why Do We Need a Long Context Window in Gemini?

A long context window is vital for handling queries that require sifting through extensive historical data. For example, when a user asks, *“Did I make this recipe before?”* or *“What were the details of my last meeting?”*, Gemini might need to process months or years of information—dozens of recipes or hundreds of meetings. Without a long context window, the model wouldn’t be able to retain and analyze all this data at once. Even with timestamps helping narrow the scope, the model still needs room to prioritize and focus on the most relevant details. That’s why a large context window is essential—it ensures the model can manage and extract meaningful insights from vast datasets.

### Why I Chose Gemini 1.5 Pro

Gemini 1.5 Pro stands out not just for its long context window but for how effectively it uses it. In tests, alternatives like Flash struggled with subtle details. For instance, when asked, *“Does this car belong to Sarah?”*, Flash pulled general car-related data but missed a key note from Sarah about her license plate. Gemini 1.5 Pro, however, seamlessly identified this subtle yet crucial detail while managing the broader context. 

Its ability to balance breadth and precision makes Gemini 1.5 Pro the perfect choice for leveraging a long context window, turning vast amounts of data into actionable insights. 



# **Proposal: The Future of Personalized AI**
This is more than just technology—it’s a **movement** toward a world where everyone has a personalized AI companion, as essential as a smartphone or car. Imagine your AI seamlessly integrated across all devices—VR headsets, smartphones, smartwatches—working together to store, retrieve, and adapt to your unique needs through a secure, cloud-based **personalized vector database**.

Your data remains:
- **Secure**: Fully encrypted and only accessible to you.
- **Tailored**: Customized to your preferences and history.
- **Accessible**: Ready to assist, anytime, anywhere.

This isn’t just about convenience; it’s about creating a **connected, smarter way of living**. From syncing tasks and navigating daily life to enhancing learning and well-being, this AI transforms how you interact with technology. Like cars or the internet, it envisions a future where personalized AI becomes a universal standard, empowering you to live with greater ease, clarity, and confidence.

**Your Omni, Your AI, Your Future** 🌟

# Import Libraries

In [None]:
!pip install -q pinecone-client pandas openai transformers pillow-heif
!pip install -q -r /kaggle/input/depth_pro/pytorch/01/1/depth_pro/requirements.txt
!pip install -q moviepy


In [None]:
import os
import sys
import gc
import time
import json
import random
import datetime
import pathlib
import textwrap
import numpy as np
import pandas as pd
import torch
import cv2

from kaggle_secrets import UserSecretsClient
from IPython.display import display, Markdown
from tqdm import tqdm
import matplotlib.pyplot as plt
from PIL import Image
import pillow_heif
import easyocr

from dataclasses import dataclass
from typing import List, Dict, Any

from transformers import (
    AutoTokenizer,
    AutoModel,
    pipeline,
    DetrImageProcessor,
    DetrForObjectDetection,
)

from torchvision.transforms import ToTensor, Normalize, Compose, Lambda
import google.generativeai as genai
from pinecone import Pinecone
from moviepy import AudioFileClip

import warnings
warnings.filterwarnings('ignore')

# The Models Powering Omni’s Capabilities

1. **Gemini 1.5 Pro**: Chosen for its superior context comprehension after initial experiments with 1.5 Flash 8B failed to identify relevant text from large corpora. Gemini 1.5 Pro excels in extracting precise context and delivering well-defined responses to user queries.
2. **DeTR ResNet 50**: Among the state-of-the-art object detection models. Alternatively, YOLO v11n performs efficiently on COCO datasets, providing exceptional detection accuracy with optimized performance.
3. **Depth Pro**: Apple’s monocular depth model delivers remarkable efficiency, processing 2-million-pixel images in just 0.4ms on a V100 GPU.
4. **Whisper**: OpenAI’s speech recognition model transcribes audio with precision, essential for building contextual understanding.
5. **EasyOCR**: Streamlines text recognition from images and videos, bridging visual data with actionable insights.video.

In [None]:
# Load Gemini Model
user_secrets = UserSecretsClient()
apiKey = user_secrets.get_secret("GEMINI_API_KEY")
genai.configure(api_key=apiKey)
gemini_model = genai.GenerativeModel(model_name='gemini-1.5-flash-latest')

# Load Object Detection Model
object_processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
object_detection_model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')


# Depth Pro imports
sys.path.append("/kaggle/input/depth_pro/pytorch/01/1/depth_pro/src")
from depth_pro import depth_pro
from depth_pro.depth_pro import create_model_and_transforms, DepthProConfig
# DepthPro configuration
weights_path = "/kaggle/input/depth_pro/pytorch/01/1/depth_pro/checkpoints/depth_pro.pt"
config = DepthProConfig(
    patch_encoder_preset="dinov2l16_384",
    image_encoder_preset="dinov2l16_384",
    decoder_features=256,
    checkpoint_uri=weights_path,
    use_fov_head=True,
    fov_encoder_preset="dinov2l16_384"
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
depth_estimation_model, transform = create_model_and_transforms(config, device=device)
depth_estimation_model.eval()


# Load Speech Recognition Model
speech_recognition_model = pipeline('automatic-speech-recognition', model='openai/whisper-tiny')

# Initialize OCR Reader for recognising text
reader = easyocr.Reader(['en'])  # For English text

print("All models loaded successfully. ✅")

# Integrate Vector Database

## Why Use RAG with Gemini's Large Context Window?
Despite Gemini’s impressive 1 million+ token capacity, the dynamic nature of Omni requires a more scalable approach. Designed as a consumer-centric product, Omni continuously ingests and stores user data, potentially exceeding token limits over time. Imagine a user building a database of 5 million tokens within months—this is where RAG (Retrieval-Augmented Generation) shines. It efficiently fetches relevant chunks of data, leveraging Gemini’s large token capacity for smooth, insightful responses.

### Addressing Scalability and Efficiency

Some may question the storage burden of such extensive data over years. Omni tackles this by focusing solely on textual information, akin to how the human brain processes and communicates. Just as humans articulate thoughts through language, Omni stores and retrieves data in a similar fashion. While its current capabilities are limited to "speech," integrating drawing, image, or video generation could propel Omni closer to AGI (Artificial General Intelligence

### The Backbone: Vector Database and Contextual Data

- **Infrastructure**: Omni uses Pinecone with MPNet Base V2 as the embedding model (768 dimensions)**, retrieving the top 20 match**es efficiently.
- **Initial Dataset**: Synthetic contextual data simula 7 days oftes diverse real-world scenarios, from meetings to workouts, ensuring a robust foundation for understanding and responding to queri
es.).I.

In [None]:
# Initialize Pinecone and Embedding Model Functions
def init_pinecone(api_key):
    pc = Pinecone(api_key=api_key)
    return pc

def init_embedding_model(model_name="sentence-transformers/all-mpnet-base-v2"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    return tokenizer, model

def embed_text(text, tokenizer, model):
    """Generate embedding vector for a given text using the specified model."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)
    return embeddings.squeeze().tolist()

def format_metadata(metadata):
    """Ensure metadata values are of supported types."""
    formatted = {}
    for key, value in metadata.items():
        # Convert to string if the value is not a supported type
        if isinstance(value, (str, int, float, bool)) or (
            isinstance(value, list) and all(isinstance(x, str) for x in value)
        ):
            formatted[key] = value
        else:
            formatted[key] = str(value)
    return formatted

def upload_to_pinecone(data_folder, index, tokenizer, model, batch_size=100):
    """Upload data to Pinecone in batches with properly formatted metadata."""
    for file_name in sorted(os.listdir(data_folder)):
        if file_name.endswith(".json"):
            file_path = os.path.join(data_folder, file_name)
            batch_vectors = []
            
            try:
                with open(file_path, "r") as file:
                    data = json.load(file)
                    
                    for entry in data:
                        try:
                            # Generate embedding
                            text_to_embed = entry["summary"]
                            embedding_vector = embed_text(text_to_embed, tokenizer, model)
                            
                            # Format metadata properly
                            metadata = format_metadata({
                                "timestamp": entry["timestamp"],
                                "source": entry["source"],
                                "summary": entry["summary"]
                            })
                            
                            # Prepare vector data
                            vector_data = (
                                str(entry["id"]),  # Ensure ID is string
                                embedding_vector,
                                metadata
                            )
                            batch_vectors.append(vector_data)
                            
                            # Upload batch when it reaches batch_size
                            if len(batch_vectors) >= batch_size:
                                index.upsert(vectors=batch_vectors)
                                print(f"✅ Uploaded batch of {len(batch_vectors)} vectors")
                                batch_vectors = []
                                
                        except Exception as e:
                            print(f"Error processing entry: {str(e)}")
                            continue
                    
                    # Upload any remaining vectors
                    if batch_vectors:
                        index.upsert(vectors=batch_vectors)
                        print(f"✅ Uploaded final batch of {len(batch_vectors)} vectors")
                        
                print(f"✅ Completed uploading data from {file_name}")
                
            except Exception as e:
                print(f"Error processing file {file_name}: {str(e)}")
                continue

#### Push Synthetic Data into Vector Database

In [None]:
try:
    # Get API key
    pinecone_api_key = user_secrets.get_secret("RAG")
    
    # Initialize Pinecone
    pc = init_pinecone(pinecone_api_key)
    
    # Define index name
    index_name = "rag-database"
    
    # Check if index exists
    if index_name not in pc.list_indexes().names():
        print(f"Index '{index_name}' does not exist. Please create it via the Pinecone dashboard.")
    else:
        # Connect to index
        index = pc.Index(index_name)
        print(f"✅ Connected to Pinecone index: {index_name}")
        
        # Initialize embedding model
        tokenizer, model = init_embedding_model()
        print("✅ Initialized embedding model")
        
        # Upload data
        data_folder = "/kaggle/input/synthetic-contextual-data"
        upload_to_pinecone(data_folder, index, tokenizer, model)
        print("✅ All JSON files uploaded to Pinecone RAG index.")
    
except Exception as e:
    print(f"An error occurred: {str(e)}")


In [None]:
import json
import glob

# Directory containing your JSON files
data_folder = "/kaggle/input/synthetic-contextual-data"
# List all JSON files in the directory
json_files = glob.glob(os.path.join(data_folder, "*.json"))


data_list = []
for json_file in json_files:
    # Read the content of the JSON file
    with open(json_file, 'r') as file:
        data = json.load(file)
    
    # Convert JSON data to a string
    data_str = json.dumps(data)
    data_list.append(data_str)
    print(json_file, gemini_model.count_tokens(data_str))

print('Token size of combined files', gemini_model.count_tokens(data_list))


# prompt = f'Read the following compilation of 7 Days of context on what a person did: {data_list}. Now tell me what are all the unique things done by this person and list his habits'
# overall_summary = gemini_model.generate_content(prompt)
# print(overall_summary)

# Processing at Its Core

### **Frame Processing**

- Object detection and depth estimation are integrated, capturing spatial relationships with precision. For instance, object depth is averaged from 50 sampled coordinates within a region of interest.
- OCR and speech recognition extract vital textual and auditory information.

### **RAGManager**: Streamlining Context

- Handles data storage, embedding, and retrieval in Pinecone.
- Converts summaries into embeddings for similarity-based querying.

In [None]:
# Define ProcessedFrame Dataclass
@dataclass
class ProcessedFrame:
    objects_and_depth: List[Dict[str, Any]]
    text_from_video: str
    text_from_audio: str
    timestamp: str


# Define ModelManager Class
class ModelManager:
    def __init__(self, depth_estimation_model, speech_recognition_model):
        # Initialize DETR for object detection
        self.object_processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
        self.object_model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')

        self.reader = easyocr.Reader(['en'])  # For English text
        
        # Models
        self.depth_estimation_model = depth_estimation_model
        self.speech_recognition_model = speech_recognition_model
        
        # Move models to GPU if available
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.object_model.to(self.device)
        self.depth_estimation_model.to(self.device)
        
    def detect_objects(self, frame: Image.Image) -> List[Dict[str, Any]]:
        try:
            # Prepare inputs for DETR object detection
            inputs = self.object_processor(images=frame, return_tensors="pt").to(self.device)
            
            # Perform inference
            outputs = self.object_model(**inputs)
            
            # Post-process predictions
            target_sizes = torch.tensor([frame.size[::-1]]).to(self.device)
            results = self.object_processor.post_process_object_detection(
                outputs, threshold=0.5, target_sizes=target_sizes
            )[0]
        
            detections = []
            frame_width = frame.size[0] 
            
            for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
                # Convert box coordinates to integers
                x1, y1, x2, y2 = map(int, box.tolist())
                # Calculate the center x-coordinate of the bounding box
                bbox_center_x = (x1 + x2) / 2
                # Determine position based on center x-coordinate
                if bbox_center_x < frame_width / 3:
                    position = 'left'
                elif bbox_center_x < 2 * frame_width / 3:
                    position = 'center'
                else:
                    position = 'right'

                
                detections.append({
                    'score': score.item(),
                    'label': self.object_model.config.id2label[label.item()],
                    'bbox': box.tolist(),
                    'position': position
                })
            
            return detections
            
        except Exception as e:
            print(f"Error in object detection: {str(e)}")
            return []

    def process_depth_for_object(self, frame: Image.Image, bbox: List[float]) -> float:
        """Calculate average depth for an object using 50 random points within its bounding box"""
        try:
            rgb_frame = cv2.cvtColor(np.array(frame), cv2.COLOR_BGR2RGB)
            frame_tensor = transform(rgb_frame).unsqueeze(0)
            
            # Move to device
            frame_tensor = frame_tensor.to(self.device)
            
            # Get depth map
            with torch.no_grad():
                pred_dep = self.depth_estimation_model.infer(frame_tensor)

            depth_map = pred_dep["depth"].cpu().numpy()
            
            # Convert bbox to integers
            x1, y1, x2, y2 = map(int, bbox)
            
            # Ensure coordinates are within bounds
            h, w = depth_map.shape[-2:]
            x1, x2 = max(0, x1), min(w, x2)
            y1, y2 = max(0, y1), min(h, y2)
            
            object_region = depth_map[y1:y2, x1:x2]
            
            # Sample points
            num_points = min(50, object_region.size)
            if num_points > 0:
                random_indices = random.sample(range(object_region.size), num_points)
                depths = object_region.flatten()[random_indices]
                avg_depth = np.mean(depths)  # Compute average depth
            else:
                avg_depth = 0.0
    
            return avg_depth
            
        except Exception as e:
            print(f"Error in depth processing: {str(e)}")
            return 0.0

    def process_frame(self, frame: Image.Image, audio_segment=None) -> ProcessedFrame:
        try:
            # Object Detection
            detections = self.detect_objects(frame)
            objects_with_depth = []
            
            for detection in detections:
                depth = self.process_depth_for_object(frame, detection['bbox'])
                objects_with_depth.append({
                    'name': detection['label'],
                    'distance': f"{depth:.1f} meters",
                    'confidence': f"{detection['score']:.2f}",
                    'position': detection['position']
                })
            
            # Text Recognition using EasyOCR
            try:
                # Convert PIL Image to numpy array if needed
                frame_np = np.array(frame)
                
                # Using EasyOCR
                results = self.reader.readtext(frame_np)
                text_from_video = ' '.join([text[1] for text in results])
                
                # if not text_from_video:
                #     print("No text detected in the image")
                # else:
                #     print(f"Detected text: {text_from_video}")
                    
            except Exception as e:
                print(f"Error in text recognition: {str(e)}")
                text_from_video = ""
            
            # Speech Recognition
            text_from_audio = ""
            if audio_segment:
                try:
                    text_from_audio = self.speech_recognition_model(audio_segment)['text']
                except Exception as e:
                    print(f"Error in speech recognition: {str(e)}")

            return ProcessedFrame(
                objects_and_depth=objects_with_depth,
                text_from_video=text_from_video,
                text_from_audio=text_from_audio,
                timestamp=datetime.datetime.utcnow().isoformat()
            )
            
        except Exception as e:
            print(f"Error in frame processing: {str(e)}")
            raise



# Define RAGManager Class
class RAGManager:
    def __init__(self, pinecone_index, embedding_model, tokenizer):
        self.pinecone_index = pinecone_index
        self.embedding_model = embedding_model
        self.tokenizer = tokenizer

    def store_entry(self, processed_data: ProcessedFrame, summary: str):
        # Format the metadata properly for Pinecone
        metadata = {
            "timestamp": processed_data.timestamp,
            "source": json.dumps({  # Convert complex objects to JSON string
                "object_and_depth": processed_data.objects_and_depth,
                "text_from_video": processed_data.text_from_video,
                "text_from_audio": processed_data.text_from_audio
            }),
            "summary": summary
        }
        
        # Generate embedding for the summary
        embedding = self.generate_embedding(summary)
        
        try:
            # Store in Pinecone with properly formatted metadata
            self.pinecone_index.upsert(
                vectors=[(processed_data.timestamp, embedding, metadata)]
            )
            return metadata
        except Exception as e:
            print(f"Error storing in Pinecone: {str(e)}")
            raise

    def query_similar(self, query: str, top_k: int = 20):
        query_embedding = self.generate_embedding(query)
        results = self.pinecone_index.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        return results

    def generate_embedding(self, text: str) -> List[float]:
        inputs = self.tokenizer(text, return_tensors="pt", 
                              truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            embeddings = self.embedding_model(**inputs).last_hidden_state.mean(dim=1)
        return embeddings.squeeze().tolist()


# Revolutionary Interaction Modes

### **RewindMode**:

- Retrieves historical data relevant to user queries.
- Formats results into a conversational “Historical Context” and generates warm, context-aware responses using Gemini.

### **FocusMode**:

- Optimized for idle or working sc**enarios, eliminating redun**dant data to maintain database efficieFocus ncy.Processing**: Summarizes and stores metadata from periodic frame and audio Focus data.
 Processing**: Processes real-time data and generates context-enriched responses.

### **TimelessMode**:

- Fuses real-time data with historical context for comprehensive responses.
- Processes current inputs, refines queries, and retrieves context-rich data to deliver highly detailed answers.

In [None]:
# Define RewindMode Class
class RewindMode:
    def __init__(self, rag_manager, gemini_model):
        self.rag_manager = rag_manager
        self.gemini_model = gemini_model

    def process_query(self, query: str) -> str:
        # Get relevant historical context
        results = self.rag_manager.query_similar(query)
        
        # Format context for Gemini
        context = "Historical Context:\n" + "\n".join(
            [str(r.metadata) for r in results.matches]
        )

        # print(context)
        

        print(gemini_model.count_tokens(context))
        
        # Generate response
        response = self.gemini_model.generate_content(f"""
            I want you to respond like you're chatting with a friend, not a robot. With a warm friendly tone.
            Context: {context}
        
            Query: {query}
        """)

        return response.text



# Define FocusMode Class
class FocusMode:
    def __init__(self, model_manager, rag_manager, gemini_model):
        self.model_manager = model_manager
        self.rag_manager = rag_manager
        self.gemini_model = gemini_model

    def remove_plagiarism(self, text: str) -> str:
        """
        Removes duplicate sentences from the text to reduce redundancy and potential plagiarism.
        """
        import re

        # Split the text into sentences using regex to handle punctuation
        sentences = re.split(r'(?<=[.!?]) +', text)
        seen_sentences = set()
        unique_sentences = []

        for sentence in sentences:
            # Normalize the sentence by stripping whitespace and converting to lowercase
            normalized_sentence = sentence.strip().lower()

            if normalized_sentence and normalized_sentence not in seen_sentences:
                seen_sentences.add(normalized_sentence)
                unique_sentences.append(sentence.strip())

        # Reconstruct the text from unique sentences
        processed_text = ' '.join(unique_sentences)
        return processed_text

    def process_idle(self, video_path: str):
        import cv2
        import numpy as np
        from PIL import Image
        from moviepy import AudioFileClip
        import datetime

        # Extract audio from the video
        audio_clip = AudioFileClip(video_path)
        audio_array = audio_clip.to_soundarray(fps=16000)

        if len(audio_array.shape) > 1 and audio_array.shape[1] == 2:
            # Average the two channels to get mono audio
            audio_array = audio_array.mean(axis=1)
        
        # Ensure audio_array is 1D
        audio_array = audio_array.flatten()
        
        audio_text_result = self.model_manager.speech_recognition_model(
            audio_array,
            return_timestamps=True
        )
        audio_text = audio_text_result['text']

        # Initialize video capture
        video_capture = cv2.VideoCapture(video_path)
        fps = video_capture.get(cv2.CAP_PROP_FPS)
        total_frames = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
        duration = total_frames / fps

        processed_frames = []
        frame_number = 0

        while True:
            ret, frame = video_capture.read()
            if not ret:
                break

            # Extract one frame per second
            current_time = frame_number / fps
            if frame_number % int(fps) == 0:
                # Convert frame to PIL Image
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                pil_image = Image.fromarray(frame_rgb)

                # Process the frame
                processed = self.model_manager.process_frame(pil_image)
                processed_frames.append(processed)

            frame_number += 1

        video_capture.release()

        # Combine metadata from all frames
        combined_metadata = {
            'objects_and_depth': [],
            'text_from_video': '',
            'text_from_audio': audio_text,
            'timestamp': datetime.datetime.utcnow().isoformat()
        }

        for processed in processed_frames:
            combined_metadata['objects_and_depth'].extend(processed.objects_and_depth)
            combined_metadata['text_from_video'] += ' ' + processed.text_from_video

        # Remove plagiarism text (implement your own logic)
        combined_metadata['text_from_video'] = self.remove_plagiarism(
            combined_metadata['text_from_video']
        )
        combined_metadata['text_from_audio'] = self.remove_plagiarism(
            combined_metadata['text_from_audio']
        )

        # Convert combined metadata to JSON string for Gemini model
        context = json.dumps(combined_metadata, indent=2)

        # Generate summary using Gemini
        summary_response = self.gemini_model.generate_content(
            f"Summarize this content:\n{context}"
        )
        summary = summary_response.text

        # Attach summary to combined metadata
        combined_metadata['summary'] = summary

        # Store in RAG
        self.rag_manager.store_entry(
            ProcessedFrame(
                objects_and_depth=combined_metadata['objects_and_depth'],
                text_from_video=combined_metadata['text_from_video'],
                text_from_audio=combined_metadata['text_from_audio'],
                timestamp=combined_metadata['timestamp']
            ),
            summary
        )

        return combined_metadata

    def process_working(self, frame: Image.Image, 
                       audio_segment: Any, query: str) -> str:
        # Process current frame
        processed = self.model_manager.process_frame(frame, audio_segment)
        
        # Generate response
        context = json.dumps(processed.__dict__, indent=2)
        response = self.gemini_model.generate_content(f"""
            I want you to respond like you're chatting with a friend, not a robot. With a warm friendly tone.
            Context: {context}
            Query: {query}
        """)
        
        
        # Store in RAG with summary
        summary = self.gemini_model.generate_content(
            f"Summarize this scene:\n{context}"
        ).text
        self.rag_manager.store_entry(processed, summary)
        
        return response.text


def to_plain_text(data_dict):
    """
    Converts a dictionary to a plain-text string, where each key-value pair is represented as 'key: value'.
    """
    plain_text = ""
    for key, value in data_dict.items():
        # Convert the value to a string if it's not already
        if isinstance(value, list):  # If the value is a list, join it into a readable string
            value = ", ".join([str(v) for v in value])
        elif isinstance(value, dict):  # If the value is a dict, recursively convert it to text
            value = to_plain_text(value)
        plain_text += f"{key}: {value}\n"  # Add each key-value pair to the string
    return plain_text


    
# Define TimelessMode Class
class TimelessMode:
    def __init__(self, model_manager, rag_manager, gemini_model):
        self.model_manager = model_manager
        self.rag_manager = rag_manager
        self.gemini_model = gemini_model


    def process_query(self, frame: Image.Image, 
                     audio_segment: Any, query: str) -> str:
        # Process current frame
        processed = self.model_manager.process_frame(frame, audio_segment)
        current_context = json.dumps(processed.__dict__, indent=2)
        
        # Get historical context
        h_context = to_plain_text(processed.__dict__)
        h_query = self.gemini_model.generate_content(f"Only generate a RAG query informing what all extra data we need to answer this question {query} we have current available information {h_context}").text
        results = self.rag_manager.query_similar(h_query, top_k=20)
        historical_context = "\n".join(
            [r.metadata["summary"] for r in results.matches]
        )
        # print(h_query)
        # print(historical_context)
        # Generate response using both contexts
        response = self.gemini_model.generate_content(
            f"""I want you to respond like you're chatting with a friend, not a robot. With a warm friendly tone. Carefully examine the contextual data and provide answer."""
            f"Query: {query}"
            f"Current Context: {current_context}\n"
            f"Historical Context: {historical_context}\n"
            
        ).text
        
        # Store current frame
        summary = self.gemini_model.generate_content(
            f"Summarize this scene:\n{current_context}"
        ).text
        self.rag_manager.store_entry(processed, summary)
        
        return response


#### Execute the Entire System

In [None]:
def initialize_system():
    try:
        # Initialize ModelManager
        model_manager = ModelManager(
            depth_estimation_model,
            speech_recognition_model
        )
        
        # Initialize embedding model
        tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
        embedding_model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")
        
        # Initialize Pinecone
        pinecone_api_key = user_secrets.get_secret("RAG")
        pc = init_pinecone(pinecone_api_key)
        index_name = "rag-database"
        pinecone_index = pc.Index(index_name)
        print(f"✅ Connected to Pinecone index: {index_name}")
        
        # Initialize RAGManager
        rag_manager = RAGManager(pinecone_index, embedding_model, tokenizer)
        
        # Initialize modes
        rewind_mode = RewindMode(rag_manager, gemini_model)
        focus_mode = FocusMode(model_manager, rag_manager, gemini_model)
        timeless_mode = TimelessMode(model_manager, rag_manager, gemini_model)
        
        print("✅ System initialized successfully")
        return rewind_mode, focus_mode, timeless_mode
        
    except Exception as e:
        print(f"❌ Error initializing system: {e}")
        return None, None, None


#### Helper function to pre-perocess vision

In [None]:
from PIL import Image

def process_image(image_path: str, max_size: int = 1024) -> Image.Image:
    """
    Processes the image by converting it to RGB if necessary and resizing it 
    to ensure its largest dimension does not exceed `max_size`.

    Parameters:
    - image_path (str): Path to the input image.
    - max_size (int): Maximum allowed size for the largest dimension of the image. Default is 1024.

    Returns:
    - Image.Image: Processed image object.
    """
    # Open the image
    image = Image.open(image_path)
    
    # Convert to RGB if not already in RGB mode
    if image.mode != 'RGB':
        image = image.convert('RGB')
    
    # Resize the image if its largest dimension exceeds the max_size
    if max(image.size) > max_size:
        ratio = max_size / max(image.size)
        new_size = tuple(int(dim * ratio) for dim in image.size)
        image = image.resize(new_size, Image.LANCZOS)
    
    return image


# Omni in Action
### Now that we’ve explored the inner mechanics of Omni, it’s time to dive into some exciting real-world examples and witness how Omni truly shines in action!

In [None]:
rewind_mode, focus_mode, timeless_mode = initialize_system()

## Rewind Mode

In [None]:
# Query historical information
query = "Hey, can you give me a quick recap of the key points from my November meetings? Oh, and do I have any meetings coming up soon?"
response = rewind_mode.process_query(query)
print("Rewind Mode Response:", response)


In [None]:

query = "What was that philosophy book I recently read? Who’s the author again? And where did I leave off?"
response = rewind_mode.process_query(query)
print("Rewind Mode Response:", response)

In [None]:
query = "By the way, what hobbies did Sarah mention the last time we chatted? I’d love to remember!"
response = rewind_mode.process_query(query)
print("Rewind Mode Response:", response)

## Focus Mode - Working

In [None]:
import matplotlib.pyplot as plt

demo_1 = '/kaggle/input/synthetic-contextual-data/Image_Creative_Commons_License.png'
processed_image = process_image(demo_1, max_size=1024)
query = "What’s this doc I’m looking at? Anything important I should know?"
response = focus_mode.process_working(processed_image, None, query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(demo_1)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()


In [None]:
demo_2 = '/kaggle/input/synthetic-contextual-data/Image_LivingRoom_Navigation.jpg'
processed_image = process_image(demo_2, max_size=1024)
prequery = "Hey you are a friendly assistant help the blind navigate the environemnt, from the contextual infomation. Provide friendly response. User Query:"
query = "Hey, can you point me to my couch? How far is it?"
t_query = prequery + query
response = focus_mode.process_working(processed_image, None, t_query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(demo_2)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()


In [None]:
demo_3 = '/kaggle/input/synthetic-contextual-data/Image_Person_Waiting.jpg'
processed_image = process_image(demo_3, max_size=1024)
prequery = "Hey you are a friendly assistant help the blind navigate the environemnt, from the contextual infomation. Provide friendly response. User Query:"
query = "Is there someone nearby? Where exactly are they?"
t_query = prequery + query
response = focus_mode.process_working(processed_image, None, t_query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(demo_3)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()


### Focus Mode - Idle


I’m going to switch on focus idle mode for three videos. After that, we’ll ask questions about them.  

- **Video 1** is a street travel video. This is to simulate a situation where users visit a place but forget the street name. We’ll test it by texting the Omni, asking something like, "Hey, do you remember the street I visited recently?"  
- **Video 2** is an Apple keynote on UX writing. This simulates watching a technical talk, presentation, or research, and then texting Omni to ask things like, "Can you create a presentation speech on UX writing based on what I learned in my last session?"  
- **Video 3** is a TED-Ed video about why we dream. This simulates forgetting what you learned from an interesting video and then querying the Omni with something like, "Can you tell me why we dream from the recent video we watched?"

In [None]:
# I've alreay executed these hence im not running these in saved version
video_1 = '/kaggle/input/synthetic-contextual-data/Video_Pexels_Street.mp4'
result = focus_mode.process_idle(video_1)
print('Completed travelling ✅')

video_2 = '/kaggle/input/tempample/Video_WWDC24_UX_Writing (online-video-cutter.com).mp4'
result = focus_mode.process_idle(video_2)
print('Completed keynote ✅')

video_3 = '/kaggle/input/synthetic-contextual-data/Video_Ted_ed_Why_do_we_dream.mp4'
result = focus_mode.process_idle(video_3)
print('Completed TEDed ✅')

In [None]:
query = "Hey, can you get me the street or city name where i stuck with lot of traffic?"
response = rewind_mode.process_query(query)
print("Response:", response)

In [None]:
query = "Can you create a notes on UX writing based on what I learned in my last session?"
response = rewind_mode.process_query(query)
print("Response:", response)

In [None]:
query = "Can you give quick summary of dream from the recent video we watched?"
response = rewind_mode.process_query(query)
print("Response:", response)

In [None]:
query = "Can you tell why do we dream from the recent video we watched?"
response = rewind_mode.process_query(query)
print("Response:", response)

## Timeless Mode

In [None]:
tdemo_1 = '/kaggle/input/synthetic-contextual-data/Image_Refridgerator_Contents.jpg'
processed_image = process_image(tdemo_1, max_size=1024)
query = "Hey, I want to impress Sarah with my cooking! Can you suggest some recipes using these ingredients that she'd like?"
response = timeless_mode.process_query(frame=processed_image, audio_segment=None, query = query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(tdemo_1)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()

In [None]:
tdemo_1 = '/kaggle/input/synthetic-contextual-data/Image_Refridgerator_Contents.jpg'
processed_image = process_image(tdemo_1, max_size=1024)
query = "Hey, I want to impress Sarah with my cooking! Can you suggest some recipes using these ingredients that she'd like?"
response = timeless_mode.process_query(frame=processed_image, audio_segment=None, query = query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(tdemo_1)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()

In [None]:
tdemo_3 = '/kaggle/input/synthetic-contextual-data/Image_Car.jpg'
processed_image = process_image(tdemo_3, max_size=1024)
query = "Does this sedan with this license plate belong to Sarah?"
response = timeless_mode.process_query(frame=processed_image, audio_segment=None, query=query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(tdemo_3)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()

In [None]:
tdemo_4 = '/kaggle/input/synthetic-contextual-data/Image_Book_Cover.jpg'
processed_image = process_image(tdemo_4, max_size=1024)
query = "Hey did i touch this book over? And which chapter did I last read?"
response = timeless_mode.process_query(frame=processed_image, audio_segment=None, query= query)

fig, ax = plt.subplots(1, 2, figsize=(15, 8))
img = plt.imread(tdemo_4)
ax[0].imshow(img)
ax[0].axis('off')
ax[0].set_title("Input Image")
ax[1].axis('off')
ax[1].set_title("Query and Response")
ax[1].text(0.1, 0.8, "User Query:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.7, query, fontsize=12, va='top', wrap=True)
ax[1].text(0.1, 0.5, "Model Response:", fontsize=14, fontweight='bold')
ax[1].text(0.1, 0.4, response, fontsize=12, va='top', wrap=True)
plt.tight_layout()
plt.show()