## **VisionMate: Spatial Awareness for the Visually Impaired**

> **A real-time assistive technology system that fuses Monocular Depth Estimation with Object Detection to provide audio-visual proximity alerts.**

### **üìå 1. Project Overview**
Traditional Object Detection models (like YOLO) tell us *what* an object is but fail to answer *where* it is in 3D space. For a visually impaired person, knowing "there is a car" is useless without knowing if it is **50 meters away (Safe)** or **2 meters away (Danger)**.

**VisionMate** solves this by combining state-of-the-art Computer Vision models to create a "Smart Blind Spot Monitor" that runs on standard camera hardware (no expensive LiDAR required).

#### **Key Features**
* **Multi-Modal AI Pipeline:** Fuses **YOLOv8** (Detection) + **Depth Anything V2** (Depth).
* **Real-Time Proximity Logic:** Calculates relative distance and defines dynamic "Safe Zones."
* **Audio-Visual Feedback:**
    * üü¢ **Green:** Safe Zone (> 2.5m)
    * üü† **Orange:** Warning Zone (1.5m - 2.5m)
    * üî¥ **Red + Audio Alarm:** Danger Zone (< 1.5m) -> **Triggers "BEEP" Sound** üîä

### **üõ†Ô∏è 2. System Architecture**

The system processes video feeds in a 4-step pipeline:

1.  **Input:** Standard RGB Video Feed.
2.  **Parallel Inference:**
    * **Branch A:** YOLOv8n detects objects (`Person`, `Car`, `Bus`, `Truck`).
    * **Branch B:** Depth Anything V2 generates a pixel-perfect relative depth map.
3.  **Sensor Fusion (The Logic Layer):**
    * The system extracts the Depth Map region corresponding to the detected Object Bounding Box.
    * **Noise Filtering:** Applies a **70th Percentile Filter** to ignore background noise (e.g., the street behind a person).
    * **Distance Calibration:** Converts relative depth score ($S$) to meters ($D$) using the inverse formula:
        $$D \approx \frac{K}{S}$$
        *(Where $K$ is a calibrated focal constant, set to 250 for this prototype)*.
4.  **Output:** Overlays bounding boxes, distance labels, and injects audio alerts into the video stream.

### **üíª 3. Tech Stack**
* **Language:** Python 3.10
* **Vision Models:** `Ultralytics YOLOv8`, `HuggingFace Transformers (Depth Anything V2)`
* **Video Processing:** `OpenCV`, `MoviePy` (for Audio Mixing)
* **Hardware Acceleration:** CUDA (NVIDIA T4 GPU)

### **üöÄ 4. Installation & Usage**

#### **Install Dependencies**

In [None]:
!pip install ultralytics transformers torch torchvision opencv-python moviepy

### **üî¨ 5. Methodology & Calibration**
One of the key challenges in monocular depth estimation is that the output is "unitless" (relative intensity 0-255). To solve this, we implemented a **Calibration Protocol**:

1.  We measured a reference object (Person) at a known distance (approx. 2 meters).
2.  We derived a **Calibration Constant (K = 250)** that maps the model's intensity score to real-world meters.
3.  **Result:** The system now accurately flags a person at ~1.5 meters as a **"STOP"** threat, distinguishing them from a person at 4 meters (Safe).

### **üîÆ 6. Future Scope**
* **Semantic Voice Navigation:** Upgrade from simple "Beeps" to Text-to-Speech (e.g., *"Car approaching on your left"*).
* **Haptic Integration:** Connect to a vibrating wristband for silent alerts in noisy traffic.
* **Auto-Calibration:** Use the average height of a detected human to dynamically calculate the camera's focal length, making the system plug-and-play for any camera.

---
**Name:** Mohammed Asadullah Shareef  
**Hub ID:** HUB2505058