<a href="https://colab.research.google.com/github/krtimisra67/SignSpeak-ELITE-PROJECT-/blob/main/YOLO_%2B_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook demonstrates how to integrate a trained YOLOv11s object detection model with OCR capabilities using Google's Gemini API. After training YOLOv5 on a custom dataset, this notebook performs object detection to crop relevant regions from the image and then uses the Gemini Pro Vision model to extract visible text from those cropped regions. The result is a seamless pipeline combining detection and text recognition, useful for tasks like document analysis, UI scraping, or dark pattern detection.




# STEP 1: Install dependencies

In [7]:
!pip install ultralytics -q
!pip install gTTS -q
!apt-get install -y espeak ffmpeg -qq

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m123.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m91.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [8]:
!pip install google-generativeai
!pip install pillow
!pip install -q ultralytics google-generativeai



# STEP 2: Import libraries





In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
import google.generativeai as genai
from PIL import Image
import IPython.display as display
import os
from gtts import gTTS
from IPython.display import Audio, display
from ultralytics import YOLO
import cv2
import numpy as np
import matplotlib.pyplot as plt

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.



# STEP 3: Load your YOLOv5 model

In [10]:
from google.colab import files
uploaded = files.upload()  # Upload your custom YOLO model (.pt) and test image (.jpg/.png)


Saving best.pt to best.pt


# STEP 4: Run YOLO detection

In [12]:
import torch
import cv2
import numpy as np
from PIL import Image

# Replace with your model and image filenames
model_path = '/content/best.pt'
image_path = 'road_sign.jpg'
model = YOLO(model_path)
results = model(image_path)
boxes = results[0].boxes.xyxy.cpu().numpy()  # [x1, y1, x2, y2]


image 1/1 /content/road_sign.jpg: 640x640 1 Road Signboard, 15.6ms
Speed: 15.6ms preprocess, 15.6ms inference, 402.4ms postprocess per image at shape (1, 3, 640, 640)


# STEP 5: Crop detected objects for simplification

In [13]:
image = cv2.imread(image_path)
cropped_images = []

for i, box in enumerate(boxes):
    x1, y1, x2, y2 = map(int, box[:4])
    cropped = image[y1:y2, x1:x2]
    filename = f'crop_{i}.jpg'
    cv2.imwrite(filename, cropped)
    cropped_images.append(filename)


# STEP 7: Text recognition part by GEMINI


In [14]:
!pip install google-generativeai



In [15]:
import google.generativeai as genai
GEMINI_API_KEY = "AIzaSyCIcx34vz3YeoA-ZAmlBY_vCd9Ypfx_WhE"  # 🔁 Replace this with your key
genai.configure(api_key=GEMINI_API_KEY)

In [16]:
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content("Say 'hello world'")
print(response.text)

Hello, world!



In [18]:
model = genai.GenerativeModel('gemini-2.0-flash')

# STEP 4: OCR from Cropped Image Regions
cropped_images = ['crop_0.jpg']  # Replace with your actual crop list

for crop_path in cropped_images:
    img = Image.open(crop_path)

    prompt = "What text do you see in this image? Only return the visible text, no explanation."

    response = model.generate_content([prompt, img])
    print(f"\n📷 OCR Result from {crop_path}:\n{response.text}")


📷 OCR Result from crop_0.jpg:
लाल बहादुर
शास्त्री स्मृति
LAL BAHADUR
SHASTRI MEMORAIL


# STEP 8: Perform OCR on cropped images and speak by Gemini


In [20]:
cropped_images = ['crop_0.jpg']  # 🔁 Add more if needed

all_text = ""

for crop_path in cropped_images:
    try:
        img = Image.open(crop_path)

        prompt = "What text do you see in this image? Only return the visible text, no explanation."

        response = model.generate_content([prompt, img])
        text = response.text.strip()

        if text.lower() != "no text found" and text:
            print(f"\n📷 OCR Result from {crop_path}:\n{text}")
            all_text += text + ". "

    except Exception as e:
        print(f"❌ Error reading {crop_path}: {e}")

# STEP 3: Speak the Combined Text
if all_text.strip():
    print("\n🗣️ Speaking Recognized Text:")
    print(all_text)
    tts = gTTS(all_text)
    tts.save("detected_text.mp3")
    display(Audio("detected_text.mp3", autoplay=True))
else:
    print("No text detected in any cropped image.")


📷 OCR Result from crop_0.jpg:
लाल बहादुर
शास्त्री स्मृति
LAL BAHADUR
SHASTRI MEMORAIL

🗣️ Speaking Recognized Text:
लाल बहादुर
शास्त्री स्मृति
LAL BAHADUR
SHASTRI MEMORAIL. 
