<a href="https://colab.research.google.com/github/jcsmcmendes/Step_Deep_Learning/blob/main/Identification%26Segmentation_YOLO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🎯 Object Detection with YOLO and Pre-trained Models

Until now, we've used deep learning models to classify entire images — deciding what is in the image overall. But what if we want to know where things are?

This is where object detection comes in.

In this section, we’ll explore YOLO (You Only Look Once) — one of the most popular and efficient object detection algorithms. Unlike simple classifiers, YOLO can:

    Detect multiple objects in the same image

    Predict their bounding boxes

    Identify what each object is, and where it is

💡 Just like before, we’ll use pre-trained YOLO models, so we don’t need to train anything from scratch. These models already know how to detect common objects (like people, cars, animals, etc.) from large datasets like COCO.

📦 What we'll do:

    Load a pre-trained YOLO model

    Run it on one or more test images

    Visualize the detections (bounding boxes + class labels)

This will help us understand how deep learning is applied not just to classification, but also to localization and detection — a key concept in real-world applications like autonomous vehicles, video surveillance, and agriculture.

In [1]:
!pip install ultralytics

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m100.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cu

In [2]:
import cv2
import os
from ultralytics import YOLO
import numpy as np


Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.


In [4]:
# Load a COCO-pretrained YOLO12n model
model = YOLO("yolo12n.pt")

Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo12n.pt to 'yolo12n.pt'...


100%|██████████| 5.34M/5.34M [00:00<00:00, 64.4MB/s]


In [5]:
 model.predict(source="/content/test_yolo.jpg",save=True)



image 1/1 /content/test_yolo.jpg: 384x640 5 persons, 10 cars, 1 truck, 2 sports balls, 422.6ms
Speed: 18.8ms preprocess, 422.6ms inference, 45.2ms postprocess per image at shape (1, 3, 384, 640)
Results saved to [1mruns/detect/predict[0m


[ultralytics.engine.results.Results object with attributes:
 
 boxes: ultralytics.engine.results.Boxes object
 keypoints: None
 masks: None
 names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted p

In [7]:
model.predict(source="/content/test_yolo.jpg",save=True, conf=0.6)


image 1/1 /content/test_yolo.jpg: 384x640 4 persons, 5 cars, 282.7ms
Speed: 5.8ms preprocess, 282.7ms inference, 2.3ms postprocess per image at shape (1, 3, 384, 640)
Results saved to [1mruns/detect/predict[0m


[ultralytics.engine.results.Results object with attributes:
 
 boxes: ultralytics.engine.results.Boxes object
 keypoints: None
 masks: None
 names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted p

In [8]:
# The 'model(...)' syntax is a shorthand for model.predict()
# Here we provide the path to the image we want to analyze.
# The 'conf=0.6' argument sets the confidence threshold:
# only predictions with confidence >= 60% will be shown.
results = model("/content/test_yolo.jpg",conf=0.6)


image 1/1 /content/test_yolo.jpg: 384x640 4 persons, 5 cars, 189.4ms
Speed: 3.8ms preprocess, 189.4ms inference, 1.6ms postprocess per image at shape (1, 3, 384, 640)


In [11]:
# Load the original image using OpenCV
img_path = "/content/test_yolo.jpg"
img = cv2.imread(img_path)

# Get the list of class names used by the model (e.g., {0: 'person', 1: 'car', ...})
class_names = model.names

# Create a directory to save the cropped object images
output_dir = "Cropped_objects"
os.makedirs(output_dir, exist_ok=True)

# Access detection results: bounding boxes and class indices
boxes = results[0].boxes
xyxy = boxes.xyxy.cpu().numpy()               # Bounding box coordinates
classes = boxes.cls.cpu().numpy().astype(int) # Class IDs as integers

# Loop through each detected object
for i, (box, cls_id) in enumerate(zip(xyxy, classes)):
    x1, y1, x2, y2 = map(int, box)  # Convert coordinates to integers
    label = class_names[cls_id]     # Get the class label (e.g., 'dog', 'car')

    # Crop the object from the original image using the bounding box
    cropped = img[y1:y2, x1:x2]

    # Create a filename like 'dog_0.jpg', 'car_1.jpg', etc.
    filename = f"{label}_{i}.jpg"

    # Save the cropped image to the output directory
    cv2.imwrite(os.path.join(output_dir, filename), cropped)

    print(f"Saved: {filename}")  # Log the saved file

Saved: person_0.jpg
Saved: person_1.jpg
Saved: person_2.jpg
Saved: person_3.jpg
Saved: car_4.jpg
Saved: car_5.jpg
Saved: car_6.jpg
Saved: car_7.jpg
Saved: car_8.jpg


In [12]:
input_dir = "Cropped_objects"

# 1. Load images and calculate their area
image_data = []
for filename in os.listdir(input_dir):
    path = os.path.join(input_dir, filename)
    img = cv2.imread(path)  # Read the image
    if img is not None:
        h, w = img.shape[:2]           # Get image height and width
        area = w * h                   # Compute the area (in pixels)
        image_data.append((filename, area, img))  # Store filename, area, and image

# 2. Sort the images by area in descending order
# This helps us identify the largest detected objects
image_data_sorted = sorted(image_data, key=lambda x: x[1], reverse=True)

# 3. Select the top 7 largest images
top_7 = image_data_sorted[:7]

In [13]:
model_seg = YOLO('yolo11n-seg.pt')

Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-seg.pt to 'yolo11n-seg.pt'...


100%|██████████| 5.90M/5.90M [00:00<00:00, 67.9MB/s]


In [14]:
model_seg.predict(source='/content/Cropped_objects/car_4.jpg',save=True)


image 1/1 /content/Cropped_objects/car_4.jpg: 544x640 1 person, 1 car, 403.8ms
Speed: 7.2ms preprocess, 403.8ms inference, 17.5ms postprocess per image at shape (1, 3, 544, 640)
Results saved to [1mruns/segment/predict[0m


[ultralytics.engine.results.Results object with attributes:
 
 boxes: ultralytics.engine.results.Boxes object
 keypoints: None
 masks: ultralytics.engine.results.Masks object
 names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 

In [15]:
output_dir = "segmented_white_background"
os.makedirs(output_dir, exist_ok=True)  # Create output directory if it doesn't exist

# Loop through the top 7 largest cropped objects
for i, (filename, _, img) in enumerate(top_7):
    results = model_seg(img)  # Run the segmentation model on the image
    seg = results[0]          # Get the first (and only) result

    # Check if any mask was detected
    if seg.masks is None or seg.boxes.conf is None:
        print(f"No segmentation detected for {filename}")
        continue

    # Extract segmentation masks and their confidence scores
    masks = seg.masks.data.cpu().numpy()       # (N, H, W) — binary masks
    confs = seg.boxes.conf.cpu().numpy()       # confidence values for each detected object

    # Use the mask with the highest confidence
    best_idx = confs.argmax()
    best_mask = masks[best_idx]

    # Resize the mask to match the image size
    resized_mask = cv2.resize(best_mask, (img.shape[1], img.shape[0]))

    # Create a white mask (same size as the image)
    mask_final = np.ones(img.shape[:2], dtype=np.uint8) * 255

    # Convert the resized mask to binary (0 or 255)
    binary_mask = (resized_mask > 0.5).astype(np.uint8) * 255

    # Update the mask: set background to 255 (white), object to 0 (keep)
    mask_final[binary_mask == 255] = 0

    # Apply the binary mask to the image: set white background
    result_img = img.copy()
    result_img[mask_final == 255] = [255, 255, 255]

    # Save the final segmented image with white background
    output_path = os.path.join(output_dir, f"segmented_{filename}")
    cv2.imwrite(output_path, result_img)
    print(f"Saved: {output_path}")


0: 640x384 2 persons, 1 car, 298.0ms
Speed: 4.3ms preprocess, 298.0ms inference, 23.7ms postprocess per image at shape (1, 3, 640, 384)
Saved: segmented_white_background/segmented_person_0.jpg

0: 640x384 1 person, 2 cars, 284.2ms
Speed: 4.1ms preprocess, 284.2ms inference, 16.3ms postprocess per image at shape (1, 3, 640, 384)
Saved: segmented_white_background/segmented_person_1.jpg

0: 640x416 1 person, 1 car, 1 bench, 387.0ms
Speed: 4.6ms preprocess, 387.0ms inference, 25.6ms postprocess per image at shape (1, 3, 640, 416)
Saved: segmented_white_background/segmented_person_2.jpg

0: 640x352 1 person, 2 cars, 313.1ms
Speed: 3.9ms preprocess, 313.1ms inference, 18.6ms postprocess per image at shape (1, 3, 640, 352)
Saved: segmented_white_background/segmented_person_3.jpg

0: 544x640 1 person, 1 car, 437.2ms
Speed: 4.2ms preprocess, 437.2ms inference, 18.3ms postprocess per image at shape (1, 3, 544, 640)
Saved: segmented_white_background/segmented_car_4.jpg

0: 608x640 (no detections