### Project members

**Sali Raffaele**:
- ðŸ“§ [raffaele.sali@studio.unibo.it](mailto:raffaele.sali@studio.unibo.it)
- Student Number: `0001167817`

**Zanotti NiccolÃ²**:
- ðŸ“§ [niccolo.zanotti@studio.unibo.it](mailto:niccolo.zanotti@studio.unibo.it)
- Student Number: `0001121646`

**Zocco Ramazzo Marco**:
- ðŸ“§ [marco.zoccoramazzo@studio.unibo.it](mailto:marco.zoccoramazzo@studio.unibo.it)
- Student Number: `0001198289`
---

# **Product Recognition of Books**

## Image Processing and Computer Vision - Assignment Module \#1


Contacts:

- Prof. Giuseppe Lisanti -> giuseppe.lisanti@unibo.it
- Prof. Samuele Salti -> samuele.salti@unibo.it
- Alex Costanzino -> alex.costanzino@unibo.it
- Francesco Ballerini -> francesco.ballerini4@unibo.it

---


Computer vision-based object detection techniques can be applied in library or bookstore settings to build a system that identifies books on shelves.

Such a system could assist in:
* Helping visually impaired users locate books by title/author;
* Automating inventory management (e.g., detecting misplaced or out-of-stock books);
* Enabling faster book retrieval by recognizing spine text or cover designs.

## Task
Develop a computer vision system that, given a reference image for each book, is able to identify such book from one picture of a shelf.

<figure>
<a href="https://ibb.co/pvLVjbM5"><img src="https://i.ibb.co/svVx9bNz/example.png" alt="example" border="0"></a>
</figure>

For each type of product displayed on the shelf, the system should compute a bounding box aligned with the book spine or cover and report:
1. Number of instances;
1. Dimension of each instance (area in pixel of the bounding box that encloses each one of them);
1. Position in the image reference system of each instance (four corners of the bounding box that enclose them);
1. Overlay of the bounding boxes on the scene images.

<font color="red"><b>Each step of this assignment must be solved using traditional computer vision techniques.</b></font>

#### Example of expected output
```
Book 0 - 2 instance(s) found:
  Instance 1 {top_left: (100,200), top_right: (110, 220), bottom_left: (10, 202), bottom_right: (10, 208), area: 230px}
  Instance 2 {top_left: (90,310), top_right: (95, 340), bottom_left: (24, 205), bottom_right: (23, 234), area: 205px}
Book 1 â€“ 1 instance(s) found:
.
.
.
```

---

## Approach Overview

This solution implements a **traditional computer vision pipeline** for detecting books on shelves.

### Key Design Decisions

After extensive experimentation, we adopted a **simplified pipeline** based on the following observations about our dataset:

1. **Books are nearly planar**: Shelf images have minimal perspective distortion
2. **Books are upright or horizontal**: Spines can be vertical or horizontal (stacked shelves)
3. **Scale is consistent**: Model and scene images have similar scale
4. **Multiple copies are adjacent**: Same book appears side-by-side on shelves

These characteristics led us to choose:

| Component | Choice | Justification |
|-----------|--------|---------------|
| Preprocessing | Clahe | Enhance local contras and reveal texture |
| Features | RootSIFT | ~15-30% better matching than standard SIFT |
| Matching | BF 5NN consecutive ratio test | Captures good matches at any rank position, fondamental to detect multiple instances in the same scene |
| Geometric model | **Similarity transform** (4 DOF) | Uniform scale + rotation + translation; strongest inductive bias for books |
| Multi-instance | Iterative detection with keypoint index exclusion | Handles multiple copies efficiently |


## Setup and dependencies installation

In the following, we will assume that you have
- created a local python virtual environment - either with python [venv](https://docs.python.org/3/library/venv.html) module or via [uv](https://github.com/astral-sh/uv) (preferred) - with the `ipykernel` or `jupyter` packages pre-installed to start the jupyter kernel;
- have `git` installed on your machine;
- have a working internet connection

We will now download the `pyproject.toml` file specifying the project dependencies.

In [None]:
from pathlib import Path


def get_project_root() -> Path:
    """Return the root directory of the project."""
    start_dir = Path.cwd()

    markers = ["assignment1.ipynb"]

    for path in [start_dir, *list(start_dir.parents)]:
        for marker in markers:
            if (path / marker).exists():
                return path

    return start_dir


PROJECT_ROOT: Path = get_project_root()

In [None]:
import urllib.request

PROJECT_REPO: str = "niccolozanotti/ipcv-assignments"
COMMIT_HASH: str = "9f1f600af59401673e2e816b12d1ae740dc4386b"

pyproject_url = (
    f"https://raw.githubusercontent.com/{PROJECT_REPO}/{COMMIT_HASH}/pyproject.toml"
)
lockfile_url = f"https://raw.githubusercontent.com/{PROJECT_REPO}/{COMMIT_HASH}/uv.lock"

urllib.request.urlretrieve(pyproject_url, PROJECT_ROOT / "pyproject.toml")
urllib.request.urlretrieve(lockfile_url, PROJECT_ROOT / "uv.lock");
pyproject_url

If using [uv](https://github.com/astral-sh/uv) (recommended) you can now install the dependencies to a local virtual environment at `.venv` simply via
```sh
uv sync --extra assignment1
```

If not, the same can be achieved with the usual python [venv](https://docs.python.org/3/library/venv.html):
```sh
python3 -m venv .venv
source .venv/bin/activate
(.venv) pip install ".[assignment1]"
```

Make sure to do the above and *restart the kernel* if necessary before proceeding.

In [None]:
import warnings
from collections import defaultdict
from dataclasses import dataclass
from typing import List, Optional, Set, Tuple

import cv2
import matplotlib.pyplot as plt
import numpy as np

import os
import shutil
import subprocess
import zipfile
from pathlib import Path

warnings.filterwarnings("ignore")

## Configuration

The following class contains all the parameters used throughout the notebook. The meaning of each parameter is specified.

In [None]:
class Config:
    """
    Configuration parameters for the detection pipeline.

    Each parameter is justified based on dataset characteristics
    and experimental results.
    """

    # === FEATURE DETECTION ===
    # SIFT parameters (used via RootSIFT)
    SIFT_FEATURES = 0                # 0 = detect all features
    SIFT_N_OCTAVE_LAYERS = 7         # Increased from default value to capture more details
    SIFT_CONTRAST_THRESHOLD = 0.02   # Lowered form default value to detect more keypoints
    SIFT_EDGE_THRESHOLD = 10         # Default value
    SIFT_SIGMA = 1.2                 # Slightly lower than default 1.6 to detect more keypoints

    # === PREPROCESSING ===
    # Clahe + histogram equalization
    CLAHE_CLIP_LIMIT = 2.0           # Moderate contrast enhancement
    CLAHE_GRID_SIZE = (4, 4)         # 4x4 provides finer local adaptation than 8x8
    APPLY_PREPROCESSING = True       # Able/disable Clahe

    # === FEATURE MATCHING ===
    # BF 5NN consecutive ratio test:
    KNN_K = 5                        # Number of nearest neighbors considerd
    KNN_CONSECUTIVE_RATIO = 0.6      # Consecutive neighbor ratio threshold

    # === GEOMETRIC VERIFICATION ===
    # Similarity transformation (4 DOF) with RANSAC
    MIN_MATCH_COUNT = 3                  # Minimum matches to attempt estimation
    RANSAC_REPROJ_THRESHOLD = 4.0        # Pixels â€” allows small localization errors

    # === DETECTION VALIDATION ===
    MIN_INLIERS = 3                  # Minimum inliers for valid detection
    MIN_INLIERS_RATIO = 1 / 3        # At least 1/3 of matches should be inliers
    MIN_AREA = 1000                  # Minimum bounding box area (pixels)
    MAX_AREA = 100000                # Maximum bounding box area (pixels)
    MIN_EXTENT = 0.5                 # Min ratio: contour area / bounding rect area
    LOW_EXTENT_THRESHOLD = 0.65      # Below this extent, require extra inlier support
    MIN_INLIERS_LOW_EXTENT = 5       # Minimum inliers when extent is below LOW_EXTENT_THRESHOLD
    TOLERANCE_on_IMAGE_LIMITS = 4    # Tolerance in pixels of the bounding box to exeed the frame.
                                     # Fondamental for the detection of model_14 in scene_4, a correct
                                     # detection had a bounding box out of the frame by 3 pixels.

    # === MULTI-INSTANCE DETECTION ===
    MAX_INSTANCES_PER_BOOK = 10      # Safety limit
    SAME_BOOK_IOU_THRESHOLD = 0.30   # Reject same-book detection if >30% overlap with previous instance

## Data Classes

Structured representation of detection results.

In [None]:
@dataclass
class BoundingBox:
    """
    Represents a detected book instance.

    Stores the four corners of the bounding quadrilateral,
    which may not be axis-aligned due to the affine transformation.
    """

    top_left: Tuple[int, int]
    top_right: Tuple[int, int]
    bottom_right: Tuple[int, int]
    bottom_left: Tuple[int, int]
    area: int
    n_inliers: int
    inlier_ratio: float

    def get_polygon(self) -> np.ndarray:
        """Return corners as numpy array for geometric operations."""
        return np.array(
            [self.top_left, self.top_right, self.bottom_right, self.bottom_left],
            dtype=np.float32,
        )


@dataclass
class BookDetection:
    """All detections for a single book model in a scene."""

    book_id: int
    model_path: str
    instances: List[BoundingBox]

## RootSIFT Feature Extractor

### Why RootSIFT?

Standard SIFT descriptors are compared using Euclidean distance. However, SIFT descriptors are histograms of gradient orientations, and the **Hellinger kernel** (Bhattacharyya distance) is more appropriate for histogram comparison.

RootSIFT achieves this by:
1. L1-normalizing the SIFT descriptor
2. Taking the element-wise square root

This allows Euclidean distance to implicitly compute Hellinger distance, providing **~15-30% better matching accuracy**.

**Reference**: ArandjeloviÄ‡ & Zisserman (2012), *Three things everyone should know to improve object retrieval*, IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, DOI: [10.1109/CVPR.2012.6248018](https://doi.org/10.1109/CVPR.2012.6248018)

In [None]:
class RootSIFT:
    """
    RootSIFT feature extractor.

    Enhances SIFT descriptors for better matching performance
    by applying L1 normalization followed by square root.
    """

    def __init__(self):
        self.sift = cv2.SIFT_create(
            nfeatures=Config.SIFT_FEATURES,
            nOctaveLayers=Config.SIFT_N_OCTAVE_LAYERS,
            contrastThreshold=Config.SIFT_CONTRAST_THRESHOLD,
            edgeThreshold=Config.SIFT_EDGE_THRESHOLD,
            sigma=Config.SIFT_SIGMA,
        )

    def detect_and_compute(self, image: np.ndarray):
        """
        Detect keypoints and compute RootSIFT descriptors.

        Args:
            image: Grayscale input image

        Returns:
            keypoints: List of cv2.KeyPoint
            descriptors: RootSIFT descriptors (Nx128 float32 array)
        """
        keypoints, descriptors = self.sift.detectAndCompute(image, None)

        if descriptors is None or len(descriptors) == 0:
            return keypoints, None

        # Convert to RootSIFT
        eps = 1e-7

        # Step 1: L1 normalize
        descriptors = descriptors / (np.sum(descriptors, axis=1, keepdims=True) + eps)

        # Step 2: Square root (Hellinger kernel)
        descriptors = np.sqrt(descriptors)

        return keypoints, descriptors.astype(np.float32)

## Image Preprocessing

### Grayscale conversion + Contrast Limited Adaptive Histogram Equalization (CLAHE)

Clahe was applied to enhance local contrast. We initially used a default dimension for the kernel (`8x8`), but we noticed how that could be problematic on some noisy images of the dataset.
After some tweaking we decided to lower it to `4x4`. This guarantees a more global behavior, smoother results, less local contrast exaggeration and more important less noise amplification.      

In [None]:
class ImagePreprocessor:
    """
    Preprocessing pipeline: Gaussian smoothing + histogram equalization.
    """

    def __init__(self):
        self.clahe = cv2.createCLAHE(
            clipLimit=Config.CLAHE_CLIP_LIMIT, tileGridSize=Config.CLAHE_GRID_SIZE
        )

    def preprocess(self, image: np.ndarray) -> np.ndarray:
        """
        Convert to grayscale, apply Gaussian blur, and equalize histogram.

        Args:
            image: BGR input image

        Returns:
            Preprocessed grayscale image
        """
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        else:
            gray = image.copy()
        if Config.APPLY_PREPROCESSING:
            return self.clahe.apply(gray)
        else:
            return gray

## Feature Matching

### BF 5NN Consecutive Ratio Test

For each model keypoint's 5 nearest neighbors we compare consecutive distances:   
if `match[i].distance < 0.6 * match[i+1].distance` keep `match[i]`.

Unlike the standard Lowe's ratio test that only checks rank 1 vs rank 2, this captures good matches at **any rank position**. This is particularly valuable for multi-instance detection, in which a model keypoint may match multiple scene locations at different ranks.
We decided to use a kNN with `k=5` knowing there are no more than 4 instances of the same book in a single scene of the dataset. `k` from the class `Config` could be increased for robustness and generalization.

In [None]:
class FeatureMatcher:
    """
    Feature matcher using BF 5NN consecutive ratio test.

    For each model keypoint, finds 5 nearest neighbors in the scene
    and keeps match[i] if match[i].distance < ratio * match[i+1].distance.
    """

    def __init__(self):
        self.bf = cv2.BFMatcher(cv2.NORM_L2)

    def match(
        self,
        des_model: np.ndarray,
        des_scene: np.ndarray,
        excluded_indices: Set[int] = None,
    ) -> List[cv2.DMatch]:
        """
        Match model descriptors to scene descriptors.

        Args:
            des_model: Model descriptors
            des_scene: Scene descriptors
            excluded_indices: Scene keypoint indices to exclude (already matched)

        Returns:
            List of good matches
        """
        if des_model is None or des_scene is None:
            return []
        if len(des_model) < 2 or len(des_scene) < Config.KNN_K:
            return []

        if excluded_indices is None:
            excluded_indices = set()

        try:
            knn_matches = self.bf.knnMatch(des_model, des_scene, k=Config.KNN_K)
        except cv2.error:
            return []

        good_matches = []
        seen = set()

        for match_group in knn_matches:
            for i in range(len(match_group) - 1):
                m = match_group[i]
                m_next = match_group[i + 1]

                if m.trainIdx in excluded_indices:
                    continue

                if m.distance < Config.KNN_CONSECUTIVE_RATIO * m_next.distance:
                    key = (m.queryIdx, m.trainIdx)
                    if key not in seen:
                        seen.add(key)
                        good_matches.append(m)

        return good_matches


## Similarity Transform Estimator

### Similarity Transform (Partial Affine) instead of Full Affine or Homography

We use `cv2.estimateAffinePartial2D` which estimates a **similarity transform** (4 DOF):

$$
\begin{bmatrix}
x' \\
y'
\end{bmatrix}
=
\begin{bmatrix}
s \cos\theta & -s \sin\theta \\
s \sin\theta & s \cos\theta
\end{bmatrix}
\begin{bmatrix}
x \\
y
\end{bmatrix}
+
\begin{bmatrix}
t_x \\
t_y
\end{bmatrix}
$$

This allows only:
- **Uniform scaling** (single scale factor $s$)
- **Rotation** (angle $\theta$)
- **Translation** ($t_x$, $t_y$)

### Why Similarity Transform (4 DOF) instead of Full Affine (6 DOF) or Homography (8 DOF)?

| Property | Homography | Full Affine | Similarity (Partial Affine) |
|----------|------------|-------------|----------------------------|
| Degrees of freedom | 8 | 6 | **4** |
| Minimum points | 4 | 3 | **2** |
| Handles perspective | Yes | No | No |
| Preserves aspect ratio | No | No | **Yes** |
| Allows shear | Yes | Yes | **No** |
| Stability with few points | Lowest | Medium | **Highest** |

Books on shelves do not undergo perspective distortion, shear, or non-uniform scaling. The similarity transform (`cv2.estimateAffinePartial2D`) encodes exactly this **inductive bias**: it only allows uniform scaling, rotation, and translation. This makes RANSAC converge faster and more reliably, especially with the limited matches (~8-15) typical of narrow book spines.

The resulting 2Ã—3 matrix is converted to a 3Ã—3 homography matrix (by appending `[0, 0, 1]`) for convenient corner projection via `cv2.perspectiveTransform`.

In [None]:
class AffineEstimator:
    """
    Similarity transform estimator using RANSAC.
    """

    def estimate(
        self, src_pts: np.ndarray, dst_pts: np.ndarray
    ) -> Tuple[Optional[np.ndarray], Optional[np.ndarray], int]:
        """
        Estimate similarity transformation using RANSAC.

        Args:
            src_pts: Source points (model) - Nx2 array
            dst_pts: Destination points (scene) - Nx2 array

        Returns:
            (homography_matrix, inlier_mask, n_inliers)
            homography_matrix is 3x3 (similarity embedded in homography form)
        """
        if len(src_pts) < 2:
            return None, None, 0

        try:
            M, inliers = cv2.estimateAffinePartial2D(
                src_pts.reshape(-1, 1, 2),
                dst_pts.reshape(-1, 1, 2),
                method=cv2.RANSAC,
                ransacReprojThreshold=Config.RANSAC_REPROJ_THRESHOLD,
            )

            if M is None or inliers is None:
                return None, None, 0

            # Convert 2x3 affine to 3x3 homography matrix
            H = np.vstack([M, [0, 0, 1]])

            n_inliers = int(np.sum(inliers))
            return H, inliers, n_inliers

        except cv2.error:
            return None, None, 0

    def transform_corners(
        self, H: np.ndarray, model_shape: Tuple[int, int]
    ) -> np.ndarray:
        """
        Project model corners to scene using the homography matrix.

        Args:
            H: 3x3 homography matrix
            model_shape: (height, width) of model image

        Returns:
            Projected corners as 4x2 array: [TL, TR, BR, BL]
        """
        h, w = model_shape
        corners = np.float32([[0, 0], [w, 0], [w, h], [0, h]]).reshape(-1, 1, 2)
        projected = cv2.perspectiveTransform(corners, H)
        return projected.reshape(-1, 2)

## Detection Validation

### Validation Criteria

A detected rectangle is considered geometrically valid if the following 4 conditions hold:
1. **4 points**: The projected contour has exactly 4 corners
2. **Reasonable area**: Between 1000 and 100000 pixels
3. **Roughly rectangular**: The ratio of contour area to bounding rect area (extent) is above 0.5
4. **Within image bounds**: All corners lie within the image dimensions with a tolerance of 4 pixels

In [None]:
def is_rectangle_valid(
    rectangle: np.ndarray, image_shape: Tuple[int, ...]
) -> Tuple[bool, str]:
    """
    Validate whether a detected rectangle is geometrically reasonable.

    Args:
        rectangle: 4x1x2 or 4x2 array of corner points
        image_shape: (height, width, ...) of the scene image

    Returns:
        (is_valid, reason_string)
    """
    # Ensure correct shape for cv2 functions
    rect = np.int32(rectangle)
    if rect.ndim == 2:
        rect = rect.reshape(-1, 1, 2)

    if len(rect) != 4:
        return False, "Not 4 points"

    # Check area
    area = cv2.contourArea(rect)
    if area < Config.MIN_AREA or area > Config.MAX_AREA:
        return False, f"Invalid area: {area:.0f}"

    # Check ratio of contour area to bounding box area (extent)
    x, y, w, h = cv2.boundingRect(rect)
    bounding_box_area = w * h
    if bounding_box_area == 0:
        return False, "Zero bounding box area"
    extent = area / bounding_box_area

    if extent < Config.MIN_EXTENT:
        return False, f"Not rectangular enough: extent={extent:.2f}"

    # Check if the rectangle is within the image bounds
    for p in rect:
        px, py = p[0]
        if (
            px + Config.TOLERANCE_on_IMAGE_LIMITS < 0
            or px - Config.TOLERANCE_on_IMAGE_LIMITS >= image_shape[1]
            or py + Config.TOLERANCE_on_IMAGE_LIMITS < 0
            or py - Config.TOLERANCE_on_IMAGE_LIMITS >= image_shape[0]
        ):
            return False, "Point out of bounds"

    return True, "Valid rectangle"

## Book Detector

### Multi-Instance Detection: Iterative Keypoint Exclusion

1. Match all model keypoints to scene (excluding previously used indices)
2. Estimate similarity transformation (RANSAC)
3. Validate the detection (number of inlier, inlier ratio, rectangle validity)
4. If valid: save detection, add inlier scene keypoint indices to exclusion set
5. Re-match with excluded indices, repeat until no valid detections remain

This approach extracts features only once and allows multiple RANSAC
attempts even when match quality degrades.

In [None]:
class GeometryUtils:
    """Geometric utility functions."""

    @staticmethod
    def polygon_area(points: np.ndarray) -> float:
        """Compute polygon area using Shoelace formula."""
        n = len(points)
        area = 0.0
        for i in range(n):
            j = (i + 1) % n
            area += points[i][0] * points[j][1]
            area -= points[j][0] * points[i][1]
        return abs(area) / 2.0

    @staticmethod
    def polygon_iou(poly1: np.ndarray, poly2: np.ndarray) -> float:
        """Compute Intersection over Union for two polygons."""
        all_pts = np.vstack([poly1, poly2])
        x_min = max(0, int(np.floor(all_pts[:, 0].min())) - 5)
        y_min = max(0, int(np.floor(all_pts[:, 1].min())) - 5)
        x_max = int(np.ceil(all_pts[:, 0].max())) + 5
        y_max = int(np.ceil(all_pts[:, 1].max())) + 5

        w, h = x_max - x_min, y_max - y_min
        if w <= 0 or h <= 0:
            return 0.0

        offset = np.array([x_min, y_min])
        mask1 = np.zeros((h, w), dtype=np.uint8)
        mask2 = np.zeros((h, w), dtype=np.uint8)
        cv2.fillPoly(mask1, [(poly1 - offset).astype(np.int32)], 1)
        cv2.fillPoly(mask2, [(poly2 - offset).astype(np.int32)], 1)

        intersection = np.sum(mask1 & mask2)
        union = np.sum(mask1 | mask2)
        return intersection / union if union > 0 else 0.0

In [None]:
class BookDetector:
    """
    Main book detection pipeline.

    Pipeline:
    1. Preprocess images (Gaussian blur + histogram equalization)
    2. Extract RootSIFT features
    3. Match features (BF 5NN consecutive ratio test)
    4. Iterative detection with keypoint index exclusion
    """

    def __init__(self):
        self.preprocessor = ImagePreprocessor()
        self.feature_extractor = RootSIFT()
        self.matcher = FeatureMatcher()
        self.affine_estimator = AffineEstimator()
        self.model_cache = {}

    def load_model(self, model_path: str):
        """Load and cache model image features."""
        if model_path in self.model_cache:
            return self.model_cache[model_path]

        img = cv2.imread(model_path)
        if img is None:
            raise ValueError(f"Could not load: {model_path}")

        gray = self.preprocessor.preprocess(img)
        kp, des = self.feature_extractor.detect_and_compute(gray)

        self.model_cache[model_path] = (img, kp, des)
        return img, kp, des

    def detect_in_scene(
        self, scene_path: str, model_paths: List[str], verbose: bool = False
    ) -> Tuple[List[BookDetection], np.ndarray]:
        """
        Detect all books in a scene image.
        """
        scene_img = cv2.imread(scene_path)
        if scene_img is None:
            raise ValueError(f"Could not load: {scene_path}")

        scene_gray = self.preprocessor.preprocess(scene_img)
        scene_kp, scene_des = self.feature_extractor.detect_and_compute(scene_gray)

        if verbose:
            print(f"Scene: {Path(scene_path).name} - {len(scene_kp)} keypoints")

        detections = []
        result_img = scene_img.copy()

        for book_id, model_path in enumerate(model_paths):
            model_img, model_kp, model_des = self.load_model(model_path)

            if model_des is None or len(model_kp) < Config.MIN_MATCH_COUNT:
                detections.append(BookDetection(book_id, model_path, []))
                continue

            instances = self._detect_instances(
                model_img,
                model_kp,
                model_des,
                scene_img,
                scene_kp,
                scene_des,
            )

            detection = BookDetection(book_id, model_path, instances)
            detections.append(detection)
            self._draw_detection(result_img, detection)

            if verbose and len(instances) > 0:
                print(f"  Book {book_id}: {len(instances)} instance(s)")

        return detections, result_img

    def _detect_instances(
        self, model_img, model_kp, model_des, scene_img, scene_kp, scene_des
    ) -> List[BoundingBox]:
        """
        Detect all instances using iterative similarity estimation
        with keypoint index exclusion.
        """
        instances = []
        excluded_indices: Set[int] = set()

        for iteration in range(Config.MAX_INSTANCES_PER_BOOK):
            matches = self.matcher.match(model_des, scene_des, excluded_indices)

            if len(matches) < Config.MIN_MATCH_COUNT:
                break

            src_pts = np.float32([model_kp[m.queryIdx].pt for m in matches])
            dst_pts = np.float32([scene_kp[m.trainIdx].pt for m in matches])
            match_indices = [m.trainIdx for m in matches]

            # Estimate similarity transformation
            H, inlier_mask, n_inliers = self.affine_estimator.estimate(src_pts, dst_pts)

            if H is None:
                break

            # Check inlier quality
            inlier_ratio = n_inliers / len(matches)
            if (
                n_inliers < Config.MIN_INLIERS
                or inlier_ratio < Config.MIN_INLIERS_RATIO
            ):
                break

            # Project model corners to scene
            projected = self.affine_estimator.transform_corners(H, model_img.shape[:2])

            is_valid, reason = is_rectangle_valid(projected, scene_img.shape)
            if not is_valid:
                break

            # Combined extent + inlier check: low-extent detections need
            # stronger inlier support to be trusted
            rect_check = np.int32(projected).reshape(-1, 1, 2)
            contour_area = cv2.contourArea(rect_check)
            x_r, y_r, w_r, h_r = cv2.boundingRect(rect_check)
            bb_area = w_r * h_r
            extent = contour_area / bb_area if bb_area > 0 else 0
            if (
                extent < Config.LOW_EXTENT_THRESHOLD
                and n_inliers < Config.MIN_INLIERS_LOW_EXTENT
            ):
                break

            # Compute area
            area = GeometryUtils.polygon_area(projected)

            bbox = BoundingBox(
                top_left=tuple(map(int, projected[0])),
                top_right=tuple(map(int, projected[1])),
                bottom_right=tuple(map(int, projected[2])),
                bottom_left=tuple(map(int, projected[3])),
                area=int(area),
                n_inliers=n_inliers,
                inlier_ratio=inlier_ratio,
            )

            # Same-book overlap check: reject if >30% IoU with any
            # previously detected instance of this same book
            new_poly = bbox.get_polygon()
            is_same_book_dup = any(
                GeometryUtils.polygon_iou(new_poly, prev.get_polygon())
                > Config.SAME_BOOK_IOU_THRESHOLD
                for prev in instances
            )
            if is_same_book_dup:
                # Exclude inliers so iteration can continue past this region
                polygon = projected.astype(np.float32).reshape(-1, 1, 2)
                for i in range(len(match_indices)):
                    if inlier_mask[i]:
                        kp_idx = match_indices[i]
                        pt = scene_kp[kp_idx].pt
                        if cv2.pointPolygonTest(polygon, pt, measureDist=True) >= 0:
                            excluded_indices.add(kp_idx)
                continue

            instances.append(bbox)

            # --- Spatial inlier filtering ---
            # Only exclude inliers whose scene keypoint falls INSIDE
            # the detected bounding box. This prevents RANSAC from claiming
            # keypoints that physically sit on adjacent book copies.
            polygon = projected.astype(np.float32).reshape(-1, 1, 2)
            for i in range(len(match_indices)):
                if inlier_mask[i]:
                    kp_idx = match_indices[i]
                    pt = scene_kp[kp_idx].pt
                    # pointPolygonTest returns positive if inside, 0 on edge, negative if outside
                    # Check to exclude only the inliers belonging to the instance just detected
                    if cv2.pointPolygonTest(polygon, pt, measureDist=True) >= 0:
                        excluded_indices.add(kp_idx)

        return instances

    def _draw_detection(self, img: np.ndarray, detection: BookDetection):
        """Draw bounding boxes on image."""
        colors = [
            (0, 255, 0),
            (255, 0, 0),
            (0, 0, 255),
            (255, 255, 0),
            (255, 0, 255),
            (0, 255, 255),
            (128, 0, 255),
            (255, 128, 0),
        ]
        color = colors[detection.book_id % len(colors)]

        for inst in detection.instances:
            pts = np.array(
                [inst.top_left, inst.top_right, inst.bottom_right, inst.bottom_left],
                dtype=np.int32,
            )
            cv2.polylines(img, [pts], True, color, 3)

            label = f"Book {detection.book_id}"
            pos = (inst.top_left[0], max(inst.top_left[1] - 10, 20))
            cv2.putText(img, label, pos, cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)

### Detection/Rejection Criteria

Each candidate detection goes through a sequence of checks before being accepted. If any check fails, the iteration stops (`break`) or the detection is skipped (`continue`). The checks are applied in order, and each serves a distinct purpose:

**1. Minimum match count** (`MIN_MATCH_COUNT = 3`) â€” (`break`)   
Before attempting RANSAC, at least 3 matches are required. A similarity transform needs a minimum of 2 points, but 3 provides a basic sanity margin. Below this, there is not enough evidence to attempt geometric verification.

**2. RANSAC failure** â€” (`break`)        
If `estimateAffinePartial2D` returns `None`, the point correspondences are too noisy or contradictory for any consistent transformation. This typically means the remaining matches are scattered across unrelated scene regions.

**3. Inlier count and ratio** (`MIN_INLIERS = 3`, `MIN_INLIERS_RATIO = 1/3`) â€” (`break`)    
The number of RANSAC inliers must reach at least 3, and the inlier ratio (inliers / total matches) must be at least 33%. These ensure that the estimated transformation has sufficient geometric consensus. A low inlier ratio indicates the matches are dominated by outliers.

**4. Rectangle validity** (`is_rectangle_valid`) â€” (`break`)  
Function `is_rectangle_valid` previously presented in detail.

**5. Combined extent + inlier check** (`LOW_EXTENT_THRESHOLD = 0.65`, `MIN_INLIERS_LOW_EXTENT = 5`) â€” (`break`)  
A detection with moderately low extent (between 0.5 and 0.65) is ambiguous: it could be a slightly rotated true detection or a poorly constrained false positive. To distinguish between these cases, we require stronger inlier support. If extent < 0.65 **and** inliers < 5, the detection is rejected. This allows well-supported detections with slight skew to pass while filtering out weakly-supported ones (see next section).

**6. Same-book IoU overlap** (`SAME_BOOK_IOU_THRESHOLD = 0.30`) â€” (`continue`)
After all geometric checks pass, the new bounding box is compared against all previously accepted instances **of the same book model** using Intersection over Union (IoU). If the overlap exceeds 30%, the detection is a duplicate â€” RANSAC converged on the same physical book again. Unlike the previous checks, this uses `continue` instead of `break`: the inliers are excluded (via spatial filtering) and the loop tries the next iteration, since valid non-overlapping instances may still exist.

### Combined Extent + Inlier Check

A simple `MIN_EXTENT` threshold creates a dilemma:

| Case | Extent | Inliers | Verdict |
|------|--------|---------|--------|
| Scene 4, Book 15 iter 2 (false positive) | 0.620 | 3 | Should reject |
| Scene 16, Book 12 iter 2 (true positive) | 0.601 | 10 (100%) | Should accept |

Both have similar extent, so no single threshold can separate them. However, the key difference is **inlier support**: the false positive has only 3 inliers (the minimum), while the true positive has 10 with 100% inlier ratio â€” strong geometric consensus.

The solution is a **combined check**: detections with low extent (below `LOW_EXTENT_THRESHOLD = 0.65`) are only accepted if they have sufficient inlier support (`MIN_INLIERS_LOW_EXTENT = 5`). This rejects poorly-supported skewed detections while preserving well-supported ones that happen to be slightly rotated.

## Dataset

The dataset consists of two folders of images:
* **Models**: contains one reference image for each product that the system should be able to identify;
* **Scenes**: contains different shelve pictures to test the developed algorithm in different scenarios.

In [None]:
BRANCH_NAME: str = "dataset/assignment1"
REPO_URL: str = f"https://github.com/{PROJECT_REPO}.git"

temp_dir: Path = PROJECT_ROOT / "temp_repo"
dataset_path: Path = PROJECT_ROOT / "dataset"

if dataset_path.exists():
    print(f"'{dataset_path.name}' folder already exists locally. Skipping download.")
else:
    try:
        print(
            f"Downloading dataset at {PROJECT_REPO}/{BRANCH_NAME} via git sparse checkout..."
        )

        # Clone the repo tree
        clone_cmd = [
            "git",
            "clone",
            "--filter=blob:none",
            "--sparse",
            "--depth",
            "1",
            "--branch",
            BRANCH_NAME,
            REPO_URL,
            str(temp_dir),
        ]
        subprocess.run(clone_cmd, check=True, capture_output=True, text=True)

        # Fetch the contents of the 'dataset' folder
        sparse_cmd = ["git", "-C", str(temp_dir), "sparse-checkout", "set", "dataset"]
        subprocess.run(sparse_cmd, check=True, capture_output=True, text=True)

        source_dataset_path: Path = temp_dir / "dataset"

        if source_dataset_path.exists():
            shutil.move(source_dataset_path, dataset_path)
            print("Dataset successfully downloaded.")
        else:
            print(
                f"Error: Could not find the 'dataset' folder inside the cloned repo at '{temp_dir}'."
            )

    except subprocess.CalledProcessError as e:
        print(f"Git command failed: {e.stderr}")

    finally:
        # Clean up
        if temp_dir.exists():
            shutil.rmtree(temp_dir, ignore_errors=True)

In [None]:
MODELS_PATH: Path = dataset_path / "models"
SCENES_PATH: Path = dataset_path / "scenes"

model_files = sorted(
    MODELS_PATH.glob("model_*.png"), key=lambda x: int(x.stem.split("_")[1])
)
scene_files = sorted(
    SCENES_PATH.glob("scene_*.jpg"), key=lambda x: int(x.stem.split("_")[1])
)

model_paths = [str(f) for f in model_files]
scene_paths = [str(f) for f in scene_files]

print(f"Found {len(model_paths)} models and {len(scene_paths)} scenes")

## Run Full Detection

In [None]:
# Initialize detector
detector = BookDetector()

# Process all scenes
all_results = {}
all_images = {}

for idx, scene_path in enumerate(scene_paths):
    print(f"\nProcessing scene {idx}: {Path(scene_path).name}")

    detections, result_img = detector.detect_in_scene(
        scene_path, model_paths, verbose=False
    )

    all_results[idx] = detections
    all_images[idx] = result_img

    total = sum(len(d.instances) for d in detections)
    books = sum(1 for d in detections if len(d.instances) > 0)
    print(f"  Found {total} instances of {books} books")

print("\n" + "=" * 60)
print("PROCESSING COMPLETE")
print("=" * 60)

## Formatted Results

In [None]:
def format_output(detections: List[BookDetection]) -> str:
    """
    Format detection results as specified in the assignment.

    Output format:
    Book X - N instance(s) found:
      Instance 1 {top_left: (x,y), top_right: (x,y), ...}
    """
    lines = []

    for det in detections:
        if len(det.instances) > 0:
            lines.append(
                f"Book {det.book_id} - {len(det.instances)} instance(s) found:"
            )

            for i, inst in enumerate(det.instances, 1):
                lines.append(
                    f"  Instance {i} {{"
                    f"top_left: {inst.top_left}, "
                    f"top_right: {inst.top_right}, "
                    f"bottom_left: {inst.bottom_left}, "
                    f"bottom_right: {inst.bottom_right}, "
                    f"area: {inst.area}px}}"
                )

    return "\n".join(lines)

In [None]:
# Print formatted results
for idx, detections in all_results.items():
    detected = [d for d in detections if len(d.instances) > 0]
    if detected:
        print(f"\n{'=' * 60}")
        print(f"SCENE: {Path(scene_paths[idx]).name}")
        print(f"{'=' * 60}")
        print(format_output(detected))

## Visualization

In [None]:
def visualize_scene_with_models(scene_idx, scene_path, detections, model_paths):
    """
    Visualize detection results: scene with bounding boxes on the left,
    detected model images with matching colored borders on the right.
    One model image per book, with instance count in the label.

    Args:
        scene_idx: Scene index (for title)
        scene_path: Path to scene image
        detections: List of BookDetection for this scene
        model_paths: List of all model image paths
    """
    import matplotlib.pyplot as plt

    colors_bgr = [
        (0, 255, 0), (255, 0, 0), (0, 0, 255), (255, 255, 0),
        (255, 0, 255), (0, 255, 255), (128, 0, 255), (255, 128, 0)
    ]

    def bgr_to_rgb_norm(bgr):
        return (bgr[2] / 255, bgr[1] / 255, bgr[0] / 255)

    # Collect detected books (those with at least one instance)
    detected = [(d, colors_bgr[d.book_id % len(colors_bgr)])
                for d in detections if len(d.instances) > 0]

    # One entry per book (not per instance)
    model_entries = [(d.book_id, d.model_path, color, len(d.instances))
                     for d, color in detected]

    n_models = len(model_entries)
    if n_models == 0:
        scene_img = cv2.cvtColor(cv2.imread(scene_path), cv2.COLOR_BGR2RGB)
        fig, ax = plt.subplots(1, 1, figsize=(8, 8))
        ax.imshow(scene_img)
        ax.set_title(f"Scene {scene_idx} â€” No detections", fontsize=14)
        ax.axis('off')
        plt.tight_layout()
        plt.show()
        return

    # Build the scene image with bounding boxes
    scene_img = cv2.imread(scene_path)
    result_img = scene_img.copy()
    for d, color in detected:
        for inst in d.instances:
            pts = np.array([inst.top_left, inst.top_right,
                           inst.bottom_right, inst.bottom_left], dtype=np.int32)
            cv2.polylines(result_img, [pts], True, color, 3)
            label = f"Book {d.book_id}"
            pos = (inst.top_left[0], max(inst.top_left[1] - 10, 20))
            cv2.putText(result_img, label, pos, cv2.FONT_HERSHEY_SIMPLEX,
                        0.6, color, 2)
    result_img = cv2.cvtColor(result_img, cv2.COLOR_BGR2RGB)

    # Layout: scene on left, one model per detected book on right
    n_cols = 1 + n_models
    width_ratios = [4] + [1] * n_models
    fig_width = 6 + 1.5 * n_models

    fig, axes = plt.subplots(1, n_cols, figsize=(fig_width, 6),
                              gridspec_kw={'width_ratios': width_ratios})
    if n_cols == 2:
        axes = [axes[0], axes[1]]

    # Scene
    axes[0].imshow(result_img)
    axes[0].set_title(f"Scene {scene_idx}", fontsize=14, fontweight='bold')
    axes[0].axis('off')

    # Model images
    for idx, (book_id, model_path, color, n_instances) in enumerate(model_entries):
        ax = axes[1 + idx]
        model_img = cv2.cvtColor(cv2.imread(model_path), cv2.COLOR_BGR2RGB)
        ax.imshow(model_img)

        rgb_color = bgr_to_rgb_norm(color)
        for spine in ax.spines.values():
            spine.set_edgecolor(rgb_color)
            spine.set_linewidth(4)
            spine.set_visible(True)

        label = f"Model {book_id}"
        if n_instances > 1:
            label += f"\n({n_instances} inst.)"
        ax.set_title(label, fontsize=10, color=rgb_color, fontweight='bold')
        ax.set_xticks([])
        ax.set_yticks([])

    plt.tight_layout()
    plt.show()

In [None]:
# Visualize all scenes
for scene_idx, scene_path in enumerate(scene_paths):
    visualize_scene_with_models(
        scene_idx, scene_path,
        all_results[scene_idx], model_paths
    )

## Statistics

In [None]:
# Compute statistics
total_detections = 0
detections_per_scene = []
detections_per_book = defaultdict(int)

for detections in all_results.values():
    scene_total = sum(len(d.instances) for d in detections)
    detections_per_scene.append(scene_total)
    total_detections += scene_total

    for d in detections:
        if len(d.instances) > 0:
            detections_per_book[d.book_id] += len(d.instances)

print("DETECTION STATISTICS")
print("=" * 60)
print(f"Total detections: {total_detections}")
print(f"Average per scene: {np.mean(detections_per_scene):.2f}")
print(f"Max in one scene: {max(detections_per_scene)}")
print(f"\nUnique books detected: {len(detections_per_book)}")

print("\nTop 10 most detected books:")
for book_id, count in sorted(detections_per_book.items(), key=lambda x: -x[1])[:10]:
    print(f"  Book {book_id}: {count} instances")

## Summary

This pipeline implements book detection using traditional computer vision:

1. **RootSIFT features** for robust descriptor matching
2. **BF 5NN consecutive ratio test** for feature matching
3. **Similarity transform** (4 DOF via `estimateAffinePartial2D`) for geometric verification
4. **Iterative keypoint exclusion** for multi-instance detection