# **Product Recognition of Books**

## Image Processing and Computer Vision - Assignment Module #1

---

## Approach Overview

This solution implements a **traditional computer vision pipeline** for detecting books on shelves.

### Key Design Decisions

After extensive experimentation, we adopted a **simplified pipeline** based on the following observations about our dataset:

1. **Books are nearly planar**: Shelf images have minimal perspective distortion
2. **Books are upright**: Spines are vertical with minimal rotation
3. **Scale is consistent**: Model and scene images have similar scale
4. **Multiple copies are adjacent**: Same book appears side-by-side on shelves

These characteristics led us to choose:

| Component | Choice | Justification |
|-----------|--------|---------------|
| Features | RootSIFT | ~15-30% better matching than standard SIFT |
| Matching | 5NN | Allows one model keypoint to match multiple scene locations |
| Geometric model | **Affine** (not Homography) | 6 DOF sufficient for upright books, more stable with fewer points |
| Multi-instance | Iterative masking | Simpler and more effective than clustering approaches |

### Why Affine Instead of Homography?

- **Homography**: 8 degrees of freedom, requires 4+ point correspondences, typically needs 8+ for robustness
- **Affine**: 6 degrees of freedom, requires only 3 point correspondences, more stable with limited matches

For upright books with minimal perspective distortion, the extra 2 DOF of homography add noise without benefit. Our experiments showed that with ~8-10 matches per book instance, affine transformation provides more reliable detection.

## 1. Setup and Imports

In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from typing import List, Tuple, Optional, Set
from dataclasses import dataclass
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

print(f"OpenCV version: {cv2.__version__}")

## 2. Configuration

All parameters are centralized here with justifications.

In [None]:
class Config:
    """
    Configuration parameters for the detection pipeline.

    Each parameter is justified based on dataset characteristics
    and experimental results.
    """

    # === PREPROCESSING ===
    # CLAHE (Contrast Limited Adaptive Histogram Equalization)
    # Normalizes lighting variations across shelf images
    CLAHE_CLIP_LIMIT = 2.0   # Moderate contrast enhancement
    CLAHE_GRID_SIZE = (4, 4) # 4x4 provides finer local adaptation than 8x8
    # Justification: Experiments showed fixed CLAHE outperformed adaptive selection

    # === FEATURE DETECTION ===
    # SIFT parameters (used via RootSIFT)
    SIFT_FEATURES = 0                # 0 = detect all features
    SIFT_CONTRAST_THRESHOLD = 0.015  # 0.04  # Default value, works well for book spines
    SIFT_EDGE_THRESHOLD = 15         # 10       # Default value

    # === FEATURE MATCHING ===
    # 5NN matching strategy for multi-instance detection
    # Justification: Unlike 2NN ratio test, 5NN allows one model keypoint
    # to match multiple scene locations (essential for detecting book copies)
    DISTANCE_THRESHOLD = 0.7  # Relative quality threshold
    OUTLIER_RATIO = 1.5       # Filter matches far from the group

    # === GEOMETRIC VERIFICATION ===
    # Affine transformation with RANSAC
    # Justification: Affine (6 DOF) is sufficient for upright books
    # and more stable than homography (8 DOF) with limited matches
    MIN_MATCH_COUNT = 6             # Minimum matches to attempt affine estimation
                                    # (affine needs 3 points, we require 6 for robustness)
    RANSAC_REPROJ_THRESHOLD = 4.0   # 5.0  # Pixels - allows for small localization errors

    # === DETECTION VALIDATION ===
    MIN_INLIERS = 6           # Minimum inliers for valid detection
    MIN_INLIERS_RATIO = 0.25  # At least 25% of matches should be inliers
    MIN_AREA = 1000           # Minimum bounding box area (pixels)
    MAX_AREA_RATIO = 4        #  20       # Max ratio: detected_area / model_area
    MIN_AREA_RATIO = 0.2      #  0.05     # Min ratio: detected_area / model_area

    # === MULTI-INSTANCE DETECTION ===
    MAX_INSTANCES_PER_BOOK = 10  # Safety limit
    IOU_THRESHOLD = 0.3          # Overlap threshold for duplicate removal

## 3. Data Classes

Structured representation of detection results.

In [None]:
@dataclass
class BoundingBox:
    """
    Represents a detected book instance.

    Stores the four corners of the bounding quadrilateral,
    which may not be axis-aligned due to the affine transformation.
    """
    top_left: Tuple[int, int]
    top_right: Tuple[int, int]
    bottom_right: Tuple[int, int]
    bottom_left: Tuple[int, int]
    area: int
    n_inliers: int
    inlier_ratio: float

    def get_polygon(self) -> np.ndarray:
        """Return corners as numpy array for geometric operations."""
        return np.array([self.top_left, self.top_right,
                        self.bottom_right, self.bottom_left], dtype=np.float32)

@dataclass
class BookDetection:
    """All detections for a single book model in a scene."""
    book_id: int
    model_path: str
    instances: List[BoundingBox]

## 4. RootSIFT Feature Extractor

### Why RootSIFT?

Standard SIFT descriptors are compared using Euclidean distance. However, SIFT descriptors are histograms of gradient orientations, and the **Hellinger kernel** (Bhattacharyya distance) is more appropriate for histogram comparison.

RootSIFT achieves this by:
1. L1-normalizing the SIFT descriptor
2. Taking the element-wise square root

This allows Euclidean distance to implicitly compute Hellinger distance, providing **~15-30% better matching accuracy**.

**Reference**: Arandjelović & Zisserman, "Three things everyone should know to improve object retrieval" (CVPR 2012)

In [None]:
class RootSIFT:
    """
    RootSIFT feature extractor.

    Enhances SIFT descriptors for better matching performance
    by applying L1 normalization followed by square root.
    """

    def __init__(self):
        self.sift = cv2.SIFT_create(
            nfeatures=Config.SIFT_FEATURES,
            contrastThreshold=Config.SIFT_CONTRAST_THRESHOLD,
            edgeThreshold=Config.SIFT_EDGE_THRESHOLD
        )

    def detect_and_compute(self, image: np.ndarray):
        """
        Detect keypoints and compute RootSIFT descriptors.

        Args:
            image: Grayscale input image

        Returns:
            keypoints: List of cv2.KeyPoint
            descriptors: RootSIFT descriptors (Nx128 float32 array)
        """
        keypoints, descriptors = self.sift.detectAndCompute(image, None)

        if descriptors is None or len(descriptors) == 0:
            return keypoints, None

        # Convert to RootSIFT
        eps = 1e-7

        # Step 1: L1 normalize
        descriptors = descriptors / (np.sum(descriptors, axis=1, keepdims=True) + eps)

        # Step 2: Square root (Hellinger kernel)
        descriptors = np.sqrt(descriptors)

        return keypoints, descriptors.astype(np.float32)

## 5. Image Preprocessing

### Why CLAHE?

Shelf images have varying lighting conditions (shadows, reflections, uneven illumination). **CLAHE** (Contrast Limited Adaptive Histogram Equalization) normalizes local contrast while preventing over-amplification of noise.

### Parameter Choices

- **clipLimit = 2.0**: Moderate enhancement; higher values can amplify noise
- **tileGridSize = (4, 4)**: Finer grid provides better local adaptation for book spines

We use **fixed parameters** rather than adaptive selection because experiments showed that adaptive approaches (varying CLAHE based on image brightness) performed worse overall.

In [None]:
class ImagePreprocessor:
    """
    Preprocessing pipeline for lighting normalization.
    """

    def __init__(self):
        self.clahe = cv2.createCLAHE(
            clipLimit=Config.CLAHE_CLIP_LIMIT,
            tileGridSize=Config.CLAHE_GRID_SIZE
        )

    def preprocess(self, image: np.ndarray) -> np.ndarray:
        """
        Convert to grayscale and apply CLAHE.

        Args:
            image: BGR input image

        Returns:
            Preprocessed grayscale image
        """
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        else:
            gray = image.copy()

        return self.clahe.apply(gray)

## 6. Feature Matching with 5NN

### Why 5NN Instead of Standard Ratio Test?

The standard approach (Lowe's ratio test with k=2) assumes each model keypoint matches **at most one** scene location. This fails for **multi-instance detection** where the same book appears multiple times.

**5NN strategy**:
1. For each model keypoint, find 5 nearest neighbors in the scene
2. Keep matches that are significantly better than the worst in the group
3. This allows one model keypoint to match multiple scene locations

### Filtering Criteria

- **DISTANCE_THRESHOLD = 0.7**: Best match must be within 70% of median distance
- **OUTLIER_RATIO = 1.5**: Keep matches within (max_distance / 1.5)

In [None]:
class FeatureMatcher:
    """
    5NN feature matcher for multi-instance detection.
    """

    def __init__(self):
        # FLANN matcher for efficient approximate nearest neighbor search
        FLANN_INDEX_KDTREE = 1
        index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
        search_params = dict(checks=100)
        self.flann = cv2.FlannBasedMatcher(index_params, search_params)

    def match(self, des_model: np.ndarray, des_scene: np.ndarray,
              excluded_indices: Set[int] = None) -> List[cv2.DMatch]:
        """
        Match model descriptors to scene descriptors using 5NN.

        Args:
            des_model: Model descriptors
            des_scene: Scene descriptors
            excluded_indices: Scene keypoint indices to exclude (already matched)

        Returns:
            List of good matches
        """
        if des_model is None or des_scene is None:
            return []
        if len(des_model) < 2 or len(des_scene) < 5:
            return []

        if excluded_indices is None:
            excluded_indices = set()

        try:
            matches = self.flann.knnMatch(des_model, des_scene, k=5)
        except cv2.error:
            return []

        good_matches = []

        for match_group in matches:
            # Filter excluded indices
            valid = [m for m in match_group if m.trainIdx not in excluded_indices]

            if len(valid) < 2:
                continue

            distances = [m.distance for m in valid]
            min_dist = distances[0]
            median_dist = distances[min(2, len(distances) - 1)]
            max_dist = distances[-1]

            # Quality check: best match should be good relative to median
            if median_dist > 0 and min_dist > Config.DISTANCE_THRESHOLD * median_dist:
                continue

            # Keep matches significantly better than worst
            threshold = max_dist / Config.OUTLIER_RATIO
            for m in valid:
                if m.distance <= threshold:
                    good_matches.append(m)

        return good_matches

## 7. Geometry Utilities

Helper functions for geometric validation and IoU computation.

In [None]:
class GeometryUtils:
    """Geometric utility functions."""

    @staticmethod
    def polygon_area(points: np.ndarray) -> float:
        """Compute polygon area using Shoelace formula."""
        n = len(points)
        area = 0.0
        for i in range(n):
            j = (i + 1) % n
            area += points[i][0] * points[j][1]
            area -= points[j][0] * points[i][1]
        return abs(area) / 2.0

    @staticmethod
    def is_convex(points: np.ndarray) -> bool:
        """Check if quadrilateral is convex using cross product signs."""
        n = len(points)
        if n != 4:
            return False

        sign = None
        for i in range(n):
            p1, p2, p3 = points[i], points[(i+1)%n], points[(i+2)%n]
            cross = (p2[0]-p1[0]) * (p3[1]-p2[1]) - (p2[1]-p1[1]) * (p3[0]-p2[0])

            if abs(cross) < 1e-6:
                continue
            if sign is None:
                sign = cross > 0
            elif (cross > 0) != sign:
                return False
        return True

    @staticmethod
    def polygon_iou(poly1: np.ndarray, poly2: np.ndarray) -> float:
        """Compute Intersection over Union for two polygons."""
        # Find bounding region
        all_pts = np.vstack([poly1, poly2])
        x_min, y_min = np.floor(all_pts.min(axis=0)).astype(int) - 10
        x_max, y_max = np.ceil(all_pts.max(axis=0)).astype(int) + 10
        x_min, y_min = max(0, x_min), max(0, y_min)

        w, h = x_max - x_min, y_max - y_min
        if w <= 0 or h <= 0:
            return 0.0

        # Create masks
        offset = np.array([x_min, y_min])
        mask1 = np.zeros((h, w), dtype=np.uint8)
        mask2 = np.zeros((h, w), dtype=np.uint8)
        cv2.fillPoly(mask1, [(poly1 - offset).astype(np.int32)], 1)
        cv2.fillPoly(mask2, [(poly2 - offset).astype(np.int32)], 1)

        intersection = np.sum(mask1 & mask2)
        union = np.sum(mask1 | mask2)

        return intersection / union if union > 0 else 0.0

## 8. Affine Transformation Estimator

### Why Affine Instead of Homography?

| Property | Homography | Affine |
|----------|------------|--------|
| Degrees of freedom | 8 | 6 |
| Minimum points | 4 | 3 |
| Handles perspective | Yes | No |
| Stability with few points | Lower | Higher |

For our dataset:
- Books are **upright** on shelves (minimal rotation)
- Camera view is **nearly frontal** (minimal perspective)
- We often have only **6-10 matches** per instance

Affine transformation preserves:
- Parallel lines (book edges remain parallel)
- Ratios of distances along lines

This is sufficient for our use case and provides more stable estimation with limited correspondences.

In [None]:
class AffineEstimator:
    """
    Affine transformation estimator using RANSAC.

    More stable than homography for our nearly-planar book images.
    """

    def estimate(self, src_pts: np.ndarray, dst_pts: np.ndarray) -> Tuple[Optional[np.ndarray], Optional[np.ndarray], int]:
        """
        Estimate affine transformation using RANSAC.

        Args:
            src_pts: Source points (model) - Nx2 array
            dst_pts: Destination points (scene) - Nx2 array

        Returns:
            (affine_matrix, inlier_mask, n_inliers)
            affine_matrix is 2x3 for cv2.warpAffine
        """
        if len(src_pts) < 3:
            return None, None, 0

        try:
            # cv2.estimateAffine2D returns 2x3 matrix and inlier mask
            M, inliers = cv2.estimateAffine2D(
                src_pts.reshape(-1, 1, 2),
                dst_pts.reshape(-1, 1, 2),
                method=cv2.RANSAC,
                ransacReprojThreshold=Config.RANSAC_REPROJ_THRESHOLD,
                maxIters=2000,
                confidence=0.99
            )

            if M is None or inliers is None:
                return None, None, 0

            n_inliers = int(np.sum(inliers))
            return M, inliers, n_inliers

        except cv2.error:
            return None, None, 0

    def transform_points(self, M: np.ndarray, points: np.ndarray) -> np.ndarray:
        """
        Apply affine transformation to points.

        Args:
            M: 2x3 affine matrix
            points: Nx2 array of points

        Returns:
            Transformed points (Nx2)
        """
        # Convert to homogeneous coordinates
        ones = np.ones((len(points), 1))
        pts_h = np.hstack([points, ones])  # Nx3

        # Apply transformation: [x', y'] = M @ [x, y, 1]^T
        transformed = pts_h @ M.T  # Nx2

        return transformed

    def validate(self, M: np.ndarray, model_shape: Tuple[int, int],
                 scene_shape: Tuple[int, int]) -> bool:
        """
        Validate if the affine transformation produces a reasonable bounding box.

        Checks:
        1. Result is convex quadrilateral
        2. Area is within reasonable bounds
        3. Box is within scene (with margin)
        """
        if M is None:
            return False

        # Get model corners
        h, w = model_shape
        corners = np.array([[0, 0], [w, 0], [w, h], [0, h]], dtype=np.float32)

        # Transform to scene
        projected = self.transform_points(M, corners)

        # Check convexity
        if not GeometryUtils.is_convex(projected):
            return False

        # Check bounds (allow some margin outside scene)
        scene_h, scene_w = scene_shape
        margin = max(scene_h, scene_w) * 0.2

        if (np.any(projected < -margin) or
            np.any(projected[:, 0] > scene_w + margin) or
            np.any(projected[:, 1] > scene_h + margin)):
            return False

        # Check area ratio
        model_area = h * w
        detected_area = GeometryUtils.polygon_area(projected)

        if detected_area < Config.MIN_AREA:
            return False

        ratio = detected_area / model_area
        if ratio < Config.MIN_AREA_RATIO or ratio > Config.MAX_AREA_RATIO:
            return False

        return True

## 9. Book Detector

### Multi-Instance Detection Strategy

We use **iterative detection with masking**:

1. Match all model keypoints to scene
2. Estimate affine transformation (RANSAC)
3. If valid: save detection, mark inlier keypoints as used
4. Repeat with remaining keypoints until no more valid detections

### Why Not Clustering?

We experimented with clustering approaches:
- **Scene-location DBSCAN**: Failed because adjacent books have nearby keypoints
- **Translation-voting DBSCAN**: Fragmented instances into too-small clusters

Simple iterative masking proved most effective.

In [None]:
class BookDetector:
    """
    Main book detection pipeline.

    Pipeline:
    1. Preprocess images (CLAHE)
    2. Extract RootSIFT features
    3. Match features (5NN)
    4. Iteratively detect instances (Affine + RANSAC + Masking)
    """

    def __init__(self):
        self.preprocessor = ImagePreprocessor()
        self.feature_extractor = RootSIFT()
        self.matcher = FeatureMatcher()
        self.affine_estimator = AffineEstimator()
        self.model_cache = {}  # Cache model features

    def load_model(self, model_path: str):
        """Load and cache model image features."""
        if model_path in self.model_cache:
            return self.model_cache[model_path]

        img = cv2.imread(model_path)
        if img is None:
            raise ValueError(f"Could not load: {model_path}")

        gray = self.preprocessor.preprocess(img)
        kp, des = self.feature_extractor.detect_and_compute(gray)

        self.model_cache[model_path] = (img, kp, des)
        return img, kp, des

    def detect_in_scene(self, scene_path: str, model_paths: List[str],
                        verbose: bool = False) -> Tuple[List[BookDetection], np.ndarray]:
        """
        Detect all books in a scene image.

        Args:
            scene_path: Path to scene image
            model_paths: List of paths to model images
            verbose: Print progress information

        Returns:
            (list of BookDetection, annotated image)
        """
        # Load and preprocess scene
        scene_img = cv2.imread(scene_path)
        if scene_img is None:
            raise ValueError(f"Could not load: {scene_path}")

        scene_gray = self.preprocessor.preprocess(scene_img)
        scene_kp, scene_des = self.feature_extractor.detect_and_compute(scene_gray)

        if verbose:
            print(f"Scene: {Path(scene_path).name} - {len(scene_kp)} keypoints")

        detections = []
        result_img = scene_img.copy()

        # Process each model
        for book_id, model_path in enumerate(model_paths):
            model_img, model_kp, model_des = self.load_model(model_path)

            if model_des is None or len(model_kp) < Config.MIN_MATCH_COUNT:
                detections.append(BookDetection(book_id, model_path, []))
                continue

            # Detect all instances of this book
            instances = self._detect_instances(
                model_img, model_kp, model_des,
                scene_img, scene_kp, scene_des
            )

            detection = BookDetection(book_id, model_path, instances)
            detections.append(detection)

            # Draw on result image
            self._draw_detection(result_img, detection)

            if verbose and len(instances) > 0:
                print(f"  Book {book_id}: {len(instances)} instance(s)")

        return detections, result_img

    def _detect_instances(self, model_img, model_kp, model_des,
                          scene_img, scene_kp, scene_des) -> List[BoundingBox]:
        """
        Detect all instances of a book using iterative affine estimation.

        Strategy:
        1. Match features
        2. Estimate affine transformation
        3. Validate and save detection
        4. Mask inlier keypoints
        5. Repeat until no more valid detections
        """
        instances = []
        excluded_indices: Set[int] = set()

        for _ in range(Config.MAX_INSTANCES_PER_BOOK):
            # Match with exclusion
            matches = self.matcher.match(model_des, scene_des, excluded_indices)

            if len(matches) < Config.MIN_MATCH_COUNT:
                break

            # Extract point coordinates
            src_pts = np.float32([model_kp[m.queryIdx].pt for m in matches])
            dst_pts = np.float32([scene_kp[m.trainIdx].pt for m in matches])
            match_indices = [m.trainIdx for m in matches]

            # Estimate affine transformation
            M, inlier_mask, n_inliers = self.affine_estimator.estimate(src_pts, dst_pts)

            if M is None:
                break

            # Check inlier quality
            inlier_ratio = n_inliers / len(matches)
            if n_inliers < Config.MIN_INLIERS or inlier_ratio < Config.MIN_INLIERS_RATIO:
                # Mask these points and try again
                inlier_indices = [match_indices[i] for i in range(len(match_indices))
                                  if inlier_mask[i]]
                excluded_indices.update(inlier_indices)
                continue

            # Validate transformation
            if not self.affine_estimator.validate(M, model_img.shape[:2], scene_img.shape[:2]):
                inlier_indices = [match_indices[i] for i in range(len(match_indices))
                                  if inlier_mask[i]]
                excluded_indices.update(inlier_indices)
                continue

            # Create bounding box
            h, w = model_img.shape[:2]
            corners = np.array([[0, 0], [w, 0], [w, h], [0, h]], dtype=np.float32)
            projected = self.affine_estimator.transform_points(M, corners)
            area = GeometryUtils.polygon_area(projected)

            bbox = BoundingBox(
                top_left=tuple(map(int, projected[0])),
                top_right=tuple(map(int, projected[1])),
                bottom_right=tuple(map(int, projected[2])),
                bottom_left=tuple(map(int, projected[3])),
                area=int(area),
                n_inliers=n_inliers,
                inlier_ratio=inlier_ratio
            )

            # Check for duplicates (NMS)
            is_duplicate = False
            for existing in instances:
                if GeometryUtils.polygon_iou(bbox.get_polygon(), existing.get_polygon()) > Config.IOU_THRESHOLD:
                    is_duplicate = True
                    break

            if is_duplicate:
                inlier_indices = [match_indices[i] for i in range(len(match_indices))
                                  if inlier_mask[i]]
                excluded_indices.update(inlier_indices)
                continue

            # Valid new instance!
            instances.append(bbox)

            # Mask inlier keypoints
            inlier_indices = [match_indices[i] for i in range(len(match_indices))
                              if inlier_mask[i]]
            excluded_indices.update(inlier_indices)

        return instances

    def _draw_detection(self, img: np.ndarray, detection: BookDetection):
        """Draw bounding boxes on image."""
        colors = [
            (0, 255, 0), (255, 0, 0), (0, 0, 255), (255, 255, 0),
            (255, 0, 255), (0, 255, 255), (128, 0, 255), (255, 128, 0)
        ]
        color = colors[detection.book_id % len(colors)]

        for inst in detection.instances:
            pts = np.array([inst.top_left, inst.top_right,
                           inst.bottom_right, inst.bottom_left], dtype=np.int32)
            cv2.polylines(img, [pts], True, color, 3)

            label = f"Book {detection.book_id}"
            pos = (inst.top_left[0], max(inst.top_left[1] - 10, 20))
            cv2.putText(img, label, pos, cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)

## 10. Output Formatting

In [None]:
def format_output(detections: List[BookDetection]) -> str:
    """
    Format detection results as specified in the assignment.

    Output format:
    Book X - N instance(s) found:
      Instance 1 {top_left: (x,y), top_right: (x,y), ...}
    """
    lines = []

    for det in detections:
        if len(det.instances) > 0:
            lines.append(f"Book {det.book_id} - {len(det.instances)} instance(s) found:")

            for i, inst in enumerate(det.instances, 1):
                lines.append(
                    f"  Instance {i} {{"
                    f"top_left: {inst.top_left}, "
                    f"top_right: {inst.top_right}, "
                    f"bottom_left: {inst.bottom_left}, "
                    f"bottom_right: {inst.bottom_right}, "
                    f"area: {inst.area}px}}"
                )

    return "\n".join(lines)

## 11. Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MODELS_PATH = "/content/drive/MyDrive/Colab Notebooks/IPCV_1/dataset/models"
SCENES_PATH = "/content/drive/MyDrive/Colab Notebooks/IPCV_1/dataset/scenes"

# Load file lists
model_files = sorted(Path(MODELS_PATH).glob("model_*.png"), key=lambda x: int(x.stem.split('_')[1]))
scene_files = sorted(Path(SCENES_PATH).glob("scene_*.jpg"), key=lambda x: int(x.stem.split('_')[1]))

model_paths = [str(f) for f in model_files]
scene_paths = [str(f) for f in scene_files]

print(f"Found {len(model_paths)} models and {len(scene_paths)} scenes")

## 12. Run Detection

In [None]:
# Initialize detector
detector = BookDetector()

# Process all scenes
all_results = {}
all_images = {}

for idx, scene_path in enumerate(scene_paths):
    print(f"\nProcessing scene {idx}: {Path(scene_path).name}")

    detections, result_img = detector.detect_in_scene(scene_path, model_paths, verbose=False)

    all_results[idx] = detections
    all_images[idx] = result_img

    total = sum(len(d.instances) for d in detections)
    books = sum(1 for d in detections if len(d.instances) > 0)
    print(f"  Found {total} instances of {books} books")

print("\n" + "="*60)
print("PROCESSING COMPLETE")
print("="*60)

## 13. Results

In [None]:
# Print formatted results
for idx, detections in all_results.items():
    detected = [d for d in detections if len(d.instances) > 0]
    if detected:
        print(f"\n{'='*60}")
        print(f"SCENE: {Path(scene_paths[idx]).name}")
        print(f"{'='*60}")
        print(format_output(detected))

## 14. Visualization

In [None]:
# Visualize all results
for idx in range(len(scene_paths)):
    if idx not in all_images:
        continue

    plt.figure(figsize=(12, 8))
    plt.imshow(cv2.cvtColor(all_images[idx], cv2.COLOR_BGR2RGB))
    plt.title(f"Scene {idx}: {Path(scene_paths[idx]).name}")
    plt.axis('off')
    plt.show()

In [None]:
def diagnose_detection(detector, scene_path, model_path, book_id, scene_idx):
    """
    Diagnose the affine detection pipeline for a specific scene/model pair.

    Usage:
        diagnose_detection(detector, scene_paths[19], model_paths[6], book_id=6, scene_idx=19)
    """
    import cv2
    import numpy as np
    from pathlib import Path

    print("="*70)
    print(f"DIAGNOSTIC: Scene {scene_idx} / Book {book_id}")
    print("="*70)

    # =========================================================================
    # STEP 1: Load and preprocess images
    # =========================================================================
    print("\n[STEP 1] IMAGE LOADING & PREPROCESSING")
    print("-"*50)

    scene_img = cv2.imread(scene_path)
    model_img, model_kp, model_des = detector.load_model(model_path)

    print(f"Scene path: {Path(scene_path).name}")
    print(f"Model path: {Path(model_path).name}")
    print(f"Scene image shape: {scene_img.shape}")
    print(f"Model image shape: {model_img.shape}")

    scene_gray = detector.preprocessor.preprocess(scene_img)
    model_gray = detector.preprocessor.preprocess(model_img)

    print(f"Scene mean intensity (after CLAHE): {scene_gray.mean():.1f}")
    print(f"Model mean intensity (after CLAHE): {model_gray.mean():.1f}")

    # =========================================================================
    # STEP 2: Feature extraction
    # =========================================================================
    print("\n[STEP 2] FEATURE EXTRACTION (RootSIFT)")
    print("-"*50)

    scene_kp, scene_des = detector.feature_extractor.detect_and_compute(scene_gray)

    print(f"Model keypoints: {len(model_kp)}")
    print(f"Scene keypoints: {len(scene_kp)}")

    if model_des is not None:
        print(f"Model descriptor shape: {model_des.shape}")
    if scene_des is not None:
        print(f"Scene descriptor shape: {scene_des.shape}")

    # =========================================================================
    # STEP 3: Feature matching (5NN)
    # =========================================================================
    print("\n[STEP 3] FEATURE MATCHING (5NN)")
    print("-"*50)

    matches = detector.matcher.match(model_des, scene_des, excluded_indices=None)

    print(f"Matches after 5NN + filtering: {len(matches)}")
    print(f"Match ratio: {len(matches)/len(model_kp)*100:.1f}% of model keypoints matched")

    if len(matches) < Config.MIN_MATCH_COUNT:
        print(f"⚠️  Not enough matches (need {Config.MIN_MATCH_COUNT}, have {len(matches)})")
        return

    # Extract coordinates
    src_pts = np.float32([model_kp[m.queryIdx].pt for m in matches])
    dst_pts = np.float32([scene_kp[m.trainIdx].pt for m in matches])
    match_indices = [m.trainIdx for m in matches]

    print(f"\nScene match locations:")
    print(f"  X range: {dst_pts[:,0].min():.0f} - {dst_pts[:,0].max():.0f}")
    print(f"  Y range: {dst_pts[:,1].min():.0f} - {dst_pts[:,1].max():.0f}")

    # =========================================================================
    # STEP 4: Affine estimation (single pass on all matches)
    # =========================================================================
    print("\n[STEP 4] AFFINE ESTIMATION (all matches)")
    print("-"*50)

    M, inlier_mask, n_inliers = detector.affine_estimator.estimate(src_pts, dst_pts)

    if M is None:
        print("⚠️  Affine estimation failed!")
        return

    inlier_ratio = n_inliers / len(matches)
    print(f"Inliers: {n_inliers}/{len(matches)} ({inlier_ratio*100:.1f}%)")
    print(f"MIN_INLIERS threshold: {Config.MIN_INLIERS}")
    print(f"MIN_INLIERS_RATIO threshold: {Config.MIN_INLIERS_RATIO*100:.0f}%")

    passes_inlier_count = n_inliers >= Config.MIN_INLIERS
    passes_inlier_ratio = inlier_ratio >= Config.MIN_INLIERS_RATIO
    print(f"Passes inlier count: {passes_inlier_count}")
    print(f"Passes inlier ratio: {passes_inlier_ratio}")

    # Show affine matrix
    print(f"\nAffine matrix:")
    print(f"  [{M[0,0]:.3f}  {M[0,1]:.3f}  {M[0,2]:.1f}]")
    print(f"  [{M[1,0]:.3f}  {M[1,1]:.3f}  {M[1,2]:.1f}]")

    # Decompose affine into components
    scale_x = np.sqrt(M[0,0]**2 + M[1,0]**2)
    scale_y = np.sqrt(M[0,1]**2 + M[1,1]**2)
    rotation = np.arctan2(M[1,0], M[0,0]) * 180 / np.pi
    tx, ty = M[0,2], M[1,2]

    print(f"\nAffine decomposition:")
    print(f"  Scale X: {scale_x:.3f}")
    print(f"  Scale Y: {scale_y:.3f}")
    print(f"  Rotation: {rotation:.1f}°")
    print(f"  Translation: ({tx:.1f}, {ty:.1f})")

    # =========================================================================
    # STEP 5: Validation
    # =========================================================================
    print("\n[STEP 5] VALIDATION")
    print("-"*50)

    is_valid = detector.affine_estimator.validate(M, model_img.shape[:2], scene_img.shape[:2])
    print(f"Validation result: {'✓ VALID' if is_valid else '✗ INVALID'}")

    # Show projected corners
    h, w = model_img.shape[:2]
    corners = np.array([[0, 0], [w, 0], [w, h], [0, h]], dtype=np.float32)
    projected = detector.affine_estimator.transform_points(M, corners)

    print(f"\nProjected corners:")
    print(f"  Top-left:     ({projected[0,0]:.0f}, {projected[0,1]:.0f})")
    print(f"  Top-right:    ({projected[1,0]:.0f}, {projected[1,1]:.0f})")
    print(f"  Bottom-right: ({projected[2,0]:.0f}, {projected[2,1]:.0f})")
    print(f"  Bottom-left:  ({projected[3,0]:.0f}, {projected[3,1]:.0f})")

    # Area check
    model_area = h * w
    detected_area = GeometryUtils.polygon_area(projected)
    area_ratio = detected_area / model_area

    print(f"\nArea analysis:")
    print(f"  Model area: {model_area} px")
    print(f"  Detected area: {detected_area:.0f} px")
    print(f"  Area ratio: {area_ratio:.3f}")
    print(f"  Valid range: [{Config.MIN_AREA_RATIO}, {Config.MAX_AREA_RATIO}]")
    print(f"  Min area threshold: {Config.MIN_AREA}")

    # Convexity check
    is_convex = GeometryUtils.is_convex(projected)
    print(f"\nConvexity: {'✓ Convex' if is_convex else '✗ Not convex'}")

    # =========================================================================
    # STEP 6: Inlier distribution
    # =========================================================================
    print("\n[STEP 6] INLIER DISTRIBUTION")
    print("-"*50)

    if inlier_mask is not None:
        inlier_dst = dst_pts[inlier_mask.ravel() == 1]
        outlier_dst = dst_pts[inlier_mask.ravel() == 0]

        print(f"Inlier locations:")
        print(f"  X range: {inlier_dst[:,0].min():.0f} - {inlier_dst[:,0].max():.0f}")
        print(f"  Y range: {inlier_dst[:,1].min():.0f} - {inlier_dst[:,1].max():.0f}")

        if len(outlier_dst) > 0:
            print(f"\nOutlier locations:")
            print(f"  X range: {outlier_dst[:,0].min():.0f} - {outlier_dst[:,0].max():.0f}")
            print(f"  Y range: {outlier_dst[:,1].min():.0f} - {outlier_dst[:,1].max():.0f}")

    # =========================================================================
    # STEP 7: Simulate iterative detection
    # =========================================================================
    print("\n[STEP 7] ITERATIVE DETECTION SIMULATION")
    print("-"*50)

    excluded = set()
    iteration = 0

    while iteration < Config.MAX_INSTANCES_PER_BOOK:
        iteration += 1

        # Match with exclusion
        iter_matches = detector.matcher.match(model_des, scene_des, excluded)

        if len(iter_matches) < Config.MIN_MATCH_COUNT:
            print(f"\nIteration {iteration}: Only {len(iter_matches)} matches remaining - STOP")
            break

        iter_src = np.float32([model_kp[m.queryIdx].pt for m in iter_matches])
        iter_dst = np.float32([scene_kp[m.trainIdx].pt for m in iter_matches])
        iter_indices = [m.trainIdx for m in iter_matches]

        M_iter, mask_iter, n_inliers_iter = detector.affine_estimator.estimate(iter_src, iter_dst)

        if M_iter is None:
            print(f"\nIteration {iteration}: Affine estimation failed - STOP")
            break

        inlier_ratio_iter = n_inliers_iter / len(iter_matches)
        is_valid_iter = detector.affine_estimator.validate(M_iter, model_img.shape[:2], scene_img.shape[:2])

        # Get inlier indices
        inlier_indices = [iter_indices[i] for i in range(len(iter_indices)) if mask_iter[i]]

        status = ""
        if n_inliers_iter < Config.MIN_INLIERS:
            status = "REJECTED (too few inliers)"
        elif inlier_ratio_iter < Config.MIN_INLIERS_RATIO:
            status = "REJECTED (low inlier ratio)"
        elif not is_valid_iter:
            status = "REJECTED (invalid geometry)"
        else:
            # Project corners for this detection
            corners_iter = detector.affine_estimator.transform_points(M_iter, corners)
            status = f"✓ VALID at ({corners_iter[0,0]:.0f}, {corners_iter[0,1]:.0f})"

        print(f"\nIteration {iteration}:")
        print(f"  Matches: {len(iter_matches)}")
        print(f"  Inliers: {n_inliers_iter} ({inlier_ratio_iter*100:.1f}%)")
        print(f"  Valid geometry: {is_valid_iter}")
        print(f"  Status: {status}")

        # Mask inliers for next iteration
        excluded.update(inlier_indices)

        if "REJECTED" in status and n_inliers_iter < 3:
            print("  → No more viable matches - STOP")
            break

    print("\n" + "="*70)
    print("END DIAGNOSTIC")
    print("="*70)


def diagnose_quick(detector, scene_paths, model_paths, scene_idx, book_id):
    """
    Quick diagnostic shortcut.

    Usage:
        diagnose_quick(detector, scene_paths, model_paths, scene_idx=19, book_id=6)
    """
    diagnose_detection(detector, scene_paths[scene_idx], model_paths[book_id], book_id, scene_idx)


def diagnose_all_failures(detector, scene_paths, model_paths, expected):
    """
    Run diagnostics on all expected multi-instance cases.

    Args:
        expected: dict of {scene_idx: {book_id: expected_count}}

    Example:
        expected = {
            10: {19: 4},  # scene_10 should have 4 copies of book_19
            19: {6: 3},   # scene_19 should have 3 copies of book_6
        }
        diagnose_all_failures(detector, scene_paths, model_paths, expected)
    """
    for scene_idx, books in expected.items():
        for book_id, expected_count in books.items():
            print(f"\n{'#'*70}")
            print(f"Expected {expected_count} instances of Book {book_id} in Scene {scene_idx}")
            print(f"{'#'*70}")
            diagnose_detection(detector, scene_paths[scene_idx], model_paths[book_id], book_id, scene_idx)

In [None]:
diagnose_detection(detector, scene_paths[27], model_paths[2], book_id=2, scene_idx=27)

## 15. Statistics

In [None]:
# Compute statistics
total_detections = 0
detections_per_scene = []
detections_per_book = defaultdict(int)

for detections in all_results.values():
    scene_total = sum(len(d.instances) for d in detections)
    detections_per_scene.append(scene_total)
    total_detections += scene_total

    for d in detections:
        if len(d.instances) > 0:
            detections_per_book[d.book_id] += len(d.instances)

print("DETECTION STATISTICS")
print("="*60)
print(f"Total detections: {total_detections}")
print(f"Average per scene: {np.mean(detections_per_scene):.2f}")
print(f"Max in one scene: {max(detections_per_scene)}")
print(f"\nUnique books detected: {len(detections_per_book)}")

print("\nTop 10 most detected books:")
for book_id, count in sorted(detections_per_book.items(), key=lambda x: -x[1])[:10]:
    print(f"  Book {book_id}: {count} instances")

## 16. Conclusion

### Summary

This pipeline implements book detection using traditional computer vision:

1. **RootSIFT features** for robust descriptor matching
2. **5NN matching** to support multi-instance detection
3. **Affine transformation** (not homography) for geometric verification
4. **Iterative detection with masking** for finding multiple copies

### Design Rationale

The key insight is that **simpler is better** for this dataset:

- Books are upright → minimal rotation needed
- Camera is frontal → minimal perspective distortion
- Scale is consistent → no complex scale handling needed

Using affine (6 DOF) instead of homography (8 DOF) provides more stable estimation with the limited number of matches per book instance (~6-10 matches).

### Limitations

- May struggle with heavily occluded books
- Performance depends on book cover texture (low-texture covers yield fewer features)
- Adjacent identical books with overlapping features can be challenging