# Classifier to detector using image pyramids

Back before deep learning-based object detectors, the state-of-the-art was to use HOG + Linear SVM to detect objects in an image.

We’ll be borrowing elements from HOG + Linear SVM to convert any deep neural network image classifier into an object detector.

**1st key ingredient:** 

from HOG + Linear SVM is to use image pyramids.

![](https://929687.smushcdn.com/2633864/wp-content/uploads/2020/06/keras_classifier_object_detector_pyramid_example_2.png?lossy%3D1%26strip%3D1%26webp%3D1)

Utilizing an image pyramid allows us to find objects in images at different scales (i.e., sizes) of an image (Figure 2).

At the bottom of the pyramid, we have the original image at its original size (in terms of width and height).

And at each subsequent layer, the image is resized (subsampled) and optionally smoothed (usually via Gaussian blurring).

The image is progressively subsampled until some stopping criterion is met, which is normally when a minimum size has been reached and no further subsampling needs to take place.

**2nd key ingredient**

Sliding Windows

![](https://929687.smushcdn.com/2633864/wp-content/uploads/2014/10/sliding_window_example.gif?size%3D256x377%26lossy%3D1%26strip%3D1%26webp%3D1)

a sliding window is a fixed-size rectangle that slides from left-to-right and top-to-bottom within an image.

At each stop of the window we would:

1. Extract the ROI
2. Pass it through our image classifier (ex., Linear SVM, CNN, etc.)
3. Obtain the output predictions

>> **Combined with image pyramids, sliding windows allow us to localize objects at different locations and multiple scales of the input image:**

**3rd key ingredient**

Non maxima Supression

When performing object detection, our object detector will typically produce multiple, overlapping bounding boxes surrounding an object in an image.

This behavior is totally normal — it simply implies that as the sliding window approaches an image, our classifier component is returning larger and larger probabilities of a positive detection.

Of course, multiple bounding boxes pose a problem — there’s only one object there, and we somehow need to collapse/remove the extraneous bounding boxes.

**The solution to the problem is to apply non-maxima suppression (NMS), which collapses weak, overlapping bounding boxes in favor of the more confident ones**

![](https://929687.smushcdn.com/2633864/wp-content/uploads/2020/06/keras_classifier_object_detector_steps.png?lossy%3D1%26strip%3D1%26webp%3D1)

In [1]:
#--Tensorflow
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications import imagenet_utils
#--Others
from imutils.object_detection import non_max_suppression
import numpy as np
import argparse 
import imutils
import time
import cv2

### Define the sliding window and the image pyramid functions

In [2]:
def sliding_window(image, step, ws):
    #ws: The window size defines the width and height (in pixels) of the window we are going to extract 
    #from our image
    #slide a window through the image. Complete the rows (x) before moving down the columns (y) of the image
    for y in range(0, image.shape[0]-ws[1], step): #substract ws so we get the difference that is the space we will pass through
        for x in range(0, image.shape[1]-ws[0], step):
            # step size, which indicates how many pixels we are going to “skip” in both the (x, y) directions.
            #ususally goes from 4 to 8 pixels
            #yield the current window (generator)
            yield (x, y, image[y:y+ws[1], x:x+ws[0]])

In [3]:
def image_pyramid(image, scale=1.5, minSize=(224,224)):
    #Yield the original image
    yield image

    #Keep lopping over the image pyramid
    while True:
        #Get the dimentions of the next image of the pyramid
        w = int(image.shape[1]/scale)
        image = imutils.resize(image, width=w)
        
        #if the resized images reaches the minimum supplied size then stop constructing the pyramid
        if image.shape[0] < minSize[1] or image.shape[1] < minSize[0]: #remember that cv2 reads height first. so image.shape[0] is height
            break

        #yield the next image in the pyramid
        yield image 

### Get the model

In [4]:
#Load the model with it's weights
model = ResNet50(weights='imagenet', include_top=True)

### Load and preprocess the data

In [5]:
orig = cv2.imread('/media/juan/juan1/pyimage_univ/object_detect_201/classifier-to-detector/images/hummingbird.jpg')
orig = imutils.resize(orig, width=600)
(H,W) = orig.shape[:2]

In [6]:
#Initialize the pyramid
pyramid = image_pyramid(orig, scale=1.5, minSize=(200,150))

In [7]:
#create two lists:
#Rois: will contain all the ROI's generated from the pyramid and sliding window
#Locs: contains the x, y coordinates of where those ROI's were on the original image
rois, locs = [], []

In [14]:
roi_size = (200,150)
input_size = (224, 224)

In [9]:
#Go through all images in the pyramid
for image in pyramid:

    #Determine the scale factor between the original image dimensions and the current layer of the pyramid
    scale = W/float(image.shape[1])

    #For each layer of the image pyramid, loop ober the sliding window locations
    for (x, y, roiOrig) in sliding_window(image, 16, roi_size):
        #Scale the (x, y) of the ROI with respect to the original image dimensions
        x = int(x*scale)
        y = int(y*scale)
        w = int(roi_size[0]*scale)
        h = int(roi_size[1]*scale)

        #Take the roi and pre-proces it soo we can classify the region with keras/tf
        roi = cv2.resize(roiOrig, input_size) #resize to the dimensions used by resnet50
        roi = img_to_array(roi)
        roi = preprocess_input(roi)

        #Update the list of ROIs and coordinates
        rois.append(roi)
        locs.append((x, y, x+w, y+h)) #remember to add the difference

        #Visualize the sliding window and roi in real time
        clone = orig.copy()
        cv2.rectangle(clone, (x,y), (x+w, y+h), (0,255,0), 2)

        cv2.imshow("visualization", clone)
        cv2.imshow('roi', roiOrig)
        cv2.waitKey(0)
    
cv2.destroyAllWindows()

### Convert the rois list into a float32 array

In [20]:
rois = np.array(rois, dtype="float32")

In [22]:
rois.shape

(365, 224, 224, 3)

### Take the ROIs and make predictions

In [21]:
start = time.time()
preds = model.predict(rois)
end = time.time()
print("[INFO] classifying ROIs took {:.2f} seconds".format(end - start))

[INFO] classifying ROIs took 17.53 seconds


In [24]:
# decode the predictions and initialize a dictionary which maps class
# labels (keys) to any ROIs associated with that label (values)
preds = imagenet_utils.decode_predictions(preds, top=1) #all the labels used to train the resnet50 model
labels = {}

In [26]:
#loop over the predictions
for (i,p) in enumerate(preds):
    #grab the prediction information for the current ROI
    (imagenetID, label, prob) = p[0]

    #filter out weak detections. Use a treshold: the minimum accepted probabilty
    if prob>= 0.9:
        #grab the bounding box associated to the probability 
        box = locs[1]

        #Get the label associated to the probability
        L = labels.get(label, [])
        L.append((box, prob))
        labels[label] = L

In [29]:
labels

{'hummingbird': [((16, 0, 216, 150), 0.98895955),
  ((16, 0, 216, 150), 0.9960862),
  ((16, 0, 216, 150), 0.9686435),
  ((16, 0, 216, 150), 0.9984379),
  ((16, 0, 216, 150), 0.9995752),
  ((16, 0, 216, 150), 0.9992683),
  ((16, 0, 216, 150), 0.9980788),
  ((16, 0, 216, 150), 0.9991916),
  ((16, 0, 216, 150), 0.9996457),
  ((16, 0, 216, 150), 0.9991525),
  ((16, 0, 216, 150), 0.99810106),
  ((16, 0, 216, 150), 0.9958823),
  ((16, 0, 216, 150), 0.9972958),
  ((16, 0, 216, 150), 0.9951442),
  ((16, 0, 216, 150), 0.9589257),
  ((16, 0, 216, 150), 0.9922621),
  ((16, 0, 216, 150), 0.9608538),
  ((16, 0, 216, 150), 0.9903033),
  ((16, 0, 216, 150), 0.9993269),
  ((16, 0, 216, 150), 0.9986399),
  ((16, 0, 216, 150), 0.9982065),
  ((16, 0, 216, 150), 0.9993145),
  ((16, 0, 216, 150), 0.9992097),
  ((16, 0, 216, 150), 0.99889195),
  ((16, 0, 216, 150), 0.9905469),
  ((16, 0, 216, 150), 0.98441863),
  ((16, 0, 216, 150), 0.96884245),
  ((16, 0, 216, 150), 0.95047325),
  ((16, 0, 216, 150), 0.958

In [27]:
#Loop obrt sl bounding boxes and the current label
for(box, prob) in labels[label]:
    #draw the bounding box on the image
    (startx, starty, endX, endY) = box
    cv2.rectangle(clone, (startx, starty), (endX, endY), (0,255,0))

In [28]:
# show the results *before* applying non-maxima suppression, then
	# clone the image again so we can display the results *after*
	# applying non-maxima suppression
cv2.imshow("Before", clone)
cv2.waitKey(0)
cv2.destroyAllWindows()
clone = orig.copy()

### Apply non maxima supression

In [30]:
#extract the bounding boxes and associated prediction probabilities, then apply non-maxima suppression
boxes = np.array([p[0] for p in labels[label]])
proba = np.array([p[1] for p in labels[label]])
boxes = non_max_suppression(boxes, proba)

In [31]:
# loop over all bounding boxes that were kept after applying non-maxima suppression
for (startX, startY, endX, endY) in boxes:
	cv2.rectangle(clone, (startX, startY), (endX, endY), (0, 255, 0), 2)
	y = startY - 10 if startY - 10 > 10 else startY + 10
	cv2.putText(clone, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 255, 0), 2)

	# show the output after apply non-maxima suppression
	cv2.imshow("After", clone)
	cv2.waitKey(0)
cv2.destroyAllWindows()