# Intro: Object Detection with Deep Learning
***

# Motivation
***
What if we could make computers see?
<center><img src="img/slides/tu_bs.jpg"/ style="width:1000px;"></center>


# Visual recognition tasks
***
A camera is not enough!
* We want the computer to understand the scene
    * Maybe start off by recognizing different objects first

# Classification
***
Does this image contain a building? What about a plane?
<center><img src="img/slides/tu_bs.jpg"/ style="width:1000px;"></center>


# Detection
***
Does this image contain a tree? [where?]
<center><img src="img/slides/tu_bs_tree.png"/ style="width:1000px;"></center>


# Detection
***
Which objects does this image contain?
<center><img src="img/slides/tu_bs_multi.png"/ style="width:1000px;"></center>


# Detection
***
Accuracate localization (segmentation, pixelwise classification)
<center><img src="img/slides/tu_bs_seg.png"/ style="width:1000px;"></center>


# Detection
***
Classes vs instances: Where is *my* bike?
<center><img src="img/slides/tu_bs_inst.png"/ style="width:1000px;"></center>


# Visual recognition tasks
***
There is:
* Class and instance detection/localization
    * Bounding box, segmentation mask
* Object attribute estimation
    * How far away, how old, or just who is that
* Activity or event recognition
    * What is someone doing?
* Single image vs video
    * All of the above plus a time dimension

# Today's scope
***
For this lecture and the challenge:
* Class detection using bounding box labels
    * Single images

Next steps:
* Single vs multi stage networks
    * R-CNN, ...
    * YOLO/SSD
* Face detection challenge! 

# Recap: Convolutional neural networks
***
<center><img src="img/slides/cnn.gif"/ style="width:700px;"></center>


# Recap: Convolutional neural networks
***
<center><img src="img/slides/cnn_classifier.jpeg"/ style="width:1000px;"></center>


# A first approach to localization
***
<center><img src="img/slides/tu_bs.jpg"/ style="width:500px;float:right;"></center>

Let's say we want to find the positions of all bikes in our image:
* Train an image classifier for bikes
* Apply the classifier to each crop

Is that a good solution?

* There are *many* crops. We need to check all
    * All possible heights and widths...
* Computation is not shared between overlapping crops

# Better ideas
***
There are two main streams of object detectors, both refining the sliding window approach:
* Two-stage: Classify regions of interest
* Single-stage: Predictions on a grid of proposals

# Two-stage detection
***
<center><img src="img/slides/region_proposal.jpg"/ style="width:500px;float:right;"></center>

1. Region proposal
* Based on the input image, find regions likely to contain objects
    * Selective search, region proposal networks
    * Usually, few thousand proposals are drawn
2. Classification & bounding box regression stage
* For each proposed region:
    * Predict an object class
    * Predict a bounding box, containing the complete object


# Region-based CNNs (R-CNN)
***
Proposed by Girshick et al. @CVPR14

<center><img src="img/slides/rcnn_1.jpg"/ style="width:1000px;"></center>


# Region-based CNNs (R-CNN)
***
Selective search

<center><img src="img/slides/rcnn_2.jpg"/ style="width:1000px;"></center>


# Region-based CNNs (R-CNN)
***

<center><img src="img/slides/rcnn_3.jpg"/ style="width:1000px;"></center>


# Region-based CNNs (R-CNN)
***
Stage two: Feature extraction

<center><img src="img/slides/rcnn_4.jpg"/ style="width:1000px;"></center>


# Region-based CNNs (R-CNN)
***

<center><img src="img/slides/rcnn_5.jpg"/ style="width:1000px;"></center>


# Problems with R-CNN
***
Training is complicated:
* Softmax loss for classification
* SVM training with hinge loss
* Bounding box regression with MSE

The method is very slow:
* 2000 convnet passes per image!
* 47s/image during inference (in 2015)


# Fast R-CNN
***
Proposed by Girshick @ICCV15

<center><img src="img/slides/fast_rcnn_1.jpg"/ style="width:1000px;"></center>

# Fast R-CNN
***
Region proposal on convnet features!

<center><img src="img/slides/fast_rcnn_2.jpg"/ style="width:1000px;"></center>

# Fast R-CNN
***
RoI pooling for fixed size region proposal

<center><img src="img/slides/fast_rcnn_3.jpg"/ style="width:1000px;"></center>

# Fast R-CNN
***
<center><img src="img/slides/fast_rcnn_4.jpg"/ style="width:1000px;"></center>

# Fast R-CNN
***
<center><img src="img/slides/fast_rcnn_5.jpg"/ style="width:1000px;"></center>

# Region-of-interest Pooling
***
<center><img src="img/slides/fast_rcnn_6.jpg"/ style="width:1200px;"></center>

Source: Stanford CS231 lecture

# What's still wrong?
***
<img src="img/slides/rcnn_meme.jpg" style="width:300px;float:right;"/>

The region proposals still come from an external source:
* E.g. selective search, EdgeBoxes, ...


Solution: Region proposal networks (RPN)
* You guessed it: Faster R-CNN!


# Faster R-CNN
***
<center><img src="img/slides/faster_rcnn.jpg"/ style="float:right;width:800px;"></center>

Proposed by Ren et al. @NIPS15:
* Four losses:
    * RPN classification object/no object
    * RPN bounding box regression
    * Final classification into object classes
    * Final box regression

Image sources: Stanford CS231 lecture

# Two-stage speed
***
<center><img src="img/slides/rcnn_speed.jpg"/ style="width:600px;"></center>

Source: Stanford CS231 lecture

# Single stage detection
***
Object detection without proposals
* Prediction on a *grid*
    * Divide image into multiple fixed cells
    * Predict for each cell:
        * Object class
        * Bounding box coordinates for the complete object
            * It might span multiple cells!

Notable architectures/papers:
* You Only Look Once (YOLO), by Redmon et al.
* Single Shot Detection (SSD), by Liu et al.

# Grid-based detection
***
<center><img src="img/slides/coverage.png"/ style="width:1000px;"></center>

Source: <a src="https://www.jeremyjordan.me/object-detection-one-stage/">Jeremy Jordan</a>

# Network architecture
***
<center><img src="img/slides/anchors.jpg"/ style="float:right;width:600px;"></center>


For each cell in a $H_c\times W_c$ grid, we need:
* The probability that the class $n$ is in that cell
    * For $C$ classes: $H_c\times W_c\times C$ output map
* The bounding box coordinates for an object in a cell
    * Simple regression
        * Needs similarly shaped BBs
    * Anchor box proposals
        * Simplifies finding the right shape
        
Image source: Stanford CS231 lecture

# Single class detection example
***
<center><img src="img/slides/ssd.png"/ style="width:800px;"></center>


# How to train?
***
The coverage map is just a simple classification task:
* Which cell contains an object?

Bounding box regression is a little tricky:
* Based on the position of the containing cell in the image
    * Return bounding box coordinates relative to the cell position
* What about labels for empty cells?
    * Solution: Train only on cells for which we have objects
        * Mask the regression loss!

Our challenge will take you through this!

# How to measure
***
Evaluation is done in steps
1. Find all bounding boxes that have a high overlap with a true box
    * Overlap measured relatively, also called intersection-over-union (IOU)
    * Usually, we consider a box a match for $iou>0.5$
2. Each groundtruth bounding box that was covered by a prediction is a *true positive*
3. Unmatched groundtruth boxes are then *false negatives*
4. All predicted boxes that were not matched to a true box are *false positives*

With these metrics we can compute:
* The ratio of predictions that were correct, aka *precision*:
    * $precision=TP / (TP+FP)$
* The ratio of groundtruth boxes, that we found:
    * $recall=TP/(TP+FN)$

# Thanks for the attention & have fun in the challenge!