# **Object Detection Using CNN**

## **Learning Objectives**

By the end of this lesson you will be able to:<br>
* Describe neural style transfer
* Extract content from an image
* Extract style from an image
* Perform neural style transfer using TensorFlow and PyTorch
* Implement CNN architecture
* Use the concept of transfer learning to model complex networks
* Explain object detection with YOLO

### Neural Style Transfer:
Neural Style Transfer uses a pretrained convolutional neural network (CNN) to transfer styles from a given image to another. 

It is achieved by a group of algorithms that takes two images, one content image and one style image. As an output, this algorithm combines the content and the style of the inputs to generate a totally new image.

Consider two images A and B. C is the style transfered image. 

* A = Gothic style image (style image)
* B = Rottweiler puppy image (content image)
* C = Rottweiler puppy image with gothic style (newly generated image)

![alt text](https://miro.medium.com/max/1208/1*fxJGoLMvVTvbNf11eqK62w.png)

### Convolutional Network of Neural Style Transfer:

## Extraction of Content from an Image:
* Shape and design are the contents of an image.
* As the convolution network progresses, the input image is transformed into a representation that increasingly considers the actual content pixel of the image compared to the detailed pixel value.
* For content representation, the feature of the higher layers of the networks are referred.
* The 4_2 convolution layer (second part of the fourth convolution layer) of pretrained VGG-19 network is used as a content extractor.

## Extraction of Style from an Image:
* Color and texture are the styles of an image.
* By performing correlation among various filter responses, the actual global arrangement of the pixels is ignored. Only the color and textures are extracted in other words only the styles are extracted.
* The correlation between the feature maps are called the gram matrix.
* For VGG-19 network, 1_1, 2_1, 3_1, 4_1, and 5_1 layers are used to calculate the gram matrix. 
* The style weight constant varies for each layer and can be seen as the hyperparameter used for changing the style level.

## Model Overview of VGG-19:

![title](https://miro.medium.com/max/1204/1*9dAefEvl4Oo1QCo_oLAOSw.png)

### Content Loss:
* Content Loss function makes sure that the content present in the content image (B) is captured in the generated image (C).
* It captures the root mean square error between the feature representations in layer I (of generated image and the content image)

* Let p be the original image and x be the image generated.
* P and F are their respective feature representations in layer <b> l </b>.
Thus, the content loss is defined as the squared-error loss between the two feature representation.

![title](https://miro.medium.com/max/890/1*LT9330x35To8U5yKhziGrQ.png)

# Style Loss:

* It is defined as the difference of correlation present between the feature maps computed by the generated image and the style image.

* If we compute a style matrix for generated and style images, style loss can be seen as the root mean square difference between the two style matrices.



* Let a be the original image and x be the the image generated.
* Ai and Gi are their respective style representations in layer <b> l </b>.
* w is the arbitary weight.

The contribution of that layer to the total loss is:

![title](https://miro.medium.com/max/778/1*BJq2SJ-5bOW1TJrGUXjJrQ.png)

And the total loss is:

![title](https://miro.medium.com/max/618/1*nnyQkctkUmO_zZ1DlLDrMg.png)


### Gram Matrix:
![title](https://miro.medium.com/max/532/1*BcOk_R3Ky2ADbcBmBL5q6A.png)

### Combined Loss Function:
* α = Content Weight
* β = Loss Weight

![title](https://miro.medium.com/max/1100/1*grlT7LGNDxnyeM4dNDKP5w.png)

## Assisted Practice
<b>Problem Statement: </b>
Perform a neural style transfer with TensorFlow using VGG-19 model.
<br>
<br>
<b>Objective: </b>
Use Tensorflow for neural style transfer.

## Assisted Practice


<b> Problem Statement: </b> You work in a startup that develops photo editing features for third party apps. You are assigned with a task of creating an editing feature that implements different art styles on people’s faces.

<b>Objective:</b> Use PyTorch to build a neural style transfer model.

<b>Link to Input Images:</b> https://www.dropbox.com/sh/gggwid6oc4uo9pc/AAB0Qu4_t2KdnFSCIh7u3mTEa?dl=0

## **Object Detection**

Note: Before diving into the hands-on, let's explore the pretrained CNN models that we are leveraging.


### <b> CNN Pretrained models </b>

## **Success and History**

### **Few Popular CNNs**

![Pop CNN](https://drive.google.com/uc?id=1diezztFyAx-o28kiVQhb1xf3Kunvznyq)

### **Few CNN Architectures**


![Architecture](https://drive.google.com/uc?id=1dLRWFA7k6BP4iQV8D6w8PVHObQeV-vUa)

## **AlexNet**

### **AlexNet Architecture**

![CNN](https://drive.google.com/uc?id=1zk-dm9V1VVbrSmXulKtOlR9P8-oVxwk7)

### **AlexNet Architecture: Layer 1**


![Layer1](https://drive.google.com/uc?id=1YFIxNzULxqoTgr1Aw0DoVGFWibbaK7UE)

![neuron](https://drive.google.com/uc?id=13Sv8mxJ33yoE4P5E3UT_rv_lH0UQiG2V)

## **VGGNet**

* Seeks to investigate the effect of depth in large scale image recognition
* Fixes other parameters of architecture and steadily increases depth
* First place in localization(25.3% error), second in classification(7.3% error) in ILSVRC 2014 using ensemble of 7 networks


### **Fixed Configuration**

* Convolutional Layers: 8 to 16
* Fully Connected Layers: 3
* Stride: 1
* ReLu: Follow all hidden layers
* Max-Pooling: 2x2 window
* LRN: No perceived improvement in performance so, not used
* Padding: s/t spatial resolution is preserved
* Convolutional filters: Starts from 64, which doubles after each max-pooling layer until 512 
* Filter sizes: 3x3, 1x1


#### **Architecture**

![12](https://drive.google.com/uc?id=111e_zR4ttJmsT3zxxiDBnbzVIy5vmFLl)

### **3x3 Filters**

 0
-1
0<br>
-1
4
-1<br>
0
-1
0


* Enables deeper architectures
* Minimum size required for learning concepts of horizontal, vertical, and blob
* Less parameters for same receptive field:
  * Receptive field for stack of three 3x3 filters = one 7x7
  * With C input and output channels:
    * One layer of 3x3 filters: params =3^2 C^2; grows as 0(n^2)
    * Three layers of 3x3 filters: params = 3* (3^2 C^2) = 27 C^2
    * One layer of 7x7 filters; params = 7^2 C^2 = 49 C^2
    

### **VGGNet: The Core Idea**

VGG network is derived from the idea of deeper networks with smaller filters. Multiple VGGNet architectures range from 11 to 19 layers but have one 3X3 filter that helps in 3X3 CONV operation with periodic pooling throughout the network.

[Reference Paper](https://arxiv.org/pdf/1409.1556.pdf)

## **ResNet**

### **Need for a Deeper Network**

* An extremely deep network contains 152 layers
* The presence of a residual learning framework eases the training of networks while deeper neural networks are more difficult to train


### **Stacking Deeper Layers**

![image13](https://drive.google.com/uc?id=10GrTuunx5Gn_XLLRWiEDpzF4j60V3YzQ)

The deeper model (the 56-layer model) performs worse on both training and test errors.

### **Using Skip Connections**

* Take the activation from one layer and feed it into another layer, much deeper into the network
* Use layers to fit residual F(x) = H(x) – x, instead of H(x) directly


### **Residual Block**

![image 14](https://drive.google.com/uc?id=13tMFlF_D_Wm-SRp1NC1AU4KEijfEBWYO)
<br><br>
* Input x goes through conv-relu-conv series and gives you F(x)
* The result is then added to the original input x such that: H(x) = F(x) + x
* In traditional CNNs, H(x) would just be equal to F(x)
* So, instead of just computing that transformation (straight from x to F(x)), the term that you must add (F(x)) to the input (x) is computed


### **ResNet Features**

* Reduces resolution in the first layer therefore, reducing memory consumption
* Maintains resolution and channels in each stage
* Averages values from each channel fed to the linear classifier
* Divides each stage into residual blocks

### **ResNet Architecture**

* Has additional convolutional layer at the beginning
* Has stack residual blocks and every residual block has two 3x3 convolutional layers
* Does not have FC layers at the end (only FC 1000 to output classes)

### **ResNet: Experimental Results**
Despite having extremely deep architecture (based on layers) as compared to VGG-16 and VGG-19, ResNet model size is much smaller due to global average pooling rather than fully-connected layers.

[Reference Paper](https://arxiv.org/abs/1512.03385)

## **Transfer Learning (using the pretrained models)**

The core idea of transfer learning is to use a pretrained model to solve the current classification problem.


![image 17](https://drive.google.com/uc?id=1g6e-xj8kcicnkGLIxBz1sOGh98M_U_gj)

#### **Feature Extractor**


* Remove the classification layer in the network
* Pass the input data through a pretrained network
* Train the above output on traditional ML algorithms



#### **Fine Tuning**

* Initialize a pretrained network
* Update the classification layers from 1000 neurons to number of classes in the input data
* Train the network

### **Transfer Learning: Considerations**


![image 18](https://drive.google.com/uc?id=1j-sC4CbxpBdVV6mJauBjSOmP3hmDJHWt)

<br>
    Note: If the input data has no similarity with the ImageNet dataset, you will end up training the entire network.

### **Transfer Learning Outcomes**

Below is the table illustrating improvements with transfer learning convolutional networks:
<br>
![19](https://drive.google.com/uc?id=1FHbEJ_0vL2wUte2S5iWRUuX1amnfKXcz)

Below is the table illustrating improvements with fine tuning convolutional networks:
<br>
![20](https://drive.google.com/uc?id=1jcapMfBUvS0K5vTo7YiDgzaBqAsgYeFr)

## **Assisted Practice**

**Problem Statement:** Classify the images of cats and dogs in a dataset using ResNet50.<br>
[Dependencies](https://www.dropbox.com/sh/xmgbd6527nj7kcm/AAA6-G_L0tNKaV6psRiO4tI0a?dl=0)

## **Object Detection**

### **Detection vs. Classification**

![21](https://drive.google.com/uc?id=17KB_uF2EBwfzg2CO6tGDTbc6OjXE4OEC)

### **Real-Time Detection**

![22](https://drive.google.com/uc?id=1J-i5-EMNRoYz7FS4xq8iqBaHH-6Cpn9y)

### **Evaluating a Detector**

#### **Test Image**

![23](https://drive.google.com/uc?id=1RMwwvO0RFkyPXUJqcm5LRveGUtREnxdd)


#### **Ideal Detections after 3 Predictions**

![24](https://drive.google.com/uc?id=1biOKWUdSdrYz8dWafvIKc51bHI50w2Rk)

### **Sort by Confidence**

![25](https://drive.google.com/uc?id=1ura-1pqledPDOybRdE8YFI1FjT9P_b9s)

### **Evaluation Metric**

![26](https://drive.google.com/uc?id=12NIfwtgf3Q569qCm50SJvDO_lEPFkX5o)

![27](https://drive.google.com/uc?id=1AjD78EQmYEUOSW6WAnGzNkgiNt4kkW7f)

## **IOU (Intersection Over Union)**

* An advanced metric for object detection
* A measurement for determining the overlap between two areas (in case of computer vision it is between two images).
* Measures how two areas are equal in terms of size and location of the area 
* If two areas are exactly equal, IOU will be 1
<br>
<br>

![IOU](https://drive.google.com/uc?id=1nFFHg88D7osq-QZ0HG1z3PtOmLg8CyA4)
<br>
<br>

## **Implementing IOU**

The bounding box coordinates are in the form (x, y, width, height). You will first calculate the width and height of the Intersection Box and size of Intersection will be area of the Intersection Box. You can get the Union size by subtracting the Intersection size from total area.


### Function For IOU:

In [None]:
def IOU(box1, box2):
    x1, y1, w1, h1 = box1
    x2, y2, w2, h2 = box2
    w_intersection = min(x1 + w1, x2 + w2) - max(x1, x2)
    h_intersection = min(y1 + h1, y2 + h2) - max(y1, y2)
    if w_intersection <= 0 or h_intersection <= 0: # No overlap
        return 0
    I = w_intersection * h_intersection
    U = w1 * h1 + w2 * h2 - I # Union = Total Area - I
    return I / U

## Assisted Practice
Problem Statement: Generate a random rectangle, detect the object using Keras, and evaluate IOU for the prediction.

## **YOLO (You Only Look Once)**

### **YOLO Features**

* Generalized learning
* Exceptionally fast
* Processes the entire image simultaneously


### **YOLO Detection**

### **Steps Performed in YOLO Detection**

* Consider the below image of a dog and a bicycle which will be fed to the YOLO algorithm.

![28](https://drive.google.com/uc?id=103cEhMpKrFuMxJvhoZ233RBnhwHc12Mq)

* The image gets split into an S*S grid.


![29](https://drive.google.com/uc?id=1a4zG9jZfyLbGUD4DzaltcznUMPICQlbP)

* Each cell contributes to B bounding boxes (x,y,w and h)

![30](https://drive.google.com/uc?id=1uev21zfH_rs7qOIL7oKen-rwXgl7QbLx)

* Confidences are assigned to each of the bounding boxes (x,y,w and h)

![31](https://drive.google.com/uc?id=1NBDSA1PkfPqksZeb2yRt8c565xz8vDDB)

* The above steps continue simultaneously for every cell.

![31](https://drive.google.com/uc?id=1xIkOfZ-ituKySu4dP9_S1_sXWcn5QM14)

* Apart from bounding boxes, each cell also contributes to class probabilities for individual bounding box.

![33](https://drive.google.com/uc?id=1himJKpWL-bQmDJDt6my2effL80-yK2NV)

* Conditional probability is evaluated based on the corresponding object.

![32](https://drive.google.com/uc?id=1rTUrBm_2FlDYMiXRXb28DSia100673xr)

* The box and class predictions are then combined.

![35](https://drive.google.com/uc?id=100MIN2m1tnPr4H6eUun-P4DwsJ2eLoZl)

* Finally, you proceed with threshold detections.

![36](https://drive.google.com/uc?id=1abgo0VB6IPKgI_wQn1iyr8ltKjgxEfYn)

## **Implementing YOLO**

### **Training Phase Steps**

* While training your YOLO model, you must match objects to the right cells.

![37](https://drive.google.com/uc?id=1pwA2u0WykOrZx3afhy4mKMiJ7zcX3Wdc)

* You should alter the class prediction for the cell.

![37](https://drive.google.com/uc?id=10yCMhbj9A1qzMsv8m8cGN1sZKf2lHZue)

* Also, examine the boxes predicted by the cell.

![39](https://drive.google.com/uc?id=1Su3Pk_phnb7n6ADod5pKuL0hwilxy84x)

* You must adjust the boxes with a specific confidence for each of them and increase the confidence accordingly.

![40](https://drive.google.com/uc?id=1dZXhPvDJMENtoBeWtIFBkiE1ufvRUI46)

* Also, decrease the confidence for other boxes.

![41](https://drive.google.com/uc?id=1PMSVtwNDUQSMKBYDWO2TZFpocS-GtBBO)

* You may end up with some cell predictions which lack true detections.

![42](https://drive.google.com/uc?id=1q0xNPUOWSGtRyrbyyF4y3T3Texz93FtB)

* Therefore, decrease the confidence for such cells

![43](https://drive.google.com/uc?id=1WE8kwbT1JqLnOKX4baPLtGFoyYCZQ1zw)

### Advancements in YOLO:
* <b> YOLOv2 </b> uses a few tricks to improve training and increase performance. Like Overfeat and SSD we use a fully-convolutional model, but we still train on whole images, not hard negatives. Like Faster R-CNN we adjust priors on bounding boxes instead of predicting the width and height outright. However, we still predict the x and y coordinates directly.
* <b> YOLOv3 </b> uses a few tricks to improve training and increase performance, including: multi-scale predictions, a better backbone classifier, and more. 

### **Assisted Practice**

**Problem Statement:** Detect objects in the image using pretrained YOLO on COCO dataset that consists of 80 objects with labels.
<br>
[Dependencies](https://www.dropbox.com/s/3sc7bkculbkm8n2/Dependencies-20200418T111606Z-001.zip?dl=0)


### **Knowledge Check**


Click [here](https://drive.google.com/open?id=1ABWWW-AGhvbPq9MJ5Z17rAe2uLWjqNvX) for knowledge checks 

## **Key Takeaways**

* Neural style transfer takes two images as input.
* Color, texture, etc. are styles extracted from the input called style image.
* Shape, design, etc. are content properties extracted from the input called content image.
* VGGNet investigates the effect of depth in large-scale image recognition where it fixes other parameters of architecture and steadily increases the depth.
* ResNet reduces the resolution in the first layer therefore, reducing memory consumption. It also maintains the resolution and channels in each stage, where average values from each channel are fed to the linear classifier.
* YOLO exhibit is exceptionally fast because it processes the entire image simultaneously and also shows generalized learning.

### Problem Statement:
You are provided with a trained model of YOLO v3 on the MSCOCO dataset. Using this model, you have to create an object detection program for the different objects of the dataset.
### Objective:
Use YOLO v3 pretrained model for object detection.

Link to Dependencies: https://www.dropbox.com/sh/b8zuyzbiey8kdso/AAAVKIorrhF-khmfNAFdSbhba?dl=0