# Introduction to Computer Vision

## Tasks Within Computer Vision
* Image Classification
* Object Detection
* Image Segmentation

* Visual Relationship Detection 
* Image Captioning
* Image Reconstruction or Image Inpainting

### Convolutional Neural Networks
<img src="https://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/2016-02-15T13-14-33-533Z--1280x720.focal-760x428.jpg">




### Convolutional Neural Networks
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/cnn.jpeg?raw=1">

#### Convolutional Layer
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/conv.gif?raw=1">
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/kernelmove.png?raw=1" width="300" height="150">

#### Pooling Layer
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/pool.png?raw=1">

### Image Classification
Image classification involves assigning a label to an entire image or photograph.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/mnist.png?raw=1">
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/cifar.png?raw=1" width="600" height="350">

An amazing platform for CNN : https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Object Detection
Object detection is the task of image classification with localization, although an image may contain multiple objects that require localization and classification.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/obj.png?raw=1">

### Image Segmentation
Image segmentation is the task of pixel-wise classification of objects in an image.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/segmentation.png?raw=1">

#### General Structure

1. **Encoder**: Feature extraction through a sequence of progressively narrower and deeper filters (Think: pre-trained classification network like VGG/ResNet)
2. **Decoder**: Progressively grows the output of the encoder into a segmentation mask resembling the pixel resolution of the input image (where most customizations happen)
3. **Skip connections**: Long range connections to draw on features at varying spatial scales to improve model accuracy
<img src="https://missinglink.ai/wp-content/uploads/2019/03/SegNet-neural-network_2x.png">

#### Fully Convolutional Network (FCN) [\[paper\]](https://arxiv.org/abs/1411.4038)
* Output from shallower layers have more location information.
* Deep features can be obtained when going deeper.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/fcn.png?raw=1">

#### SegNet [\[paper\]](https://arxiv.org/abs/1511.00561)
* uses unpooling to upsample feature maps in decoder to use and keep high frequency details intact in the segmentation.


#### U-Net [\[paper\]](https://arxiv.org/abs/1505.04597)
* simply concatenates the encoder feature maps to upsampled feature maps from the decoder at every stage to form a ladder like structure.
* The architecture by its skip concatenation connections allows the decoder at each stage to learn back relevant features that are lost when pooled in the encoder.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/unet.png?raw=1">

#### DeepLabv3+ [\[paper\]](https://arxiv.org/abs/1802.02611)
* Atrous Convolution + Atrous Spatial Pyramid Pooling (ASPP): exploits multi-scale features by employing multiple parallel filters with different rates.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/atrous.png?raw=1">
* Improved Decoder: Encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features from the network backbone that have the same spatial resolution
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/deeplab.png?raw=1">

#### PSPNet [\[paper\]](https://arxiv.org/abs/1612.01105)
* Pyramid Pooling Module
    1. CNN is used to extract feature map of the last convolutional layer
    2. Pyramid parsing module is applied to harvest different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation, which carries both local and global context information.
    <img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/pspnet.png?raw=1">

#### Mask-RCNN [\[paper\]](https://arxiv.org/abs/1703.06870)
* Instance Segmentation
* FCN is added on top of CNN features of Faster R-CNN to generate a binary mask (Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere)
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/maskrcnn.jpg?raw=1">
* R-CNN: object detection model
    1. Generate a set of proposals for bounding boxes
    2. Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
    3. Run the box through bbox regression model to output tighter coordinates for the box once the object has been classified.
    <img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/rcnn.png?raw=1">
* Faster R-CNN: uses Regional Proposal Network (RPN) for faster candidate bounding box generation.

#### Cascade Mask R-CNN [\[paper\]](https://arxiv.org/abs/1906.09756)
* Same as Mask R-CNN, segmentation branch is inserted in parallel to the detection branch of Cascade R-CNN.
* Cascade R-CNN: object detection model. Trains multiple detection heads with multiple IoU thresholds. The output of the previous detector is fed to the next as a resampling mechanism.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/cascade.png?raw=1">

#### HRNet [\[paper\]](https://arxiv.org/abs/1908.07919)
* Existing frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), then recover the high-resolution representation from the encoded low-resolution representation.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/hrnet1.png?raw=1">
* HRNet maintains high-resolution representations through parallel multi-resolution convolutions.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/hrnet2.png?raw=1">

#### Applications
* Self-Driving Vehicles
* Face Detection
* Medical Imaging
* Video Surveillance

## Datasets
There are many vision datasets out there, but since our focus will be semantic segmentation, we'll introduce the following datasets.

### Cityscapes
30 classes in urban street scenes.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/cityscape.png?raw=1">
### PASCAL
20 classes. The train/val data has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/pascal.png?raw=1">
### COCO
91 object types. More than 300k photos and 40 scene categories.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/cocodata.png?raw=1">
### ADE20K
3,169 object classes across 1,072 complex everyday scenes. 25k images with an average of 19.5 annotated instances and 10.5 annotated object classes per image.
<img src="https://github.com/mcdy143/tmls_computer_vision/blob/master/tutorial/Chapter-1-Introduction/images/ade20k.png?raw=1">
This is the dataset we'll be focusing our efforts on. 

[Here](https://github.com/mosdragon/kdd2020/tree/master/datasets) are the instructions for downloading the dataset.

