# Convolutional neural networks

# Week 1

## Computer vision

Computer vision is developing fast.

Some examples of computer vision: *image classification*, *object detection* (draw boxes around objects in pictures), *neural style transfer* (combining photos with paintings).

**A single one megapixel image has 3 million features.** If we use a fully connected network, like in the previous courses, learning on such data will require a great deal of computational resources - **we would need a separate parameter for each feature. The *convolution operation* helps us handle large images.**

## Edge Detection Example

As we remember, **in a facial recognition NN the early layers detect edges, later layers detect objects and last layers detect faces**. We will now discuss the edge detection part.

**Filter = kernel**

Here is an example of *convolving* (\*) an image with a filter:

<img src="notes_images/ede.png" width="700">

In the case below edge detection basically works by applying a filter that recognizes the difference between "bright" and "dark" pixels.

<img src="notes_images/ede2.png" width="700">

This is how the convolution operator works. Next, we will see how it can be used in NNs.

## More Edge Detection

**There is difference in the outputs of detecting light to dark and dark to light edge transitions.** With the filter in the image above, the former produces positive and latter negative edges (30 vs -30).

Horizontal edge detection works similarly to vertical:

<img src="notes_images/med.png" width="700">

Many types of filters have been invented, for example the Sobel filter puts more emphasis in the middle. However, **nowadays filters are not really coded manually but learned by treating the filter numbers as parameters.**

<img src="notes_images/med2.png" width="700">

## Padding

There are two downsides to applying the convolution as we did above:  
1) The image shrinks with every operation (6x6 -> 4x4).  
2) The pixels in the corners are only used in a single operation, whereas the pixels in the middle are used multiple times. This basically discards information near the edges.

***Padding* solves both of these problems. Basically we *pad* an image by adding extra pixels (zeros) around it.** We can pad with one or more pixels.

<img src="notes_images/pd.png" width="700">

There are **two common forms of padding**:
* *Valid convolution*, which shrinks the image (no padding)  
* *Same convolution*, where the output size is the same as input size (as in the image above)

<img src="notes_images/pd2.png" width="700">

The filter size is typically odd (3x3, 5x5, 7x7 etc.) for two reasons: 1) it allows for a symmetric padding (see equation) and 2) an odd filter has a central pixel/position, which can be useful for discussing the position of a filter.

## Strided Convolutions

**In *strided convolution* we essentially "skip" some pixels - we move the filter multiple steps at a time.** For example, a *stride* of two moves the filter two steps at a time.

<img src="notes_images/sc.png" width="700">

The striding filter must reside entirely inside the image to produce an output. If it would step "out", the results are discarded. To respect this, we round down the output dimensions while using the formula to calculate them.

In maths, the convolution operation starts with a mirroring step, where the filter is flipped horizontally and vertically. In DL literature (and these lectures) we typically skip this step while still calling the operation "convolution". However, a more mathematically accurate name for the operation would be *cross-correlation*.

## Convolutions Over Volume

**Convolutions over volume work very similarly to 2D convolutions - we basically slide a 3D filter over a 3D image.** As before, the numbers in the filters are typically learned as parameters.

**The number of channels (depth) in the image and the filter must match.** So both would have e.g. 3 channels. The height and the width of the filter can vary.

<img src="notes_images/cov.png" width="700">

With the filter, we can detect edges for singular channels by setting the other channels' filters to zero. Or we can detect them in all channels simultaneously.

We can convolve with multiple filters for detecting edges (e.g. vertical and horizontal). We can then stack the results into a single "volume" as channels.

<img src="notes_images/cov2.png" width="700">

**Convolutions over volume are useful**, because:
1) We can operate directly with 3D images.  
2) More importantly, we can use multiple (2 or 10 or 128 etc.) filters for detecting things and combine the outputs into a single object with multiple channels.

## One Layer of a Convolutional Network

A single convolutional layer behaves as follows:  
1) An input image is convolved with filter(s) [think: $W^{[1]}a^{[0]}$].  
2) A (different) bias is added to the output of each filter [$z^{[1]}$] and then a non-linear function is applied to each output.  
3) The resulting non-linear outputs are stacked into a single object [$a^{[1]}$].

**In a convolutional layer, the number of parameters depends on the filters - it does not depend on the size of the input image.** This is useful, since **we can process even very large images with a relatively small amount of parameters and it makes convolutional NNs less prone to overfitting.**

<img src="notes_images/olcn.png" width="700">

See below for a summary of the notation we will use in this course. This is mostly conventional, but the order of |width x depth x channels| can vary between resources and frameworks.

<img src="notes_images/olcn2.png" width="700">

## Simple Convolutional Network Example

In a CNN, we often link multiple convolutional layers. **Generally, each convolutional layer decreases the weight and height compared to the original image and increases the number of channels.** We can feed the output of the last layer to a logistic/softmax neuron to make a prediction.

<img src="notes_images/scne.png" width="700">

A lot of the work in designing CNNs is choosing the hyperparameters: stride, padding, filter size, number of filters.

**In CNNs, we typically use three types of layers:**  
1) *Convolutional* (CONV)  
2) *Pooling* (POOL)  
3) *Fully connected* (FC)

## Pooling Layers

Pooling layers are used to reduce the size of the representation, speed up computation and make detecting features more robust.

**In *max pooling* we basically slide over the input a filter that takes the maximum value at each position and saves it as output.** So the pooling essentially **shrinks the input while trying to preserve any detected features (high numbers).** This has been found to work well in practice.

**Max pooling has hyperparameters (f & s), but it has no parameters to learn.** So it is essentially a fixed computation, once the HPs have been chosen (often f=2 & s=2 or f=3 & s=3). Padding is usually not used in max pooling (p=0).

<img src="notes_images/pl.png" width="700">

Max pooling follows the same rules we learned for the filters before - the number of channels is preserved etc. In a 3D scenario, the pooling is done separately for each channel.

**In *average pooling*, we take the average at each filter position**, instead of maximum value. This is **not used nearly as commonly as max pooling**, but it has some niche uses in collapsing certain regions in very deep NNs.

## CNN Example

We now have all the building blocks to build a CNN.

Typically the number of layers in a NN is reported as the number of layers that have parameters. Since pooling layers have no parameters, they are not counted as separate layers. Instead, POOL layers are usually seen as follow-ups to CONV layers, since the two are commonly used together.

There are a lot of hyperparameters to be chosen in CNNs and the best way to choose them is usually based on literature (starting from scratch might take a long time).

A typical CNN structure is to have multiple CONV + POOL pairs and at the end a couple of FCs into a softmax.

<img src="notes_images/cnne.png" width="700">

<img src="notes_images/cnne2.png" width="700">

Note, that most of the parameters are in the FCs of the network - not in the CONVs.

## Why Convolutions?

**The main advantage of using CONV layers instead of FCs is that CONVs require way less parameters.** Especially while handling images, FCs can require crazy amounts of parameters.

There are **two main reasons for why CNNs have less parameters:**  
1) *Parameter sharing*  
  * A feature detector (e.g. vertical edges) can successfully use the same parameter values in multiple areas of the image. 
  
2) *Sparsity of connections*
  * In each layer, each output value only depends on a handful of inputs (not all of them, like in FCs).

**Since CNNs have fewer parameters, they can be trained with smaller training sets and are less prone to overfitting.**

CNNs are inherently good at capturing *translation invariances*. For example, they recognize that a cat picture shifted two pixels to the left is still a cat picture.

<img src="notes_images/wc.png" width="700">

The basic principles of training CNNs are largely similar to what we learned before. We define a cost function $J$ and we optimize it with GD/Adam/etc.

In real life, CNNs are often used with the copy-paste principle. We copy a successful CNN structure that somebody else has created and use that.

# Week 1 - exercises

### Step by step

The **main benefits of padding** are the following:

- It allows us to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as we go to deeper layers. An important special case is the "same" convolution, in which the height/width is exactly preserved after one layer. 

- It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

In modern deep learning frameworks, you only have to implement the forward pass, and the framework takes care of the backward pass. **The backward pass for convolutional networks is complicated.**

When in an earlier course we implemented a simple (fully connected) neural network, we used backpropagation to compute the derivatives with respect to the cost to update the parameters. Similarly, in convolutional neural networks we can **calculate the derivatives with respect to the cost in order to update the parameters**. The backprop equations are not trivial and we did not derive them in lecture.

**Even though a pooling layer has no parameters for backprop to update, we still need to backpropagate the gradient through the pooling layer** in order to compute gradients for layers that came before the pooling layer. 

### Application

#### Window, kernel, filter
The words "window", "kernel", and "filter" are used to refer to the same thing.  This is why the parameter `ksize` refers to "kernel size", and we use `(f,f)` to refer to the filter size.  Both "kernel" and "filter" refer to the "window."

#### Details on softmax_cross_entropy_with_logits
* Softmax is used to format outputs so that they can be used for classification.  It assigns a value between 0 and 1 for each category, where the sum of all prediction values (across all possible categories) equals 1.
* Cross Entropy is compares the model's predicted classifications with the actual labels and results in a numerical value representing the "loss" of the model's predictions.
* "Logits" are the result of multiplying the weights and adding the biases.  Logits are passed through an activation function (such as a relu), and the result is called the "activation."
* The function is named `softmax_cross_entropy_with_logits` takes logits as input (and not activations); then uses the model to predict using softmax, and then compares the predictions with the true labels using cross entropy.  These are done with a single function to optimize the calculations.

Having some trouble running stuff. Should possibly downgrade tf, if problem persists. Picked from forum: "As I looked from coursera notebooks they are using tensorflow==1.2.1, keras == 2.0.7 these version."

# Week 2

## **Case studies**

## Why look at case studies?

This week, we'll basicly go through a bunch of examples of CNNs to see how the pros do it.

## Classic networks

**LeNet-5**

The goal of LeNet-5 was to recognize handwritten digits. Padding wasn't used at this time. Average pooling was used, even though nowadays max would be used. The network was small, with 60 k parameters. **The height and width decrease and number of channels increases deeper into the network, as is typical in ConvNets. The basic arrangement of *conv pool conv pool fc fc output* is still quite common.** Back then, ReLU was not used - they used tanh/sigmoid. They had to do tricks with the filters due to lacking computational resources. They also had a sigmoid layer after pooling.

<img src="notes_images/cnln.png" width="700">

**AlexNet**

Input starts with images -> 96 filters -> max pooling -> same convolution (with padding) -> ... This net has a bunch of same conv. Uses a softmax at the end to figure which of 1000 objects the image is. AlexNet is way bigger than LeNet-5, even though the building blocks were the same - it had 60 mil parameters. Using ReLU also made this network better than LN-5. This was one of the first NNs that actually made people believe in NNs in computer vision. This paper is relatively easy to read.

**Convolutions with same padding allow us to build very deep networks while not downsizing the input too quickly. Pooling layers are used to decrease the height and width.**

<img src="notes_images/cnan.png" width="700">

**VGG-16**

This network really simplified the CNN architectures. Input -> same conv layer with 64 filters -> same conv layer with 64 filters -> pool -> ... This one is way deeper than the earlier two, but the structure/architecture itself is quite simple, which made it attractive to researchers. The CNN is very big even on modern standards, 138 M parameters. There is also VGG-19, which is even bigger version of this CNN, but it doesn't do that much better.

<img src="notes_images/cnvgg.png" width="700">

## ResNets

**In a ResNet we basically jump over some layers of the NN to directly pass information deeper into the network.** ResNets are built of *residual blocks*.

<img src="notes_images/res.png" width="700">

**Stacking residual blocks allows us to build much deeper NNs than we could achieve with *plain networks* (non-resnets). In plain nets, the training error starts increasing after a certain number of layers, whereas in resnets, it keeps going down until it plateaus.** Using ResNets (passing info down the NN) helps with the vanishing and exploding gradients problems. The network below has 5 residual blocks.

<img src="notes_images/res2.png" width="700">

## Why ResNets Work

ResNets can be made very deep while still doing well on the training set. Doing well on the training set is a prerequisite for doing well on the dev and test sets.

**Creating ResNets with many layers works well, because the layers either learn something useful or, if they don't, the NN can simply skip the useless layers by the short cuts.** So essentially, a ResNet is able to control its effective size (if we have L2-regularization or weight decay which can set terms to zero). Thus, adding more layers should rarely damage the NN, but it's possible that the extra layers learn something useful. So, we could think that for very deep NNs: ResNets $\geq$ plain nets.

**In ResNets we often see a bunch of same convolution, because this allows the previous activation to be summed directly to future layers (same dim. required).** If the dimensions are different, we can add the multiplier $W_s$ to match them. This can be a learned matrix or a fixed matrix implementing zero padding.

<img src="notes_images/wrw.png" width="700">

## Networks in Networks and 1x1 Convolutions

***A network in network* or *1-by-1 convolution* allows us to control (shrink/preserve/increase) the number of channels in a network.** While a 1x1 convolution might seem trivial, it is actually useful in more than 2 dimensions.

Basically we can think of 1x1 conv as a neuron, that takes in a slice of the volume, multiplies it elementwise by the weights in the 1x1 volume and takes ReLU over the results outputting a single real number. When we add multiple filters, we get multiple real numbers. So in essence, a 1x1 conv takes in some amount of numbers (32 in in image below) and outputs an amount corresponding to the number of filters.

Therefore, 1x1 convolutions can be used to shrink the depth of the NN (whereas pooling layers only shrink the width and height).

<img src="notes_images/1by1.png" width="700">

## Inception Network Motivation

**The basic idea of an *inception network* is that the network uses multiple filter sizes and/or poolings and learns which combination works best instead of us manually picking a single combination.** So, we implement multiple options and concatenate the outputs. Then, the network chooses whatever combination of filter sizes it wants by learning the parameters as it wants.

<img src="notes_images/inm.png" width="700">

**The downside of inception is the added computational cost of calculating all the options.** However, using 1x1 convolution can help us reduce the computational cost. If we 1x1 convolve to reduce the amount of channels before applying the actual (5x5 in image below) filters, we reduce the computational cost tenfold compared to directly applying the filters.

<img src="notes_images/inm2.png" width="700">

## Inception Network

An *inception module/block* is a combination of filter sizes, pooling layers and 1x1 convolutions that concatenates all the output channels together. In pooling layers, the 1x1 conv comes after the pooling.

<img src="notes_images/incn.png" width="700">

An inception network consists of multiple inception modules.

<img src="notes_images/incn2.png" width="700">

Inception networks often have *side branches*, that try to make predictions based on hidden layers in the middle of the network. These help ensure that the intermediate features are ok at predicting the final outputs. This appears to have a regularizing effect (helps avoid overfitting).

The name "inception" comes from the movie meme "we need to go deeper".

## **Practical advice for using ConvNets**

## Using Open-Source Implementation

**Replicating networks based on papers can be difficult. Use open-source and search for open-source implementations of existing NNs.**

When developing a computer vision project, use an existing (open-source) network architecture as a basis and use transfer learning to make it work for you.

## Transfer Learning

**For most computer vision applications, we should use transfer learning with an existing open-source implementation.** This will speed up the work, since it is usually much faster to use pre-trained networks than to train all the weights from scratch.

For example, we want to create a cat detection fNN for detecting our cats from images. We have little training data. What do?  
1. Download an open-source network with code and weights.  
2. Remove existing softmax output layer and replace it with our own.  
3. Use our small training set to teach only the softmax layer (*freeze* the earlier layers).  
4. Speed up training by "skipping" all the frozen layers and mapping the input *x* to the second last layer. So basically we can pre-compute the features for the second last layer and save them. Then we're basically just training a shallow softmax model based on the feature vector.

If we have a large dataset, we can freeze fewer layers, i.e. train more of the last layers. We can either use the existing layers and weights (as initial values) or replace the layers and reinitialize the weights.

If we have a LOT of data, we can retrain the entire network from scratch (replacing softmax output to fit our needs). In this case, we could still use the pretrained weights as initial values.

## Data Augmentation

**Most existing computer vision applications would like to have more data to improve the model (this is not true for all DL fields). Data augmentation is a way to produce more data.**

Common data augmentation methods for images:  
* Mirroring  
* Random cropping
* Color shifting (e.g. randomly add 5 to RGB channels)

Less commonly used methods, but still worth a try:  
* Rotation  
* Shearing  
* Local warping

Color shifting should be done using PCA (principal component analysis). For example, for an image with a lot of RB and little G, PCA might reduce RB values by a lot and reduce G just slightly to keep the overall balance.

<img src="notes_images/da.png" width="700">

Augmentation might require a lot of disk space, if all the images are saved. Consequently, the augmentation is usually done on the fly with e.g. a single CPU/GPU thread that augments images to create mini-batches of data. This data is then passed forward to the rest of the CPU/GPU that does the training. These processes can happen in parallel.

**Data augmentation also has some hyperparameters, so it can be a good idea to look for existing open-source implementations of the augmentation methods.**

## State of Computer Vision

There are some things that are unique about deep learning in computer vision as opposed to other fields.

Even though there is a lot of data for *image recognition* ("is there a cat in the picture") online, it also requires a LOT of data and could always use more. Some problems like *object recognition* ("where is the cat in the picture") have relatively very little data in the first place.

**When you have large amounts of data, you can use simpler algorithms and less hand-engineering. When you have little data, you will need more hand-engineering ("hacks").**

A learning algorithm has two sources of data:  
1. Labeled data  
2. Hand-engineering

**Computer vision relies quite a lot on hand-engineering, since historically there has always been too little data.** The amount of data is increasing now, but some hand-engineering is still going on. Nowadays, transfer learning also helps with small datasets.

Hand-engineering is not bad per se, it takes a lot of skill and effort to work with small amounts of data. But if there's a lot of data, it's not really required and we can focus on other parts of the model.

<img src="notes_images/socv.png" width="700">

**In research, people are very fixated on doing well on standardized benchmark datasets and winning competitions, because this can help publish a paper.** The upside of this is, that it helps the entire community figure out the most effective methods. The downside is, that to do well on benchmarks researches might apply tricks/hacks that would never be used in production.

**Some tricks, that do well on benchmarks but are not used for production systems:**
* Ensembling - train several NNs independently and average the outputs $\hat{y}$ (max. 1-2 % improvement). This slows down runtime considerably, since we need to run every image through multiple (often 3 to 15) networks. Also, we need to keep all the networks available, which sucks memory.  
* Multi-crop at test time - augment the test images before running them through the network and average the results over the augmented images. A typical method is *10-crop*. While resource-heavy, at least it only requires one network as opposed to ensembling.

We should take advantage of open-source code and available research papers in our projects.

# Week 2 - exercises

## Keras tutorial

### Key Points to remember
- Keras is a tool we recommend for rapid prototyping. It allows you to quickly try out different model architectures.
- Remember The four steps in Keras: 


1. Create  
2. Compile  
3. Fit/Train  
4. Evaluate/Test  

## Residual Networks

### What you should remember
- Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.  
- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function. 
- There are two main types of blocks: The identity block and the convolutional block. 
- Very deep Residual Networks are built by stacking these blocks together.

# Week 3

## Detection algorithms

### Object Localization

**Localization refers to figuring out where in the picture is the object to be detected.**

In *image classification* with/without *localization* there is only one object in the image. In *detection*, there can be multiple of different types.

Image classification can be done with a ConvNet as we learned before. **For classification with localization we can add four new output neurons for the bounding box coordinates, width and height.** We typically use a single output for dictating if there is an object and the others to describe it, should it exist.

In classification with localization, the loss function usually has log-likelihood for the object labels $c$, squared error for bounding box coordinates, logistic regression loss for $p_c$

<img src="notes_images/ol2.png" width="700">

<img src="notes_images/ol.png" width="700">

### Landmark Detection

We can have a NN output the x and y coordinates of **important points in an image, aka *landmarks***, that we want the NN to recognize. We can have the landmark coordinates in the NN outputs.

**The landmarked object can be for example a feature in a persons face (corners of eyes, nose, mouth).** This is a basic building block for recognizing emotions from faces, for snapchat face filters etc. Another example is pose detection, where we identify landmarks for different body parts.

Training a NN with landmarks requires training data with labeled landmarks. The landmarks must be consistent across all image, e.g. the coordinates $l_1$ are always right corner of left eye.

<img src="notes_images/ld.png" width="700">

### Object Detection

Object localization and landmark detection build up to *object detection*.

Using car detection as an example: First, we get a training set with very closely cropped images of cars. Then, we use the images to train a ConvNet to recognize if there is a car in a given image. The trained ConvNet can then be used in *sliding windows detection*.

**In sliding windows detection, we slide a boundary box over the image, taking snapshots at each location and passing them to the CNN, that detects if there is a car in the box.** The box/window is slid over the entire image with a chosen stride. Then, we choose a bigger window and slide it over the image. Then, we can choose an even bigger window. The aim is to find crops where the CNN can recognize cars.

**The computational cost of sliding windows detection is great due to the amount of crops that we have to pass through the CNN.** Using a bigger stride will decrease computational cost and decrease accuracy and a smaller one will increase both (up to a point).

Sliding window detection was a fine method in the past, where the models were simpler and lighter. However, **its standard form is too slow and/or inaccurate for huge modern NNs**. Luckily, there is a modern way to implement SWD more efficiently.

<img src="notes_images/od.png" width="700">

### Convolutional Implementation of Sliding Windows

**We can implement sliding windows convolutionally to make the method efficient by modern standards.** Normally, separate sliding windows do separate and overlapping computations, which "wastes" a lot of resources. However, in a convolutional implementation, we can share some of the computations between the windows.

The first step is **turning fully connected layers into convolutional layers by computing the FC layers as a 3-D volumes, "1-by-1-by-$z$" layers**, through convolution with filters. The output layer is created in a similar way - as multiple values corresponding to the class probabilities of each class the softmax is identifying.

Then, **instead of running the CNN forward prop multiple times for different crops of the image, we can run it once on the entire image and get an output describing all of the individual crops.** This significantly lowers the computational cost.

However, the positions of the bounding box are not very accurate with this implementation. Next, we'll learn how to fix this.

<img src="notes_images/cisw.png" width="700">

### Bounding Box Predictions

Sometimes none of the sliding window positions match perfectly with the object to be detected. And sometimes the bounding box is not really a box/square, it can have different aspect ratios. Luckily, **we can use the *YOLO algorithm* to get more accurate bounding boxes.**

YOLO works as follows:  
1) Set a grid (in example 3x3, typically 19x19) over the input image (in example 100x100).  
2) Apply the image classification and localization algorithm to each grid cell.  
3) **Assign each object to a grid cell based on the midpoint of the object**\.  
4) The target output is a 3x3x8 volume (grid x grid x labels).

**YOLO is usually done with a CNN, which maps the image to the output volume.** This makes the algorithm very efficient, since we are just running an image through the CNN once - it's a convolutional implementation. YOLO is so fast it even works for real-time object detection.

The advantage of YOLO is that it gives precise bounding boxes. The disadvantage is that there can only be one object per grid cell (in the default implementation), however with a finer grid like 19x19 it's less likely that there will be multiple objects in a single cell.

In YOLO, the top left point of each grid cell has coordinates (0,0) and the lower right (1,1). The object midpoint coordinates are specified based on these coordinates. The height of the bounding box is specified as a fraction of the total grid cell height, and similarly for width.

While the midpoint coordinates only get values within [0,1], the width and he height can be > 1 if the object is bigger than the bounding box.

**In the YOLO algorithm, at training time, only one cell ---the one containing the center/midpoint of an object--- is responsible for detecting this object.**

There are other parametrizations that might work better in the original YOLO paper. It's quite a hard read.

<img src="notes_images/bbp.png" width="700">

<img src="notes_images/bbp2.png" width="700">

### Intersection Over Union

We can use *the intersection of union* function to tell if our object detection algorithm is working well. We can also include it as a component in our OD algorithm to make it work better.

**IoU basically takes two bounding boxes and divides the size of their intersection by the size of their union. The standard value in computer vision tasks accepted as clear overlap is 0.5** - however stricter values such as 0.6 or 0.7 can also be used based on the task.

Basically, we can compare the predicted box with the true box to tell if our OD is working well. However, IoU generally measures how much any two bounding boxes overlap and can be used to improve our OD.

<img src="notes_images/iou.png" width="700">

### Non-max Suppression

Sometimes, our object detection algorithm might detect the same object multiple times. **We can use non-max suppression to make sure each object is only detected once.**

It's possible that multiple YOLO grid cells think they found the center point of the object. This can lead to multiple detections of each object.

**Non-max suppression compares the probabilities of each object detection.** It will take the highest probability as the "true" detection and suppress any lower-probability detection boxes that have a significant overlap with the true box. Then, it checks any remaining non-supressed detections and picks the one with the highest probability, repeating the process. **In the end, we are left with the unique highest probability boxes, which are the final predictions.** (In reality, the process is slightly more complex but this is the general idea)

The algorithm can also be written as follows:  
1) Cut out any bounding boxes with probabilites lower than a certain threshold (e.g. < 0.6).  
2) Choose the box with the maximum probability.  
3) Discard any box that has IoU > 0.5 with the maximum box.  
4) Repeat 2 and 3 until no new boxes remain.

**If we have multiple output classes, we carry out NMS separately for each of the classes.** Details in the programming exercise.

The algorithm supresses any probability values that are not individual maxima -> non-max suppression.

<img src="notes_images/nms.png" width="700">

<img src="notes_images/nms2.png" width="700">

### Anchor Boxes

One problem we saw with OD so far is that each grid cell can only detect a single object. **Using *anchor boxes*, we can have a grid cell detect multiple objects.**

We predefine a number of anchor boxes (two in lecture example) and reflect this by adding additional outputs $y$ corresponding to each box. Then, we can associate each object with an anchor box of similar shape using IoU.

Essentially, **each object is assigned to a grid cell *and* an anchor box.** This allows for detection of multiple objects within the same grid cell.

Anchor box limitations:  
* We **need to have as many anchor boxes as we expect to find objects in a single grid cell.** So if we have 2 anchor boxes but find 3 objects in a single cell, the algorithm will not work properly, unless we implement some sort of tie breaker.
* Anchor boxes fail, if we have multiple objects (of similar shape) that can be associated with the same anchor box. This case would also require a tie breaker.

**In practice, two objects having the same midpoint in a 19x19 grid is quite rare.**

Anchor boxes allow the learning algorithm to specialize better - some neurons can specialize in wide objects and others in tall objects.

**Anchor boxes need to be chosen by the modeller.** Before, they were chosen purely by hand to reasonably represent the object shapes that were expected. Nowadays, a more advanced approach would be using a k-means algorithm to group together the shapes that the objects tend to get and use this to choose the most representative anchor boxes.

<img src="notes_images/ab.png" width="700">

<img src="notes_images/ab2.png" width="700">

### YOLO Algorithm

Here, we put all the above pieces together to form a practical YOLO algorithm.

**Training**. We train a CNN that inputs a 100x100x3 image and outputs a 3x3x16 volume. Each grid cell is gone through separetely and considered for all the anchor boxes. An object is detected for a grid cell and anchor box combination.

**Predicting**. There are numerical values even in cells with no objects, but these are trash.

**Non-max suppression**. Due to having two bounding boxes, we get two bounding box predictions for each grid cell. Some of the boxes have very low probabilities. We do NMS to generate final predictions for each class we try to detect.

**YOLO is one of the most effective OD algorithms.**

<img src="notes_images/yolo.png" width="700">

<img src="notes_images/yolo2.png" width="700">

<img src="notes_images/yolo3.png" width="700">

### Region Proposals (optional lecture)

Region proposal algorithms are not used very commonly nowadays, but it's still good to know what they are.

Basically, RP selects a few meaningful windows and runs the classifier on those, instead of sliding a window over the whole image. This is based on a segmentation algorithm, which tries to find regions of importance. This is the basic R-CNN (regions with CNN) algorithm.

R-CNN is slow, but there are faster modern alternatives. For example, *Fast R-CNN* is a convolutional implementation. *Faster R-CNN* does the region proposals with a CNN.

Even the modern CNN algorithms are usually slower than YOLO. According to Ng, YOLO is a more promising algorithm and RP is more "nice-to-know".

<img src="notes_images/rp.png" width="700">

# Week 3 - exercises

## What we should remember:
    
- YOLO is a state-of-the-art object detection model that is fast and accurate
- It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume. 
- The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
- You filter through all the boxes using non-max suppression. Specifically: 
    - Score thresholding on the probability of detecting a class to npeep only accurate (high probability) boxes
    - Intersection over Union (IoU) thresholding to eliminate overlapping boxes
- Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise. 

# Week 4

## Face Recognition

### What is face recognition?

Face recognition is a special application of CNNs.

***Face verification* and *face recognition* are two different problems. Recognition is more difficult (requires accuracy > 99 %).**

<img src="notes_images/wfr.png" width="700">

### One Shot Learning

In face recognition we need to solve the ***one shot learning problem*. This means, that our system needs to recognize a face based on a single photo (there is just one photo of each employee in the database).** This is traditionally a difficult task for NNs.

A bad approach would be training a CNN into a softmax with outputs corresponding to different faces. This does not work well, since the training set is too small to train the NN properly. Also, adding new personnel would be difficult and require retraining the NN.

Instead, we will learn **a *similarity* function. This compares two images and returns a value describing if they are of the same person (low value) or not (high).** We can then set a threshold and if the returned value is under the threshold, then the two images are of the same person. -> Verification.

**In a recognition task, we apply the similarity function to the photo and each face in the database.**

<img src="notes_images/osl.png" width="700">

### Siamese Network

**In a *siamese network*, we pass two images through the same CNN and** compare the outputs. The output layer is typically fully connected, and it gives a representation ("an encoding") of the full image. We can then **compare the two images' encodings and use this comparison as the similarity function - to decide if the images portray the same person.**

<img src="notes_images/sn.png" width="700">

<img src="notes_images/sn2.png" width="700">

### Triplet Loss

**The *triplet loss function* is defined on triplets of images: an anchor (A, target image), a positive (P, similar to anchor) and a negative (N, different to anchor).** This loss function is used to train the NN to differentiate between similar and dissimilar images.

We want the encodings of A - P to be less or equal to the encodings of A - N. Furthermore, we add a *margin* hyperparameter to the equation. The margin pushes the AP and the AN pairs further away from each other, making it harder to satisfy the comparison. The margin also prevents the NN from outputting trivial solutions.

**We cannot choose the triplets totally randomly - we should choose triplets that are "difficult" to train on (difference is small).** If we choose them randomly, the task will be too easy for gradient descent and the NN won't work properly. Furthermore, **for training we require multiple pictures of each person.**

In the DL world, algorithms are often named "\_\_\_Net" or "Deep\_\_\_".

**Typically for facial recognition, we would use a NN trained by somebody else instead of implementing it from scratch.** Big companies are training such networks with tens or hundreds of millions of images.

<img src="notes_images/tl.png" width="700">

<img src="notes_images/tl2.png" width="700">

### Face Verification and Binary Classification

We learned that triplet loss is one way to approach face recognition. However, face recognition can also be posed as a *binary classification problem*. It also works well.

In binary classification, we give the siamese network FC output to a logistic regression unit, which decides whether the images are of the same (1) or different (0) person.

We can precompute the encodings for the database images to save computational resources. So, we only compute the encodings for the new images taken of employees trying to enter.

The training set consists of pairs of images labeled 0 or 1.

<img src="notes_images/fvbc.png" width="700">

## Neural Style Transfer

### What is neural style transfer?

***Neural style transfer* = applying the style of an image (S) to content image (C) resulting in a generated image (G).** For example, making a photo have the style of some old painting.

This is based on using the features in the intermediate layers of a CNN.

### What are deep ConvNets learning?

Neurons in the early CNN layers learn simple features like edges or certain shades of color.

Neurons in deeper layers recognize more complex features/shapes. It can be hard to say for sure what exactly each neuron is detecting.

<img src="notes_images/wcl.png" width="700">

### Cost Function

The cost function measures how good a particular generated image is.

Two parts:  
* content cost $J$ measures how similar the content of C and G are.
* style cost $J$ measures how similar the style of S and G are.

So, the cost is of the form:

$J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$

<img src="notes_images/cf.png" width="700">

### Content Cost Function

**We choose a hidden layer $l$ for computing the content cost. The hidden layer should be somewhere in the middle of the CNN** - not too deep and not too early.

The *content cost function* is essentially just the element-wise sum of squares of differences between the activations in layer $l$
of the images C and G.

<img src="notes_images/ccf.png" width="700">

### Style Cost Function

**The "style" of an image is defined as the correlation between activations across channels in layer *l*.** For example, we take the first and second channel and compare them value by value checking how correlated the values are.

The correlation tells how often the high-level features (e.g. texture, color) occur or do not occur together. For example, if channel 1 is orange every time channel 2 has vertical lines, there is high correlation between these features.

**For the cost function, we calculate a style matrix *G* (aka "gram matrix"), that computes all the required correlations and thus tells us how correlated the values are between channels *k* and *k'*.** To form *G*, we sum over the image height and width and multiply together the activations of the channels *k* and *k'*. We do this for every value of *k* and *k'*. If *G* has high values, there is a high correlation.

Technically, **in *G* we do not compute correlation but *unnormalized cross-covariance*.**

The style matrix ***G* is computed for both the style image S and generated image G. *The style cost function J* is computed as the sum of squares of the elementwise differences between the style matrices, summed over all the layers.** Computing *J* over all the layers allows us to include both lower and higher level features in the "style".

<img src="notes_images/scf.png" width="700">

<img src="notes_images/scf2.png" width="700">

### 1D and 3D Generalizations

We can apply to CNNs to more than just 2D images - they also work for 1D and 3D data.

For 1D data, we convolve a 1D input with a number of (e.g. 16 & 32 in slides) 1D filters.

Similarly, for 3D data we convolve a 3D input with a number of 3D filters.

CNNs are mostly used on 2D images, since image data are so pervasive.

<img src="notes_images/1d3d.png" width="700">

<img src="notes_images/1d3d2.png" width="700">

# Week 4 - exercises

## Art generation

In this model the optimization algorithm updates the pixel values rather than the neural network's parameters. Deep learning has many different types of models and this is only one of them! 


#### Key points to remember
**General**  
- Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
- It uses representations (hidden layer activations) based on a pretrained ConvNet.  
- We get even better results by combining this representation from multiple different layers. 
- This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.

**Content cost**
- The content cost function is computed using one hidden layer's activations.
- The content cost takes a hidden layer activation of the neural network, and measures how different $a^{(C)}$ and $a^{(G)}$ are.

**Style cost**  
- The style of an image can be represented using the Gram matrix of a hidden layer's activations. 
- The style cost function for one layer is computed using the Gram matrix of that layer's activations. The overall style cost function is obtained using several hidden layers.
- Minimizing the style cost will cause the image $G$ to follow the style of the image $S$.

**Total cost**
- The total cost is a linear combination of the content cost $J_{content}(C,G)$ and the style cost $J_{style}(S,G)$.
- Optimizing the total cost function results in synthesizing new images.
- $\alpha$ and $\beta$ are hyperparameters that control the relative weighting between content and style.


## Face recognition

#### Key points to remember
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem. 
- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images' encodings allows you to determine whether they are pictures of the same person. 