## Lighthouse Labs - Synaptive Medical

### W08D2 Deep Learning and Convolutional Neural Networks (CNNs)

Instructor: Socorro Dominguez  
February 23, 2021

**Agenda:**

- CNNs
- What kind of layers CNN have?
    - Convolution
    - Pooling
    - Flattening
    - Full connection
     
- Case Studies of different CNN algorithms. 

- CNN tutorial

- Intro to RNNs and Special Case LSTMs

**Review: What is a Neural Network?**

We define the function recursively:

$$ x^{(l+1)} = h\left( W^{(l)} x^{(l)} + b^{(l)}\right) $$

where $W^{(l)}$ is a matrix of parameters, $b^{(l)}$ is a vector of parameters. 

So what is $x^{(l)}$?
 * $x^{(0)}$ are the inputs
 * $x^{(L)}$ are the outputs, so we can say $\hat{y}=x^{(L)}$
 * we refer to $L-1$ as the _number of hidden layers_
 * $h$ activation function

**Deep Learning** is a subfield of *machine learning* concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

![img](img/neural_nets.png)

### Why Deep Learning?

![img](img/unstructured_data.png)

## What is a CNN?

![img](img/CNN.png)

CNNs most common task is Computer Vision.

* Sports:
    * Player Tracking
    * Ball Tracking

* Health and Medicine:
    * Cancer / Tumor Detection
    * Cell Classification
    * Movement Analysis for neurological and musculoskeletal diseases

* Agriculture and farming:
    * Plant Recognition.
    * Farm Automation
    * Animal Monitoring

* Transportation,  Oiling and mining, many others!

Convolutional Neural Networks are a type of Deep Learning Algorithm.

1. CNNs take an image as an input.
2. CNNs learn the features of the image through filters. 
3. They identify important objects present in the image, allowing them to learn to discern one image from the other.

In one walkthrough, the CNN will learn specific features of cats that differentiate them from the dogs. 
Then, when it is provided input of cats and dogs, it can differentiate between the two. 

! During cold-start, the filters "require" hand engineering but with progress in training, they are able to adapt to the learned features and develop filters of their own. CNNs are continuously evolving.

![img](img/robot.png)

* CNNs output is usually probabilities of being something. 

* A CNN is special as it tries to reduce the number of parameters in a deep neural network with many units without losing too much in the quality of the model. 

* In images, pixels that are close to one another usually have the same type of information: sky, water, leaves, etc. 

* The exception from the rule are **the edges**: the parts of an image where two different objects “touch” one another.

* The neural network is trained to recognize regions of the same information as well as the edges. This would allow to predict the object represented in the image. 

* **Example:** If the neural network detected multiple skin regions and edges that look like parts of an oval with skin-like tone on the inside and bluish tone on the outside, then it is likely that it’s a face on a sky background. 
    * If the goal is to detect people on pictures, the neural network will most likely succeed in predicting a person in this picture.

**The most important information in the image is local**

How does a CNN work?
- We split the image into square patches using a moving window approach. 
- We can train multiple smaller regression models at once, each regression model receives a square patch as input.
    - We train the 'filters'.
- Each regression model's work is to learn to detect a specific kind of pattern in the input patch. 

For example, one small regression model will learn to detect the sky; another one will detect the grass, the third one will detect edges of a building.

* CNNs perform similarly to an ordinary fully connected Neural Networks. 
    * They have weights and biases that are learned from the input. 
    * Every neuron connected in the network receives an input and performs a dot product on it. 
    * There is a function at the end that consists of scores that we obtain from the various layers. 
    * They have a loss function at the end to evaluate performance. 

![img](img/anatomyofcnn.png)

What seems different

![](https://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg)


The first architecture is more practical manner. 

There is no linear arrangement of neurons. CNN's neurons have a structure of three dimensions – Length, Width, and Height. 

For instance, Dogs and Cats images are dimensions 32x32x3 and the final output will have a singular vector of the images of dimensions 1x1x2.

![](img/3channels.png)

The goal:  Reduce the images into an easier form to process, without losing features which are critical for getting a good prediction.

**ARCHITECTURE**

* INPUT – A typical image dataset will hold images if dimensions l x w x d, where the depth denotes the number of channels (RGB) in the image.
  
  
* CONV layer - computes the dot product between the weights of the neuron and the region of the input image to which share a connection. An example would be 32x32x12 denoting the 12 filters which the neural network makes use of.
  
  
* The third layer consists of RELU which (activation function) to our resultant dot product. 
  
  
* The fourth layer is a POOLing layer, it downsamples the spatial dimensions of the image (width and height).
  
  
* The fully connected layer will compute the class score, leading to a final volume of 1 x 1 x n; where n is the number of categories to classify.

The convolutional component comprises the learnable filter.  

* To detect some pattern, a small regression model has to learn the parameters of a matrix F (for “filter”) of size p × p, where p is the size of a patch.


* If we had for input a black and white image, 1 would represent the black and 0 would represent the white pixels. 
* Assume 3x3 pixels patches (p = 3). Some patch could then look like the following matrix P (for “patch”):

$$P = \begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}$$

The previous patch represents a pattern that looks like a cross. 

The small regression model that will detect such patterns (and only them) would need to learn a 3 by 3 parameter matrix F where parameters at positions corresponding to the 1s in the input patch would be positive numbers, while the parameters in positions corresponding to 0s would be close to zero. 

If we calculate the convolution of matrices P and F, the value we obtain is higher the more similar F is to P. To illustrate the convolution of two matrices, assume that F looks like this:

$$F = \begin{bmatrix} 0 & 2 & 3 \\ 2 & 4 & 1 \\ 0 & 3 & 0 \end{bmatrix}$$

Then convolution operator is only defined for matrices that have the same number of rows and columns. For our matrices of P and F it’s calculated as illustrated below:

![convolution](img/02_Convolution.png)

If our patch had a different pattern, then the convolution with F would give a different result. 

*The more the patch “looks” like the filter, the higher the value of the convolution operation is*

For convenience, there’s also a bias parameter b associated with each filter F which is added to the result of a
convolution before applying the nonlinearity (activation function).

One layer of a CNN consists of multiple convolution filters (each with its own bias parameter).

Each filter of the first layer slides — or convolves — across the input image, left to right, top to bottom, and convolution is computed at each iteration.

Like this:

![](https://miro.medium.com/max/1400/1*ciDgQEjViWLnCbmX-EeSrA.gif)

![](https://i.stack.imgur.com/FjvuN.gif)

If the CNN has one convolution layer following another convolution layer, then the subsequent layer *l + 1* treats the output of the preceding layer *l* as a collection of size *l* image matrices.

**Pooling**

This is a technique very often used in CNNs. Pooling works in a way very similar to convolution, as a filter applied using amoving window approach. 

Instead of applying a trainable filter to an input matrix, a pooling layer applies a fixed operator, usually either max or average. 

Pooling's hyperparameters are also the size of the filter and the stride. 

Usually, a pooling layer follows a convolution layer, and it gets the output of convolution as input. 

Pooling does not have parameters to learn. It also contributes to the increased accuracy of the model and improves the speed of training by reducing the number of parameters of the neural network.
![pooling](https://miro.medium.com/max/792/1*uoWYsCV5vBU8SHFPAPao-w.gif)

### Why ReLU as Normalization Technique

After getting the new convolved matrix, anything negative is turned to zero.

This removes unnecessary noise. 

Hyperparameters:
* **Stride** Choose how big you want the step to be for the pooling, conv layers
* **Size of Kernel** How big you want your filter to be
* **Padding** Add zeros around the image

## What a CNN looks like after all?

![img](img/05_FullCNN.png)

![img](img/mnistcnn.png)

Check out this video to understand more: https://www.youtube.com/watch?v=FmpDIaiMIeA&feature=emb_title

## Introduction to RNNs

### Example of a Use Case

[Music Generator](https://magenta.tensorflow.org/performance-rnn)

### Motivation to know about RNNs

- Language is an inherently sequential phenomenon.
- Reflected in the metaphors used to describe language
- flow of conversation, news feeds, and twitter streams

**Sentiment analysis using feed-forward neural networks**

* In feed-forward neural networks, all connections flow forward (no loops).
* Each layer of hidden units is fully connected to the next.
* We need to pass fixed sized vector representation of text (CountVectorizer object) as input.
* We lose the temporal aspect of text in this representation.



![img](https://learnopencv.com/wp-content/uploads/2017/10/mlp-diagram.jpg)

## Language modeling: Why should we care?

### Powerful idea in NLP and helps in many tasks.

* Machine translation:
> P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)

* Spelling correction
> My office is a 10  <span style="color:red">minuet</span> bus ride from my home.  
> P(10 <span style="color:blue">minute</span> bus ride from my home) > P(10 <span style="color:red">minuet</span> bus ride from my home)

* Speech recognition
> P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)

### Motivation: Language modeling task

In the beginning, NLP used a lot of Markov Chains. If you are really interested in learning NLP, you should study Markov Chains.

Markov model: $P(w_t|w_1,w_2,\dots,w_{t-1}) = P(w_t|w_{t-2}, w_{t-1})$


**Markov Models Downsides:**
* They are 'memoryless'. They do not have memory beyond the previous maximum $n$ steps and when $n$ becomes larger, there is sparsity problem.

* They have huge RAM requirements because you have to store all ngrams.

## RNNs motivation
* RNNs can help us with this limited memory problem!
* RNNs are a kind of neural network model which use hidden units to remember things over time.
* Condition the neural network on all previous words.

### RNN intuition: Example

- Put a number of feedforward networks together.
- Suppose I have 1 word represented by a vector of size 4 and I want to predict something about that word, I use one feedforward neural network. 
- Suppose I have 2 words, I use 2 of these networks and put them together. 

<img src="img/RNN_two_feedforward.png" height="800" width="800"> 


(Image credit: [learnopencv](https://www.learnopencv.com/understanding-feedforward-neural-networks/))    

### RNN intuition

- Put a number of feedforward networks together. 
- Make connections between the hidden layers.
- Process sequences by presenting one element at a time to the network.


<img src="img/RNN_introduction.png" height="800" width="800"> 

(Credit: [Stanford CS224d slides](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf))

### What can we do with RNNs?

- Simple or Vanilla RNN

<img src="img/RNN_introduction.png" height="800" width="800"> 

- But a number of architectures are possible, which makes them a very rich family of models.  

One to one 

- The usual feedforward neural network 

One to many

- Music generation
- Text generation
- Image captioning 

Many to one

- Sentiment analysis
- Text classification 
- Video activity recognition 

Many to many (sequence to sequence or encoder-decoder models)

- Speech recognition 
- Machine translation


Check Transformers and BERT (NLP Uses)

### RNN architectures

- A number of possible RNN architectures

<img src="img/RNN_architectures.png" height="1000" width="1000"> 

[source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

### Mini Intro to LSTMs

* RNNs tend to have a problem called The Vanishing Gradient.

### A robust solution to this problem is 

- **Use a more complex recurrent unit with gates**
    - Gated Recurrent Units (GRUs)    
    - **Long Short Term Memory networks (LSTMs)**

### Long Short Term Memory networks (LSTMs)

- [Invented in 1997](https://www.bioinf.jku.at/publications/older/2604.pdf) by Hochreiter and Schmidhuber. 
- Designed so that model can remember things for a long time (hundreds of time steps)! 

### LSTM for image captioning 

<img src="img/RNN_LSTM_image_captioning.png" height="2000" width="2000"> 


(Credit: [LSTMs for image captioning](https://arxiv.org/pdf/1411.4555.pdf))

### Long Short Term Memory networks (LSTMs)

- In an LSTM, the repeating module is more complicated. 
- It selectively controls the flow of information using gates. 

- In addition to usual hidden units, LSTMs have memory cells. 
- Purpose of memory cells is to remember things for a long time.

### LSTMs: The core idea 

- The core idea in LSTMs is using a cell state (memory cell)
- Information can flow along the memory unchanged. 
- Information can be removed or written to the cells regulated by gates. 