# Neural Networks and Deep Learning

## History

NNs started with the motivation to build computers that could mimic how the brain works --- to make
computers that could think like humans.

This work actually started in the 1950s, then fell out of favor for a while (first AI winter, 1974-1980).
- A period of greatly reduced interest and funding in AI research, mostly because of the perception that
AI (in general, not just NNs) had failed to live up to its grandiose promises and objectives it had laid out.

NN's gained popularity again in the 1980s and early 1990s before falling out of favor again in late 1990s.

- Successes: handwriting digit recognition used in check-writing and recognizing zip codes on envelopes.

- Not necessarily any "failures," but was overtaken in popularity by probabilistic approaches.

Resurgence from around 2005, also "rebranded" with deep learning: nowadays these terms (NNs and deep learning) mean almost the same thing.

Since 2005, NNs/deep learning have revolutionized area after area of comp sci research.

- Speech recognition came first (2005-2009).  This field was initially dominated by hidden Markov models
(HMMs), but are now dominated by LSTMs (long shot-term memory), a deep learning technique.

  - An interesting fact is that prior to this era, speech recognition programs often had to be trained on individual speaker's voices.
  
- Then came image recognition.  In 2012, a CNN (convolutional neural net) called AlexNet, achieved a 
top-5 error of 13.5% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up.  Made feasibly because of GPUs (graphical processing units).

- Then text/NLP came next.  Earlier than about 2014/2015, most machine translation was using statistical models, but now almost all is done with neural (network) models.

- So many other areas have been revolutionized as well (medical imaging, advertising, climate change forecasting, musical playlist generation, chatbots, etc)
  
  

## Brain neurons

- Even though modern neural networks have almost nothing to do with how the brain actually works,
there was the early motivation to try to build software to replicate what was happening in the human
brain.  So it's still worthwhile to understand (a little bit) about how that works.

![image.png](attachment:b2064675-52f7-43d8-9c86-19d0feab9f14.png)
(from Wikipedia's neuron entry)

- Neurons can send information to other neurons through electrical impulses.
Each neuron in the brain is connected to other neurons and they sometimes modify what set 
of neurons each one is connected to.

- Neurons receive electrical signals from their dendrites, and send signals out down their axon.
The signals from the dendrites function as "inputs" and the neuron then can determine whether or not
to send an "output" signal down the axon based on the inputs.  The axon then connects to the dendrites
of another neuron.  This is greatly simplified, but this is the basis for human thought.


## Artificial neural networks

- In software, we will build a very simplified model of real-world neurons.  Our neurons will 
receive inputs (as numbers), do some computations based on those numbers, and produce another
number as output.

- This output will serve as the input to one (or usually more) other neurons.  We will often arrange
a collection of neurons in **layers**, where each neuron in the layer performs the same computations.



![image.png](attachment:5252d88f-0c5c-4e30-8cda-10852bebd7fc.png)


### Caveat

We really don't know how the brain works, and this model of
a neuron is vastly simplified from what we know (or think) is happening
in the brain.

Every few years neuroscientists learn more and more about what's actually
happening, so even though neural networks can do really powerful
things, it's not 100% clear that they are truly mimicking what is
happening in our brains.

And most computer scientists are ok with that.  People who do neural
net research have moved away from trying to replicate exactly
what is going on in the brain, especially with regards to modern
NN engineering techniques, which are all based on what works best, not
necessarily how our brain does it.

## Why now?

What enabled the modern deep learning revolution?  It was a combination
of things:

- We have tons of (digitized) data around that we never used to have.
So many things are now recorded electronically that never used to be,
and so machine learning algorithms can now harness that data to 
train models.

- Similarly, we now have incredibly fast computer processors (CPUs
and GPUs), which can train models incredibly quickly.

  - GPUs: Graphical Processing Unit.  This is a specialized circuit that was originally designed for displaying graphics on your computer (and still is).  It contains specialized circuits for doing 2d and 3d computations in parallel, which are the underlying computations needed for computer graphics (everything is based on matrix calculations).  Because modern machine learning (and especially NNs/deep learning) is all based around matrix calculations, and can often be paralellized, these chips turned out to be extremely helpful for deep learning as well.

![image.png](attachment:86f866ad-7967-4ea5-b73c-ea14b8773a49.png)

## Example: Demand Prediction

- Let's use an example we haven't seen before: "demand prediction."

- Suppose we work for a store that sells clothing, and we want to predict whether a new T-shirt design will be a top seller or not.  We could set this up as a classification problem, where we're trying to predict "yes" or "no" for whether this shirt will be a top seller.  In the real world, we'd probably have lots of features of the shirt, but for the moment, assume we just have one feature, price.



![image.png](attachment:0de65043-1973-431d-be17-656df34d2fd1.png)

- If we were to set up this problem as a logistic regression problem, we would have our single feature (price) be $x_1$ and our model would be 

$$f(x) = \frac{1}{1 + e^{-w \cdot x}}$$

  where $x$ would be a vector of $x_0$ (the "fake" feature) and $x_1$ (price), and $w$ is a vector of weights.
  
- In neural network terms, we're going to add a new term, called the 
**activation**, and equate that with $f(x)$ in this case:

$$a = f(x) = \frac{1}{1 + e^{-w \cdot x}}$$

  The term comes from when we "activate" a neuron.

- It turns out that a neural network is built up of many of these
tiny logistic regression models, all encapsulated into an entity we'll
call a neuron or a "unit."

  - So each neuron takes as input a price $x_1$, computes the formula for 
  $a = f(x)$, and outputs that number (which we interpret as the probability
  of this shirt being a top seller.)
  
- Imagine each neuron as its own little computer doing this calculation.

- Modern neural networks consist of a wiring these neurons/units together
in different ways.

### Extending the example to multiple features

- Imagine now we have four features: price, shipping cost, marketing (something indicating how much marketing has been done for this shirt), and material (something indicating the quality of the material used to make the shirt).

- Additionally, suppose we also think that whether a shirt becomes a top seller
or not relates to: (1) affordability (do people think they can afford this shirt), (2) awareness (do people know that the shirt exists), and (3) perceived
quality (do people think the shirt is a high-quality product).

- In some sense, our four features do not map directly onto the three factors
that we believe influence whether a shirt will be a top seller.  Instead:
  - Price and shipping affect affordability,
  - Marketing affects awareness, and
  - Material and price affect perceived quality.
  
  
                                                                      

- We can connect up the features into a neural network 
like this:

![image.png](attachment:befd894a-72cd-474f-9de3-ae534d32d880.png)

- We group neurons into layers.  Each layer contains
a set of neurons all performing the same mathematical
computations, often (but not always) on the same set of features,
and often (but not always) sending their output to the same following layer
of neurons.

- In the picture above, the affordability/awareness/perceived quality
collection of neurons form a layer.  The probability of being a top seller
is also a layer all by itself.

- The neural network above accepts 4 numbers as inputs (the **input layer**).
Those 4 numbers are sent to the 3 neurons (or units, or nodes) in the middle layer.
Each of those neurons does a computation and produces an **activation value**
(which is itself a number).  Those 3 numbers are then sent to the next layer,
which is called the **output layer**, which here is only one neuron.
The output layer does another computation and produces a final activation value,
which is the result of the neural network.

![image.png](attachment:3d193ae8-ba78-43f2-8e15-159cecb7bb31.png)

- In the picture above, we only connected some nodes of the input layer to some neurons of the middle layer.  In practice, we often connect every input in a layer to all the neurons of the following layer.

- The middle layer(s) of a neural net are sometimes called **hidden layers**. The reason for this is while we can observe the input layer values and correct output layer values from our training data, we cannot in general know ahead of time the "correct" values for what the numbers at the hidden layer should be.


## Key idea so far

- Think of a neural network as a collection of individual logistic regression
units.  We know that logistic regression can only learn a linear combination
of its input features.  However, because neural networks are arranged in layers,
each layer can learn a linear combination of its input features.  This layering
idea, combined with the non-linear activation function (the sigmoid function)
results in neural networks being able to learn more sophisticated functions
than logistic regression can learn.

- Furthermore, the multiple layers will allow automatic construction of more complicated features: some feature engineering is taken care of for us.

- In fact, though we came up with "interpretable" features in our middle/hidden layer, one of the main ideas of neural networks is that we don't need to figure out the features of any middle/hidden layers ahead of time.  The neural network will figure them out for us.

<hr>
In general, neural networks can have any number of layers, and any number of
neurons in any of the layers.

## do this example on board

Let's say we want to solve the facial recognition problem.

![image.png](attachment:525c7f6a-f8c8-4f2e-8713-62c7ef24dc1c.png)

What is each layer of the network doing?

![image.png](attachment:db04e387-a784-455a-8a40-11c49e014edd.png)

First hidden layer finds straight lines.

2nd hidden layer finds eyes/noses/mouths

3rd hidden layer finds larger portions of faces.

It figures out the features all by itself.

One of the cool things about this particular network is that the way
the features are set up, is that each layer only looks at certain sections of the iamge, not the whole image.  The first layer looks at very small squares, then larger and larger.  In this way, the neural net learns to find features that can appear anywhere in the image.

## for cars

Same thing happens for cars.

![image.png](attachment:f271718d-332f-42d2-8007-6bd949bab172.png)

## Math of each layer of a neural net

Recall what a single neuron is doing, mathematically.  It receives a collection
of inputs, let's assume they're in a vector $\boldsymbol{x}$.  We assume,
as before, that $x_0$ is always 1.

Each neuron also has a weight vector $\boldsymbol{w}$.  These two vectors
are the same length.

Each neuron computes the dot product $z =\boldsymbol{w} \cdot \boldsymbol{x}$.
(We used that $z$ notation in logistic regression as well!). 

Each neuron takes this dot product and then passes it through the sigmoid
function, sometimes called the **activation function**,  $g(z) = \dfrac{1}{1+e^{-z}}$.

So the complete computation is $a = g(z) = g(\boldsymbol{w} \cdot \boldsymbol{x})  = \dfrac{1}{1+e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}$.
This is a single number (a scalar).





![image.png](attachment:a06924d1-c968-47e7-8960-f17eae23eee0.png)

![image.png](attachment:b58681cb-b982-45be-8727-256b44300196.png)

<hr>

Now imagine we have an entire layer of neurons.  We have to expand our notation a bit.

Each neuron in the layer receives the entire input vector $\boldsymbol{x}$.
But each neuron has its own set of weights, so now we have a collection of 
weight vectors, $\boldsymbol{w}_0, \boldsymbol{w}_1$, etc, one per neuron
in the layer.

Similarly, each neuron produces its own activation value, and now since
there's one per neuron, we will call them $a_1$, $a_2$, etc.  But again,
we can collect them into a vector $\boldsymbol{a}$.




![image.png](attachment:64489bd2-607d-4855-aacb-52f9f563d3f5.png)

By convention, we call the input layer "layer 0" and each subsequent layer
gets one higher number (layer 1, layer 2, etc).

We will use a superscript number in square brackets to denote the variables
at each layer.

So the first layer that does any computation is layer 1 (layer 0 is just the input features), so the weight vectors at this layer are now
$\boldsymbol{w}_0^{[1]}, \boldsymbol{w}_1^{[1]}$, etc.  And the activation
values are combined into vector $\boldsymbol{a^{[1]}}.$

![image.png](attachment:37bef2a4-9f98-4c34-8cb9-63432c33e170.png)

...or, equivalently:

![image.png](attachment:0af0c337-e124-481b-8466-609244018512.png)

To make this even more complicated, now the output of
layer one becomes the input to layer 2.

![image.png](attachment:00a458b6-277b-467b-8dc3-52412177dd7b.png)

The final step of a neural network is optional, 
and depends on if we want a binary prediction, or a
probability output.

If we want, we can put a threshold onto the final output unit of the neural
network.  This optional computation will change the output of a neuron
from a value between 0-1 to **exactly** 0 or 1, depending on if the value is
greater than or less than 0.5

![image.png](attachment:cf32269e-3b99-4254-86ec-f8f05b3d4263.png)

## A more complex network 

## do on board, showing activation from layer 3 to layer 4.

![image.png](attachment:a8ea425e-758c-494e-bd7b-070d0d55e099.png)

![image.png](attachment:749501cf-425c-44b7-bf49-93836e25ea54.png)

## General formula

$$\boldsymbol{a}_j^{[\ell]} = g \left( \boldsymbol{w}_j^{[\ell]} \cdot 
\boldsymbol{a}^{[\ell-1]} \right)$$

In the context of neural networks, we will often call $g$ the **activation function**.  $g$ doesn't actually have to be the sigmoid function; it can 
technically be any function, but there are a few common activation functions
that we use, the sigmoid function being one of them.

To make our notation consistent, we will also sometimes use $\boldsymbol{a}_0$
as another name for our input vector $\boldsymbol{x}$.

## Making predictions (inference)

Remember, of course, that we want to use neural networks, like any 
machine learning model, to make predictions about data.  In linear and
logistic regression, we defined our models as functions $f$.  In neural networks, it's a little bit more complicated to do this with a single function
since we have multiple layers of neurons, each computing its own values, but
we can certainly do it!

For a neural network with its output layer being called layer $L$, 
we define $f(x) = \boldsymbol{a}^{[L]}$, in other words, the output of the model
is just the output of the output layer.

Of course, we compute that vector $\boldsymbol{a}^{[L]}$ iteratively, by 
starting with the input layer $\boldsymbol{a}^{[0]}=\boldsymbol{x}$, and 
moving forward through the layers until we reach the output layer, computing
each layer of numbers along the way.  For this reason, this is called 
the **forward propogation** algorithm.



## example with handwritten digit recognition

![image.png](attachment:c809764d-2c34-44ad-8263-33412768b4ed.png)

![image.png](attachment:8950a4ac-8331-4517-ab7b-d43f2b8799fb.png)

![image.png](attachment:1089d627-5428-4979-89ff-fdca1161c56d.png)

## Vectorized version of forward propogation

Suppose we have a vector $\boldsymbol{a}$ representing the output from
a layer of the neural network (or the input features $\boldsymbol{x}$).

If we want to compute what happens at the next layer, we know we can use
the general formula

$$\boldsymbol{a}_j^{[\ell]} = g \left( \boldsymbol{w}_j^{[\ell]} \cdot 
\boldsymbol{a}^{[\ell-1]} \right)$$

However, the computation above defines each term of the $\boldsymbol{a}$ vector
separately.  In other words, this equation above is actually a bunch of equations, one per neuron $j$ in the layer.

To do all the computations for the equations at once, what we can do is
define a weight matrix $\boldsymbol{W}$ like we did for logistic regression.

$$\begin{bmatrix}
\leftarrow & \boldsymbol{w_1} & \rightarrow\\
\leftarrow & \boldsymbol{w_2} & \rightarrow\\
 & \vdots & \\
\end{bmatrix}$$

Note that this matrix has the same number of rows as the number of neurons 
in the layer being computed, and the same number of columns as the number of neurons in
the **previous** layer.  

We can then calculate $\boldsymbol{a}$ all at once by:

$$\boldsymbol{z}^{[\ell]} = \boldsymbol{W}^{[\ell]}\boldsymbol{a}^{[\ell-1]}$$

$$\boldsymbol{a}^{[\ell]} = g(z^{[\ell]})$$

where the meaning of the $g$ above, but applied to a vector, is we evaluate $g$ on each item in the vector, separately.

Remember, to multiply $\boldsymbol{W}^{[\ell]}$ by $\boldsymbol{a}^{[\ell-1]}$,
the number of columns of $\boldsymbol{W}^{[\ell]}$ must match the number of 
rows of $\boldsymbol{a}^{[\ell-1]}$.  This should make sense because both of
those quantities are **the number of neurons in the previous layer** of the network.  

Furthermore, the resulting vector from that multiplication will have the number of rows of $\boldsymbol{W}^{[\ell]}$ and the number of columns of 
$\boldsymbol{a}^{[\ell-1]}$.  These two quantities are, respectively, the
**number of neurons in the current layer**, and simply **1**, because each 
vector $\boldsymbol{a}$ is a column vector.  So therefore $\boldsymbol{a}^{[\ell]}$ is a column vector with the number of entries
matching the number of neurons.