# Learning to learn

Woah man... That's deep...

Although, at the present moment, the result of training a neural network is something of a black box, its operation does not have to be! Knowing what's going on inside of a neural network, from end-to-end, will allow you go beyond simple hacking to solve new problems with novel architectures of your own design. Thus, before we start mucking around with code, let's take a gander at what that code aims to do!

## It's all in your head...

A neural network, as the name suggests, can be usefully imagined as a collection of neurons, each of which takes one or more inputs and produces one or more outputs. When all of a neuron's connections are from/to other neurons, we refer to that neuron as *hidden*.

Additionally, it also helps to group neurons that participate in computing the same function into *layers*. In a feed-forward neural network, like the one shown in Figure 1, the outputs of one layer are the inputs to the next. 

<img src="graphics/ff_basic.svg" style="width: 33%">
<div class="figcaption">Figure 1: A simple, feed-forward network with one hidden layer</div>

In general, a layer can have arbitrary connections, but you don't have to worry about that now. As a bit of a teaser, later on in the workshop, we'll circle back around and build a *recurrent* neural network in which layers have self-connections through time.


## ...except when it's on the [CG]PU

While neurons and layers are certainly convenient metaphors, we have to reformulate these abstract concepts in a way that a computer can understand! And what better format to use than numbers and simple operations on them?

The operations performed by neural networks are most elegantly described using vector calculus, which you, too, will hopefully understand. If you'd like another perspective, though, the [Hacker's guide to Neural Networks](https://karpathy.github.io/neuralnets/) offers a more code-oriented approach (but don't worry, we'll get to teh c0dez soon enough!).

## Perceptron? I loved that movie!

Let's start with a simple example. Without further eXORdium, please allow me to introduce... [*Perceptron*](https://en.wikipedia.org/wiki/Perceptron)!

<img src="graphics/perceptron.svg" style="width: 40%">
<div class="figcaption">Figure 2: A single-layer perceptron.</div>

Don't let the notation confuse you!

* $\vec{x}$ is a *feature vector*, or a numerical representation of the input['s features]
* $\vec{w}$ is a vector of weights that controls how much we pay attention to each feature
* $z$ is just a sum that forms the input to the next/output neuron,
* $f(z)$ is some function of the the output neuron's input, and
* $y$ is the output

Perceptron is doing "nothing more" than computing $f(w^Tx) = f(w_2x_2 + w_1x_1 + w_0)$. Note, that we set the third component of the vector $x$ to 1 (this is equivalent to the bias term).

Okay, now let's take a step back and look at our Perceptron from a different perspective. Recall the equation for a line in 2D: $ax_1 + bx_2 + c= 0$. Huh, that looks like what Perceptron is doing! That's because the inner product, $w^Tx$ actually represents the distance from a line! The line, itself, is where the inner product is zero (i.e. when $w$ and $x$ are perpendicular).

So far so good. Now, if we let $f(x) = sign(x)$ or
$$f(x) = \begin{cases}
+1 & \mbox{ if } x > 0 \\
-1 & \mbox{ if } x \le 0
\end{cases}$$

then everything on the side of the line in the direction of $w$ gets the value $+1$ and everything on the other side gets $-1$. In other words, we have a classifier!

<img src="graphics/classifier.png" style="width: 200px">
<div class="figcaption">Figure 3: A simple classifier in 2D.</div>


Here's how what we just described might look like in Torch:

In [1]:
w = torch.randn(3)
x = torch.range(3, 1, -1)
z = torch.dot(w, x)
f = z > 0 and 1 or -1
print(f)

1	


### Product reviews get me worked up!

To make things concrete, suppose we're interested in determining the sentiment of product reviews (major snoozeville, I know, but bear with me). Our reviews contain two features, $x_1$ and $x_2$, which will represent, say, the number of stars and the number of speling erors. Additionally, a subset of them are labeled as "positive," "neutral," and "negative." Our task is determine which reviews are polarized--either positive or negative--and assign those a $+1$ while giving the neural reviews $-1$. We will accomplish this by updating Perceptron's weights so that it performs well on the training set; the hope is for the learned classifier to *generalize* and perform well on the test set, too!

Here are our training examples:

<img src="graphics/reviews.png" style="width: 250px">
<div class="figcaption">Figure 4: Labeled product reviews.</div>

Intuitively, we can see that the most discriminative feature is $x_1$, the number of stars. I guess we should see what Perceptron thinks:

<img src="graphics/reviews_badcls.png" style="width: 250px">
<div class="figcaption">Figure 5: A first attempt at classification.</div>

Well, that's no good! Our simple linear classifier gets most of the polarized reviews but misses a chunk of the positive reviews and the lower left cluster of negative reviews. Can we fix this? Yes, but we need to go deeper!

Let's see what happens if we add more layers and some non-linearities:

<img src="graphics/perceptron2.svg" style="width: 60%">
<div class="figcaption">Figure 6: A multi-layer Perceptron (MLP).</div>

Wait, wait, no don't go! This new notation really isn't that bad! In the same manner as before,

* $\vec{x}$ is still the input feature vector
* $W^{(1)}$ is a matrix containing the stacked weight [row] vectors for each input neuron,
* the $z$s are still sums,
* the $f^{(2)}$ is an *activation function* applied to each $z^{(1)}$--we'll get to this in a bit,
* $f^{(3)}(x) = sign(x)$,
* and the rest of the network is basically the same

With more depth comes more representational power. Well, kind of. It depends on our choice of $f^{(2)}$. If we used the identity (i.e. $f^{(2)}(x) = x$), then we'd just end up with a sum of three lines--another line. The trick is that we need a *non-linear* activation function. There are [plenty of choices](https://github.com/torch/nn/blob/master/doc/transfer.md) for activation function but, for now, we'll use the simple and ubiquitious [Rectified Linear Unit](https://github.com/torch/nn/blob/master/doc/transfer.md#relu), or ReLU for short. It's just $$f^{(2)}(x) = \max(0,\, x)$$

<img src="https://raw.githubusercontent.com/torch/nn/master/doc/image/relu.png" style="width: 350px">
<div class="figcaption">Figure 7: The ReLU activation function.</div>

I guess there's nothing else to do now but train another Perceptron! Does it work?  
<span style="color: gray">no. machine learning isn't possible. you can go home now.</span>

<img src="graphics/reviews_cls.png" style="width: 250px">
<div class="figcaption">Figure 8: Correctly classified reviews.</div>

The ReLU makes everything on the "negative" side of each line zero and $sign(0 + 0 + 0) \leq 0$, so it works! Take that, Amazon!  
<span style="color: gray">yay! my life isn't a lie!</span>

## Training

So far, we've been relying on the little-studied, black-box training algorithm known as "Nick draws some lines that look right." Unfortunately, this algorithm breaks down fairly quickly as the size of the data increases. We need something more robust!

The good news is that we can optimize even very deep networks with this one weird trick: backpropagation! Ignoring that for now, let's try our hands at a simple optimization problem:

Say our loss function is $\mathcal{L}(x, y) = \frac{1}{2}(y - f(x))^2$. Essentially, we want the output of our neural network, $f(x)$, to be as close as possible, in terms of Euclidean (or $\ell_2$) distance, to our desired output $y$. This is to say that we want to minimize $\mathcal{L}(x, y)$. From calculus, we know that, by taking the derivative, we can find the direction in which a variable can be nudged to increase the function the most. In this case, the only thing we can change is $f(x)$:

$$\frac{\partial\mathcal{L}}{\partial f(x)} = y - f(x)$$

Okay, we now know how to adjust $f(x)$ to make it look more like our desired output. Unfortunately, we can't just go and directly change $f(x)$ because it's actually some complicated, multi-layer neural network. We *can*, however, adjust the *parameters* (i.e. weights) of the network to change the function it computes; all that's needed is $\partial\mathcal{L}/\partial W^{(i)}$ for each layer $i \in [1..n]$ in the network!

We can go from where we are now, $\partial\mathcal{L}/\partial f(x)$, to $\partial\mathcal{L}/\partial W^{(i)}$ by using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule):

$$\frac{\partial\mathcal{L}}{\partial W^{(n)}} = \frac{\partial\mathcal{L}}{\partial f(x)}\frac{\partial f(x)}{\partial W^{(n)}}$$

and so on. Thus, the main idea of backpropagation is to take the error signal we get at the output of the network and incrementally work it backwards until it reaches all of the weights. Finally, once you know how to move your weights to maximally increase your loss, you take a small step in the *opposite* direction to maximally decrease it! Rinse and repeat until you start [overfitting](https://en.wikipedia.org/wiki/Overfitting) your training data.

This is still very abstract, though. Let's analyze our MLP from earlier.

<img src="graphics/perceptron2.svg" style="width: 60%">
<div class="figcaption">Figure 6, duplicated for convenience.</div>

Recall that we want $\partial\mathcal{L}/\partial W^{(1)}$ and $\partial\mathcal{L}/\partial W^{(2)}$. It will help to write out the definition of each variable:

$$
\begin{align*}
z^{(1)} = &\ (W^{(1)})^T x \\
f^{(2)} = &\ \mbox{ReLU}(z^{(1)}) \\
z^{(2)} = &\ (w^{(2)})^T f^{(2)} \\
o = &\ 1/(1 + \exp(-z^{(2)})) = \sigma(z^{(2)})
\end{align*}
$$

Note that we had to change $f^{(3)}$ to the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) since $sign(o)$ is not [convex](https://en.wikipedia.org/wiki/Convex_analysis)! This new function can be used to output the "probability" that a particular input is of our target class, which will represented by $y \in \{0, 1\}$.

We'll also use [binary cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression) as our loss function since it pairs well with logistic regression:
$\mathcal{L}(o, y) = -[\,y\log o + (1 - y)\log(1 - o)\,]$

Now for some derivatives! (Yes, these are a lot of symbols, but don't panic; they're just simple derivatives)

$$
\begin{align*}
\frac{\partial L}{\partial o} &= (1 - y)/(1 - o) - y/o \\[1ex]
\frac{\partial o}{\partial z^{(2)}} &= \sigma(z^{(2)})(1 - \sigma(z^{(2)})) \\[1ex]
\frac{\partial z^{(2)}}{\partial f^{(2)}} &= w^{(2)}  \\[1ex]
\frac{\partial z^{(2)}}{\partial w^{(2)}} &= f^{(2)} \\[1ex]
\frac{\partial f^{(2)}}{\partial z^{(1)}} &=  [\![z^{(1)} > 0]\!] \\[1ex]
\frac{\partial z^{(1)}}{\partial W^{(1)}} &= x
\end{align*}
$$

Phew, that was a mouthful. Anyway, we can string those partials together to get our derivatives w.r.t. the weights:

$$
\frac{\partial L}{\partial w^{(2)}} = \frac{\partial L}{\partial o}\frac{\partial o}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial w^{(2)}} = [\,(1 - y)\sigma(z^{(2)}) -y(1 - \sigma(z^{(2)}))\,]\,f^{(2)}
$$

($o$ was replaced with its definition, $\sigma(z^{(2)})$, for clarity). Similarly,

$$
\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial o}\frac{\partial o}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial f^{(2)}}\frac{\partial f^{(2)}}{\partial z^{(1)}}\frac{\partial z^{(1)}}{\partial W^{(1)}} = [\,(1 - y)\sigma(z^{(2)}) -y(1 - \sigma(z^{(2)}))\,]\, [\![z^{(1)} > 0]\!]\, x
$$

Hopefully you've spotted some patterns:

* one doesn't really *care* how the derivative (gradient) w.r.t. the output came to be; all one needs to do is differentiate the current layer of computation to get the gradient w.r.t. the input
* backprop is a bit like a train: the error signal (passengers) rides backwards through the network, visiting each layer (station), and terminates (gets off at) the weights.

If you're confused at this point, don't worry. Backpropagation is not something people generally get the first time they see it. The takeaway is that, as long as your loss and neural network are differentiable, you can backpropagate the error directly to the parameters and adjust them so that you make fewer errors in the future!

Additionally, the modules in the neural network package do the backward calculation for you, so, for the most part, you can ignore what's going on under the hood. Of course, fully understanding the training process is what will differentiate you (ha) from the garden variety neural network hackers!

### Efficient Training (Batching)

In all of the previous examples, the input to the network was a single vector $x$. Since matrix-vector operations are really quite similar to matrix-matrix operations, we can stack multiple $x$s as rows and feed them in as a batch for massive <strike>damage</strike> throughput.

## Validation and Testing (and How Not to Overfit Your Training Data)

Okay, I know that you're tired of the theory and want to get to coding up a neural network, so I'm going to keep this part short. The gist is that, at the end of the day, you'll be running your network on data it's never seen before. While it's easy to get a model that performs very well during training (e.g., by memorizing the training data), for it to be useful, it has to *generalize* to unseen data; this is where the learning occurs.

<img src="graphics/overfitting.png" style="width: 300px">
<div class="figcaption">Figure 9: Train/Test Error vs Model Complexity</div>

To get an idea of how your model is generalizing as it trains, you'll want to periodically check its performance on some unseen data that resembles the training data: the *validation set*. Later, once you've twiddled the model's hyper-parameters (e.g., number of layers, learning rate, batch size) to maximize your performance on the validation set, you run your model once on some more unseen data: the *test* set. The performance that you report is what you got on the test set.

Often, the training, validation, and test sets are created by appropriately partitioning some larger set of data. For reproducabiity, it will help to keep track of which data go in which partition.

Finally, always keep in mind that it's super, super important that you never, *never ever* mix your training, validation, and test data; doing so would make it impossible to fairly evaluate your model!

# 🎉 Congratulations, you made it! 🎉

Hurrah! We finally have enough background knowledge to usefully begin building and training actual neural nets!