This notebook was inspired by neural network & machine learning labs led by [GMUM](https://gmum.net/).

For additional sources regarding today's material, I recommend reading the beginning of [Chapter 6](https://www.deeplearningbook.org/contents/mlp.html) of the Deep Learning book, especially up to and including 6.1 (*Example: Learning XOR*). Much of the following section follows that chapter.

# Neural networks

![venn diagram](figures/fig2.png)
<center>Source: <a href="https://www.deeplearningbook.org/contents/intro.html">Chapter 1</a> of the Deep Learning book.</center>

Neural networks are a type of universal function approximator loosely based on biological systems.

The goal of a neural network is to approximate some function $f^*$. For example, in the case of a classifier, $f^*(\mathbf{x})=y$ maps an input $x$ to some category $y$. A neural network defines a mapping $f(\mathbf{x};\theta)=y$ and learns the value of the parameters $\theta$ which best approximate the function $f^*$.

Neural networks are *hierarchical*, in the sense that they are typically represented by composing together many diﬀerent functions. For example, $f(\mathbf{x})=f^{(3)}(f^{(2)}(f^{(1)}(\mathbf{x})))$ is a function comprised of three different functions composed together. In this case, $f^{(1)}$ is called the *first layer* of the network, $f^{(2)}$ the *second layer*, and so on. The length of this chain of functions is called the *depth* of the model (this is where the term *deep learning* comes from). The final layer of the network is called the *output layer*.

During training, we want $f$ to match $f^*$, which implies that the output of the last layer is somewhat determined. That is not the case for the other layers, though -- the learning algorithm must decide how to use those layers to produce the desired output, but the training data do not say what each individual layer should do. As the training data does not show the desired output for each of these layers, they are called *hidden layers*.

Each hidden layer of the network is typically vector valued. The dimensionality of these hidden layers determines the *width* of the model. Each element of the vector may be interpreted as playing a role analogous to a neuron.
<center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/1200px-Colored_neural_network.svg.png" width=300 /> 
A neural network with one hidden layer comprised of four neurons. Source: <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">Wikipedia</a>. </center>

Instead of thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many *units* that act in parallel, each representing a vector-to-scalar function. The choice of the functions used to compute these representations is also loosely guided by neuroscientiﬁc observations about the functions that biological neurons compute. Each unit resembles a neuron in the sense that it receives input from many other units and computes its own activation value. 

In modern deep learning, the standard layer can be written as $g(\Theta^T\mathbf{x}+\mathbf{b})$, where $\mathbf{x}$ is the output of the previous layer, $\Theta$ is the parameter (or weight) matrix for the given layer, $\mathbf{b}$ is the bias vector, and $g$ is an activation function (e.g. the sigmoid). In this terminology, we can write the $j$-th neuron of that layer as $g(\theta_j^T\mathbf{x}+\mathbf{b}_j)$, where $\theta_j$ is the $j$-th column of $\Theta$ (or the $j$-th row of $\Theta^T$), $\mathbf{b}_j$ is the $j$-th value of the bias vector, and $g$ is the aforementioned activation function. We can write the whole neural network as $$f(x)=g_D(\Theta_D^T( g_{D-1}(\Theta_{D-1}^T(\ldots g_1(\Theta_1^T\mathbf{x}+\mathbf{b}_1)\ldots)+\mathbf{b}_{D-1}) +\mathbf{b}_D),$$ where $g_i$, $\Theta_i$, and $\mathbf{b}_i$ are the activation function, parameter matrix, and bias vector for the $i$-th layer, respectively, and $D$ is the depth of the network.

To better understand neural networks, let's look at the linear model from last week (multinomial logistic regression): $$f(x;\theta)=\mathtt{softmax}(\Theta^T\mathbf{x}+\mathbf{b}).$$

We can think of it as a simple neural network with only one hidden layer, with the activation function being the $\mathtt{softmax}$. 
<img src="https://i.kym-cdn.com/photos/images/newsfeed/000/531/557/a88.jpg" /> 
<center> Why not stop at single-layer neural networks? </center>

Linear models are appealing, because they can be fit efficiently and reliably. Linear models also have the obvious defect that their model capacity is limited to linear functions, so the model cannot understand more complicated interactions between the input variables.

Another way to look at this is that in order to tackle nonlinear problems, we need to somehow transform the data to a space where the problem is linear. This can be done in a variety of manners, such as via kernel methods (e.g. support vector machines) or specifying the transformation manually (this was the dominant approach until the advent of deep learning, with different methods for different tasks, and practitioners specializing in diﬀerent domains, such as speech recognition or computer vision, and with little transfer between domains). 

![representation](figures/fig1.png)
<center>Source: <a href="https://www.deeplearningbook.org/contents/intro.html">Chapter 1</a> of the Deep Learning book.</center>

In the case of deep learning, we simply learn the new representation!

<img src="https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/img/spiral.1-2.2-2-2-2-2-2.gif" width=300 />
<center>Source: <a href="https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/">Neural Networks, Manifolds, and Topology</a>.</center>

How does this look in the context of an actual (convolutional) neural network (more on this later)?
![representation](figures/fig3.png)
<center>Source: <a href="https://www.deeplearningbook.org/contents/intro.html">Chapter 1</a> of the Deep Learning book.</center>

## Task 1 (1p)

Using TensorFlow's [Playground](http://playground.tensorflow.org/) answer the following questions. Some remarks:
- Each answer should be 2 sentences max. 
- When specifying the architecture of a network, it's enough to write it out as $n_1$-$n_2$-$\ldots$-$n_k$, where $n_i$ is the number of neurons in the $i$-th layer and $k$ is the number of layers (e.g. 5-3-6 specifies a neural network with 5 neurons in the first layer, 3 neurons in the second layer, and 6 neurons in the third layer).
- Don't change the amount of noise or the ratio of train to test data.
- Don't change the input features unless instructed so in the exercise.

### Gauss dataset

- Is this dataset a priori solvable by *shallow* methods?
- What makes this dataset easier than the others?
- Compare two models: a neural network with many layers and many neurons and a neural network with a single neuron. Which of these models is more fitting for this task?

[your answer here]

### Circle dataset
- Assume we have access to only one neuron. How many (and what) input features do we need to achieve test loss lower than $0.001$?
- Assume we have access to only unmodified features (i.e. $x_1$ and $x_2$). Create the smallest (in the number of neurons) neural network which achieves test loss lower than $0.001$. Describe the architecture of the network (including activation functions).
- Try to solve the problem with an arbitrary amount of neurons with linear activations (without changing the input features). Did you manage to achieve test loss lower than $0.001$? If yes, describe the architecture. If not, propose a hypothesis as to why it didn't work. 

[your answer here]

### Spiral dataset

- Achieve (stable) test loss lower than $0.1$. Describe the architecture (including the activation function, regularization, and learning rate).
- Which features improve your configuration the most?
- What visually distinguishes solutions which generalize well from solutions which overfit (look at the model visualization after training)?

[your answer here]