# 1. Overview

## 1.1. Basic concepts

### Nodes and edges
Going back to familiar binary Logistic Regression, we visualize, let's say, a model trained on the dataset having 3 features and a single label. On the graph, each feature/label ($\mathbf{x}$ or $\mathbf{y}$) is represented by a *node* and each model weight ($w_1,w_2,w_3$) is represented by a *colored edge*. The bias $w_0$ (or sometimes denoted $b$) is not showing on the graph, but keep in mind it is attached to the output node. This is the most basic architecture of a Neural Network with 4 parameters (3 weights + 1 bias).

<img src='image/mlp_linear_simple.png' style='height:180px; margin:20px auto;'>

### Layers
Now, we extend the problem to a Stacking model, where 5 base models and the meta model are all Logistic Regression. Beside an input layer and an output layer, there is a new layer between them, called the *hidden layer*. We can add more and more hidden layers for multilevel stacking design. By doing this, our Neural Network becomes *deeper* and can capture more complicated relationship in our data. This opens up a new branch of Machine Learning algorithms: Deep Learning.

<img src='image/mlp_linear_stacking.png' style='height:300px; margin:20px auto;'>

### Representation learning
In the two examples above, the Neural Network is designed for a binary classification problem, where the target is a vector storing the probabilities of being classified to the positive class. For a multi-class classification problem, we need to contruct a vector of probabilities for each class. Below is an example Neural Network architecture with 2 hidden layers for the Iris data which has 4 features and 3 classes.

<img src='image/mlp_iris.png' style='height:300px; margin:20px auto;'>

This type of architecture is generally called Deep Neural Network of Multilayer Perceptron. Notice that each node represents a vector, we can think of nodes in the hidden layers as *latent features*, as they are automatically discovered by Deep Neural Networks. Such an approach is called [representation learning], one of the benefits that Multilayer Perceptron offer.

[representation learning]: https://en.wikipedia.org/wiki/Feature_learning

### Inspiration
Now we have known what a Deep Neural Network is, but what does the term *neural* implies here? This is because Neural Nets are inspired by biological neural network that constitute human brains. Artificial Neural Networks are constructed by nodes and edges, which resemble *neurons* and *synapses* in biological brains. An artificial neuron recieves signals, processes them and transmit it to other neurons. The strength of a signal between neurons is modeled by the weight of an edge.

Biological nervous system in fact is much more complicated, and Artificial Neural Network is just a simple counterpart.

### Implementation
Now, let's construct a Neural Network for the Iris example above using TensorFlow. We use a
<code style='font-size:13px'><a href='https://www.tensorflow.org/api_docs/python/tf/keras/Sequential'>Sequential</a></code>
to add [layers] successively to the network. There are so many types of layer, but we will start with the most basic one, Dense (also known as Fully Connected), indicating all nodes from the previous layers connect to the next layer.

The first thing to notice when constructing a Neural Network is the shape of each layer. Let's take a look at the first hidden layer: it has the shape of (None, 6). This means, this layer is a matrix with 6 columns and an unspecified number of rows. In other words, the architecture of the Neural Network is fixed, but it can adapts to any data size.

Another important thing is the number of parameters (weights and biases). A large number of parameters leads to high training time and oerfitting. Here are the numbers of parameters for each layer:
- Layer 1: $4\times6=24$ weights and $6$ biases or $30$ in total
- Layer 2: $6\times6=36$ weights and $6$ biases or $42$ in total
- Layer 3: $6\times3=18$ weights and $3$ biases or $21$ in total

[layers]: https://www.tensorflow.org/api_docs/python/tf/keras/layers

In [6]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras import layers

In [5]:
model = keras.Sequential()
model.add(layers.Dense(units=6))
model.add(layers.Dense(units=6))
model.add(layers.Dense(units=3, activation='softmax'))
model.compile(loss='categorical_crossentropy')

model.build(input_shape=(None, 4))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 6)                 30        
                                                                 
 dense_4 (Dense)             (None, 6)                 42        
                                                                 
 dense_5 (Dense)             (None, 3)                 21        
                                                                 
Total params: 93
Trainable params: 93
Non-trainable params: 0
_________________________________________________________________


## 1.2. Activation functions
We have already known a Deep Neural Network is simply the combination of many Logistic Regression models. But if we keep stacking up linear functions, the result is still a linear function. In other words, using multiple linear layers is the same as using a single linear layer. To go beyonds linearity, an [activation function] is added to each node. 

[activation function]: https://keras.io/api/layers/activations/

In [9]:
layer = tf.keras.layers.Activation('relu')
output = layer(np.arange(-1,1,0.1))
output.numpy()

array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.1, 0.2,
       0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], dtype=float32)

## 1.3. Backpropagation

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [5]:
model = keras.Sequential()
model.add(layers.Dense(units=1))
model.compile(tf.optimizers.Adam(learning_rate=0.1), loss='mean_absolute_error')
model.build(input_shape=(None,3))
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 1)                 4         
                                                                 
Total params: 4
Trainable params: 4
Non-trainable params: 0
_________________________________________________________________


In [2]:
x = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
y = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0], dtype=float)

In [14]:
model1.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_7 (Dense)             (None, 1)                 2         
                                                                 
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________


In [42]:
normalizer = layers.Normalization(input_shape=(1,), axis=None)
normalizer.adapt(xs)

xs = xs.reshape(-1,1)
ys = ys.reshape(-1,1)
model1 = keras.Sequential()
# model1.add(normalizer)
model1.add(layers.Dense(units=1))

model1.compile(loss='binary_crossentropy')

model1.fit(xs, ys, validation_split=0.2, epochs=100, verbose=0)
model1.predict(xs)

array([[-0.03436869],
       [ 0.9754254 ],
       [ 1.9852195 ],
       [ 2.9950137 ],
       [ 4.0048075 ],
       [ 5.0146017 ]], dtype=float32)

- https://www.tensorflow.org/api_docs/python/tf/keras/layers
- https://www.tensorflow.org/api_docs/python/tf/keras/Model
- https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
- https://www.tensorflow.org/api_docs/python/tf/keras/metrics