# Artificial Neural Networks
https://en.wikipedia.org/wiki/Artificial_neural_network  
Artificial neural networks (ANNs) are a popular machine learning model inspired by the biological neural networks in our brains. As with other machine learning models, by fitting the models with input data with labeled outputs the neural network can "learn" the function that maps the input to the output, and extrapolate to new examples.  

This demo focus on visualizing what a neural network looks like, how it operates, and give examples on how to create and train them.

## Tools

In [None]:
# Used for array calculations
import numpy as np

In [None]:
# Used to define and train neural network models
from tensorflow import keras

In [None]:
# Used to visualize graphs (in this case: neural networks)
import networkx as nx

In [None]:
# The classic Python visualization library
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 10]

In [None]:
import itertools

## Data
https://en.wikipedia.org/wiki/Vector_(mathematics_and_physics)  
https://en.wikipedia.org/wiki/Tensor  
https://en.wikipedia.org/wiki/Vectorization_(mathematics)  
As with most machine learning models, data needs to be **vectorized**, i.e. turned into a array of numbers, so that the computer can work with it.  
Sometimes we might change the shape of the vector to be a multidimensional array instead instead of a simple array. We may then think of the vector as a **tensor**.  
For the purposes of this demo we will only be looking at data in the form of simple vectors as 1D arrays.  

The power of a neural network is its ability to learn any function mapping any input vector to any output vector.  
In this demo we will demonstrate this property by making a binary function for the neural network to learn. Considering the fact that making binary functions is what the transistors in our computer do, we can see the power of a neural network.

In [None]:
# Edit this!
n_inputs = 2
my_function = lambda x: x[0] ^ x[1]

# Turn the function into a truth table for training data
X = []
y = []
for bin_input in itertools.product([0,1], repeat=n_inputs):
    print(f"Input: {bin_input} Output: {my_function(bin_input)}")
    X.append(bin_input)
    y.append(my_function(bin_input))
X = np.array(X, dtype=np.float64)
y = np.array(y, dtype=np.float64)

## A Description of Simple Neural Networks
### Mathematical Description of a Neural Network
A neural network takes an **input vector**, and transforms if via a series of linear and non-linear transformations into an **output vector**.  
Mathematically this can be described by the following recurrence relation:  
$$ \vec{x}_{i+1} = g \left ( W_{i+1} \vec{x}_{i} + \vec{b}_{i+1} \right )$$
Where:  
$\vec{x}_{i}$ is the vector after $i$ transformations.  
$N$ is the number of layers in the neural network.  
$\vec{x}_{0}$ is the input vector.  
$\vec{x}_{N}$ is the output vector.  
$W_{i}$ is the weights matrix of the connections between layer $i-1$ and $i$.  
$\vec{b}_{i}$ is the bias vector of layer $i$.  
$g$ is the activation function applied elementwise to a vector.

In [None]:
# Define the NN's layers 
layers = [] # Input layer is simply defined by an input shape
layers.append(keras.layers.Dense(8, activation='sigmoid', input_shape=(2,))) # 1st Hidden layer
layers.append(keras.layers.Dense(1, activation='sigmoid')) # Output layer

# Join the layers together
model = keras.Sequential(layers)
model.summary()

In [None]:
# Display the model weights and biases as tensors
for i,layer in enumerate(model.layers):
    weights = layer.get_weights()[0]
    biases = layer.get_weights()[1]
    print("Layer {} Weights:".format(i))
    print(weights)
    print("Layer {} Biases:".format(i))
    print(biases)
    print('')

### Neural Networks as Graphs
https://en.wikipedia.org/wiki/Directed_graph  
https://en.wikipedia.org/wiki/Directed_acyclic_graph  
Often we like to look at the connections a directed graph, where the nodes are the neurons and the edges are the connections between the neurons.  
The graph is directed and acyclic.  
We can take the representation a step further by thinking of the neural network as a weighted graph, in which the weights of the edges are the connection weights, and the weights of the nodes are the biases.

In [None]:
# Create a graph object from the NN
NN_graph = nx.DiGraph()
for i_layer,layer in enumerate(model.layers):
    for (i_input,i_output),weight in np.ndenumerate(layer.get_weights()[0]):
        input_label = f"{i_layer}:{i_input}"
        output_label = f"{i_layer+1}:{i_output}"
        if i_input==0:
            bias = layer.get_weights()[1][i_output]
            NN_graph.add_node(output_label, bias=bias)
        NN_graph.add_edge(input_label, output_label, weight=weight)

In [None]:
# Layout and display the graph
pos = nx.nx_agraph.graphviz_layout(NN_graph, prog='dot', args="-Grankdir=LR")
nx.draw_networkx(NN_graph, pos=pos, node_size=2000, node_color='lightskyblue')

## Training
### Problem Definition
https://en.wikipedia.org/wiki/Universal_approximation_theorem  
https://en.wikipedia.org/wiki/Machine_learning  
The training, or learning, of an machine learning model is accomplished by performing an optimization over the model's parameters to minimize the error between the model's predicted values, and the true labeled values from the training data.  

Mathematically we can write training as solving the following problem:  
$$ \min_{\vec{w}} \sum_{i} C \left (  f \left ( \vec{X}_{i}, \vec{w} \right ),  \vec{y}_{i} \right ) $$
Where:  
$\vec{w}$ is the vector of parameters (weights) of the model.  
$\vec{X}_{i}$ is the input vector of the $i$th datapoint in the training set.  
$\vec{y}_{i}$ is the output vector (label) of the $i$th datapoint in the training set.  
$f$ Is the function evaluated by the model.  
$C$ is the loss (cost) function.  

As one can see, this is the classic formulation of an optimization problem.  

### Objective/Cost/Loss Functions
https://en.wikipedia.org/wiki/Loss_function  
In machine learning the terms "objective function", "cost function" and "loss function" all refer to the same concept: a measure of error of the model that we seek to minimize.  

Let's look at a few different popular loss functions:  

#### Absolute Error
https://en.wikipedia.org/wiki/Approximation_error  
The simplest loss function. Simple to compute, and intuitive to understand.  
$$ C \left ( \vec{y}_{\mathrm{pred}}, \vec{y}_{\mathrm{true}} \right ) = \left \| \vec{y}_{\mathrm{pred}} - \vec{y}_{\mathrm{true}} \right \| _{1} = \sum_i \left | y_{\mathrm{pred}, i} - y_{\mathrm{true}, i} \right | $$

#### Quadratic (Least Squares)
https://en.wikipedia.org/wiki/Least_squares  
The most commonly known loss function. Simple to compute, and intuitive to understand in terms of distance in Euclidean space.  
$$ C \left ( \vec{y}_{\mathrm{pred}}, \vec{y}_{\mathrm{true}} \right ) = \left \| \vec{y}_{\mathrm{pred}} - \vec{y}_{\mathrm{true}} \right \| ^{2} = \sum_i \left ( y_{\mathrm{pred}, i} - y_{\mathrm{true}, i} \right ) ^{2} $$

#### Cross-Entropy
https://en.wikipedia.org/wiki/Cross_entropy  
https://en.wikipedia.org/wiki/Information_theory  
A very popular loss function for machine learning classification problems. Simple to compute, but a little more difficult to get intuition from information theory: the average number of binary questions needed to identify an element chosen from distribution $\vec{y}_{\mathrm{true}}$ when optimizing the binary questions for distribution $\vec{y}_{\mathrm{pred}}$.
$$ C \left ( \vec{y}_{\mathrm{pred}}, \vec{y}_{\mathrm{true}} \right ) = - \sum_{i} y_{\mathrm{true}, i} \log \left ( y_{\mathrm{pred}, i} \right ) $$  

Many other loss functions exist. Many are simply variants of the above.  

### Optimization Techniques
https://en.wikipedia.org/wiki/Mathematical_optimization  
https://en.wikipedia.org/wiki/Optimization_problem  
https://en.wikipedia.org/wiki/Stochastic_gradient_descent  

In [None]:
optimizer = keras.optimizers.Adam(lr=0.1)
model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['accuracy'])

In [None]:
model.fit(X, y, epochs=100)

In [None]:
y_pred = model.predict(X)
print("y actual:")
print(y)
print("y predicted:")
print(y_pred)