# Introduction to ANN with Keras

## The Perceptron
Based on an artificial neuron called the <em>threshold logical unit</em>(TLU). The TLU computes a weighted sum of it's inputs:

$z = w_{1}x_{1} + ... + w_{n}x_{n} = \textbf{x}^T\textbf{w}$ = z

then applies a step function to that sum an output the result:

$h_{\textbf{x}}(\textbf{x}) = step(z)$<br>

It's composed of a layer of TLU's

$h_{\textbf{W,b}}(\textbf{X}) = \phi(\textbf{XW} + \textbf{b})$<br>

$\textbf{X}$ – is the input features<br>
$\textbf{W}$ – has one row / input neuron and one column / artificial neuron in the layer<br>
$\textbf{b}$ – bias vector that contains all the connection weights between the bias neuron and the artificial neurons.<br>
$\phi$ – is the activation function – when the artificial neuron's are TLU's then it's just a step function<br>

Hebb's rule: the connection weight between neuron's is increased whenever they have the same output.

### Perceptron training rule
$w^{next step}_{i,j} = w_{i,j} + \eta(y_j - \hat{y}_j)x_i$<br>

$w_{i, j}$ is the connection weight between the ith input neuron and the jth output neuron.<br>
$x_i$ is the ith input value of the current training instance.<br>
$\hat{y}_j$ is the output of the jth output neuron for the current training instance.<br>
$y_j$ is the target output of the jth output neuron for the current training instance.<br>
$\eta$ is the learning rate.<br>


In [10]:
from sklearn.linear_model import Perceptron
from sklearn.datasets import load_iris
import numpy as np
import warnings


iris = load_iris()
X = iris.data[:, (2, 3)]
y = (iris.target == 0).astype(np.int)
per_clf = Perceptron(max_iter=50, tol=None)
per_clf.fit(X, y)
per_clf.predict([[2, 0.5]])



array([1])

## Multilayer perceptron and backpropagation
An MLP is composed of one input layer, one or more layers of TLU's (i.e. hidden layers), and one output layer. Every layer except the output layer inclused a bias neuron and is fully connected to the next layer. 

### Backpropagation
It's Gradient Descent using an efficient technique for computing the gradients automatically in two passes through the network. Backpropagation is able to calculate the gradient of the network's error with regards to every single model parameter – it can find how each connection weights and bias term should be tweaked to reduce the error. Once it has these gradients, it performs a normal gradient descent step and the process is repeated until convergence. 

For each training instance, the algorithm does a forward pass over the network, and calculates performance with a loss function. It then goes back through the network (one layer at a time) to measure the error contribution from each connection, and tries to tweak the weights to reduce the error (gradient descent step). 

To make this work, the author's replaced the step function with the logistic function $\sigma(z) = \frac{1}{1+e^{-z}}$. The step function had only segments (i.e. there was no slope to compute)
