# Artificial Neural Network (ANN)

***Overview***

<p style="float: right; margin: 0 0 10px 10px;">
  <a data-flickr-embed="true" data-header="true" href="https://en.wikipedia.org/wiki/Cerebral_cortex#/media/File:Minute_structure_of_the_cerebral_cortex.jpg" title="Caught in the App LONDON">
    <img 
      src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Minute_structure_of_the_cerebral_cortex.jpg/800px-Minute_structure_of_the_cerebral_cortex.jpg" 
      alt="Neocortex layers" 
      width="400" 
      height="666">
  
  </a>
</p>

An artificial neural network (ANN) is a computational model inspired by biological networks of neurons found in the brain. Consisting of interconnected units ("neurons" or "nodes") organized in layers, these models are capable of recognizing intricate patterns, modeling complex relationships, and making useful predictions. Generally, a model includes an input layer, wherein data is initially fed; one or more hidden layers, which process and transform the data; and an output layer, which produces the final classification or prediction. Connections between nodes vary in their strength, characterized by a "weight" (represented as a float value) and are associated with a "bias" or adjustment term. 

An ANN "learns" via an iterative process of weight adjustment ("training"). During this process, the model makes predictions, compares these to known values ("labels"), and computes the difference ("error"). Using a process of optimization, the model attempts to minimize this error by adjusting weights and biases iteratively. 

_Right: biological neurons spanning vertically-organized layers of neocortex. Source: [Encyclopedia Britannica - 1911 EB, Vol. 4, Page 400](https://en.wikipedia.org/wiki/Cerebral_cortex)_

***Layered Structure of an ANN***

1. Input layer: receives raw data as input. The number of neurons in the input layer corresponds to the number of features of the dataset. 
2. Hidden layer(s): one or more intermediate layers that process input data through various mathematical operations. These layers enable the network to learn complex patterns. Generally, the "deepest" layer typically refers to the hidden layer furthest from the input layer. 
3. Output layer: the final layer, responsible for producing the model's prediction. The number of neurons here corresponds to the number of output classes (in the case of a classification task) or a single value (in the case of a regression task).

***The Model***

Each neuron in an ANN performs a weighted sum of its inputs, adds a bias, and applies an activation function to produce an ouput. This can be expressed mathematically for each layer as: 

$$
O = f(W \cdot X + b)
$$

Where:

- $X$: the node's input as a vector whose dimensionality depends on the number of neurons in the previous layer
- $W$: the weight matrix for the layer, whose values update as training occurs 
- $b$: a vector of "bias" values corresponding to each neuron in the layer
- $f$: the activation function applied to the weighted sum of inputs
- $O$: output representing transformed data, and 
- $W \cdot X $: the dot product of $W$ and $X$

***Training Process***

The training process is an iterative procedure wherein the network adjusts weights and biases to minimize error (or loss) between its predictions and actual labels (ground truth). An optimization algorithm such as gradient descent is implemented to this end. During gradient descent, the gradient of the loss function with respect to the weights is computed. then, the weights are updated in the direction that reduces prediction error. The weight update rule for gradient descent is expressed mathematically as: 

$$
\text{New weight} = \text{Old weight} - \eta \cdot \frac{\partial L}{\partial W}
$$ 

Where: 
- $ \eta $ (eta): the learning rate, a hyperparameter that determines the size of steps taken during training 
- $L$: the loss function, which measures the error between predictions and actual values 
- $W$: the weight matrix to be continuously updated 
- $\frac{\partial L}{\partial W}$: the gradient of the loss function with respect to the weight matrix 

The gradient of the loss function $ L $ with respect to the weight matrix $ W $ is expressed by:

$$
\frac{\partial L}{\partial W}
$$

***Activation Function Selection***

An activation function is applied to the weighted sum of inputs plus bias at each layer, introducing nonlinearity to the model. It plays a critical role by determining the output of each neuron based on its input, enabling ANNs to model complex relationships. 

Examples include: 

1. Sigmoid: outputs values between 0 and 1. 
2. Rectified linear unit (ReLU): popular in the hidden layers of deep networks, ReLU "rectifies" negative values to 0 while leaving positive values unchanged. 
3. Softmax: commonly applied in the output layer for multi-class classification problems. Outputs a normalized vector of probabilities between 0 and 1 whose sum equals 1. 

***Implementation in Python***

We'll be setting up our model manually to demonstrate its inner mechanics. This approach helps us understand the fundamental concepts behind neural networks, such as how layers, weights, and activation functions work together to make predictions.

In practice, it can be convenient to use one of several powerful libraries and frameworks for neural network implementation. These offer pre-built models with various architectures and optimization routines to streamline model development and training by removing the need for writing low-level code from scratch. In this notebook, we will utilize TensorFlow, an open-source framework developed by Google. It provides a comprehensive ecosystem for building and deploying machine learning models. TensorFlow is widely used because it offers:

- Ease of Use: High-level APIs like Keras make it easy to build and train models.
- Performance: Optimized for both CPU and GPU, allowing for efficient training and inference.
- Flexibility: Supports a wide range of machine learning tasks, from simple linear regression to complex deep learning models.
- Community and Support: A large community of developers and extensive documentation.
- Alternatives to TensorFlow

Alternatives include: 

- PyTorch: Developed by Facebook, PyTorch is known for its dynamic computation graph and ease of use, especially in research settings.
- Keras: Initially an independent project, Keras is now part of TensorFlow. It provides a high-level API for building neural networks.
- MXNet: An open-source deep learning framework that is scalable and supports multiple languages.
- Caffe: Developed by the Berkeley Vision and Learning Center (BVLC), Caffe is known for its speed and modularity.
- Theano: One of the earliest deep learning libraries, Theano is now mostly used for educational purposes.

Let's proceed with our setup for a manual implementation. 

In [20]:
import numpy as np 

# Set model characteristics and hyperparameters

input_layer_size = 3 # number of features
hidden_layer_size = 4 # number of hidden units
output_layer_size = 1 # number of classes

learning_rate = 0.01

# Randomly initialize weights and biases
w1 = np.random.randn(hidden_layer_size, input_layer_size + 1)
b1 = np.random.randn(hidden_layer_size)
w2 = np.random.randn(output_layer_size, hidden_layer_size + 1)
b2 = np.random.randn(output_layer_size)

***Explanation***

np.random.randn() generates a matrix of random numbers from a standard normal distribution ($\mu$ = 0, $\sigma$ = 1).

Random initialization ensures that the network does not begin training with any pattern, thus optimizing learning and helping ensure that the model converges. 

The weight matrix w1 describes strength of connections between neurons in the input and hidden layer. Its dimensionality is (hidden_layer_size, inputer_layer_size + 1): the number of rows corresponds to the number of neurons in the hidden layer, while the number of columns corresponds to the number of neurons in the input layer + 1. This accounts for the bias term, an additional weight added at each node. This value helps the model better fit the data. 

The matrix w2 contains weights between the hidden and output layer. Dimensionality here is (output_layer_size, hidden_layer_size +1). The number of rows corresponds to the number of neurons in the output layer, while the number of columns corresponds to the number of neurons in the hidden layer plus one, which accounts for bias terms. 


