# Artificial Neural Networks (ANNs)

# 1. What are ANNs?

Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. ANNs have three layers that are interconnected. The first layer consists of input neurons. Those neurons send data on to the second layer, which in turn sends the output neurons to the third layer. ANNs are considered non-linear statistical data modeling tools where the complex relationships between inputs and outputs are modeled or patterns are found. Note that a neuron can also be referred to as a perceptron.

Note: The way layers would be created, arranged, assigned, number of neurons the layers would hold and other such questions come under architecture designing for neural networks. It is a common thing while starting out to have these queries. Architecture of NNs is an ocean frankly, one can gain knowledge of it only by exploring it, working on different problems optimising for better solutions by trial and error. There are some thumb rules that are known but since the complexity of the mechanism is such there are a very few detailed proofs of why some things work for some problems and some don't.

<img src="https://www.digitaltrends.com/wp-content/uploads/2017/11/artificial_neural_network_1.jpg"  width="500">

For a basic idea of how a deep learning neural network learns, imagine a factory line. After the raw materials (the data set) are input, they are then passed down the conveyer belt, with each subsequent stop or layer extracting a different set of high-level features. If the network is intended to recognize an object, the first layer might analyze the brightness of its pixels. The next layer could then identify any edges in the image, based on lines of similar pixels. After this, another layer may recognize textures and shapes, and so on. By the time the fourth or fifth layer is reached, the deep learning net will have created complex feature detectors. It can figure out that certain image elements (such as a pair of eyes, a nose, and a mouth) are commonly found together.

Once this is done, the researchers who have trained the network can give labels to the output, and then use backpropagation to correct any mistakes which have been made. After a while, the network can carry out its own classification tasks without needing humans to help every time.

<a id="2"></a>
# 2. Types of Neural Network

There are multiple types of neural network, each of which come with their own specific use cases and levels of complexity. The most basic type of neural net is something called a feedforward neural network, in which information travels in only one direction from input to output. A more widely used type of network is the recurrent neural network, in which data can flow in multiple directions. These neural networks possess greater learning abilities and are widely employed for more complex tasks such as learning handwriting or language recognition.

There are also convolutional neural networks, Boltzmann machine networks, Hopfield networks, and a variety of others. Picking the right network for your task depends on the data you have to train it with, and the specific application you have in mind. In some cases, it may be desirable to use multiple approaches, such as would be the case with a challenging task like voice recognition.

<img src="https://i.stack.imgur.com/LgmYv.png"  width="500">

<a id="3"></a>
# 3. Is there a difference between NN and ANN?

Neural Network is a broad term that encompases various types of networks which were shown above. Is ANN one of the types? Well to understand this it is important to realise that neural network alone is not an algorithm but a framework which assists the algorithms to work. ANN is the most basic type of implementation of neurals. ANN was the term coined much earlier and nowadays the two terms are interchangeably used.


<a id="4"></a>
# 4. In what situation does the algorithm fit best?

ANN was rarely used for predictive modelling. The reason being that Artificial Neural Networks (ANN) usually tries to over-fit the relationship. ANN is generally used in cases where what has happened in past is repeated almost exactly in same way. For example, say we are playing the game of Black Jack against a computer. An intelligent opponent based on ANN would be a very good opponent in this case (assuming they can manage to keep the computation time low). With time ANN will train itself for all possible cases of card flow. And given that we are not shuffling cards with a dealer, ANN will be able to memorize every single call. Hence, it is a kind of machine learning technique which has enormous memory. But it does not work well in case where scoring population is significantly different compared to training sample. For instance, if I plan to target customer for a campaign using their past response by an ANN. I will probably be using a wrong technique as it might have over-fitted the relationship between the response and other predictors.

<a id="5"></a>
# 5. How does ANN work?

It is truly said that the working of ANN takes its roots from the neural network residing in human brain. ANN operates on something referred to as Hidden State. These hidden states are similar to neurons. Each of these hidden state is a transient form which has a probabilistic behavior. A grid of such hidden state act as a bridge between the input and the output.

We have an input layer which is the data we provide to the ANN. We have the hidden layers, which is where the magic happens. Lastly, we have the output layer, which is where the finished computations of the network are placed for us to use.





![](http://cdn-images-1.medium.com/max/600/1*f0hA2R652htmc1EaDrgG8g.png)

Initially the weights of the network can be randomly. When the input in given to the input layer the process moves forward and the hidden layer receives the input combined with the weights. This process goes on till the final layer of output is reached and result is given. When the result is out it is compared to the actual value and a back propagation algorithm comes into play to adjust the weights of the network linkages to better the result. What do the neurons in the layers then do? They are responsible for the learning individually. They consist of activation function that allows the signal to pass or not depending on which activation function is being used and what input came from the previous layer. We'll see activation functions in detail now.

<img src="http://www.analyticsvidhya.com/blog/wp-content/uploads/2014/10/flowchart-ANN.png"  width="600">

<a id="6"></a>
# 6. Activation Function
Activation functions are really important for a Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable.They introduce non-linear properties to our Network.Their main purpose is to convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a input in the next layer in the stack.

Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.



#### Most popular types of Activation functions -
* Sigmoid or Logistic
* Tanh — Hyperbolic tangent
* ReLu -Rectified linear units

**Sigmoid Activation function**: It is a activation function of form f(x) = 1 / 1 + exp(-x) . Its Range is between 0 and 1. It is a S — shaped curve. It is easy to understand and apply but it has major reasons which have made it fall out of popularity -

* Vanishing gradient problem
* Secondly , its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
* Sigmoids saturate and kill gradients.
* Sigmoids have slow convergence.

<img src="http://cdn-images-1.medium.com/max/1600/0*WYB0K0zk1MiIB6xp.png"  width="400">


**Hyperbolic Tangent function- Tanh** : It’s mathamatical formula is f(x) = 1 — exp(-2x) / 1 + exp(-2x). Now it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 . Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function . But still it suffers from Vanishing gradient problem.


<img src="http://cdn-images-1.medium.com/max/1600/0*VHhGS4NwibecRjIa.png"  width="400">

**ReLu- Rectified Linear units** : It has become very popular in the past couple of years. It was recently proved that it had 6 times improvement in convergence from Tanh function. It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x. Hence as seeing the mathamatical form of this function we can see that it is very simple and efficinent . A lot of times in Machine learning and computer science we notice that most simple and consistent techniques and methods are only preferred and are best. Hence it avoids and rectifies vanishing gradient problem . Almost all deep learning Models use ReLu nowadays.

But its limitation is that it should only be used within Hidden layers of a Neural Network Model.

Hence for output layers we should use a Softmax function for a Classification problem to compute the probabilites for the classes , and for a regression problem it should simply use a linear function.

Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.

To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.

We then have another variant made form both ReLu and Leaky ReLu called Maxout function .

![](http://cdn-images-1.medium.com/max/1600/0*qtfLu9rmtNullrVC.png)

## 6.1 Differentiation of Activation Functions
When constructing Artificial Neural Network (ANN) models, one of the primary considerations is choosing activation functions for hidden and output layers that are **differentiable**. This is because calculating the **backpropagated error** signal that is used to determine ANN parameter updates requires the **gradient** of the activation function gradient. Three of the most commonly-used activation functions used in ANNs are the identity function, the logistic sigmoid function, and the hyperbolic tangent function. Examples of these functions and their associated gradients (derivatives in 1D) are plotted in the figure below:

<img src="https://dustinstansbury.github.io/theclevermachine/assets/images/a-gentle-introduction-to-neural-networks/common_activation_functions.png" width="700">

#### Why we use differentiation (derivatives)?
When updating the curve, to know in **which direction** and **how much to change or update the curve** depending upon the **slope**.That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*p_hyqAtyI8pbt2kEl6siOQ.png" width="600">

<a id="7"></a>
# 7. What happens without activation function?

If we do not apply a Activation function then the output signal would simply be a simple linear function.A linear function is just a polynomial of one degree. Now, a linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings from data. A Neural Network without Activation function would simply be a Linear regression Model, which has limited power and does not performs good most of the times. We want our Neural Network to not just learn and compute a linear function but something more complicated than that. Also without activation function our Neural network would not be able to learn and model other complicated kinds of data such as images, videos , audio , speech etc. That is why we use Artificial Neural network techniques such as Deep learning to make sense of something complicated ,high dimensional,non-linear -big datasets, where the model has lots and lots of hidden layers in between and has a very complicated architecture which helps us to make sense and extract knowledge form such complicated big datasets.

<a id="8"></a>
# 8. How are NNs different from classical models?

To better understand artificial neural computing it is important to know first how a conventional 'serial' computer and it's software process information. A serial computer has a central processor that can address an array of memory locations where data and instructions are stored. Computations are made by the processor reading an instruction as well as any data the instruction requires from memory addresses, the instruction is then executed and the results are saved in a specified memory location as required. In a serial system (and a standard parallel one as well) the computational steps are deterministic, sequential and logical, and the state of a given variable can be tracked from one operation to another.

In comparison, ANNs are not sequential or necessarily deterministic. There are no complex central processors, rather there are many simple ones which generally do nothing more than take the weighted sum of their inputs from other processors. ANNs do not execute programed instructions; they respond in parallel (either simulated or actual) to the pattern of inputs presented to it. There are also no separate memory addresses for storing data. Instead, information is contained in the overall activation 'state' of the network. 'Knowledge' is thus represented by the network itself, which is quite literally more than the sum of its individual components.

# 9. Overall ANN Internal Behavior
The figure below depicts an example of a 3-4-1 Multi-layered ANN that solves a classification problem:

<img src="https://miro.medium.com/v2/resize:fit:1400/1*rLUL1hmN8E53lqGuei-jyw.png" width="600">

# 10.1 Practical / Manual Example

In this notebook, we will exam computational abstractions of neural networks. These can help us understand the essence of what neurons compute, but can also be used to compute functions for which we don't know how otherwise.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import math

%matplotlib inline

## 10.2 The Neuron

The actual working of neurons involves many aspects (including chemical, electrical, physical, timings). We will abstract all of this away into three numbers:

* **Activation** - a value representing the excitement of a neuron
* **Bias** - a value representing a default or bias (sometimes called a threshold)
* **Weight** - a value representing a connection to another neuron

In addition, there is a **transfer function** that takes all of the incoming activations times their associated weights plus the bias, and squashes the resulting sum. This limits the activations from growing too big or too small.

## 10.3 Network of Neurons

To build a network of neurons, we first start by grouping neurons together in **layers**.

A typical Artificial Neural Network (ANN) is composed of three layers: **input**, **hidden**, and **output**. Each layer contains a collection of neurons, or simply **nodes** for short. Typically, the nodes in a layer are **fully connected** to the nodes in the next layer. For instance, every input node will have a weighted connection to every hidden node. Similarly, every hidden node will have a *weighted connection* to every output node.

Processing in a network works as follows. Input is propagated forward from the input layer through the hidden layer and finally through the output layer to produce a response. Each node, regardless of the layer it is in, uses the same transfer function in order to propagate its information forward to the next layer. This is described next.

## 10.4 Activation / Transfer function of a neuron (node)

Each node maintains an activation value that depends on the activation values of its incoming neighbors, the weights from its incoming neighbors, and an additional value, called the **default bias value**. To compute this activation value, we first calculate the node's net input.

The net input is a weighted sum of all the incoming activations plus the node's bias value:

$$ net_i = \sum\limits_{j=1}^n w_{ij} x_j + b_i $$

where $w_{ij}$ is the weight, or connection strength, from the $j^{th}$ node to the $i^{th}$ node, $x_j$ is the activation signal of the $j^{th}$ input node, and $b_i$ is the bias value of the $i^{th}$ node. 

Here is some corresponding Python code to compute this function for each node:

First, we define the indexes for the result nodes (`toNodes`) and the incoming nodes (`fromNodes`):

In [None]:
toNodes = range(0, 2)
fromNodes = range(0, 2)

That allows us to store the weights between the nodes in a matrix, and other related values in lists:

In [None]:
bias       = [0.2, -0.1, 0.5, 0.1]
activation = [0.8, -0.3]
netInput   = [0, 0]
netOutput = [0, 0]
weight = [[ 0.1, -0.8], [-0.3,  0.1]]

We can then compute the `netOutput[i]` as per the above equation:

In [None]:
# Add input biases to netInput
for i in fromNodes:
    netInput[i] += bias[i]

# Compute output nodes
for i in toNodes: 
    for j in fromNodes:
        netOutput[i] += weight[i][j] * netInput[j]
    netOutput[i] = (netOutput[i] + bias[i + 2]) * activation[i]  # Adding output biases
netOutput

where `weight[i][j]` is the weight $w_{ij}$, or connection strength, from the $j^{th}$ node to the $i^{th}$ node, `activation[i]` is the activation signal $x_j$ of the $j^{th}$ input node, and `bias[i]` is the bias value $b_i$ of the $i^{th}$ node. 

After computing the net input, each node has to compute its output activation. The value that results from applying the activation function to the net input is the signal that will be sent as output to all the nodes in the next layer. The **activation function** used in backprop networks is generally:

$$ a_i = \sigma(net_i) $$

where 

$$ \sigma(x) = \dfrac{1}{1 + e^{-x}} $$

The method math.exp() returns returns exponential of x: $e^{x}$.

In [None]:
def activationFunction(netInput):
    return 1.0 / (1.0 + math.exp(-netInput))

Now we can compute the complete activation of a unit:

In [None]:
for i in toNodes:
    activation[i] = activationFunction(netInput[i])
activation

This $\sigma$ is the activation function, as shown in the plot below. Notice that the function is monotonically increasing and bounded by 0.0 and 1.0 as the net input approaches negative infinity and positive infinity, respectively.

In [None]:
xs = range(-10, 10)
pts = [activationFunction(x) for x in xs]

In [None]:
plt.figure(figsize=(6,4))
plt.plot(xs, pts)
plt.xlabel("input")
plt.ylabel("output")
plt.show()

## 10.5 How to set the weights?

For many years, it was unknown how to learn the weights in a multi-layered neural network. In addition, Marvin Minsky and Seymour Papert proved in their 1969 book [Perceptrons](https://en.wikipedia.org/wiki/Perceptrons_(book)) that you could not do simple functions without having multi-layers. (Actually, the idea of using simulated evolution to search for the weights could have been used, but no one thought to do that.) 

Specifically, they looked at the function XOR:

**Input 1** | **Input 2** | **Target**
------------|-------------|-------
 0 | 0 | 0
 0 | 1 | 1 
 1 | 0 | 1 
 1 | 1 | 0 

This killed research into neural networks for more than a decade. So, the idea of neural networks generally was ignored until the mid 1980s when the **Back-Propagation of Error** (backprop) was created.

## 10.6 Learning Rule - Backpropagation Algorithm

Backpropagation algorithm-powered networks fall under the category of *supervised learning* schemes. That is, during training, the network is presented a training input, the inputs are propagated using the transfer function, until output appears in the output layer. The output is then compared with the expected or target output and an error is computed. The error is then backpropagated by applying the learning rule.

### 10.6.1 Learning Rate
The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

The learning rate may be the most important hyperparameter when configuring your neural network. Therefore it is vital to know how to investigate the effects of the learning rate on model performance and to build an intuition about the dynamics of the learning rate on model behavio

### 10.6.2 Effects of Learning Rate Values
Given a perfectly configured learning rate, the model will learn to best approximate the function given available resources (the number of layers and the number of nodes per layer) in a given number of training epochs (passes through the training data).

Generally, a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights. A smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train.

At extremes, a learning rate that is too large will result in weight updates that will be too large and the performance of the model (such as its loss on the training dataset) will oscillate over training epochs. Oscillating performance is said to be caused by weights that diverge (are divergent). A learning rate that is too small may never converge or may get stuck on a suboptimal solution.

<img src='https://willamette.edu/~gorr/classes/cs449/figs/descent1.gif' width='200'>

In the simplified function of figure above, the situation is simple. Any step in a downward direction will take us closer to the global minimum. For real problems, however, error surfaces are typically complex, and may more resemble the situation shown in the figure below.

<img src='https://willamette.edu/~gorr/classes/cs449/figs/descent2.gif' width='200'>

Here there are numerous local minima, and the ball is shown trapped in one such minimum. Progress here is only possible by climbing higher before descending to the global minimum.

The figure below shows the difference between different learning rate values, and their effect on under and overfitted models:
<img src="https://machinelearningmastery.com/wp-content/uploads/2018/11/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem.png" width="600">

### 10.6.3 Momentum
Another technique that can help the network out of local minima is the use of a momentum term. This is probably the most popular extension of the backprop algorithm; it is hard to find cases where this is not used. With momentum m, the weight update at a given time t becomes:

<img src='https://willamette.edu/~gorr/classes/cs449/equations/momentum.gif' width='300'>

where 0 < m < 1 is a new global parameter which must be determined by trial and error. Momentum simply adds a fraction m of the previous weight update to the current one. When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum. It is otherefore often necessary to reduce the global learning rate µ when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot of momentum, you will rush past the minimum with huge steps!

### 10.6.4 Effects of Momentum Values
When the gradient keeps changing direction, momentum will smooth out the variations. This is particularly useful when the network is not well-conditioned. In such cases the error surface has substantially different curvature along different directions, leading to the formation of long narrow valleys. For most points on the surface, the gradient does not point towards the minimum, and successive steps of gradient descent can oscillate from one side to the other, progressing only very slowly to the minimum (figure below). 

<img src='https://willamette.edu/~gorr/classes/cs449/figs/valley1.gif' width='300'>

And now, how the addition of momentum helps to speed up convergence to the minimum by damping these oscillations:

<img src='https://willamette.edu/~gorr/classes/cs449/figs/valley2.gif' width='300'>

The figure below shows the difference between different learning rate values, and their effect on under and overfitted models:

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/11/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem.png" width="600">

Both momentum and learning rate are crucial in training neural networks, as they can significantly impact the convergence and performance of the model. Finding the right balance for these hyperparameters is often a key part of successfully training a neural network.

The backpropagation algorithm, also called the *generalized delta rule*, systematically changes the weights by using a weight change equation. We use an optional momentum term in the weight change rule to help speed up convergence. The weight change rule is different for weights between the hidden-output layer nodes and the input-hidden layer nodes. For the hidden-output layer nodes it is:

In [None]:
desiredOutput = [0.1, 0.2]
actualOutput = [0.8, 0.6]

error = [0.0 for _ in desiredOutput]
delta = [0.0 for _ in desiredOutput]

EPSILON = 0.1   # Learning rate
MOMENTUM = 0.9 # Momentum term

# Assuming weight is the weight matrix from the forward pass
weightUpdate = [[0.0 for _ in fromNodes] for _ in toNodes]

And the learning rule applied:

In [None]:
for i in toNodes:
    error[i] = desiredOutput[i] - actualOutput[i]
    delta[i] = error[i] * actualOutput[i] * (1 - actualOutput[i])
    for j in fromNodes:
        weightUpdate[i][j] = (EPSILON * delta[i] * actualOutput[j]) + (MOMENTUM * weightUpdate[i][j])

# Apply weight updates to the network's weights
for i in toNodes:
    for j in fromNodes:
        weight[i][j] += weightUpdate[i][j]

weight

That is, at the $i^{th}$ output node, the error is the difference between desired and actual outputs. The weight change between a hidden layer node $j$ and output node $i$ --- `weightUpdate[i][j]` --- is a fraction of the computed delta value and additionally a fraction of the weight change from the previous training step. **MOMENTUM** is a constant that ranges between 0.0 and 1.0 and **EPSILON** is called the **learning rate** and is also a constant that varies between 0.0 and 1.0.

In the above code `delta[i] * actualOutput[j]` is the partial derivative of the overall error with respect to each weight. This is the slope of error. Thus, backprop changes the weight a tiny portion of the slope of the error. We only know the slope of this curve, not the shape, and thus have to take very small steps.

And that is all of the math, and Python, necessary to train a back-propagation of error neural network. Even though this is a very simple formulation, it has been proved that such three-layer network (input, hidden, output) is capable of computing any function that can be computed.

## 1.6 Training a Neural Network

Given a task, how does one train a neural network to do/solve the task? This involves the following steps:

 1. Determine an appropriate network architecture.
 1. Define a data set that will be used for training.
 1. Define the neural network parameters to be used for training.
 1. Train the network.
 1. Test the trained network.
 1. Do post training analysis.

### 1.6.1 Determining an appropriate architecture

Recall that a neural network consists of an input layer, an output layer, and zero or more hidden layers. Once a network has been trained, when you present an input to the network, the network will propagate the inputs through its layers to produce an output (using the transfer function described above). If the input represents an instance of the task, the output should be the solution to that instance after the network has been trained. Thus, one can view a neural network as a general pattern associator. Thus, given a task, the first step is to identify the nature of inputs to the pattern associator. This is normally in the form of number of nodes required to represent the input. Similarly, you will need to determine how many output nodes will be required. For example, consider a simple logical connective, AND whose input-output characteristics are summarized in the table below:

**Input A** | **Input B** | **Target**
------------|-------------|-------
 0 | 0 | 0
 0 | 1 | 0 
 1 | 0 | 0 
 1 | 1 | 1 

This is a very simple example, but it will help us illustrate all of the important concepts in defining and training neural networks.

In this example, it is clear that we will need two nodes in the input layer, and one in the output layer. We can start by assuming that we will not need a hidden layer. In general, as far as the design of a neural network is concerned, you always begin by identifying the size of the input and output layers. Then, you decide how many hidden layers you would use. In most situations you will need one hidden layer, though there are no hard and fast rules about its size. Through much empirical practice, you will develop your own heuristics about this. We will return to this issue later. In the case of the AND network, it is simple enough that we have decided not to use any hidden layers.

### 1.6.2 Define a data set that will be used for training

Once you have decided on the network architecture, you have to prepare the data set that will be used for training. Each item in the data set represents an input pattern and the correct output pattern that should be produced by the network (since this is supervised training). In most tasks, there can be an infinite number of such input-output associations. Obviously it would be impossible to enumerate all associations for all tasks (and it would make little sense to even try to do this!). You have to then decide what comprises a good representative data set that, when used in training a network, would generalize to all situations.

In the case of the AND network, the data set is very small, finite (only 4 cases!), and exhaustive.

The other issue you have to take into consideration here is that of the range of each input and output value. Remember the transfer function of a node is a sigmoid-function that serves to squash all input values between 0.0 and 1.0. Thus, regardless of the size of each input value into a node, the output produced by each node is between 0.0 and 1.0. This means that all output nodes have values in that range. If the task you are dealing with expects outputs between 0.0 and 1.0, then there is nothing to worry about. However, in most situations, you will need to *scale* the output values back to the values in the task domain. 

In reality, it is also a good idea to scale the input values from the domain into the 0.0 to 1.0 range (especially if most input values are outside the -5.0 and 5.0 range). Thus, defining a data set for training almost always requires a collection of input-output pairs, as well as scaling and unscaling operations. Luckily, for the AND task, we do not need to do any scaling, but we will see several examples of this later.

### 1.6.3 Define the neural network parameters

The next step is to define the parameters required to train the neural network. These include the following:

 1. The learning constant
 1. The momentum constant
 1. The tolerance
 1. Other training-related parameters

The learning rate, EPSILON, and the momentum constant, MOMENTUM, have to be between 0.0 and 1.0 and are critical to the overall training algorithm. The appropriate values of these constants are best determined by experimentation. Tolerance (which is also between 0.0 and 1.0) refers to the level of tolerance that is acceptable for determining correctness of the output. For example, if tolerance is set to 0.1, then an output value within 10% of the desired output is considered correct. Other training parameters generally exist to specify the reporting rate of the progress of the training, where to log such progress, etc. We will see specific examples of these as we start working with actual networks.

For the AND network, we will set EPSILON to 0.5, MOMENTUM to 0.0, report the progress every 5 epochs (see below).

### 1.6.4 Train the network

Once all the parameters are specified, you start the training process. This involves presenting each input pattern to the network, propagating it all the way until an output is produced, comparing the output with the desired target, computing the error, backpropagating the error, and applying the learning rule.  This process is repeated  until all inputs are exhausted. A single pass through an entire data set is called an *epoch*. In general, you always train the network for several epochs (can be anywhere from a few hundred to millions!) until the network begins to show more improved and stable performance. Performance of the network is generally measured in terms of the *total sum-squared error* or *TSS* for short. This is the error in each pattern squared and summed over all the patterns. Initially, you will notice that the TSS is quite high, but it will slowly decrease as the number of epochs increase. 

You can either stop the training process after a certain number of epochs have elapsed, or after the TSS has decreased to a specific amount.

### 1.6.5 Test the trained network

Once the network has been trained, it is time to test it. There are several ways of doing this. Perhaps the easiest is to turn learning off (another training parameter) and then see the outputs produced by the network for each input in the data set. When a trained network is going to be used in a *deployed* application, all you have to do is save the weights of all interconnections in the network into a file. The trained network can then be recreated at anytime by reloading the weights.

### 1.6.6 Do post training analysis

Perhaps the most important step in using neural networks is the analysis one performs once a network has been trained. There are a whole host of analysis techniques, we will present some of them as we go along.

# 2. Learning AND - Calculate logic gate result

Consider the AND logic results when possible inputs come into play:

 <img src="https://cdn.shopify.com/s/files/1/0611/1644/9018/files/AND_Logic_Gate_symbol_with_truth_table_600x600.png?v=1681242963" width="500">

I have implemented a simple artificial neural network (ANN) from scratch to solve the AND problem. The network was trained with input data corresponding to the truth table of the AND logical operation. After training, the network successfully predicted the correct outputs for all input combinations.

The AND gate is so named because, if 0 is called "false" and 1 is called "true," the gate acts in the same way as the logical "and" operator. The following illustration and table show the circuit symbol and logic combinations for an AND gate. (In the symbol, the input terminals are at left and the output terminal is at right.) The output is "true" when both inputs are "true." Otherwise, the output is "false." In other words, the output is 1 only when both inputs one AND two are 1.

This implementation uses a basic step function as the activation function and a simple weight update rule. It's a rudimentary form of ANN, sufficient for demonstrating basic logic operations like AND:

## 2.1 Define the AND logic

In [None]:
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([0, 0, 0, 1])  # AND logic

## 2.2 Implement an ANN

In [None]:
class ANN:
    def __init__(self, learning_rate=0.1, momentum=0.9):
        # Initialize weights and bias to random values
        self.weights = np.random.rand(2)
        self.bias = np.random.rand(1)
        self.learning_rate = learning_rate
        self.momentum = momentum
        # Initialize the velocity for momentum
        self.velocity_weights = np.zeros(2)
        self.velocity_bias = 0

    def activation(self, x):
        # Sigmoid activation function
        return 1 / (1 + np.exp(-x))

    def derivative_activation(self, x):
        # Derivative of the sigmoid function
        return x * (1 - x)

    def predict(self, input_features):
        # Compute the weighted sum and add bias
        weighted_sum = np.dot(input_features, self.weights) + self.bias
        # Apply activation function
        return self.activation(weighted_sum)

    def train(self, training_inputs, labels, epochs):
        loss_history = []
        accuracy_history = []

        for _ in range(epochs):
            total_loss = 0
            correct_predictions = 0

            for inputs, label in zip(training_inputs, labels):
                prediction = self.predict(inputs)
                error = label - prediction

                # Calculate the loss (squared error)
                total_loss += error**2

                # Update the weights and bias using backpropagation
                gradient = error * self.derivative_activation(prediction)
                self.velocity_weights = (self.momentum * self.velocity_weights) + (self.learning_rate * gradient * inputs)
                self.velocity_bias = (self.momentum * self.velocity_bias) + (self.learning_rate * gradient)
                
                self.weights += self.velocity_weights
                self.bias += self.velocity_bias

                # Count correct predictions
                if (prediction > 0.5) == label:
                    correct_predictions += 1

            # Record loss and accuracy for each epoch
            average_loss = total_loss / len(training_inputs)
            accuracy = correct_predictions / len(training_inputs)
            loss_history.append(average_loss)
            accuracy_history.append(accuracy)

        return loss_history, accuracy_history

# Initialize and train the enhanced ANN
ann = ANN(learning_rate=0.05, momentum=0.9)
loss_history, accuracy_history = ann.train(inputs, labels, epochs=500)

## 2.3 Test the model and calculate predictions

In [None]:
# Testing the trained ANN
predictions = [ann.predict(input_data) for input_data in inputs]
predictions

#### Add a threshold for classification

In [None]:
adjusted_predictions = [1 if ann.predict(input_data) > 0.5 else 0 for input_data in inputs]
adjusted_predictions

#### Calculate final accuracy score

In [None]:
from sklearn.metrics import accuracy_score
acc_score = accuracy_score(adjusted_predictions, labels)
print("Accuracy: ", acc_score)

In [None]:
import matplotlib.pyplot as plt
# Plotting the loss and accuracy
plt.figure(figsize=(8, 4))
plt.plot(loss_history)
plt.plot(accuracy_history)
plt.title('Accuracy vs. Loss Convergence over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy / Loss')
plt.legend(['Loss', 'Accuracy'])
plt.tight_layout()
plt.show()