# Artificial Neural Networks


# Neural Network

Computers can calculate complex numerical calculations in a blink of an eye. In theory, they even have bigger processing power than human brain – $10^9$ transistors with a switching time of $10^{-9}$ seconds against $10^{11}$ neurons, with a switching time of about 10-3 seconds. Still, brains outperform computers in almost every aspect. What AI wants to do is to make computers learn and make decisions based on that knowledge. One of the tools that AI is using to achieve that goal is __Neural Networks__. They are trying to imitate a few abilities that the brain has, and computers don’t. Basically what ANN are trying to accomplish is to introduce brain functionalities to a computer by copying the behavior of the nervous systems. In essence, ANN is coping nature God created.

# Biology behind the Idea

So, what nature’s concepts are Neural Networks trying to  imitate? As you are probably aware, the smallest unit of the nervous system is a __neuron__. These are cells with similar and simple structure. Yet, by continuous communication, these cells achieve enormous processing power. If you put it in simple terms, neurons are just switches. These switches generate an output signal if they receive a certain amount of input stimuli. This output signal is input for another neuron.

<table><tr>
    <td> <img src="images/in search of memory_big.jpg" width="300"> </td> 
<td><img src="images/two_neurons.png" width="300" /></td> 
    </tr>    </table>
<center>Figure 1: Biological Neurons and Conceptual drawing of two Neurons </center>

Each neuron has these components:

- Body, also known as soma
- Dendrites
- Axon

Body (soma) of a neuron carries out the basic life processes of a neuron. Every neuron has a single axon. This is a long part of the cell; in fact, some of these go through the entire length of the spine. It acts like a wire and it is an output of the neuron. Dendrites, on the other hand, are inputs of neuron and each neuron has multiple dendrites. These inputs and outputs, axons and dendrites of different neurons never touch each other even though they come close.

These gaps between axons and dendrites are called synapses. Trough these synapses signals are carried by neurotransmitter molecules. There are various neurotransmitter chemicals and each serves a different type of neuron. Among them are the famous serotonin and dopamine. The amount and type of these chemicals will dictate how “strong” the input to the neuron is. And, if there is enough input on all dendrites, the soma will “fire up” the signal on the axon, and transmit it to the next neuron.

# Main Components and Concepts of Neural Networks

Before we dive into concepts and components of neural networks, let’s just reflect on the goal of neural networks. What we want for them to do is to learn certain processes the way we do. For example, after we show an image of the dog to the neural network few times, we expect that next time we show it, our network will be able to say “Ok, that is a dog”. So, how do neural networks do that?

Based on the elements of the nervous system, Artificial Neural Networks are composed of small processing units – neurons and weighted connections between them. A weight of the connection simulates a number of neurotransmitters transferred among neurons, described in the previous chapter. Mathematically, we can define __Neural Network__ as a sorted triple $(N, C, w)$, where $N$ is set of neurons, $C$ is set ${(i, j)|i, j ∈ N}$ whose elements are connections between neurons $i$ and $j$, and $w(i, j)$ is weight of connection between neurons $i$ and $j$.

Usually, a neuron receives output from many other neurons as its input. Propagation function transforms them in consideration of the connecting weights to the so-called input network of that neuron. Often, this propagation function is just the sum of weighted inputs – weighted sum. Mathematically, it is defined like this: 

\begin{align}
net = Σ (i \cdot w)
\end{align}

where $net$ is input network, $i$ is a value of each individual input and $w$ is a weight of the connection through which input value came. After, that input network is processed by something that is called activation function. This function decides if the output of the neuron will be active. This function simulates the functionality of the biological soma, which will ignite output only if there are strong enough stimuli on the input.

# A bit of history 

Deep learning and artificial intelligence are big buzz words today, aren’t they? However, this field is not quite as new as the majority of people thinks. We as humans were always interested in the way we think and the structure of our brain. Of course, I am not saying that our great-great-great ancestors were trying to build Artificial Neural Networks, but there was always certain curiosity revolving thinking and learning processes. With the advance of modern electronics, this curiosity was harnessed and we started exploring ways in which we can build a thinking machine. The roots of the field can be traced back to 1943. when a young mathematician – __Walter Pitts__ and a neurophysiologist – __Warren McCulloch__ wrote a paper that introduced the first model of neurological networks. They explained how neurons might work and how we can replicate this behavior using simple circuits.

<table><tr>
    <td> <img src="images/JohnvonNeumann.jpg" width="400" /> </td> 
<td><img src="images/WarrenMcCulloch.png" width="400" /></td> 
    <td><img src="images/WalterPitts.jpg" width="200"/></td> 
    <td><img src="images/FrankRosenblatt.jpg" width="300"/></td>
    </tr>    </table>
<center>Figure 1: John von Neumann, Warren McCulloch, Walter Pitts and Frank Rosenblatt </center>

<img src="images/PerceptronAlgorithm.png" />
<center>Figure 2: Perceptron Algorithm </center>

This figure explains how the perceptron receives the inputs of a sample $\mathbf{x}$ and combines them with the weights $\mathbf{w}$ to compute the net input $\mathbf{z}$. The net input $\mathbf{z}$ is then passed on to the activation function $h$ (or the step function $step(z)$ in this case), which generates a binary output $-1$ or $+1$ - the predicted class label of the sample. During the learning phase, this output $\hat{y}$ is used to calculate the error the prediction and update the weights. 

_The Organization of Behavior_, a book written by Donald O. Hebb, reinforced this concept and introduced the Hebbian rule. This rule implies that connection between two neurons is strengthened when both neurons are active. However, testing all these theories was limited until computer gained on its processing power the 1950s. Then it became possible to test some of these theories and Nathanial Rochester from the IBM was one of the first to try to simulate a neural network in the lab. Even thou this first experiment failed, it blazed the path to further experiments. In 1956 well-known scientists and ambitious students met at the Dartmouth Summer Research Project and discussed how to simulate a brain.

Somewhere around that time, around 1956-1957, _Frank Rosenblatt_ and neuro-biologist Charles Wightman developed the first successful neurocomputer – Mark I Perceptron. Intrigued by the functions of the eye of the fly they made a computer that was able to recognize simple numerics. It used simple 20×20 pixel sensor and worked with 512 motor driven potentiometers – where potentiometers were used to adjust weights of the connections. The first neural network that was used in the real world was MADALINE in 1959. and it was developed by Bernard Widrow and Marcian Hoff of Stanford. MADALINE is actually an adaptive filter that eliminated echoes on the phone lines and it is still in commercial use. Cool fact, right?

Only a few years after Frank Rosenblatt's perceptron algorithm, Adaptive Linear Neuron (_Adaline_) was published by __Bernard Widrow__ and his doctoral student Tedd Hoff, and can be considered as an improvement on the latter (B. Widrow et al. Adaptive "Adaline" neuron using chemical "Memistors".) The Adaline algorithm is particularly interesting because it illustrates the key concept of defining and _minimizing cost functions_, which will lay the groundwork for understanding more advanced machine learning algorithms for classification, such as logistic regression and supprot vector machines, as well as regression models that we will discuss in future lectures.

<img src="images/AdalineAlgorithm.png" />
<center>Figure 3: Adaline Algorithm </center>

The key difference between the Adaline and Perceptron algorithm is that the weights are updated based on a linear activation function rather than a step function like in the perceptron. In Adaline, this linear activation function $h(z)$ is simply the identity function of the net input so that 

\begin{align}
h(\mathbf{z}) = h(\mathbf{w}^T\mathbf{x}) = \mathbf{w}^T\mathbf{x}.
\end{align}

While the linear activation function is used for learning the weights, a _quantizer_, which is similar to the step function that can then be used to predict the class labels, as illustrated in the figure above.

If we compare the preceding figure to the illustration of the perceptron algorithm, the difference is that we know to use the continuous valued output from the linear activation fucntion to computer the model error and update the weights, rather than the binary class labels.

# How do Artificial Neural Networks learn?

The most important aspect of the Artificial Neural Networks is __learning__. The biggest power of these systems is that they can be familiarized with some kind of problem in the process of training and are later able to solve problems of the same class – just like humans do! Before we dive into that exciting topic let’s have a quick recap of some of the most important components of artificial neural networks and its architecture.

The smallest and most important unit of the artificial neural network is the __neuron__. As in biological neural systems, these neurons are connected with each other and together they have the great processing power. In general, ANNs try to replicate the behavior and processes of the real brain, and that is why their architecture is modeled based on biological observations. The same is with the artificial neuron. It’s structure reminiscent of the structure of the real neuron. 


Every neuron has input connections and output connections. These connections simulate the behavior of the synapses in the brain. The same way that synapses in the brain transfer the signal from one neuron to another, connections pass information between artificial neurons. These connections have weights, meaning that value that is sent to every connection is multiplied by this factor. Again, this is inspired by brain synapses, and weights actually simulate the number of neurotransmitters that are passed over among biological neurons. So, if the connection is important it will have a bigger weight value than those connections which are not important.

Since there could be numerous values getting into one of the neurons, every neuron has a so-called _input function_. Input values from all weighted connections are usually summarized, which is done by __weighted sum__ function. This value is then passed to the __activation function__, whose job is to calculate whether some signal should be sent to the output of the neuron. 

We can (and usually do) have multiple layers of neurons in each ANN. it looks like something like this:


<img src="images/MultiLayerNeuralNetwork.png"  >
<center>Figure 9: Multiple Layered Artificial Neural Network </center>

## Learning

If we observe nature, we can see that systems that are able to learn are highly adaptable. In their quest to acquire knowledge, these systems use input from the outside world and modify information that they’ve already collected, or modify their internal structure. That is exactly what ANNs do. They adapt and modify their architecture in order to learn. To be more precise, the ANNs change weights of connections based on input and desired output.

“Why weights?”, one might ask. Well, if you look closer into the structure of the ANNs, there are a few components we could change inside of the ANN if we want to modify their architecture. For example, we could create new connections among neurons, or delete them, or add and delete neurons. We could even modify input function or activation function. As it turns out, changing weights is the most practical approach. Plus, most of the other cases could be covered by changing weights. Deleting a connection, for example, can be done by setting the weight to 0. And a neuron can be deleted if we set weights on all its connections to zero.

## Training

Training is a necessary process for every ANN, and it is a process in which the ANN gets familiar with the problem it needs to solve. In practice, we usually have some collected data based on which we need to create our predictions, or classification, or any other processing. This data is called training set. In fact, based on behavior during the training and the nature of training set, we have a few classes of learning:

- Unsupervised learning – Training set contains only inputs. The network attempts to identify similar inputs and to put them into categories. This type of learning is biologically motivated but it is not suitable for all the problems.

- Reinforcement learning – Training set contains inputs, but the network is also provided with additional information during the training. What happens is that once the network calculates the output for one of the inputs, we provide information that indicates whether the result was right or wrong and possibly, the nature of the mistake that the network made.

- Supervised learning – Training set contains inputs and desired outputs. This way the network can check its calculated output the same as desired output and take appropriate actions based on that.

Supervised learning is most commonly used, so let’s dig a little deeper into this topic. Basically, we get a training set that contains a vector of input values and a vector of desired output values. Once the network calculates the output for one of the inputs, __cost function__ calculates the error vector. This error indicates how close our guess is to the desired output. One of the most used cost functions is __mean squared error__ function:

\begin{align}
C(w, b) = \frac{1}{2n}\sum_x (y - \hat{y})^2.
\end{align}

Here, $y$ is the class label or the desired output, $\hat{y}$ is the generated output of the ANN and is the function of the training input vector $\mathbf{x}$. Also one can notice that this function is the function that depends on $w$ and $b$ which represents weights and biases, respectively. 

Now, this error is sent back to the neural network, and weights are modified accordingly. This process is called __backpropagation__. Backpropagation is an advanced mathematical algorithm, using which the Artificial Neural Network has the ability to adjust all weights at once. Since it is a complex topic and would require an entirely separate lecture. The important thing to remember here is that by using this algorithm, ANNs are able to modify weights in a fast and easy manner.

# Gradient Descent

The entire point of training is to set the correct values to the weights, so we get the desired output in our neural network. This means that we are trying to make the value of our error vector as small as possible, i.e. to find a global minimum of the cost function. One way of solving this problem is to use calculus. We could compute derivatives and then use them to find places where is an extremum of the cost function. However, the cost function is not a function of one or a few variables; it is a function of all weights in the network, so these calculations will quickly grow into a monster. That is why we use the technique called __gradient descent__.

There is one useful analogy that describes this process quite well. Imagine that you had a ball inside a rounded valley like in the picture below. If you let the ball roll, it will go from one side of the valley to the other, until it gets to the bottom.

<img src="images/GradientDescent1.png"  width="400">
<center>Figure 5: Gradient Descent</center>




Essentially, we can look at this behavior like the ball is optimizing its position from left to right, and eventually, it comes to the bottom, i.e. the lowest point of the valley. The bottom, in this case, is the minimum of our error function. This is what gradient descent algorithm is doing. It starts from one position in which by calculating derivates and some second derivates of cost function C it gets the information about where “the ball” should roll. Every time we calculate derivates we get information about the slope of the side of the valley at its current position. This is represented in the picture below with the blue line.

When the slope is negative (downward from left to right), the ball should move to the right, otherwise, it should move to the left. Be aware that the ball is just an analogy, and we are not trying to develop an accurate simulation of the laws of physics. We are trying to get to the minimum of the function using this alternative method since we already realized that using calculus is not optimal.

<img src="images/GradientDescent2.png"  width="400">
<center>Figure 5: Gradient Descent Derivatives</center>


In a nutshell, the process goes like this:

- Put the training set in the Neural Networks and get the output.
- The output is compared with desired output and error is calculated using cost function.
- Based on the error value and used cost function, decision on how the weights should be changed is made in order to minimize the error value.
- The process is repeated until the error is minimal.

What I’ve just explained has one more name – __Batch Gradient Descent__. This is due to the fact that we put the entire training set in the network and then we modify the weights. The problem with this approach is that this way, we can hit a local minimum of the error function, but not the global one. The previous statement is one of the biggest problems in Neural Networks, and there are multiple ways to solve it.

However, the common way to avoid the trap of going to a local minimum is modifying weights after each processed input of the training set. When all inputs from a training set are processed, one epoch is done. It is necessary to do multiple epochs to get the best results. The explained process is called – __Stochastical Gradient Descent__. Also, by doing so we are minimizing the possibility of another problem arising – __overfitting__. Overfitting is a situation in which neural networks perform well on the training set, but not on the real values later. This happens when the weights are set to solve only the specific problem we have in the training set.

# Conclusion

Now, let’s sum this up in a few steps:

- We randomly initialize weights in our neural network

- We send the first set of input values to the neural network and propagate values trough it to get the output value.
- We compare output value to the expected output value and calculate the error using cost functions.
- We propagate the error back to the network and set the weights according to that information.
- Repeat steps from 2 to 4 for every input value we have in our training set.
- When the entire training set has been sent through the neural network, we have finished one epoch. After that, we repeat more epochs.

So, that is an oversimplified representation of how neural networks learn. What I haven’t mentioned is that in practice, the training set is separated into two parts and the second part of the training set is used to validate the work of the network.

Hopefully, this lecture will provide a good overview of the way neural networks learn. Since it is a complex topic some things were left uncovered (backpropagation for example), which will be covered in the lectures to come. It should be mentioned that I tried not to go too deep into the math, which leaves plenty of room for research 

# Further Study and Assignments:

## 1. Group Assignment:



## 2. Programming Assignment: 


## 3. Reading Assignment for all



# References: 

- [Introduction to Artificial Neural Networks](https://rubikscode.net/2017/11/13/introduction-to-artificial-neural-networks/)
- ["Make your own neural network" by  Tariq Rashid](http://makeyourownneuralnetwork.blogspot.com/)


# References
- [Artificial Neural Networks Series](http://deepinthought.in/artificial-neural-networks-series) BY rubiikscode, 2018


