- A new idea emerged to copy the biological brain for artificial intelligence.
- Nature-inspired algorithms led to the development of neural networks and deep learning.
- AI is now on the cusp of a new golden age with machines that rival human intelligence.
- Traditional approaches to computing cannot mimic the intelligence of birds and bees.
- A bee has around 950,000 neurons.
- Neural networks emerged as a biologically inspired solution.
- Neural networks are now the foundation of powerful AI technology, like Google's Deepmind.
- This guide is about understanding and creating neural networks for difficult tasks like recognizing human handwriting.



<a id='Forward propagation explained'></a>



###Forward propagation explained

#### What will we do?
In this book we’ll take a journey to making a neural network that can recognise human handwritten numbers.

We’ll journey through mathematical ideas like functions, simple linear classifiers, iterative refinement, matrix multiplication, gradient calculus, optimisation through gradient descent and even geometric rotations.

#### A Simple Predicting Machine

<img src="./img/kilometrs_miles.png" width="700"/>

imagine we don’t know the formula for converting between kilometres and miles. All we know is the relationship between is linear.
Mysterious calculation it needs to be of the form 'miles = kilometres * C', where 'C' is a constant. We don’t know what this constant 'C' is yet.

What should we do to work out that missing constant c? Let’s just pluck a value at random and give it a go! Let’s try c = 0.5 and see what happens.

<img src="./img/miles_kilometrs_err_1.png" width="700"/>

difference between our calculated answer and the actual truth 12,137

<img src="./img/miles_kilometrs_err_2.png" width="700"/>
<br>
<img src="./img/miles_kilometrs_err_3.png" width="700"/>

#### Classifying is Not Very Different from Predicting
We called the above simple machine a predictor, because it takes an input and makes a prediction of what the output should be.


<img src="./img/lr_divider_1.png" width="500"/>


The adjustable parameter 'C' changed the slope of that straight line. We can use the line to separate different kinds of things.

<img src="./img/lr_divider_2.png" width="500"/>

For now, we’re simply trying to illustrate the idea of a simple classifier.
How do we get the right slope? How do we improve a line we know isn’t a good divider between the two kinds of bugs?

T - target (labeled data)
Y - predicted
E - error


#### Training A Simple Classifier

<img src="./img/training_data.png" width="500"/>
<br>
<img src="./img/training_data_visualization.png" width="500"/>

Looking back at our miles to kilometre predictor, we had a linear function whose parameter we adjusted.
Dividing line is a straight line "y = Ax"
You may also notice that this "y = A * x" is simpler than the fuller form for a straight line "y = Ax + B"

Let’s go for **'A' is 0.25** to get started. The dividing line is **'y = 0.25x'** Let’s plot this line on the same plot of training data to see what it looks like:

<img src="./img/training_data_visualization_2.png" width="500"/>
<br>
<img src="./img/training_data_visualization_3.png" width="500"/>

How is **A** related to **E**? If we can know this, then we can understand how **changing one affects the other.**
Mathematicians use the delta symbol Δ to mean “a small change in”. Let’s write that out:

t = (A + ΔA)x
Let’s picture this to make it easier to understand. You can see the new slope **(A+ ΔA)**

<img src="./img/training_data_visualization_4.png" width="500"/>
<br>
<img src="./img/training_data_visualization_explained.png" width="700"/>

T - target (labeled data)
Y - predicted value (dependent value)
E - error
x - feature (independent value)


#### Sometimes One Classifier Is Not Enough
Illustrate the limit of a linear classifier with a simple but stark example.
Boolean logic functions - AND and OR.
Boolean logical functions typically take two inputs and output one answer:

<img src="./img/logical_ftion.png" width="500"/>

There is another Boolean function called XOR, short for eXclusive OR, which only has a true output if either one of the inputs A or B is true, but not both.

<img src="./img/XOR.png" width="500"/>
<br>
<img src="./img/logical_XOR_visual.png" width="500"/>

A and B to the logical function as coordinates on a graph. The plot shows that only when both are true, with value (1,1), is the output also true, shown as green. False outputs are shown red.

#### Representing Boolean Functions with Linear Classification



#### Neurons, Nature’s Computing Machines
The very capable human brain has about 100 billion neurons!
A biological neuron doesn’t produce an output that is simply a simple linear function of the input.

Observations suggest that neurons don’t react readily, but instead suppress the input until it has grown so large that it triggers an output. You can think of this as a threshold that must be reached before any output is produced. It’s like water in a cup - the water doesn’t spill over until it has first filled the cup. Intuitively this makes sense - the neurons don’t want to be passing on tiny noise signals, only emphatically strong intentional signals. The following illustrates this idea of only producing an output signal if the input is sufficiently dialed up to pass a threshold.

***Output=(constant∗input)+(maybe another constant)***

<img src="./img/neuron_threshold.png" width="500"/>

A function that takes the input signal and generates an output signal, but takes into account some kind of threshold is called an **activation function**.

A simple step function could do this:

<img src="./img/step_fun.png" width="500"/>
<br>
<img src="./img/sigmoid_fun.png" width="500"/>


<img src="./img/sigmoid_formula.png" width="500"/>

e - a mathematical constant 2.71828…
x - The input x is negated and e is raised to the power of that -x. (The result is added to 1, so we have 1+e^-x)

when x is zero, e^−x is one because anything raised to a power of zero is 1. So y becomes 1/(1+1) or simply 1/2, a half. So the basic sigmoid cuts the y-axis at y=1/2.

<img src="./img/sigmoid_formula_detailed.png" width="500"/>

The first thing to realize is that real biological neurons take many inputs, not just one. We saw this when we had two inputs to the Boolean logic machine, so the idea of having more than one input is not new or unusual.

We simply combine them by adding them up, and the resultant sum is the input to the sigmoid function which controls the output. This reflects how real neurons work. The following diagram illustrates this idea of combining inputs and then applying the threshold to the combined sum:

<img src="./img/sum_inputs.png" width="600"/>
<br>
<img src="./img/neron_bio_comp.png" width="600"/>
<br>
<img src="./img/neron_comput.png" width="600"/>


If only one of the several inputs is large and the rest small, this may be enough to fire the neuron.

What’s more, the neuron can fire if some of the inputs are individually almost, but not quite, large enough because when combined the signal is large enough to overcome the threshold.

**The electrical signals are collected by the dendrites and these combine to form a stronger electrical signal. If the signal is strong enough to pass the threshold, the neuron fires a signal**

<img src="./img/bio_nn.png" width="600"/>

The most obvious thing is to adjust the strength of the connections between nodes. Within a node, we could have adjusted the summation of the inputs

The following diagram again shows the connected nodes, but this time a weight is shown associated with each connection.

A low weight will de-emphasise a signal, and a high weight will amplify(посилить) it.

<img src="./img/nn_layers_weighted.png" width="600"/>

What do we mean by this? It means that as the network learns to improve its outputs by refining the link weights inside the network

Some weights become zero or close to zero. Zero, or almost zero, weights means those links don’t contribute to the network because signals don’t pass.

TODO

<img src="./img/nn_example1.png" width="600"/>

Random starting values aren’t such a bad idea, and it is what we did when we chose an initial slope value for the simple linear classifiers earlier on.

- The first layer of nodes is the input layer, and it doesn’t do anything other than represent the input signals. That is, the input nodes don’t apply an activation function to the input.
- The first layer of neural networks is the input layer and all that layer does is represent the inputs that’s it.
- the second layer where we do need to do some calculations. For each node in this layer we need to work out the combined input. Remember that sigmoid function **y = 1/(1+e^-x)**

<img src="./img/second_layer_explained.png" width="600"/>
<br>
<img src="./img/second_layer_formula_explained.png" width="600"/>
<br>
<img src="./img/second_layer_formula_explained2.png" width="600"/>

The process will be repeated for the all layers



#### Matrix Multiplication is Useful
A matrix is just a table, a rectangular grid, of numbers. That’s it. There’s nothing much more complex about a matrix than that.

Here’s an example of two simple matrices multiplied together.

<img src="./img/matrics_multiplication.png" width="600"/>
<br>
<img src="./img/matrics_multiplication2.png" width="600"/>

So you can’t multiply a “2 by 2” matrix by a “5 by 5” matrix.




Look what happens if we replace the letters with words that are more meaningful to our neural networks. The second matrix is a two by one matrix, but the multiplication approach is the same.

<img src="./img/matrixs_mult_nn.png" width="700"/>
<br>
<img src="./img/nn_matrix_inputs.png" width="700"/>

We can express all the calculations that go into working out the combined moderated signal, x, into each node of the second layer using matrix multiplication

<center>
X=W∗I
</center>

W - the matrix of weights
I - the matrix of inputs
X - the resultant matrix of combined moderated signals into layer 2

This is fantastic! A little bit of effort to understand matrix multiplication has given us a powerful tool for implementing neural networks without lots of effort from us.

## Forward propagation explained
TODO img

Input layer:
- X1=1
- X2=2

Hidden layer:
    * Node 1: w1,0=0.1, w1,1=0.2, b=0
    * Node 2: w1,0=0.4, w1,1=0.5, b=0
    * Node 3: w1,0=1.1, w1,1=1.2, b=0

Output layer:
    * w2,0=1, w2,1=1.2, w2,2=1.4

1. Calculate the weighted sum and the activation function output for each node in the hidden layer:
    * Node 1:
    * Weighted sum = (0.1 * 1) + (0.2 * 2) + 0 = 0.5
    * Activation function output = sigmoid(0.5) = 0.6225
    * Node 2:
    * Weighted sum = (0.4 * 1) + (0.5 * 2) + 0 = 1.4
    * Activation function output = sigmoid(1.4) = 0.8022
    * Node 3:
    * Weighted sum = (1.1 * 1) + (1.2 * 2) + 0 = 3.5
    * Activation function output = sigmoid(3.5) = 0.9707

2. Calculate the weighted sum and the activation function output for the output node:
* Weighted sum = (1 * 0.6225) + (1.2 * 0.8022) + (1.4 * 0.9707) = 3.0206
* Activation function output = sigmoid(3.0206) = 0.9531


## Activation functions

TODO:

**activation function?** That’s easy and doesn’t need matrix multiplication. All we need to do is apply the sigmoid function to each individual element of the matrix X.

**soft max**

**Relu**
<br>
<img src="./img/activation_functions.png" width="900"/>

## Backpropagation explained
Backpropagation is an algorithm used to train neural networks by adjusting the weights. It works by propagating the error in the output of the network backwards through the layers to adjust the weights in each layer.

TODO:
http://neuralnetworksanddeeplearning.com/chap2.html

### Learning Weights From More Than One Node
How do we update link weights when more than one node contributes to an output and its error? The following illustrates this problem.

<br>
<img src="./img/otput_layer.png" width="700"/>



#### Backpropagating Errors From More Output Nodes

<img src="./img/err_calculation.png" width="700"/>
<br>
<img src="./img/w_err_calc.png" width="700"/>
<br>
<img src="./img/backpropagating_explained.png" width="800"/>

#### Backpropagating Errors To More Layers

<br>
<img src="./img/backpropagating_err_2.png" width="800"/>

<br>
Working back from the final output layer at the right hand side
<br>
<img src="./img/backpropagating_err_2_details.png" width="800"/>
<br>
<img src="./img/backpropagating_err_3.png" width="800"/>

<br>

We don’t have the target or desired outputs for the hidden nodes. We only have the target values for the final output layer nodes, and these come from the training examples. That means we have some kind of error for each of the two links that emerge from this middle layer node. We could recombine these two link errors to form the error for this node as a second best approach because we don’t actually have a target value for the middle layer node.

<br>
<img src="./img/backpropagating_err_4.png" width="800"/>
<br>
<img src="./img/backpropagating_err_5.png" width="800"/>
