# How do Neural Networks Actually Learn
The goal of this journey is to take you from "1+1=2" to "How a Neural Network Learns" .. where you think "that was easy" at each step..



# Supporting material

This journey takes ideas from these excellent sources:

- But what is a neural network: https://www.youtube.com/watch?v=aircAruvnKk
- Backpropagation Demystified: A Step-by-Step Guide to the Heart of Neural Networks: https://www.youtube.com/watch?v=QflXxNfMCKo
- Neural Networks Explained from Scratch using Python: https://www.youtube.com/watch?v=9RN2Wr8xvro
- Using neural nets to recognize handwritten digits: https://neuralnetworksanddeeplearning.com/chap1.html
- The fastai course: https://course.fast.ai/

# To follow this journey yourself:

- install Jupyter Labs: https://jupyter.org/install
- git clone https://github.com/kennylomax/aiwarmups.git
- cd aiwarmups
- pip install nbclassic
- jupyter nbclassic hackAIthonWarmup3_0.ipynb


# Prepare our environment:

In [None]:
!pip install fastbook
from fastai.vision.all import *
from fastbook import *
from ipywidgets import *
import math
import matplotlib.pyplot as plt
import numpy as np

# BACKGROUND
## Setting the scene:

- Neural Networks may be unfamiliar but they are not complex or magic. They are actually rather simple
- We consider the basic Neural Network in this journey.
- Making and training a new Neural Network from scratch, to recognise hand-written digits takes about 20 lines of code
  - code: https://github.com/Bot-Academy/NeuralNetworkFromScratch/blob/master/nn.py
  - video explanation: https://www.youtube.com/watch?v=9RN2Wr8xvro
  - there are no special libraries (hiding complexity) here<br>

![pyy.png](attachment:6baeb1dd-8c40-4bb2-915d-ff2e0751f1aa.png)


  - I did not (and cannnot) write this, but we can try it out :)
    - git clone https://github.com/Bot-Academy/NeuralNetworkFromScratch.git
    - cd NeuralNetworkFromScratch
    - python nn.py

## Computer Data.. It's all just numbers..
Computers deal with numbers.

**ALL data in a computer**  - code, text, photos, videos, spreadsheets, songs, websites - **EVERYTHING** is represented internally as a **sequence of numbers**.

When a computer is presented with a photo/audio recording/video/text, the computer will convert it to number(s) before accepting it and working with it.

![hsnc.png](attachment:53f5b9a2-3472-4797-ae4e-e83f2ed2d24b.png)<br>


## Also with AI, ML, Neural Networks, LLMs.. it's all just numbers

AI, ML, NEURAL NETWORKS and LLMs **all work by processing NUMBERS: taking NUMBERs as input and providing NUMBERs as output**
<br>
<br>![ainums.png](attachment:33d1b4df-8534-410a-beb3-e1c0f2e90d95.png)

## MNIST Data Set - The "Hello World" data set of Neural Networks:
- consists of 60000 hand-drawn images<br>
![mnist.png](attachment:96392cc6-4499-4148-b3b7-2bf74fb2453f.png)

- each stored numerically, and each paired with a label between 0 and 9<br>
![mnists.png](attachment:23cfe50e-3d3c-4607-94cf-9d4069d6ecd2.png)
- each image is 28x28 pixels
- each pixel is encoded as a number (0=black, 255=white):<br>
- for example one of the digits, with the label "9" is: <br>

![mnist9.png](attachment:fb1c7c0a-fbb7-4706-928f-2882e13b1f3a.png)

![mnist9excel.png](attachment:106b8d16-a49c-436b-a48e-6699119e413f.png)

- If you read from left to right and top to bottom, you get 28*28 = 784 numbers
- **784 Numbers -> an image -> 784 Numbers**
- The classic Neural Network challenge of the last decades was to get Neural Networks to recognise the MNIST digits (More @ https://en.wikipedia.org/wiki/MNIST_database and https://www.kaggle.com/code/hojjatk/read-mnist-dataset )



# MATHEMATICAL FUNCTIONS

**1+1=2** is a statement about the Mathematical Function "ADDITION":

- if the numbers 1 and 1 are provided as **input** to the **Mathematical Function** "ADDITION", we get the **output** 2 
![112.png](attachment:bebccc99-82c1-47da-b663-ceaffdfc6fe7.png)

Some other examples of Mathematical Functions:

  - 1,2,3,4,5,6 → **ADDITION** → 21    (several inputs, 1 output)<br>
  - 1 → **ADD_ONE** → 2   (1 input, 1 output)<br>
  - 2,6,10 → **MULTIPLY** →  120     <br>
  - 2,6,10 → **AVERAGE** → 6 <br>
  - -12 ,  7,  11,  18  →  **SHOW_SMALLEST_AND_LARGEST** →  -12, 18  (Note we can have functions with **multiple outputs**)
  - -12, 7, 11, 17  →  **HOW_MANY_NEGATIVE_AND_HOW_MANY_POSITIVE**  →  1, 3

All of these Mathematical Functions:
- take zero or more **number(s)** as input,
- produce one or more **numbers(s)** as output
- are **deterministic** - meaning they always give the same answer for the same input.   1+1 is always equals to 2...<br> ![nfn.png](attachment:10bfd4cd-81cf-4421-8b28-30f6eae2417e.png)


What about these:

![dogcat12.png](attachment:87c45a98-c5df-4593-bf81-9fd7ef30921a.png)

![mn09.png](attachment:e6f9b419-9692-4e35-9811-6c4f8ad9f7fb.png)

![Chess.png](attachment:ad1e001f-2a98-4821-9ee0-ec13a7efd6fe.png)


- These might be horrendously complex MATHEMATICAL FUNCTIONS, but **THEY ARE STILL MATHEMATICAL FUNCTIONS** :
    - they take numbers as input, 
    - they produce numbers as output
    - they are deterministic

### Some Mathematical Functions can be implemented as a normal computer program..

![nfn.png](attachment:65cbfba9-044f-4123-a5e0-d5f6c498b66d.png)

![ncpn.png](attachment:01532d2a-e79c-49c0-8148-652624056538.png)

For example adding numbers together:


![addab.png](attachment:33bd2e57-8db5-4b67-a544-0593aa2ddd94.png)

And also supremely complex problems like:

![rock.png](attachment:7d2d0e47-e69c-44c2-8e8e-913335b77063.png)

### But some Mathematical Functions are too complex to be implemented as a traditional computer program..

- is this a picture of a dog
- beat me at chess
- what does this image of a hand-written message say
  
This is where Neural Networks come in..

# NEURAL NETWORKS

Like a program, Neural Networks:

- also take numbers as input and produce numbers as output.
- are deterministic (they provide the same answer when given the same input)


![nfn.png](attachment:681cabd7-2436-4716-b568-e55f3a2a4605.png)

![nnnn.png](attachment:02d19032-3d22-4705-93e2-56625a419130.png)


But unlike a program, the Neural Network:

- gives **APPROXIMATE** answers, typically as "probabilities/likelihoods"
- rather than being "programmed", it is "trained" using thousands/millions of input-output pairs
- Neural networks let us **approximate the behaviour of massively complex mathematical functions**
- Think of a Neural Network as a **Universal Mathematical Function Approximator**  (More @ https://www.youtube.com/watch?v=xg4bIeJTVF0)

## Our task for the Neural Network, from 30000 feet: 

How to create a Neural Network that recognises MNIST characters very well..

![mniste.png](attachment:acf8ce98-2a45-4676-96cb-3857cfd63be6.png)
 

## Our task from 20000 feet: 
Neural Networks are made up of **Perceptrons** - often hundreds, sometime thousands or millions of them..

### A Perceptron:

![ps.png](attachment:9dd98392-6ddd-4ccc-b6a4-438ea46fda57.png)

A neural network will have:
  - an input layer, which is simply **the input** numbers
  - 0 or more "hidden" layers, **containing perceptrons**
  - an output layer containing **the answer**
  - **adjustable weights** connecting everything

Here is an untrained neural network for recognising MNIST characters. 

The blue parts are our perceptrons<br>

![NNBW.png](attachment:619a3302-1597-4a59-a180-8f845287071e.png)


(For comparison, chatG.P.T. has ~96 layers of size ~12288 each)

The challenge is to **adjust the weigths** to get the behaviour we want..


### Why Perceptrons? What makes them so useful?

Perceptrons can represent simple mathematical functions exactly:

Consider the function: **y = mx+c**

This can be replicated by the Perceptron:

![ymxc2.png](attachment:5bca19ab-c7a9-4ab2-8754-ff9cb58c7ab0.png)

In [None]:
def f(x): return 4*x+2
plot_function(f, 'x', '4x+2')

But that is not very exciting or useful..

Let's make a small change to this perceptron:

![ymxcmin0.png](attachment:87cfe4b3-175b-4d5e-8eee-b8740527afcd.png)

We call the extra block the **"activation function"** (because it determines how much the perceptron is "activated")

![perceprenamed.png](attachment:e76ca99c-a4af-431d-a60c-501aae40ed7b.png)


In [None]:
def rectified_linear(weight,bias,x):
  y=weight*x+bias
  return torch.clip(y, 0.)

@interact(weight=1.5, bias=1.5)
def plot_relu(weight, bias):
    plot_function( partial(rectified_linear, weight, bias))

This is much more useful, because **you can theoretically combine a lot of these (perhaps millions) to closely approximate any function**..

![NN3.png](attachment:54ccdcdf-1dd0-4790-bfa1-0cd23669fb34.png)


### Consider (just) two perceptrons:

In [None]:
def double_relu(m1, b1, m2, b2,x):
   return rectified_linear(m1,b1,x)+  rectified_linear(m2,b2,x)

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function( partial(double_relu, m1, b1,m2,b2))

With enough perceptrons and tweaking of the weights, you could replicate any mathematical function:

![awave2.png](attachment:c3e9d5fa-23ff-4123-88b5-27e38ac43b15.png)

### A better activation function
It turns out that max(a,0) is good but a different **activation function** is even more useful as:

- it still allows us to create any function by adding more and more perceptrons
- it again disallow negative values, but it also squashes the output to between 0 and 1
- it prevents explosions in values across the Neural Network,
- it is called the Sigmoid Function  (more @ https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e)

![sigmoid.png](attachment:50cb1784-f376-40af-9ab8-1ebd2e199a1e.png)


In [None]:
def sigmoid(x): return 1/(1+torch.exp(-x))
plot_function(torch.sigmoid, title='Activation Function "Sig"', min=-4, max=4)

This gives us a full Perceptron:

![percbg.png](attachment:e1bb32a8-346f-4baf-b55b-eab192136342.png)

Our network has not just 1 layer (that can simulate a complex x-y function, but has an additional "hidden" layer. 

This allows the neural network to simulate multi-dimensional mathematical functions - ones that are good enough to recognise hand-written digits.. 

![nnbg.png](attachment:dbba9a37-27b2-4239-b2b3-bc3332900c57.png)


Our challenge is to find the weights in the network such that the Neural Network best approximates our hypothetical perfect mathematical function.

!! If we had endless time, we could simply adjust the weights randomly until we found an excellent fit, but we want an efficient way to do this...


## Our task from 5000 feet


### Training a Neural Network

- Let's call one snapshot of the NN at work, **an iteration**.
- In the **iteration** below we have presented an image of "9" and we can see the output

![gameplan1.png](attachment:f9e21160-61b2-47ca-874e-9fe9e7436ff1.png)

#### The "Cost" of our Network is measured with Mean Squared Error

The Cost (Mean Squared Error) is a useful way to **measure how bad our results are**:  (more @ https://en.wikipedia.org/wiki/Mean_squared_error)

- 0 = perfect
- The larger the Cost, the worse the results


## Finding good weights with The Chain Rule .. THE SECRET SAUCE...

- In this simple example of a NN,  we have **over 110000 weights** (784*15*10)
- The ideal (unknown) mathematical function, mapping all MNIST inputs to their correct ideal outputs, is **VERY COMPLICATED** and we cannot hope to directly calculate the ideal weights in an efficient manner.
- Instead, a technique called the "CHAIN RULE" allows us to **iteratively improve the weights at each iteration, to reduce the current cost**

### Chain Rule

- For any iteration:
  - Calculate the gradient of every weight for that interation, by appling the chain rule
  - Take each weight "downhill" a little bit 
  - Repeat for the next iteration

See this cool explanation:  https://www.youtube.com/watch?v=Ilg3gGewQ5U


#### Gradients ..

If variables x and y are related by a function **f** such that **y = f(x)** then the derivate of that function **f'** a.k.a. **the derivate of f w.r.t. x** also called **dy/dx** gives us the  **slope** at any point **x** 


![slope2.png](attachment:4bc4b4dd-f2ec-436f-8a9a-f7e0bcc81dd9.png)

We know each **weight is related to the overall cost** by some (unknown) function: c = f(w) 

Though we don't know that function, we **CAN** determine the current slope of that function at w.

We want to reduce the cost so:

  - find the slope at w
  - if the slope at w is positive (going up to the right), then choose a slightly smaller w
  - if the slope at w is negative (going up to the left), then choose a slightly larger w

![Screenshot 2024-04-12 at 09.37.42.png](attachment:992e5ee1-3e3d-40df-8704-f129d9c53146.png)


This is called "**Gradient Descent**"
-  it allows us to iteratively locate the best weight for producing the lowest cost.

![mingd.png](attachment:8cde5f06-4743-47b8-b21a-8f5fa9a0623b.png)


#### The Chain Rule

But all functions (relationships) in our neural network are clearly defined..

We consider just a small section of the NN, but big enough to demonstrate the whole process.  

- Cost = MSE(o1)
- o1 = sig(z3)
- z3 = h1 * w5 + h2 * w6
- h1 = sig(z1)
- h2 = sig(z2)
- z1 = x1 * w1 + x2 * w3
- z2 = x1 * w2 + x2 * w4

![msez.png](attachment:82d66b50-ccdb-4c6d-899e-a0fae1b9d093.png)

For any iteration, the chain rule **gives us the slope of each weight with relation to the cost**  by chaining together the slopes of the functions between it and the output**

![Screenshot 2024-04-12 at 10.04.22.png](attachment:c5fd78b7-2e3e-427a-9a18-22130821054c.png)

For example:<br>
**dCost/dw5 = dCost/dO1 * dO1/dZ3 * dZ3/dW5**  gives us the slope of dw5, and<br>
**dCost/dw1 = dCost/dO1 * dO1/dZ3 * dZ3/dh1 * dh1/dZ1 * dZ1/dW1**  gives us the slope of dw1.<br>

(more @ https://en.wikipedia.org/wiki/Chain_rule )

# HOW A NEURAL NETWORK LEARNS... Ground level..

We now have all we need to train a neural network...

Again we consider a small section of the NN, but big enough to demonstrate the whole process.  

(For a good video walk-through of the next steps see https://www.youtube.com/watch?v=QflXxNfMCKo )

![nn1a.png](attachment:0076fdf9-2448-4425-96e4-25226360b3fc.png)



1) Initialize the network by selecting random values between 0 and 1 for all the weights
2) "Feed Forwards":<br> Plug in the input values of an image, to **set the values in each perceptron** (and to measure the **cost**) <br>

![nn2b.png](attachment:907abc03-edbc-4c9f-8c0b-3edf721285c6.png)


3) "Back propogation": <br>Calculate the slopes of all weights with respect to the cost, by using the chain rule..<br>

![Screenshot 2024-04-12 at 10.04.22.png](attachment:4399aebd-995f-481e-84af-dfcfeed3a032.png)
  - if the slope at w is positive, choose a slightly smaller w<br>
  - if the slope at w is negative, choose a slightly larger w<br>
4) Repeat from step 2 for all images in our data set, to make **one Epoch**

Perform a few Epochs and we are done!


# Python Example

The python example does this in about 20 lines of Python

Rather than altering one weight at a time, matrices are used so many weights can be changed in one step.

See the python walk through for an example : 

  - code: https://github.com/Bot-Academy/NeuralNetworkFromScratch/blob/master/nn.py
  - video explanation: https://www.youtube.com/watch?v=9RN2Wr8xvro
  - there are no special libraries (hiding complexity) here<br>

![pyy.png](attachment:890c5962-c790-40b7-a578-5c3e4bf1d0da.png)