# Neural networks and machine learning

## Aims:

- to be able to perform complex tasks beyond simple things such as fitting straight lines to data. 
- Examples of such tasks are:

  - handwriting recognition
  - speech recognition
  - image analysis (e.g., recognising objects in an image)
  - predicting the weather from historical data, not from physics principles
  - determining whether an email message is likely to be spam
  - targetting advertising based on the data available about a person
  - to answer "google" queries
  - to be able to write coherent and accurate text (e.g., ChatGPT)
  - finding patterns in very large datasets, e.g., examining 100's of billions of stars in our Galaxy, with dozens of data points (e.g., position, velocity, distance, spectrum) per star, and finding significant correlations (e.g., stars that were born at the same time/place; stars that come from an external galaxy)

- to be able to think as well as, or better than, a human
- to replace humans in many jobs
- to replace humans

<img src="https://mcba1.phys.unsw.edu.au/~mcba/t900.png" width="300" />

## Approach

- we have a model for a "computer" that can achieve at least some of the above aims moderately well: the human brain
- the human brain derives its thinking ability primarily through the interaction of neurons and synapses (the connections between neurons)
- if we simulate a neuron + synapses in a computer, and connect a bunch of them together, we should be able to think
- aside: will a machine ever be able to truly be conscious and think?

## How does a neuron work (greatly simplified)

- a neuron is a single cell that can be treated as having a single output (the firing rate of an electrical pulse), and multiple inputs (from other neurons, though synapses)
- the neuron's output is a function of summing all its inputs, with weights

![A neuron](https://mcba1.phys.unsw.edu.au/~mcba/neuron.png)

A complete map of the brain of a fruit fly: https://www.youtube.com/watch?v=NXr0ZdoYgRw with 3,016 neurons and 548,000 synapses, the result of a 12 year project completed in 2023.

## An artificial neuron

- is a _"node"_ with a single output and multiple inputs
- the output is equal to the value of an _"activation function"_ that takes a single input number equals to a linear combination of the inputs with _"weights"_, and a _"bias"_ (an offset)
- for example activation functions, see wikipedia

## An artificial neural network

![An artificial neural network, IBM](https://mcba1.phys.unsw.edu.au/~mcba/ibm-ann.png)

- an artificial neural network is simply a collection of _"nodes"_
- for convenience, the nodes are arranged in _"layers"_
- there is an _"input layer"_ of nodes that receives input from outside the network, e.g., this might be
     - an audio signal from a microphone
     - the light intensity in a pixel of an image
     - text
- there is an _"output layer"_ of nodes that is the final result of the calculation, e.g.,
     - the image contains a giraffe
     - the email is spam with a certainly greater than 99%
     - tomorrow's maximum temperature will be greater than 20C.
- there are zero or more _"hidden layers"_ than embody the algorithm
- a node can accept 1 or more inputs from nodes in the preceeding layer, and send its output to 1 or more nodes in the following layer
  
## To make a neural network useful you have to

- specify the number of input nodes, the number of output nodes, and the number of hidden layers
- define the inputs to the input layer
- define the outputs from the output layer
- specify the activation functions
- set initial weights and biases for each node
- train the network on data (i.e., find the weights and biases)
- evaluate the performance of the network
- if the performance isn't satisfactory, try changing training method, the node topology, the activation functions, perhaps improve the data quality, and retrain

## The Google crash course on machine learning

The following course is a 15 hour introduction to machine learning and the Google TensorFlow API:

https://developers.google.com/machine-learning/crash-course/

We are going to look at "First Steps with TF", and later "Validation Set".
       
## Machine learning terminology

https://developers.google.com/machine-learning/glossary

### Suppose we have a dataset upon which we want to train our model

- a _"feature"_ in the dataset is analogous to a column of a spreadsheet, i.e., the value of some variable, such as time, temperature; it can be as complex as you like, e.g., the "from" line in an email, the number of words in an input textbox, the intensity of a star's spectrum at 543 nm
- a _"label"_ is something we are trying to predict, e.g., "this email is spam", "this star is of spectral type G2", "this image contains a giraffe". The label is also a column in a spreadsheet, and there can be multiple labels for a feature
- an _"example"_ is like a row of a spreadsheet;  a _"labelled example"_ has the label attached (perhaps by being manually entered by a human), an _"unlabelled example"_ has no label
- a _"model"_ is the neural network with all its nodes, layers, activation functions, weights, and biases. It represents our way of estimating labels from features
- _"training"_ the model is the process of refining the model so that it works to some desired accuracy
- the _"loss"_ of a model is a measure of the quality of output from the model. A loss of zero is perfect, a high loss indicates a poor model
- we have to define how we measure _"loss"_, and what function we might pass the raw loss measurement through to assist with training.  A common choice for loss is least squares, where we measure the loss by the average of the sum of the squares of the differences between the features and the labels (assumed to be scalars in this example)

### We are now ready to train our model

- the idea is to vary the _"parameters"_ (the weights and biases of the nodes), in such a way as to minimise the loss
- the classic approach to doing this is _"gradient descent"_ to a minimum
- we may need to use random jumps of parameters to cover sufficient phase space
- for very large datasets we will have to train on a subset of the data
- we may have to randomly select _*batches*_ of data from the main dataset in order to speed up the training process

### "hyperparameters"

- a _"hyperparameter"_ is some property of the model, and the learning process, beyond the weights and biases. Hyperparameters include the number of hidden layers, the number of nodes, the rate at which the model descends to a minimum, and the activation function
- the loss function is differentiable with respect to the model _"parameters"_, which makes minimisation efficient with gradient descent; this isn't the case with hyperparameters 
- often, a human has to choose the hyperparameters, and let Tensorflow optimises the parameters

### "Feature engineering"

- the art/science of going from raw data to features
- identifying and handling bad data
- mapping data to features
- scaling data so that the features map fairly uniformly to a range such as 0 to 1, or -1 to +1) 
- _"one-hot encoding"_ where a feature is a vector of binary values, and each example only has one bit set (e.g., the vector could represent every word in a dictionary, and the feature could be an input word)
- _"multi-hot encoding"_ where multiple bits can be 1
- note that the vectors for one-hot encoding are very sparse (i.e., almost all zeroes), so special techniques are used to store them without using lots of computer memory

### Handling non-linear models

- linear problems are very quick to train, so if at all possible, try to make the problem linear
- one approach to linearising a model is to create _"feature crosses"_, i.e., creating a new feature by multiplying two or more features together

### Training sets, validation sets, and test data sets

- these are essential to avoid _"overfitting"_, where your model works perfectly with the data set, but fails when presented with new data
- split your data into three subsets: the _"training set"_, the _"validation"_ set, and the _"test set"_. Choosing these subsets, and ensuring that they are all representative of the data, can be tricky. 
- train your model on the training set
- evaluate the model on the validation set, then retrain the model
- finally test your model on the test set

### Regularization

- _"Regularization"_ is simply reducing the complexity of a model to avoid overfitting. This is often done by setting some weights to zero. The complexity of a model can be quantified by the number of nonzero weights.



