# Essential ML concepts

## What is machine learning?

Almost any problem can be framed as mapping from some input to some output

# TODO diagram of example input-outputs

Some problems are easy to solve because we have managed to find the function that maps between them.

# diagram of easy problems

For other problems of interest, it is extremely hard to find out what that function is.

# TODO diagram of hard problems

> Machine learning is the task of having a computer find this function for itself.

A more technical definition is that machine learning is: any computer system that can improve on a task (T) as measured by a performance metric (P) with experience (E)

As we will see, the most important functions in machine learning are parametric - that is, they have parameters which change the input-output relationship of the function.

The learning in machine learning, is the adjusting of these parameters so that the function changes

# TODO gif of changing linear regression prediction and parameters


## How does machine learning learn those functions?

The most important algorithm that we will need to be aware of for this program is how a machine learning algorithm can learn by repeatedly updating its parameters by comparing it's predictions with the label given for lots of different input examples.

Overall, it works like this:
1. Get an example
1. Make a prediction from the inputs by passing them through the function
1. Compare the prediction with the target to compute the error
1. Update the function parameters in such a way that reduces the error
1. Repeat

## The 4 typical components of any machine learning algorithm

1. The data
1. The model
1. The criterion
1. The optimiser

### Component 1: Data

> The data is a set of _examples_ that appear in your dataset

Examples in a dataset usually consist of 2 things: 
1. _Features_
    - The inputs
    - The thing you are using to predict the output
    - Also referred to as:
        - Inputs
1. _Labels_
    - The output
    - Also referred to as 
        - Outputs
        - Targets

Example datasets:
- A large body of text, where the examples are different sentences
- A set of examples containing images and their classification
- A set of examples containing features of different houses and the price of that house

In the most usual case, features of an example can be represented as a vector, and the labels as a scalar.

## TODO vector, scalar features labels

All of the data can be represented in a single matrix, where the rows are examples, and the columns are different features of each of those examples. We call this the design matrix.

#### Show some example data e.g. house pricing

> Your data defines your problem, and what the model learns

Whether your targets are continuous or categorical values determines whether you have a _regression_ or _classification_ problem
- Regression: Targets are continuous
    - E.g. House price prediction
- Classification: Targets are categorical
    - E.g. Which word should come next following the input text?

### Component 2: Model

> The model takes an input, and uses it to predict an output.

# TODO diagram x -> y model/f

For example:
- Your model might take in a set of text and predict the next word
- Your model might take in the joint positions and accelerations of a robot and use it to predict the next move
- Your model might take in the features of a house and use it to predict the price

Because in most important cases, the model is a mathematical function, the data needs to be represented numerically. 
This can be a challenge for some common data types like text.

### Component 3: The Criterion

> The criterion is a function that measures how bad your model is

A criterion may commonly also be referred to as:
- The loss function
- The error function

The value returned from the criterion is a measure of how _bad_ your model is, which may also be known as:
- The loss
- The error

#### Mean Squared Error

The most important loss function for regression is the _mean squared error_.

# TODO mean squared error diagram

#### Cross Entropy Loss

The most important loss function for classification is the _cross entropy loss_.

For classification tasks, your model will output a probability for each possible class. 
The cross entropy loss function is designed to take in the probability predicted for _the true class label_, and return:
- a low loss if that probability is high
    - The best probability to predict for the true label would be 1, which will give you a loss of zero
- a high loss if that probability is low.
    - The worst probability to predict for the true label would be 0, which will give you an infinite loss

As you optimise your model, you should see the loss decrease. Graphically, this produces what is known as a _loss curve_.

# TODO ![](./images/tensorboard-loss-curve.png)

# TODO cross entropy loss diagram

### Component 4: The Optimiser

> The optimiser is an algorithm used to update the model so that it improves

#### Gradient descent

The most important optimiser to be aware of is called _stochastic gradient descent_.

You can picture it doing this:

# TODO ![](./images/gradient-descent-visualisation.png)

It works as follows:
- Take a batch of examples and predict their targets
- Compute the gradient of the objective (the thing you want to minimise - in our case, the loss) with respect to the model parameters 
    - This tells you "for each parameter, if I were to increase it slightly, how quickly would the objective increase
    - The gradient is a scalar value for each model parameter
- Update the parameters in the direction that would reduce the loss
    - This direction is the opposite sign of the gradient
        - If the parameter gradient is positive, then that means that increasing it would increase the loss, so it should be decreased
        - If the parameter gradient is negative, then that means that decreasing it would increase the loss, so it should be increased
    - We shift each parameter by an amount proportional to the gradient, scaled by a value we call the learning rate, $\alpha$
        - The learning rate controls the step size of each update
            - Too small and the model will take too long to converge
            - Too large and the model will diverge

Mathematically, this is repeated until the stopping condition is met:

## $\hat{y} = model(x, w)$
## $ L = loss(\hat{y}, y)$
## $ w \leftarrow w - \alpha \frac{\partial{L}}{\partial w}$ 

Gradient descent requires that:
- The model and loss function are continuous differentiable functions
- The model is parametric


## Machine learning jargon


### 

## Other essential machine learning concepts to understand

### Supervised & Unsupervised data

Supervised data is a set of data where each example contains both an input and an output.
- House features -> House price label

Unsupervised data, on the other hand, is a set of data where you only have features - no labels.
- Corpus of unlabelled text

### Thinking about high dimensional data

# diagram of 1d, 2d, 3d data

Because we live in a 3-D world, we can visualise arrows with 1 dimension, 2 dimensions, and 3 dimensions. Each of these, is equivalent to a vector containing the coordinates of the end of the arrow.

For higher dimensional objects, like a vector with 4 or more elements, you can still think of it like an arrow, but you can't quite visualise what it looks like.

> 'To visualise 13 dimensional space, simply think of 3 dimensional space, and say "13" loudly' - Geoff Hinton


### hyperparameters

Gradient descent optimises some parameters that control the function which the model represents directly, by directly changing their values during the training process where the model iteratively makes predictions and improves.

> Hyperparameters are parameters that cannot be optimised directly, or that are difficult to optimise because they need to be set before training

Hyperparameters include:
- The type of function that represents the model e.g. neural network vs decision tree
- The learning rate of the optimiser
- The architecture of the neural network model


### The training set, the validation set, and the test set

When we train machine learning algorithms, we typically split the data into 3 sets, each with a different use.
- The training set
    - Used for updating model parameters that can be optimised directly
- The validation set
    - Used for evaluating trained models' generalisation capability
    - Used to compare between different models and different model hyperparameters and make choices about which to use
    - NOT used for evaluating final performance
- The test set
    - Used ONLY to determine a final measure of performance of the model on totally unseen data
    - Never used to make any decision about which model or model parameters are better 
        - doing so would be cheating - picking the best model based on data its final performance will be evaluated on
    - Different to the validation set. The validation set is used to make choices about the model 

### Overfitting & underfitting

- Underfitting: When your model is not able to learn to perform well enough on the training set.
- Overfitting: When your model learns to fit the training set too closely.

# TODO ![](./images/overfitting-underfitting.png)

Both overfitting and underfitting can lead to poor performance on unseen examples.

Symptoms of overfitting and underfitting can appear in your loss curves.

<!-- ### Important Performance Metrics -->

### Regularisation

> Regularisation is any method used to reduce the generalisation error of a model

# TODO ![](./images/overfitting-underfitting.png)



# Essential Python programming concepts

## Variables

## Lists

## Dictionaries

## Functions

## Classes

## Inheritance

## Magic methods



# Essential Mathematical Concepts


## Tensors

> In machine learning, a tensor is a generalisation of a matrix with more than 2 dimensions

# TODO ![](./images/vector-matrix-tensor.png)

### Vector & matrix notation

We typically represent vectors with lower-case mathematical symbols, and matrices with upper-case mathematical symbols.

A vector element might have a subscript to represent its position in the vector.

A matrix element might have a double subscript to represent its position in the vector, where the first subscript represents the row position, and the second the column position.

### Matrix multiplication

### Broadcasting

In many cases when programming, the dimensions of a calculation don't match mathematically, but the calculation still works - such as when multiplying a column vector by a matrix. 
This is because the program assumes that you want to _broadcast_ that vector across the matrix in a way that does make sense.




### Dot product

### Cosine similarity

### e & log

### PI

### SIGMA

## Real numbers

## Differentiation

## probability distributions


## Introducing the map of how we will build up to ChatGPT

- We need to be able to understand complex data like text
- We need to generate text data
- We need to be able to make sure the system is safe to use

## Where are we on the map of our journey to ChatGPT?

