# Fundamentals

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/01_fundamentals.ipynb)

In [1]:
# if you're using colab, then install the required modules
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    pass

## Basic ideas

### Overview

Machine learning is a subset of Artificial Intelligence.

It is a range of methods that learn associations from data.

It then uses these associations for new predictions.

These can be useful for:

- Prediction problems (e.g., pattern recognition).
- Problems cannot program (e.g., image recognition).
- Faster approximations to problems that can program (e.g., spam classification).

### Methods

Within machine learning, there are many different methods.

Some main methods are:

- Classic
- Deep learning (neural networks)
- Reinforcement learning
- Ensembles (e.g., multiple decision trees)

We'll focus on _classic machine learning_ and _deep learning_ in this course.

### Classic Machine Learning

...


Simple data, clear features.

There are a wide variety of types. Some common ones are:

- Linear models
    - ...
- Nearest neighbours
    - Points are similar to their neighbours.
- Decision trees
    - Split the data by a decision (i.e., a branch of leaves).
    - Combine multiple decisions (i.e., a tree). 
- Support vector machines
    - ...
- And many more.

### Deep Learning

Deep means more layers.



Neural networks 

Useful for non-linear, with large number of features 

Compilcated data, unclear features.



There are a wide variety of types. Some common ones are:

- Convolutional Neural Networks (CNN)
    - ...
- Recurrent Neural Networks (RNN)
    - For sequential data e.g., time-series, natural language.
    - Loops over timesteps while maintaining information from previous timesteps.
- Transformers
    - ...
- Sequence
    - ...
- Generative Adverserial Networks (GANs)
    - ...
- And many more.

...

Steps 

Inputs 

forward propagate 

predict outputs 

compute loss 

backward propagate 

gradient descent 

update weights and biases 


Scale is driving DL progress 

Bigger training data (Larger data sets (labelled, m)) 

Bigger neural networks 

Now investment and attention drive it forward more 



### Data

The data is a sample of the problem you're studying.

Data has inputs (features) and outputs (targets).

- The inputs are what you provide to the model.
- The outputs are what you're trying to predict.

The data is normally in the form of tensors.

Tensors are multi-dimensional arrays:

- Scalars are rank-0 tensors.
- Vectors are rank-1 tensors.
- Matrices are rank-2 tensors.
- 3+ dimensional arrays are rank-3+ tensors.

![tensors.png](images/tensors.png)  

*[Image source](https://medium.com/mlait/tensors-representation-of-data-in-neural-networks-bbe8a711b93b)*

### Supervised and unsupervised

- Supervised learning is when you provide labelled outputs to learn from.
- Unsupervised learning when you don't provide any labels.

Below is an example of supervised learning (classify different coloured markers) and unsupervised learning (find clusters within similar data).

![supervised_vs_unsupervised.png](images/supervised_vs_unsupervised.png)  

*[Image source](https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/)*

We'll focus on supervised learning in this course.

### Classification and regression

- Classification problems are those that try to predict a discrete category (i.e., cat or dog).
- Regression problems are those that try to predict a continuous number (i.e., beans in a jar).

Below is an example of classification (separate blue circles from purple crosses) and regression (predict a numerical value from the data).

![classification_vs_regression.png](images/classification_vs_regression.png)  

*[Image source](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d)*

### Training, validation, and test splits

The data is normally split into training, validation, and test sets.

- The training set is for training the model.
- The validation set (optional) is for iteratively optimising the model during training.
- The test set is only for testing the model at the end.
    - This should remain untouched and _single-use_ (to ensure representative of future data).

![train-val-test-split.png](images/train-val-test-split.png)  

*[Image source](https://stackoverflow.com/a/56100053/6250873)*

The size of the split depends on the size of the dataset and the signal you're trying to predict (i.e., the smaller the signal, then the larger the test set needs to be).

- For small data sets, a split of 60/20/20 for train/validation/test may be suitable.
- For medium data sets, a split of 80/10/10 for train/validation/test may be suitable.
- For large data sets, a split of 90/5/5 for train/validation/test may be suitable.
- For very large data sets, a split of 98/1/1 for train/validation/test may be suitable.

The split may benefit from being stratified to ensure each set has a sample of the classes.

### Cross-validation

To estimate the _variability_ in the training score, then you can use cross-validation.

This repeats the _training/validation_ split multiple times (_the test data remains untouched_).

There are various methods for cross-validation.

These are mainly variations of K-fold cross-validation, where you split the data up k times (e.g., 5).

Variations then consider stratifying (preserving original class frequencies), shuffling, sampling, and replacing.

Below is an example for 5-fold cross-validation (i.e., splitting 5 times).

![cross_validation.png](images/cross_validation_diagram.png)  

*[Image source](https://inria.github.io/scikit-learn-mooc/python_scripts/02_numerical_pipeline_cross_validation.html)*

### Hyperparameters

These are what _you set before_ model training (i.e., the architecture).

They control the learning process.

They are often found through iterative trying out different options.

This iterative tuning method can be:

- Systematically over a grid (i.e., grid-search)
    - Thorough, slow, not suitable for problems with many variables
- Randomly over a grid (i.e., random grid-search)
    - Faster and more suitable for problems with many variables
- Other options including:
    - Using Bayes Theorem (i.e., Bayes grid-search) to choose a new set of hyperparameters to test based on the performance of prior set.

### Parameters

These are what the model learns _during training_ (i.e., the weights / biases / coefficients of the model).


Best set of parameters (i.e., global optimimum)
local optimum

...

### Training

- Optimiser
- Loss function (error on single training example) 
    - always want to minimise
    - a proxy of the metric with a smooth gradient (in some cases is actually the same as the metric e.g., mean squared error)
- Metric

...

- Cost function (average of loss functions over whole training set) 
- gradient descent


### Evaluation

The goal of machine learning is predicting new data.

Hence, the objective is to minimise the _test error_ (as this represents new data).

...

- error analysis (e.g., Confusion Matrix)

...

R2 (coefficient of determination) 
Any value less than 1, as model can be continually awful 
1 is perfect 
0 is not more information than just predicting the mean 

Evaluation metric is the goal of training.

- One singular evaluation metric to help guide decisions

Classification

- Class imbalance

Regression

- 


### Underfit

A model _underfits_ the data when it has _high bias_ (i.e., systematic errors). 

This means the model is _too simple_ to capture the association.

You can tell that the model underfits because there are _both_ high training errors and high test errors.

To reduce underfitting, try:

- Adding more features.
- Adding more complex features.
- Decreasing regularisation (i.e., decrease preference for simpler functions).

More training data is unlikely to help a model that underfits the data.

### Overfit

A model _overfits_ the data when it has _high variance_ (i.e., varies a lot). 

This means the model is _too complex_ to capture the association.

You can tell that the model overfits because there are _low_ training errors _but_ high test errors (i.e., there is a big difference between these errors, where the model doesn't work well on new data because it overfitted to the noise in the training data).

To reduce overfitting, try:

- Adding more data.
- Using fewer or simpler features.
- Increasing regularisation (i.e., increase preference for simpler functions).
- A smaller neural network with fewer layers/parameters.

Below is an example of underfitting (linear line through non-linear data) and overfitting (very-high order polynomial passing through every training point).

![underfit_vs_overfit.png](images/underfit_vs_overfit.png)  

*[Image source](https://www.educative.io/edpresso/overfitting-and-underfitting)*

## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <fundamentals>`

## Key Points

```{important}

- [x] _..._

```

## Further information

### Good practices

- The choice of algortihm depends on the problem/data (i.e., whether you use linear regression, deep learning, etc.).
    - What assumptions are appropriate?
- Future data should be from the same distribution as the training data (_data drift_).
- The test set should be representative of the future data. For example:
    - For time series, test data may be 2021, while training data was 2015-2020. 
    - For medical application, test data may be completely new patients, not multiple visits from same patients in training data.
- Consider reducing the dimensionality of the data (e.g., using PCA, Principle Component Analysis).
- Have a baseline to compare the model skill against (i.e., simple model, human performance, etc.).
- ...

### Caveats

- Predictions are primarily based on associations, not explanations or causation.
- Predictions and models are specific to the data they were trained on.

### Resources

**Bold** are highly-recommended.

- **[Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/), Aurélien Géron, 2019, O’Reilly Media, Inc.**  
    - **[Jupyter notebooks](https://github.com/ageron/handson-ml2).**  
- [Deep Learning with Python, 2nd Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff), François Chollet, 2021, Manning.  
    - [Jupyter notebooks](https://github.com/fchollet/deep-learning-with-python-notebooks).  
- [Artificial Intelligence: A Modern Approach, 4th edition](http://aima.cs.berkeley.edu/), Stuart Russell and Peter Norvig, 2021, Pearson.  
- [Machine Learning Yearning](https://www.deeplearning.ai/programs/), Andrew Ng.  

(online_courses)=
### Online courses

**Bold** are highly-recommended.

#### Machine learning

- **[Machine learning](https://www.coursera.org/learn/machine-learning), Coursera, Andrew Ng.**
    - **CS229, Stanford University: [Video lectures](https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU).**  
- **[Machine Learning for Intelligent Systems](http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/), Kilian Weinberger, 2018.**  
    - **CS4780, Cornell: [Video lectures](https://youtube.com/playlist?list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS).**  
- [Artificial Intelligence: Principles and Techniques](https://www.youtube.com/playlist?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX), Percy Liang and Dorsa Sadigh, CS221, Standord, 2019.  
- [Machine learning in Python with scikit-learn](https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/), scikit-learn developers, 2022.
  - [Course materials](https://inria.github.io/scikit-learn-mooc/)
  - [Jupyter Notebooks](https://github.com/INRIA/scikit-learn-mooc/) 


#### Deep learning

- **[Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning), Coursera, DeepLearning.AI (_NumPy, Keras, TensorFlow_)**
    - **CS230, Stanford University: [Video lectures](https://www.youtube.com/playlist?list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb), [Syllabus](http://cs230.stanford.edu/syllabus/)**
- [NYU Deep Learning](https://atcold.github.io/NYU-DLSP21/), Yann LeCun and Alfredo Canziani, NYU, 2021 (_PyTorch_)
    - [Video lectures](https://www.youtube.com/playlist?list=PLLHTzKZzVU9e6xUfG10TkTWApKSZCzuBI)  