In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="css/custom.css">

# Machine Learning Theory

In this section we shall discuss the difference between classification and regression. 

- [Supervised learning](#supervised)
- [What is classification?](#classification)
- [What is regression?](#regression)
    - [<mark> Exercise: Classification or Regression </mark>](#c-or-r)

# Machine Learning Theory

Afterwards we shall also build some intuition for some of the more common machine learning models. 

- [A brief look into machine learning models](#models)
    - Decision Tree
    - Random Forest
    - Gradient Boosting
    - K-Nearest Neighbour
    - Support vector machine
    - Linear Regression
    - Polynomial Regression.
    - Logistic Regression
    - Neural Networks/Deep Learning

# Machine Learning Theory

Finally, we shall examine how to check if a model generalises well, by testing for under/over-fitting.

- [Under/over-fitting](#fitting)


<a id = 'supervised learning'></a>
## Supervised Learning
---
The goal of supervised learning is to learn a function that can map a feature matrix (X) to a target vector (y).

<img src='../images/03_Machine_Learning_Theory/supervised_predictions.png' width=800px>

The most common forms of supervised learning are **classification** and **regression**.

## Classification
---
<a id = 'classification'></a>
<img src='../images/03_Machine_Learning_Theory/classification.jpeg' width=800px align='center'>

<img src='../images/03_Machine_Learning_Theory/robot-class.png' width=400px align='right'>

## What is classification?

When performing classification we try to identify the class or category that an observation belongs to.

The target vector therefore consists of discrete values (or labels).

Can you think of any examples of a classification problem?

## Examples of Classification Problems
<img src=../images/03_Machine_Learning_Theory/types-class.png width="800" >


## Regression vs. Classification


Can you think of a situation where a classification algorithm wouldn't work?

<a id = 'regression'></a>
## Regression 
---
We've seen problems that require us to classify observations (e.g. type of penguin).

We perform regression when our target variable is continuous. 

<img src='../images/03_Machine_Learning_Theory/supervised_predictions.png' width=600px>


## Examples of Regression Problems
<img src='../images/03_Machine_Learning_Theory/type-regress.png' width=700px>


<a id = 'c-or-r'></a>
## <mark> Exercise: Classification or regression? </mark>

1. Predict the height of a potted plant from the amount of rainfall.
2. Identify whether a movie review is positive or negative.
3. Decide whether the object in an image is a cat, dog or mouse.
4. Predict the price of a house based upon information about its floor space and the area it is situated in.
5. Predict whether a user is in their 20s, 30s, 40s, 50s or 60+ (what could be an issue with this framing of the problem?)


<a id = 'models'></a>
## Machine Learning models
---

### Decision Tree

This is a non-parametric algorithm that is able to perform both classification and regression by making a series of decisions based on the different features of the dataset. This series of decisions has a tree-like structure, which allows us to produce a readable decision diagram.

<img src="../images/03_Machine_Learning_Theory/Decision-Tree-Algorithms.png" style="display: block;margin-left: auto;margin-right: auto;height: 200px"/>

Each decision splits the tree into new branches. At the very end of the branches are leaf nodes that represent a class or regression outcome from the tree (unless tree depth is restricted). 

## Ensemble methods

Ensemble models make predictions based on a number of different models. 

By combining individual models, the ensemble model tends to be more flexible (less bias) and less specific to the data it was trained on (less variance).

Two most popular ensemble methods are bagging and boosting:
- **Bagging**: Training a bunch of individual models in a *parallel* way. Each model is trained by a random subset of the data
- **Boosting**: Training a bunch of individual models in a *sequential* way. Each individual model learns from mistakes made by the previous model.

<!-- https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725 -->

### Random Forest (bagging)

A Random Forest ensembles a large number of independent unrestricted decision trees. The trees are trained on (bootstrapped) subsamples of the training data, with a random selection of features.

After each tree produces a prediction, voting or averaging is used to produce the final prediction.

<img src='../images/03_Machine_Learning_Theory/randomforest.jpeg' width=400px>


### Gradient Boosting

Gradient Boosting is an ensemble learning method that sequentially trains models to address the mistakes, or residual error, of the models that came before.

<img src='../images/03_Machine_Learning_Theory/gradient-boosting.png' width=600px>

## K-Nearest Neighbour

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. For a new data point the model will look at the K-Nearest points and use that to classify the new data point.

<font color='blue'>$\text{Birds of a feather flock together}$</font>
<img src="../images/03_Machine_Learning_Theory/KnnClassification.png" width="400">

### Support Vector Machine

SVMs are powerful classifiers that are used for classifying a binary dataset into two classes with the help of hyperplanes.

<img src='../images/03_Machine_Learning_Theory/svm.gif' width=800px>

### Support Vector Machine

<img src='../images/03_Machine_Learning_Theory/svm.png' width=900px>

### Linear Regression

A linear approach to modeling the relationship between a dependent variable and one or more explanatory variables.

In the case where there is one explanatory variable, the linear relationship is defined by a slope ($\mathbf{m}$) and an intercept parameter ($\mathbf{c}$):

$$ y = \mathbf{m} x + \mathbf{c} $$

For example, trying to predict the penguins body mass ($y$), from their bill depth ($x$):

$$ y = β_0 + β_1x $$

In [None]:
import seaborn as sns
import pandas as pd
penguins = pd.read_csv('../data/penguins.csv')
def simple_regression():
    sns.regplot(x=penguins['bill_length_mm'], y=penguins['body_mass_g'],
            scatter_kws={'s': 10, 'alpha': 0.2}, 
            line_kws={"color": "red", 'linewidth':2})

In [None]:
simple_regression()

### Residuals 

We can examine how good the regression is by calculating the **residuals**: the distance from the line to the datapoints.
<!-- 
![](https://cdn.kastatic.org/googleusercontent/Ebu4-AAwd4Z3irAQ9-AVyvA2abB-rb8cvQBjy60N42qD7JcDyd81bvz8DRiX6y2op9w2ryROslzP9OFtJ5PO9i6s) -->


<img src="../images/03_Machine_Learning_Theory/residuals.png" width="600">

### Cost function

We aim to find the model that minimises a cost/objective function.

We could consider the average residual value:

$$\Sigma_i^n \frac{1}{n} y_i – ŷ_i$$

What would be the issue with this approach?

### Cost function

To ensure residuals do not cancel each other out, it is better to use a method that is not dependent on the sign ($\pm$) of the residual.

For example, the Mean Absolute Error (MAE):

$$\Sigma_i^n \frac{1}{n} |y_i – ŷ_i|$$

### Cost function

The most common choice is the Mean Squared Error (MSE):

$$\Sigma_i^n \frac{1}{n}  (y_i – ŷ_i)^2$$

This approach gives more weight to large residuals.

### Multiple linear regression

We may get a better model if we use information from more than one explanatory variable (**multiple** linear regression).

For example, trying to predict the penguins body mass ($y$), from their bill length ($x_1$) and bill width ($x_2$):

$$y=β_0 + β_1x_1 + β_2x_2$$

### Polynomial regression

It may be that the explanatory variables have a polynomial relationship with the dependent variable.

*Think of Einstein's famous $E = m c^2$*

### Polynomial regression

To examine this, we can create polynomial versions of our explanatory variables and check if it gives us a better fit.

<!-- ![](https://static.javatpoint.com/tutorial/machine-learning/../images/machine-learning-polynomial-regression.png) -->

<img src="../images/03_Machine_Learning_Theory/poly.png" width="600" >

For example, if there are two explanatory variables $𝑥_1$ and $𝑥_2$, a regression with quadratic features would look like so:

$$ y = β_0 + β_1 x_1 + β_2 x_2 + β_3 x_1^2 + β_4 x_1 x_2 + β_5 x_2^2 $$

where we have additional $β_3$, $β_4$ and $β_5$  parameters associated with the polynomial features.



### Logistic Regression

With logistic regression we aim to predict the probability that the output = 1

For example, the probability the penguin's sex is male ($y=1$), given their bill depth ($x$).

<img src="../images/03_Machine_Learning_Theory/linear.png" width="700">


However, the predictions ($y$) made using linear regression have a potentially infinite range (-$\infty$, $\infty$)

<img src="../images/03_Machine_Learning_Theory/linear.png" width="700">


With logistic regression, we use a function ($S$) to ensure the values are bounded between (0,1). This is what gives logistic regression its famous "S" shape.

<img src="../images/03_Machine_Learning_Theory/logistic.png" width="700">

Where $S$ is the Sigmoid function:
$$ S(\textbf{x}) = \frac{1}{1+e^\textbf{-x}} = \frac{e^\textbf{x}}{1+e^\textbf{x}}$$

### Neural Networks/Deep Learning
Artificial Neural Networks are modeled after the human brain and they can learn features directly from the data. They form their own subset of Machine Learning called Deep Learning.

<!-- ![half center](https://miro.medium.com/max/1386/1*ZX05x1xYgaVoa4Vn2kKS9g.png) -->

<img src="../images/03_Machine_Learning_Theory/nn.png">

<a id = 'fitting'></a>
## Under/over-fitting
---

A model's performance falls on a spectrum between these two regimes:

* The model is **underfitting**: the model cannot learn the problem.
* The model is **overfitting**: the model doesn't generalize.

Below is an example of regression: the model starts off too simple but then gets too specific to the dataset, suggesting it will not generalize.

<img src="../images/03_Machine_Learning_Theory/fitting_regimes.png" style="display: block;margin-left: auto;margin-right: auto;width: 900px"/>

Below is an example of classification: the model starts off too simple but then gets too specific to the dataset, suggesting it will not generalize.
    
<!-- ![](https://www.oreilly.com/library/view/deep-learning/9781491924570/assets/dpln_0107.png) -->

<img src="../images/03_Machine_Learning_Theory/classification_fitting.png"/>

If we only evaluate our model performance on the data we trained on, we won't understand how our model generalizes.

Therefore we create seperate **train** and (hold out) **test** sets.

<!-- <img src="https://miro.medium.com/max/2272/1*-8_kogvwmL1H6ooN1A1tsQ.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/> -->

<img src="../images/03_Machine_Learning_Theory/train-test.png" width="600" />

**Underfitting vs overfitting**

If a model performs badly on the train set, we know it is **underfitting**.

If a model performs well on the train set, but badly on the test set it is **overfitting**.


<img src="../images/03_Machine_Learning_Theory/under-over.png" width="600" />

# Summary

In this notebook we made the distinction between two types of **supervised** machine learning:
 - Classification: where we attempt to distinguish between two or more classes;
 - Regression: where we attempt to predict a numerical value.
 
We also briefly discussed some of the more common modelling techniques. This list is by no means exhaustive and in the latter notebooks we will discuss the mechanics of these algorithms in more detail.

Finally, we discussed how to test if a model is able to generalize and what we mean by under/over-fitting.