# Supervised Machine Learning: Regression and Classification

## Regression

### Linear Regression Model

#### Univariate Linear Regression

Our linear regression model is $ f_{w,b}(x) = wx + b$, and our cost function is 
$$
J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i})^{2}.
$$

We use the gradient descent algorithm as a way to train a regression model. For example, if we have some cost function $ J(w,b) $ for linear regression, we want $\min_{w,b}  J(w,b)$. We do this through gradient descent. 

The gradient descent function for a simultaneous update is as follows: 
$$
\text{Repeat Until Convergence}: \begin{cases}
w = w - \alpha \frac{\partial}{\partial w} J(w,b) = w - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i})x^{i} \\
b = b - \alpha \frac{\partial}{\partial b} J(w,b) = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i}) \\
\end{cases}
$$
Where $\alpha$ is the learning rate. 
- Too small of a learning rate will still work, but gradient descent will be slow.
- If $\alpha$ is too large, gradient descent may fail to converge or could diverge. 

In **batch** gradient descent, each step of the gradient descent algorithm uses all the training examples. 

#### Multiple Linear Regression

For a problem with multiple feature variables, we need to change our model. For a table with $i$ rows and $j$ feature columns, denote that:
- $x_j$ is the $j^{th}$ feature. 
- $n$ is the number of features.
- $\vec{x}^{(i)}$ is the features of the $i^{th}$ training example. 
- $x_j^{(i)}$ is the value of feature $j$ in the $i^{th}$ training example. 

Our linear regression model is now $f_{w,b}(x) = w_1x_1 + w_2x_2 + ... + b$, or in vector notation $f_{\vec{w},b}(x) = \vec{w} \cdot \vec{x} + b$

The cost function now becomes $J(\vec{w}, b)$ and so our gradient descent algorithm transforms into: 
$$
\text{Repeat Until Convergence}: \begin{cases}
w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(\vec{w},b) = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x^{i}}) - y^{i}){x_j}^{i} \\
b = b - \alpha \frac{\partial}{\partial b} J(\vec{w},b) = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x^{i}}) - y^{i}) \\
\end{cases}
$$
with simultaneous updates to $w_j$ for $j = 1,...,n$ and $b$

#### Gradient Descent

##### Feature Scaling

To allow gradient descent to work more efficiently, we can use feature scaling. This is useful for when you have features that take on very different ranges of values. Typically we want to do this so values are $-1 \le x \le 1$, but ranges close to this may not need rescaling. We only need to rescale if the values are much larger or much smaller than this range. 

There are different ways of doing this: 
* Dividing by the maximum: for a range $ a \le x \le b$, we divide by $b$ to get $\frac{a}{b} \le x \le 1$
* Mean normalisation: for a range $ a \le x \le b$ with a mean value $\mu$, we normalise $x$ by $x = \frac{x - \mu}{b - a}$ so that the new range is $\frac{a - \mu}{b - a} \le x \le \frac{b - \mu}{b - a}$.
* Z-score normalisation: for a range $ a \le x \le b$ with a mean $\mu$ and a standard deviation $\sigma$, we normalise $x$ by $x = \frac{x - \mu}{\sigma}$ so that the new range is $\frac{a - \mu}{\sigma} \le x \le \frac{b - \mu}{\sigma}$.

###### Convergence

Let $\epsilon \gg 0$. If $J (\vec{w},b)$ decreases by $\le \epsilon$ in one iteration, declare **convergence**.

##### Feature Engineering

Feature engineering ivolves using intuition to design new features, by transforming or combining original features. For example, if we have features $x_1,x_2$ then we can create a new feature $x_3 = x_1x_2$.

##### Polynomial Regression 

For non-linear data, you can create polynomial feaures through feature engineering. 

##### Linear Regression in Python


In [None]:
import numpy as np
import sklearn 
import matplotlib.pyplot as plt 
from sklearn.linear_model import SGDRegressor               #Gradient Descent Regression Model
from sklearn.preprocessing import StandardScaler            #Scales the data 

## Classification 

### Logistic Regression 

As linear regression is not a good algorithm for classification problems, we turn to logistic regression. 

#### Logistic Regression Model


##### Sigmoid Function 

The sigmoid function (also known as the **logistic function**) outputs values between 0 and 1. The function is as follows: 
$$
g(z) = \frac{1}{1 + e^{-z}} \text{ for } 0 \lt g(z) \lt 1
$$

#####  Linear Regression Algorithm

Remember that the linear regression algorithm is $ f_{\vec{w},b} (\vec{x}) = \vec{w} \cdot \vec{x} + b$

##### Building Logistic Regression Algorithm 

If we let $z = \vec{w} \cdot \vec{x} + b$ and sub $z$ into $g(z)$ we get:
$$ 
f_{\vec{w},b} (\vec{x}) = g(z) 
                        = g(\vec{w} \cdot \vec{x} + b)
                        = \frac{1}{1 + e^{-(\vec{w} \cdot \vec{x} + b)}}
$$

##### Algorithm Interpretation

The result that you get from the algorithm can be interpreted as the "probability" that the classification is 1. 

For example, if we take $x = \text{tumor size}$ and 
$y = \begin{cases}
    0 \text{ : not malignant} \\
    1 \text{ : malignant} \\
\end{cases}
$
then $f_{\vec{w},b} (\vec{x}) = 0.7$ would mean there is a $70\%$ chance that $y$ is $1$

You can interpret the logistic regression algorithm as $f_{\vec{w},b} (\vec{x}) = P(y = 1|\vec{x};\vec{w},b)$

##### Decision Boundary

* A linear decision boundary is when $z = \vec{w} \cdot \vec{x} + b = 0$
* A non-linear decision boundary can be different. We can use polynomial regression to fit the decision boundary to our plotted data.

##### Cost Function for Logistic Regression

Plotting the logistic regression function in the square error cost used in linear regression would result in a non-convex cost function. This means that gradient descent would not work as there would be multiple local minima. Instead, we plot a cost function for logistic regression. 

Note that the **cost function for linear regression** is 
$$
J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2} (f_{\vec{w},b} (\vec{x}^i) - y^i)^2
$$ 
We denote the loss function 
$$ 
L(f_{\vec{w},b} (\vec{x}^i),y^i) = \frac{1}{2} (f_{\vec{w},b} (\vec{x}^i) - y^i)^2
$$ for linear regression. 

For **logistic regression**, we denote the logistic loss function as: 
$$
L(f_{\vec{w},b} (\vec{x}^i),y^i) = 
\begin{cases}
- log(f_{\vec{w},b} (\vec{x}^i)) \text { if } y^i = 1 \\
- log (1 - f_{\vec{w},b} (\vec{x}^i))) \text { if } y^i = 0 \\
\end{cases}
$$
The further prediction $f_{\vec{w},b} (\vec{x}^i)$ is from the target $y^i$, the higher the loss. 

Hence the **cost function for logistic regression** is 
$$
J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} L(f_{\vec{w},b} (\vec{x}^i),y^i) = 
\begin{cases}
- \frac{1}{m} \sum_{i=1}^{m} log(f_{\vec{w},b} (\vec{x}^i)) \text { if } y^i = 1 \\
- \frac{1}{m} \sum_{i=1}^{m} log (1 - f_{\vec{w},b} (\vec{x}^i))) \text { if } y^i = 0 \\
\end{cases}
$$

We can simply these to get: 
$$ 
\begin{aligned}
\text{Logistic Loss Function } &= L(f_{\vec{w},b} (\vec{x}^i),y^i) \\
&= -y^i log(f_{\vec{w},b}(x^i)) - (1 - y^i) log (1 - f_{\vec{w},b}(x^i)) \\
\text{Cost Function for Logistic Regression } &= J(\vec{w},b) \\ 
&= - \frac{1}{m} \sum_{i=1}^{m} L(f_{\vec{w},b} (\vec{x}^i),y^i) \\
&= - \frac{1}{m} \sum_{i=1}^{m} [-y^i log(f_{\vec{w},b}(x^i)) - (1 - y^i) log (1 - f_{\vec{w},b}(x^i))] \\
\end{aligned}
$$

##### Gradient Descent for Logistic Regression

Gradient Descent can be used to minimise the cost function. As a reminder the algorithm is: 
$$
\text{Repeat Until Convergence}: \begin{cases}
w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(\vec{w},b) = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x^{i}}) - y^{i}){x_j}^{i} \\
b = b - \alpha \frac{\partial}{\partial b} J(\vec{w},b) = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x^{i}}) - y^{i}) \\
\end{cases}
$$
with simultaneous updates to $w_j$ for $j = 1,...,n$ and $b$ and where 
$$
f_{\vec{w},b}(\vec{x}) = \frac{1}{1 + \exp^{(-\vec{w} \cdot \vec{x} + b})}
$$

##### Logistic Regression in Python

In [None]:
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression # Imports Logistic Regression function from sci-kit learn
LogisticRegression.fit(X,y)                         # Fits logistic regression model
LogisticRegressssion.predict(X)                     # Predicts y based on value x
LogisticRegression.score(X,y)                       # Calculates accuracy of the model with a score from 0-1

## Overfitting 

### The Problem

A lot of the time with fitting regression functions to data, we have the problem of overfitting. The types are: 
* Underfitting - this is when the model does not fit the training set well and the algorithm has high bias. 
* Overfitting - this is when the model fits the training set extremely well but does not work for generalisation and the algorithm has high variance. Often this means the model will not predict future values accurately. 

### Addressing Overfitting 

There are a few ways we can do this:
1. Collect more data
2. Select features
3. Reduce size of parameters through **regularisation**

Regularisation involves reducing the size of the parameters $w_j$. We do this by eliminating features from our model or reducing the size of some of the $w_j$ values to allow the curve to fit the pattern. 

### Cost Function for Regularisation

For the cost function for regularisation, we add the regularisation term $\frac{\lambda}{2m} \sum_{j=1}^m w_j^2$ to the original cost function: 

$$
\begin{aligned}
\text{Cost Function for Regularisation} &= J(\vec{w},b) \\ 
&= \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^i) - y^i)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 \\
&\text{where } \lambda \gt 0 \\
\end{aligned}
$$

We want to minimise the original cost (mean squared error cost) and the regularisation term. This helps us to fit the data and keep $w_j$ small. To balance both of these, you pick a suitable value of $\lambda$.
* A large value of $\lambda$ will reduce the size of parameters $w_1,...,w_n$

### Regularised Linear Regression 

For gradient descent, we know we want to minimise the cost function, i.e. 
$$
\min_{\vec{w},b} J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^i) - y^i)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 
$$ 

Apply this to our gradient descent algorithm, we get: 
$$
\text{Repeat Until Convergence}: \begin{cases}
w = w - \alpha \frac{\partial}{\partial w} J(\vec{w},b) = w - \alpha [ \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i})x_j^{i} + \frac{\lambda}{m}w_j ] \\
b = b - \alpha \frac{\partial}{\partial b} J(\vec{w},b) = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i}) \\
\end{cases}
$$

### Regularised Logistic Regression

For gradient descent, we know we want to minimise the cost function, i.e. 
$$
\min_{\vec{w},b} J(\vec{w},b) = - \frac{1}{m} \sum_{i=1}^{m} [-y^i log(f_{\vec{w},b}(x^i)) - (1 - y^i) log (1 - f_{\vec{w},b}(x^i))] + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 
$$
where $f_{\vec{w},b} = \frac{1}{1 + e^{-z}}$

Apply this to our gradient descent algorithm, we get: 
$$
\text{Repeat Until Convergence}: \begin{cases}
w = w - \alpha \frac{\partial}{\partial w} J(\vec{w},b) = w - \alpha [ \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i})x_j^{i} + \frac{\lambda}{m}w_j ] \\
b = b - \alpha \frac{\partial}{\partial b} J(\vec{w},b) = b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{i}) - y^{i}) \\
\end{cases}
$$