# Machine learning
Formal definition by Tom M. Mitchell:
> A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Machine is not initially (or ever) perfect at performing the given task so the function that the machine comes up with for performing the task is called a **hypothesis function**. Another function is used to train the machine, to improve the hypothesis function. As the machine is training, the performance of each prediction attempt can be measured with a **cost function**.

Hypothesis function is often denoted as $h_\theta(x)$, where as the cost function is $J(\theta_0,\theta_1)$

**Gradient descent** is one training algorithm. It uses cost function to specify such theta values that the cost function reaches minimum (best theta values).

$$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$$

In simpler form:

$$\theta_j:=\theta_j-\alpha[Slope\,of\,the\,cost\,function]$$


# Linear regression

General form of the **hyphothesis function** in Linear regression:

$$\hat{y}=h_\theta(x)=\theta_0+\theta_1x$$

Values of $\theta_0$ and $\theta_1$ are modified as the machine learns.

**Cost function**:

$$J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(\hat{y}_i-y_i)^2 = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x_i)-y_i)^2$$

**Gradient descent**:
$$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x_i)-y_i)$$

$$\theta_1:=\theta_1-\alpha\frac{1}{m}\sum^m_{i=1}((h_\theta(x_i)-y_i)x_i)$$

Above examples represent linear regression with only one variable x (univariate linear regression). For linear regression with multible variables (multivariate linear regression) the hypothesis function looks like this:
$$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...+\theta_nx_n$$

Using matrix multiplication this can be represented as:
$$
    h_\theta(x)=
    \begin{bmatrix}
    \theta_0 & \theta_1 & \ldots & \theta_n \\
    \end{bmatrix}        
    \begin{bmatrix}
    x_0 \\
    x_1 \\
    \vdots \\
    \end{bmatrix}
    =\theta^Tx
$$

We can assume $x_0=1$, which makes matrices of training inputs and theta same size, so they can be multiplied together:
$$
    X=
    \begin{bmatrix}
    x_0 \\
    x_1 \\
    \vdots \\
    x_n \\    
    \end{bmatrix}
    ,\,
    \theta=
    \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \vdots \\
    \theta_n \\
    \end{bmatrix}    
$$

**Vectorized cost function**:

$$J(\theta) = \dfrac {1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})$$

Where $\vec{y}$ denotes the vector of all y values.

**Vectorised gradient descent**:

$$\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})$$

**Feature normalization** can be used to transform input values so that they are on the same scale. This improves efficiency of the gradient descent algorithm. Feature normalization can be done by using two techniques together: **Mean normalization** subtracts the average input value $\mu_i$, **feature scaling** divides by mean value or range (eg. max - min) $s_i$.

$$x_i := \dfrac{x_i - \mu_i}{s_i}$$

Where $μ_i$ is the average of all the values for feature (i) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

Amount of features can be increased (if feasible) by producing new features from current features.

Hypothesis function can also be transformed from a straight line (function of degree 1) into a curved line (eg. function of 2 or more degrees or square root function) if feasible. This is called **Polynomial regression**. For example:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$$

Polynomial regression increases the range of input values (exponentially) so feature scaling becomes very important.

**Normal equation** is a method of finding the optimum theta without iteration.

$$\theta = (X^T X)^{-1}X^T y$$

| **Gradient Descent**       | **Normal Equation**                      |
| -------------------------- | ---------------------------------------- |
| Need to choose alpha       | No need to choose alpha                  |
| Needs many iterations      | No need to iterate                       |
| O (kn2)                    | O (n3), need to calculate inverse of XTX |
| Works well when n is large | Slow if n is very large (over 10,000)    |



