# Entry ? - Ordinary Least Squares (OLS)

## Learning Style

<table align='left'>
    <tr>
        <th>Supervision</th>
        <th>Prediction types</th>
    </tr>
    <tr>
        <td>Supervised</td>
        <td>Regression</td>
    </tr>
</table>

## Description

On page 49, *Introduction to Machine Learning with Python* calls this "the simplest and most classic linear method for regression." It is usually the default method of Linear Regression and is the method used in the `sklearn.linear_model.LinearRegression` function.

This method uses the mean squared error (MSE) to find the best fit line.

I covered mean squared error in [Entry 21](https://julielinx.github.io/blog/21_reg_score_theory/), but here's a reminder along with how the equation would be applied over a dataset/matrix:

- **error**: the difference between predictions and the true value
- **squared error**: literally square the error term. This makes all values positive. Squaring is used instead of absolute value in order to make outlier terms more important
- **mean squared error**: sum the squared error of all data points and divide by the number of data points

$MSE(X, h_{\theta}) = \frac{1}{m} \sum (\theta^{T}x^{(i)} - y^{(i)})^2$

Where:

- X: matrix of features
- $h_{\theta}$: prediction function, also called a *hypothesis*
- $\theta$: array of weights
- $x^{(i)}$: array of features for a specific observation
- $y^{(i)}$: observed output for a specific observation

## Purpose

OLS is basically the starting point for linear regression. It calculates the theta array (ie the weights) used to calculate the output from an array of inputs.

There are two options when calculating the theta array:

- Normal equation
- Gradient descent

When the matrix has an inverse, it is calculated using the normal equation. When there is no inverse, then the iterative process of gradient descent is used.

### Normal Equation

$\hat{\theta} = (X^{T} X)^{-1} X^{T} y$

Where:

- $\hat{\theta}$: theta array, the hypothesized weights
- X: input feature matrix
- $X^{T}$: the transpose of X
- y: array of target values

### Gradient Descent



## Behavior

Normal equation vs gradient descent

<table align='left'>
    <tr>
        <td><b>Comparison category</b></td>
        <td><b>Normal Equation</b></td>
        <td><b>Gradient Descent</b></td>
    </tr>
    <tr>
        <td>Alpha</td>
        <td>No need to choose alpha</td>
        <td>Need to choose alpha</td>
    </tr>
    <tr>
        <td>Interation</td>
        <td>No need to iterate</td>
        <td>Needs many iterations</td>
    </tr>
    <tr>
        <td>Computational complexity</td>
        <td>$O(n^{3})$ (need to calculate inverse of $X^{T}X$) *</td>
        <td>$O(kn^{2})$</td>
    </tr>
    <tr>
        <td>Speed with large feature set</td>
        <td>Slow if *n* is very large</td>
        <td>Works well when *n* is large</td>
    </tr></table>

\* Scikit-learn implementation of OLS uses psudoinverse instead of inverse, resolving this limitation.

Where:

- *n*: number of features
- **X**: feature matrix

#### Computational complexity

A few notes on computational complexity. The formulas in the table above are from **Reading: Normal Equation** in week 2 of Andrew Ng's [Machine Learning](https://www.coursera.org/learn/machine-learning) course. On that slide, he also notes that:

> [...] if we have a very large number of features, the normal equation will be slow. In practice, when *n* exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

*Hands-On Machine Learning with Scikit-Learn* adds the following in relation to feature size and computational complexity:

> Both the Normal Equation and the SVD approach get very slow when the number of features grows large (e.g., 100,000). On the positive side, both are linear with regard to the number of instances in the training set (they are both *O(m)*), so they handle large training sets efficiently, provided they can fit in memory.

It also adds that the computational complexity for the SVD implementation of the normal equation in `sklearn.linear_model.LinearRegression` is $O(n^{2})$. This puts it at roughly the same computational complexity as gradient descent. However, everything still has to fit into memory and *Hands-On Machine Learning with Scikit-Learn* purports on page 122 that gradient descent is much faster than normal equation or SVD when there are hundreds of thousands of features. Gradent descent is still a better choice for large datasets due to these two properties.



## Parameters

## Strengths

- Fast to train
- Fast to predict
  - Computational complexity is linear
  - Ex: it takes twice as long to predict on twice as many instances (or twice as many features)
- Easily scale to very large datasets
- Work well with sparse data
- Easy to intrepret / easy to see feature importance
- Performs well when the number of features is large compared to the number of observations (ex, 104 features but only 5 observations)

## Limitations

- In low dimensions, linear models appear to have very limited usefulness. However, as more dimensions are added, the model becomes more powerful and can become overfit
- Often unclear why coefficients are the what they are, particularly if there are highly correlated features
- Features should be scaled to improve the algorithms ability/speed to converge on the correct solution (if you've forgotten what centering and scaling are, see [Entry 8](https://julielinx.github.io/blog/08_center_scale_and_latex/))

## Evaluation

## Datasets

## Resources

- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)
- [Hands-On Machine Learning with Scikit-Learn](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646)
- [Machine Learning course](https://www.coursera.org/learn/machine-learning)