## Introduction

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

features = ['Wood Deck SF', 'Fireplaces', 'Full Bath', '1st Flr SF', 'Garage Area',
       'Gr Liv Area', 'Overall Qual']

In [5]:
X = train[features]
y = train['SalePrice']

$a = (X^TX)^{-1} X^Ty$

In [8]:
ols_estimation = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)

## Cost Function

Unlike gradient descent, OLS estimation provides what is known as a **closed form solution** to the problem of finding the optimal parameter values. A closed form solution is one where a solution can be computed arithmetically with a predictable amount of mathematical operations. Gradient descent, on the other hand, is an algorithmic approach that can require a different number of iterations (and therefore a different number of mathematical operations) based on the initial parameter values, the learning rate, etc. While the approach is different, both techniques share the high level objective of minimizing the cost function.

Before we can dive into how the cost function is represented in the matrix form, let's understand how the error is represented. Because the error is the difference between the predictions made using the model $\hat{y}$ and the actual labels $y$, it's represented as a vector. The greek letter for E (epsilon $\epsilon$) is often used to represent the error vector:

$\epsilon =  \hat{y} - y$

We can build on this to define $y$:

$y = Xa - \epsilon$

Even though this closely resembles the matrix equation of $Ax=b$, we have 2 unknowns (the vector $a$ and the vector $\hat{y}$). We're looking for a model, represented using the parameter vector $a$, that will minimize the mean squared error between the labels, $y$, and the predictions, $\hat{y}$. Said another way, the cost function is this mean squared error.

Here's what the cost function looks like in matrix form:

**$J(a) = \dfrac{1}{n} (Xa - y)^T(Xa - y)$**

## Derivative Of The Cost Function

The derivative of the cost function is decently involved, and out of scope for this mission. Understanding the derivation requires some familiarity with matrix calculus, which is a specific notation for applying calculus concepts to matrices. If you're interested in the derivation, we recommend that you read Eli Bendersky's wonderful walkthrough of the derivation [on his blog](http://eli.thegreenplace.net/2015/the-normal-equation-and-matrix-calculus/).

Here's the derivative of the cost function:

$\frac{dJ(a)}{da} = 2X^TXa - 2X^Ty$

We're now left with the OLS estimation formula:

$a = (X^TX)^{-1}X^Ty$

## Gradient Descent vs. Ordinary Least Squares

Now that we've explored a lot of the math that underlies OLS estimation, let's understand its limitations. The biggest limitation is that OLS estimation is computationally expensive when the data is large. This is because computing a matrix inverse has a computational complexity of approximately O(n^3). You can read more about computational complexity of the matrix inverse and other common matrix operations on [Wikipedia](https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra).

OLS is commonly used when the number of elements in the dataset (and therefore the matrix that's inverted) is less than a few million elements. On larger datasets, gradient descent is used because it's much more flexible. For many practical problems, we can set a threshold accuracy value (or a set number of iterations) and use a "good enough" solution. This is especially useful when iterating and trying different features in our model.