# Normal Equation

Sometimes you could resolve the optimization problem analytically. Now we are going to study a little about how to do it.

**Gradient Descent** gives one way of minimizing $J(\theta)$. Let's discuss other way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the **Normal Equation** method, we will minimize $J$ by explicitly taking its derivatives with respect to the $\theta_{j}$, and setting them to zero. This allows us to find the optimum $\theta$ without iteration. The normal equation formula is given below:

$$\theta = (X^{T}X)^{-1}X^Ty$$

Now suppose that we have the following data:

|   x<sub>0</sub> | Size (feet) x<sub>1</sub> | Number of  bedrooms x<sub>2</sub> | Number of  Floors x<sub>3</sub> | Age of home (years) x<sub>4</sub> | Price ($1000)  y |
|------|-----------------|------------------------|----------------------|------------------------|------------------|
| 1    | 2104            | 5                      | 1                    | 45                     | 460              |
| 1    | 1416            | 3                      | 2                    | 40                     | 232              |
| 1    | 1534            | 3                      | 2                    | 30                     | 315              |
| 1    | 852             | 2                      | 1                    | 36                     | 178            |

Our **Normal Equation** would be something like the following:

$$\begin{bmatrix}
1 & 2104 & 5 & 1 & 45 \\
1 & 1416 & 3 & 2 & 40 \\
1 & 1534 & 5 & 2 & 30 \\
1 & 852 & 5 & 1 & 36 \\
\end{bmatrix}y=\begin{bmatrix}
460 \\ 232 \\ 315 \\ 178
\end{bmatrix}$$

There is **no need** to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

| **Gradient Descent**           | **Normal Equation**                             |
|--------------------------------|-------------------------------------------------|
| Need to choose $\alpha$        | No Need to choose $\alpha$                      |
| Needs many iterations          | No need to iterate                              |
| $O(kn^2)$                      | $O(n^3)$, need to calculate inverse of $X^{T}X$ |
| Works well when **n** is large | Slow if **n** is very large                     |

Normally the recommendation is if n is close to 10000 would be a good idea to move to **Gradient Descent** method.

If you wanna know how you could obtain the **Normal Equation**. Please feel free to read a little more on [Derivation of the Normal Equation for Linear Regression](https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression)


## Noninvertibility

If $X^TX$ is **noninvertible**, the common causes might be having:

* Redundant features, where two features are very closely related (i.e the are linearly dependent)
* Too many features (e.g. $m<=n$). In this case, delete some features or use **Regularization** (To be explained in other lesson)

Solutions to the above problems includie deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.

