## Multivariate Linear Regression

- **Multiple Feature**
    - The multivariable form of the hypothesis function:
    $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$

    - Using the definition of matrix multiplication:
  
  $h_\theta(x) = [\theta_0  \theta_1 ... \theta_n]\begin{bmatrix}x_0\\x_1\\.. .\\x_n\end{bmatrix} = \theta^Tx$

    when $x_0$ = 1

- **Gradient Descent For Multiple Variables**

    $\theta_j = \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(j)}$

    We will repeat until convergence.

    - Remind:
        - $x^{(i)}$, $y^{(i)}$ is traing example $i^{th}$ (row i in training set)
        - $x_j^{(i)}$ is feature $j^{th}$ in training example $i^{th}$ (row i column j in training set)

- **Gradient Descent in Practice - Feature Scaling**
    
    - We can speed up gradient descent by having each of our input values in roughly the same range.
    - Two techniques to help with this are **feature scaling** and **mean normalization**

    - Feature scaling involves diving the input values by the range of the input variable, resulting in a new range of just 1.
    - Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for input variable for just zero:
    
        $x_i := \dfrac{x_i - \mu_i}{s_i}$
        
        where $\mu_i$ is average of feature(i) and $s_i$ the range of values (max - min), or $s_i$ is the standard deviation

- **Gradient Descent in Practice - Learning Rate**

    - If $\alpha$ is too small: slow convergence.
    
    - If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.

- **Feature and Polynomial Regression**

    - We can improve our feature and form of our hypothesis function in a couple different ways.

    - **Polynomial Regression**

        - Our hypothesis need not be linear (a straight line) if that does not fit the data well.

        - We can **change the behavior or curve** of our hypothesis function by making quadratic, cubic or square root function (or any other form)

## Computing Parameters Analytically

- **Normal Equation**

    - Let's discuss a second way of minimizing cost function (J) (the first one is gradient descent), this time performing the minimization explicity and without resorting to an iterative algorithm. In "Normal Equation", we will explicity taking J's derivatives with respect to the $\theta_j$'s, and setting them to zero. This allows us to find optimum theta without iteration.

    $$\theta = (X^TX)^{-1}X^Ty$$

    Where
  $X = \begin{bmatrix}1 && x_1^{(1)} && ... && x_n^{(1)} \\ 1 && x_1^{(2)} && ... && x_n^{(2)} \\ ... \\1 && x_1^{(m)} && ... && x_n^{(m)}\end{bmatrix}$ (matrix m x (n+1)), $x_j^{(i)}$ is feature $j^{th}$ of example training $i^{th}$ and $y = \begin{bmatrix}y^{(1)}\\y^{(2)}\\...\\y^{(m)}\end{bmatrix}$ (matrix m x 1 or m-dim vector), $y^{(i)}$ is the value result (the value we will predict by hypothesis) of training example $i^{th}$.

    - There is **no need** to do feature scaling with normal equation.

    - When use normal equation (compare gradient descent and normal equation):
        - **Gradient Descent**:

            - Need to choose alpha.
            - Needs many iterations.
            - $o(kn^2)$
            - Works well when n is large.

        - **Normal equation**:

            - No need to choose alpha.
            - No need to iterate.
            - $o(n^3)$, need to calculate inverse of $X^TX$ (matrix (n+1)x(n+1))
            - Slow if n is very large.

        - In practice, when n exceeds 10,000 it might a good time to go from a normal solution to an iterative process (gradient descent).

- **Normal Equation Noninvertibility**

    - When implementing the normal equation in 'octave', we want to use the 'pinv' function rather than 'inv'. The 'pinv' function will give you a value of $\theta$ even if $X^TX$ is not inverible.

    - If $X^TX$ is noninvertible, the common causes might be having:

        - Redundant features, where 2 features are very closely related (i.e. they are linearly dependent).
        
        - Two many features (e.g. $m \le n$). In this case, delete some features or use 'regularization'.

    