# Week 2 - Notes

---

## Multivatiate linear regression

**Notation:**
* n = number of features
* x<sup>(i)</sup> = input features of ith training example
* x<sub>j</sub><sup>(i)</sup> = value of feature j in ith training example
* For the convenience of notation, define x<sub>0</sub><sup>(i)</sup> = 1 (so x index starts at 0)

**Example:**
* x = \[\[2104, 5, 1, 45, 460\], \[1416, 3, 2, 40, 232\], \[1534, 3, 2, 30, 315\], \[852, 2, 1, 36, 178\]\]
* n = 5
* x<sup>(1)</sup> = \[2104, 5, 1, 45, 460\]
* x<sub>2</sub><sup>(1)</sup> = 5

### Hypothesis

Previously (single feature): h<sub>&theta;</sub>(x) = &theta;<sub>0</sub> + &theta;<sub>1</sub>x

Multiple features: h<sub>&theta;</sub>(x) = &theta;<sub>0</sub>x<sub>0</sub> + &theta;<sub>1</sub>x<sub>1</sub> + + &theta;<sub>2</sub>x<sub>2</sub> + ... + &theta;<sub>n</sub>x<sub>n</sub> 

For matrices, **_V_** is a vertical matrix and **_V<sup>T</sup>_** is a horizontal matrix.

We can write &theta; as a matrix: &theta;<sup>T</sup> = \[&theta;<sub>0</sub>, &theta;<sub>1</sub>, ... &theta;<sub>n</sub>\]

We can write x as a matrix: x<sup>T</sup> = \[x<sub>0</sub>, x<sub>1</sub>, ... x<sub>n</sub>\]

Therefore, we can write the multivariate linear regression hypothesis h<sub>&theta;</sub>(x) as

### = **&theta;<sup>T</sup>x**

---


## Gradient descent for multiple variables

**Hypothesis:** h<sub>&theta;</sub>(x) = &theta;<sup>T</sup>x

**Features:** &theta;

**Cost Function:** J(&theta;) = 1 / 2m &Sigma;( h<sub>&theta;</sub>(x<sub>(i)</sub>) - y<sub>(i)</sub> )<sup>2</sup>

### Gradient Descent

![gradient descent](https://i.imgur.com/pMKnb4f.png)

Then, after plugging our multivaraite hypothesis function into the gradient descent algorithm:

![gradient descent hypothesis](https://i.imgur.com/X4Gcj5Z.png)


This new algorithm is essentially the same as before (the single feature case). 

---

Techniques to make gradient descent perform better with multiple variables in practice:

## Feature scaling

Idea: Make sure, in a problem with multiple features, that those features are on a similar scale. Feature scaling can speed up convergence during gradient descent.

Eg: 

**Before scaling:** x<sub>1</sub> = size (0 - 2000 ft<sup>2</sup>), x<sub>2</sub> = number of bedrooms (1-5)

**After scaling:** x<sub>1</sub> = size ft<sup>2</sup> / 2000, x<sub>2</sub> = number of bedrooms / 5

So that (approximately) 0 <= x<sub>j</sub> <=1 for all j in |x|

## Mean normalization

Replace x<sub>i</sub> with x<sub>i</sub> - &mu;<sub>i</sub> to make features have approximately zero mean (does not apply to x<sub>0</sub> = 1)

Or, alternatively, replace x<sub>i</sub> with (x<sub>i</sub> - &mu;<sub>i</sub>) / s<sub>i</sub>, where s<sub>i</sub> is either the range of x<sup>(i)</sup> (max - min) or the standard deviation

---

**How do we make sure gradient descent is working correctly?**

![plot the cost function](https://i.imgur.com/s6zGUY1.png)

1. **Plot** the cost function J(&theta;) after every iteration of gradient descent, and ensure that it is decreasing with every iteration. Looking at this figure can also help you determine whether J(&theta) has converged.
2. Establish an "Automatic convergence test"; A sample convergence test: declare convergence if J(&theta;) decreases by less than some value (e.g. 10<sup>-3</sup>) after iteration.
3. For a sufficiently small a, J(&theta;) **should** decrease on every iteration. If J(&theta;) diverges or refuses to stabilize (looking at the graph can help you detect this), you may need to decrease the learning rate a.

---

In [9]:
import numpy as np

# Python implementation of the multivariate regression
# using batch gradient descent
# Input: training set (x, y)
#        a, learning rate
#        N, convergence criteria
# Output: [theta_0, theta_1, ... theta_n], parameters of J such that
#         J([theta_0, theta_1, ... theta_n]) is at a global minimum
def multivariate_regression(x, y, a, N):
    m = len(x)
    n = 0
    if m > -1:
        n = x[0].shape[1]
    theta = [0 for _ in range(n)]
    
    while True:
        theta_sum = [0 for _ in range(n)]
        for i in range(m):
            for j in range(n):
                A = np.matrix(theta)
                B = np.matrix(x[i]).transpose()
                theta_sum[j] += (np.asscalar(np.matmul(A, B)) - y[i]) * x[i,j]
        convergence = 1
        for j in range(n):
            convergence = convergence and (abs(theta_sum[j]) < N)
        if convergence:
            break
        else:
            for j in range(n):
                theta[j] = theta[j] - (a / m) * theta_sum[j]
    return theta


In [10]:
# Testing multivariate regression

a = np.matrix([
  [3.9,8.94],
  [5.4,10.85],
  [5.8,11.61],
  [6,13.65],
  [6.5,13.54],
  [6.1,13.29],
  [5.9,17.65],
  [5.5,18.81],
  [5.4,17.91]
  ])
b = [8.73,14.28,17.68,9.94,14.99,18.75,11.4,15.08,19.3]

print(multivariate_regression(a,b,.001,10))

[1.192758903904197, 0.539655929622876]


## Polynomial regression

If we look at the distribution of our data and decide that a linear regression is not appropriate, we can try to fit a polynomial regression to it.

**Linear regression hypothesis:** h<sub>&theta;</sub>(x) = &theta;<sub>0</sub>x<sub>0</sub> + &theta;<sub>1</sub>x<sub>1</sub> + + &theta;<sub>2</sub>x<sub>2</sub> + ... + &theta;<sub>n</sub>x<sub>n</sub>

**Polynomial regression hypothesis:** h<sub>&theta;</sub>(x) = &theta;<sub>0</sub>x<sub>0</sub><sup>0</sup> + &theta;<sub>1</sub>x<sub>1</sub><sup>1</sup> + + &theta;<sub>2</sub>x<sub>2</sub><sup>2</sup> + ... + &theta;<sub>n</sub>x<sub>n</sub><sup>n</sup>

With polynomial regression, feature scaling is **_extremely_** important.

You also have some choice in what feature to use for your model. E.g. rather than just using the frontage and the depth of a house plot, you can use the area (frontage * depth), or the area and the square root of the area, etc.