# Linear Regression with multiple variables


We will implement a linear regression model to predict housing prices. Here we have several features for each 
training data.  

The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is 
the size of the house (in square feet), the second column is the number of bedrooms, and the third column 
is the price of the house.

## Feature normalization 

Because house sizes are about 1000 times the number of bedrooms, we need to perform feature scaling to ensure 
that gradient descent converges in a reasonable amount of time

In [224]:
# Multivariate Linear Regression 
%matplotlib inline
import numpy as np
import pandas as pd  
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Take a second to appreciate how easy it is to read csv using pandas. No mess, no drama!
data = pd.read_csv("ex1data2.txt", header=None, names=['Size', 'No. Bedrooms', 'Price']) 
data.head()

Unnamed: 0,Size,No. Bedrooms,Price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


In [225]:
data.describe()

Unnamed: 0,Size,No. Bedrooms,Price
count,47.0,47.0,47.0
mean,2000.680851,3.170213,340412.659574
std,794.702354,0.760982,125039.899586
min,852.0,1.0,169900.0
25%,1432.0,3.0,249900.0
50%,1888.0,3.0,299900.0
75%,2269.0,4.0,384450.0
max,4478.0,5.0,699900.0


To normalize features we should:
   * Subtract the mean value of each feature from the dataset
   * additionally scale (divide) the feature values by their respective “standard deviations.”
   
*Standard Devation (std)* is a measure that is used to quantify the amount of variation in a set of data values.

In [226]:
# feature normalization using built-in pandas functions 
data = (data - data.mean()) / data.std()

## Gradient descent 
We experimented with gradient descent previously when applying it with one variable. Here, we have an additional variable. The code should support any number of features and be vectorized. According to the implementation note 
in Andrew's excercise file, in this case the cost function could be written as such:  
$$J(\theta) =  \frac{1}{2 m}  \big(X\theta -  \overrightarrow{y} \big)^{T} \big(X\theta -  \overrightarrow{y} \big)$$
                                                        where:
$$\left[ \begin{array}{cccc}
- (X^{(1)})^{T} - \\
- (X^{(2)})^{T} -\\
\ldots\\
- (X^{(m)})^{T} -\\ \end{array} \right]$$
and:
$$\vec{y} = \left[ \begin{array}{cccc}
y^{(1)}\\
y^{(2)}\\
\ldots\\
y^{(m)}\\ \end{array} \right]$$

In [227]:
# Setting alpha and initializing theta values to be zero.The number of theta's is always one more 
# than the # of features. This means in this case we have three theta's. However we are also obligated to 
# write the algorithm in a way to accommodate for any # of theta's

alpha = 0.01 
theta = np.zeros(3)
theta = np.matrix(theta)

# number of iterations
iter = 1500 

# A column of ones should be inserted at the beginning of the X matrix
# DataFrame.insert(loc, column name, value)
data.insert(0, 'One', 1)

Out of the 4 (X.shape) columns in our dataframe, columns 2 & 3 (Size and No. Bedrooms) should be used to construct X and the last one (Price) is used to build y.

In [228]:
# .iloc() lets us select columns by index/label from 0 to length-1 of the axis)
X = data.iloc[:, 0: data.shape[1]-1]
X = np.matrix(X)
y = data.iloc[:, data.shape[1]-1:data.shape[1]]
y = np.matrix(y)

In [229]:
m = data.shape[0]

In [230]:
def cost(theta, X, y):
#     print('shape of y = ', y.shape)
#     print('shape of X = ', X.shape)
#     print('shape of theta = ', theta.shape)
    arr = np.power(((X * theta.T) - y), 2)
    # we get an (m x 1) dimensional array 
    return np.sum(arr) / (2 * m)

In [231]:
cost(theta, X, y)

0.48936170212765967

In [232]:
def gradientDecentMatP(theta, alpha, X, y):
    theta1 = np.matrix(np.zeros(theta.shape))
    error = (X * theta.T) - y
    for j in range(theta.shape[1]):
        temp = np.multiply(error, X[:,j])
        theta1[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(temp))
    return theta1

In [233]:
# We run gradient descent once and use the result to check if the cost is decreasing in one iteration  
gradientDecentMatP(theta, alpha, X, y)

matrix([[ -7.08652994e-19,   8.36796367e-03,   4.32851306e-03]])

In [234]:
theta_new = gradientDecentMatP(theta, alpha, X, y)
cost(theta_new, X, y)

0.48054910410767188

In [235]:
# We try different values for alpha, to see how it affects the cost
alphas = [0.05, 0.03, 0.01, 0.1]
for alpha in alphas:
    costs = []
    theta = np.matrix(np.zeros(3))
    for i in range(100):
        costs.append(cost(theta, X, y))
        theta = gradientDecentMatP(theta, alpha, X, y)
    print("alpha =", alpha)
    print(costs[-10:-1])

alpha = 0.05
[0.1325690403200446, 0.13248881991623926, 0.13241201967481833, 0.13233849351760485, 0.13226810164385419, 0.13220071025488278, 0.13213619129159312, 0.13207442218417484, 0.13201528561331716]
alpha = 0.03
[0.13986059016215924, 0.13962141790056626, 0.13938871202698191, 0.13916227734379866, 0.13894192627700599, 0.13872747844665043, 0.13851876026925145, 0.13831560458948056, 0.13811785033864607]
alpha = 0.01
[0.19079217031603379, 0.18991630937488599, 0.18905886281327441, 0.18821934247553765, 0.18739727439800891, 0.18659219838412336, 0.185803667592357, 0.18503124813660804, 0.1842745186986435]
alpha = 0.1
[0.13072076616188341, 0.13071787659326714, 0.13071523055549961, 0.13071280752396855, 0.13071058870386201, 0.1307085568843821, 0.1307066963052449, 0.13070499253443252, 0.13070343235624751]


This means that the alphas that are larger, cause a faster convergence in 100 iterations