### Step 3: The mathematics behind the Least Squares Method.


In this particular section we'll use the least squares method to estimate the coefficients. Here's a quick breakdown of how this method works mathematically:

Take a quick look at the plot we created above using seaborn. Now consider each point, and know that they each have a coordinate in the form (X,Y). Now draw an imaginary line between each point and our current "best-fit" line. We'll call the distanace between each point and our current best-fit line, D. To get a quick image of what we're currently trying to visualize, take a look at the image below:

In [None]:
# Quick display of image form wikipedia
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/220px-Linear_least_squares_example2.svg.png'
Image(url)

Now as before, we're labeling each green line as having a distance D, and each red point as having a coordinate of (X,Y). Then we can define our best fit line as the line having the property were:
$$ D_{1}^2 + D_{2}^2 + D_{3}^2 + D_{4}^2 + ....+ D_{N}^2$$

So how do we find this line? The least-square line approximating the set of points

$$ (X,Y)_{1},(X,Y)_{2},(X,Y)_{3},(X,Y)_{4},(X,Y)_{5}, $$

has the equation:
$$ Y = a_{0} +a_{1}X $$
this is basically just a rewritten form of the standard equation for a line:
$$Y=mx+b$$

We can solve for these constants a0 and a1 by simultaneously solving these equations:
$$ \Sigma Y = a_{0}N + a_{1}\Sigma X $$
$$ \Sigma XY = a_{0}\Sigma X + a_{1}\Sigma X^2 $$

These are called the normal equations for the least squares line. There are further steps that can be taken in rearranging these equations  to solve for y, but we'll let scikit-learn do the rest of the heavy lifting here. If you want further information on the mathematics of the above formulas, check out this [video](https://www.youtube.com/watch?v=Qa2APhWjQPc).

For now, we'll use numpy to do a simple single variable linear regression. Afterwards we'll unleash the power of scikit learn to do a full multivariate linear regression.

### Step 4: Using Numpy for a Univariate Linear Regression

Numpy has a built in Least Square Method in its linear algebra library. We'll use this first for our Univariate regression and then move on to scikit learn for Multivariate regression.

We will start by setting up the X and Y arrays

In [None]:
# Mandatory imports copied from the previous project
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
from sklearn.datasets import load_boston
boston = load_boston()

# load data as pandas DataFrame
boston_df = DataFrame(boston.data)

# label columns
boston_df.columns = boston.feature_names
boston_df['Price'] = boston.target

# Set up X as median room values
X = boston_df.RM

# Set up Y as the target price of the houses.
Y = boston_df.Price

Now that we have X and Y, let's go ahead and use numpy to create the single variable linear regression.

We know that a line has the equation:
$$y=mx+b$$
which we can rewrite using matrices:
$$y=Ap$$
where:
$$A = \begin{bmatrix}x & 1\end{bmatrix}$$
and
$$p= \begin{bmatrix}m \\b\end{bmatrix}$$

This is the same as the first equation if you carry out the linear algebra. 
So we'll start by creating the A matrix using numpy. We'll do this by creating a matrix in the form [X 1], so we'll call every value in our original X using a list comprehension and then set up an array in the form [X 1]

In [None]:
# Create the X array in the form [X 1]


Great! Now we can get the best fit values!

In [None]:
# Now find out m and b values for our best fit line


Finally let's plot it! Note that we use the original format of the boston information. We only did matrix transformations to utilize the numpy least square method.

In [None]:
# Plot the original points (Price vs Number of Rooms)


# Next best fit line


### Step 5: Error evaluation ###

Great! We've just completed a single variable regression using the least squares method with Python! Let's see if we can find the error in our fitted line. Checking out the documentation [here](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html), we see that the resulting array has the total squared error. For each element, it checks the the difference between the line and the true value (our original D value), squares it, and returns the sum of all these. This was the summed D^2 value we discussed earlier. 

It's probably easier to understand the root mean square error, which is similar to the standard deviation. In this case, to find the root mean square error we divide by the number of elements and then take the square root.

For now let's see how we can get the root mean square error of the line we just fitted.

In [None]:
# Get the resulting array
result = np.linalg.lstsq(X,Y)

# Get the total error
error_total = result[1]

# Get the root mean square error
rmse = np.sqrt(error_total/len(X) )

# Print
print("The root mean square error was %.2f " %rmse)

Since the root mean square error (RMSE) corresponds approximately to the standard deviation we can now say that the price of a house won't vary more than 2 times the RMSE. Note: If you need more information check out this [link](http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule).

Thus we can reasonably expect a house price to be within $13,200 of our line fit.

### Step 6: Using scikit learn to implement a multivariate regression

Now, we'll keep moving along with using scikit learn to do a multi variable regression. This will be a similar apporach to the above example, but scikit learn will be able to take into account more than a single variable!

We'll start by importing the [linear regression library](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) from sklearn.

The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.

In [None]:
# Import for Linear Regression
import sklearn
from sklearn.linear_model import LinearRegression

Next, we create a LinearRegression object, afterwards, type linear_regression and press tab to see the list of methods availble on this object.

In [None]:
# Create a LinearRegression Object


The functions we will be using are:

linear_regression.fit() which fits a linear model

linear_regression.predict() which is used to predict Y using the linear model with estimated coefficients

linear_regression.score() which returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, learn more about it [here](http://en.wikipedia.org/wiki/Coefficient_of_determination)





We'll start the multi variable regression analysis by seperating our boston dataframe into the features and the target columns:

In [None]:
# Data Columns
X = boston_df.drop('Price',1)

# Targets
Y = boston_df.Price

Finally, we're ready to pass the X and Y using the linear regression object.

In [None]:
# Implement Linear Regression


Let's go ahead check the intercept and number of coefficients.

In [None]:
#print(' The number of coefficients used : %d ' % len(linear_regression.coef_))

Great! So we have basically made an equation for a line, but instead of just one coefficient m and an intercept b, we now have 13 coefficients. To get an idea of what this looks like check out the [documentation](http://scikit-learn.org/stable/modules/linear_model.html) for this equation:
$$ y(w,x) = w_0 + w_1 x_1 + ... + w_p x_p $$

Where $$w = (w_1, ...w_p)$$ are the coefficients and $$ w_0 $$ is the intercept 

What we'll do next is set up a DataFrame showing all the Features and their estimated coefficients obtained form the linear regression.

In [None]:
# Set a DataFrame from the Features
coeff_df = DataFrame(boston_df.columns)
coeff_df.columns = ['Features']

# Set a new column lining up the coefficients from the linear regression


# Print
coeff_df

It seems the highest correlation between a feature and a house price is the number of rooms.

Now let's move on to Predicting prices!

### Step 7: Using Training and Validation 

In a dataset a training set is implemented to build up a model, while a validation set is used to validate the model built. Data points in the training set are excluded from the validation set.

Fortunately, scikit learn has a built in function specifically for this called train_test_split().

The parameters passed are X and Y, then optionally test_size parameter, representing the proportion of the dataset to include in the test set. You can learn more about this [here](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html).

In [None]:
# Split dataset into train and test sets
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X,Y)

In [None]:
# Print shapes of the training and test sets


### Step 8: Predicting Prices

Now that we have our training and test sets, let's go ahead and try to use them to predict house prices. We'll use our training set for fitting the model and then use test set for validation.

In [None]:
# Create linear regression object
linear_regression = LinearRegression()

# Fit the model on training set
linear_regression.fit(X_train,Y_train)

Now run a prediction on both the training set and the test set.

In [None]:
# Predictions
Y_pred_train = linear_regression.predict(X_train)
Y_pred_test = linear_regression.predict(X_test)

Now we will get the mean squared error

In [None]:
# Mean Squared Error
print("Mean squared error on Training set : %.2f"  % np.mean((Y_train - Y_pred_train) ** 2)) 
print("Mean squared error on Test set : %.2f"  %np.mean((Y_test - Y_pred_test) ** 2))

It looks like our mean squared error between our training and testing is very close.

In [None]:
# Test set score
linear_regression.score(X_test, Y_test)

### Step 9 : Residual Plots

In regression analysis, the difference between the observed value and the predicted value is called the residual.

$$Residual = Observed\:value - Predicted\:value $$

Each data point has one residual.


A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

Residual plots are a good way to visualize the errors in your data.  If you have done a good job then your data should be randomly scattered around line zero. If there is some structure or pattern, that means your model is not capturing something.
So now let's go ahead and create the residual plot. For more info on the residual plots check out this great [link](http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis).

In [None]:
# Scatter plot for training data
train = plt.scatter(Y_pred_train,(Y_train-Y_pred_train),c='b',alpha=0.5)

# Scatter plot for test data
test = plt.scatter(Y_pred_test,(Y_test-Y_pred_test),c='r',alpha=0.5)

# Plot a horizontal axis
plt.hlines(y=0,xmin=-10,xmax=50)

#Labels
plt.legend((train,test),('Training','Test'),loc='lower left')
plt.title('Residual Plots')

Great! There are no patterns in the residual plot. Residuals seem to be randomly allocated above and below the horizontal. We could also use seaborn to create these plots:

In [None]:
# Residual plot of the entire dataset using seaborn


That's all for linear regression. Please refer to scikit learn documentation for more details :  http://scikit-learn.org/stable/modules/linear_model.html#linear-model