# Predicting Boston Housing Prices

In [2]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit

import ex02_visuals as vs

# pretty display for notebooks
%matplotlib inline

data = pd.read_csv('ex02_housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis=1)

print("Boston housing dataset has {} data points with {} vars each".format(*data.shape))

# features:
# RM - average number of rooms among homes in the neighborhood
# LSTAT - percentage of homeowners considered "lower class"
# PTRATIO - ratio of students to teachers in schools

print data['MEDV'].min()  # 105000.0
print data['MEDV'].max()  # 105000.0
print data['MEDV'].mean()  # 105000.0
print data['MEDV'].median()  # 105000.0
print data['MEDV'].std()  # 105000.0


Boston housing dataset has 489 data points with 4 vars each
105000.0
1024800.0
454342.944785
438900.0
165340.277653


## Developing a Model

Coefficient of determination (denoted by R<sup>2</sup>) is a key output of regression analysis. It is inpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

* the coefficient of determination is the square of the correlation (r) between predicted y scores and actual y scores; thus, it ranges from 0 to 1.
* with linear regression, the coefficient of determination is also equal to the square of the correlation between x and y scores.
* an R<sup>2</sup> of 0 means that the dependent variable cannot be predicted from the independent variable.
* an R<sup>2</sup> of 1 means the dependent variable can be predicted without error from the independent variable.
* an R<sup>2</sup> between 0 and 1 indicates the extent to which the dependent variable is predictable. An R<sup>2</sup> of 0.10 means that 10 percent of the variance in Y is predictable from X; an R<sup>2</sup> of 0.20 means that 20 percent is predictable; and so on.


In [None]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    # Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)
    
    # Return the score
    return score
