# Boston House Price Predictor

I this notebook, I have developed the first project of the Machine Learning Engineer Nanodegree at [Udacity](https://eu.udacity.com/).

## Getting Started
In this project, I have developed a predective model that, has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts, for predecting Boston Housing Prices.

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/). The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. For the purposes of this project, the following preprocessing steps have been made to the dataset:
- 16 data points have an `'MEDV'` value of 50.0. These data points likely contain **missing or censored values** and have been removed.
- 1 data point has an `'RM'` value of 8.78. This data point can be considered an **outlier** and has been removed.
- The features `'RM'`, `'LSTAT'`, `'PTRATIO'`, and `'MEDV'` are essential. The remaining **non-relevant features** have been excluded.
- The feature `'MEDV'` has been **multiplicatively scaled** to account for 35 years of market inflation.

In [1]:
#<-Import libraries necessary for this project->#
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit
%matplotlib inline

#<-Load the Boston housing dataset->#
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
    
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))



Boston housing dataset has 489 data points with 4 variables each.




## Data Exploration
In this first section of this project, I have made an investigation about the Boston housing datasets.

### Implementation: Calculate Statistics
Firstly, I have calculated descriptive statistics about the Boston housing prices because of these statistics are extremely important on to analyze various prediction results from the constructed model.

In [2]:
# Minimum price of the data
minimum_price = np.min(prices)

# Maximum price of the data
maximum_price = np.max(prices)

# Mean price of the data
mean_price = np.mean(prices)

# Median price of the data
median_price = np.median(prices)

# Standard deviation of prices of the data
std_price = np.std(prices)

# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(minimum_price)) 
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${}".format(std_price))

Statistics for Boston housing dataset:

Minimum price: $105000.0
Maximum price: $1024800.0
Mean price: $454342.9447852761
Median price $438900.0
Standard deviation of prices: $165171.13154429474


## Defining a Performance Metric
For quantifying the performance of the model overtraining and testing I have calculated the [*coefficient of determination*](http://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination), R<sup>2</sup>,

The values for R<sup>2</sup> range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the **target variable**. A model with an R<sup>2</sup> of 0 is no better than a model that always predicts the *mean* of the target variable, whereas a model with an R<sup>2</sup> of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the **features**. _A model can be given a negative R<sup>2</sup> as well, which indicates that the model is **arbitrarily worse** than one that always predicts the mean of the target variable._


In [3]:
from sklearn.metrics import r2_score

########################################################
# Performance metric function: R2 Score Implementation #
########################################################

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    #<--Calculate the performance score between 'y_true' and 'y_predict'-->#
    score = r2_score(y_true, y_predict)
    
    return score

## Training the method

In this project section, I have trained a decision tree predictor to ensure an optimized model, using grid search and suffle-split.

### Implementation: Shuffle and Split Data
The next implementation I have splitted the Boston housing dataset into training (80%) and testing subsets (20%). And, I have shuffled that dataset into a random order when I created the training and testing subsets to remove any bias in the ordering of the dataset.

In [4]:
from sklearn.cross_validation import train_test_split

#<-Shuffle and split the data into training and testing subsets->#
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

print("Training and testing split was successful.")

Training and testing split was successful.


### Implementation: Fitting the Model
I have implemented a **decision tree algorithm**. To ensure an optimized model, I have trained the model using the grid search technique to optimize the `'max_depth'` parameter.

In [7]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV

#################################
# Function to fitting the model #
#################################


def fit_model(X, y):
    ##<-Object definitions->##
    #<-Create cross-validation sets from the training data->#
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)

    regressor = DecisionTreeRegressor()
    
    #<-Creatin a dictionary for the parameter 'max_depth' with a range from 1 to 10->#
    params = {'max_depth': list(range(1,11))}

    #<-Transform 'performance_metric' into a scoring function using 'make_scorer'->#
    scoring_fnc = make_scorer(performance_metric)

    #<-Defining the grid search cv->#
    grid = GridSearchCV(regressor, params, scoring = scoring_fnc, cv = cv_sets)
    
    ##<-Fitting the model->##
    #<-Fit the grid search object to the data to compute the optimal model->#
    grid = grid.fit(X, y)

    #<-Return the optimal model after fitting the data->#
    return grid.best_estimator_



## Making Predictions
Once a model has been trained on a given set of data, it can now be used to make predictions on new sets of input data.

In [8]:
#<-Fit the training data to the model using grid search->#
reg = fit_model(X_train, y_train)

#<-Produce the optimal value for 'max_depth'->#
print("Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']))

Parameter 'max_depth' is 4 for the optimal model.


### Examples - Predicting Selling Prices
Imagine that you were a real estate agent in the Boston area looking to use this model to help price homes owned by your clients that they wish to sell. You have collected the following information from three of your clients:

| Feature | Client 1 | Client 2 | Client 3 |
| :---: | :---: | :---: | :---: |
| Total number of rooms in home | 5 rooms | 4 rooms | 8 rooms |
| Neighborhood poverty level (as %) | 17% | 32% | 3% |
| Student-teacher ratio of nearby schools | 15-to-1 | 22-to-1 | 12-to-1 |

* What price would you recommend each client sell his/her home at? 
* Do these prices seem reasonable given the values for the respective features? 

In [9]:
#<-Produce a matrix for client data->#
client_data = [[5, 17, 15], # Client 1
               [4, 32, 22], # Client 2
               [8, 3, 12]]  # Client 3

# Predictions
for i, price in enumerate(reg.predict(client_data)):
    print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price))

Predicted selling price for Client 1's home: $403,025.00
Predicted selling price for Client 2's home: $237,478.72
Predicted selling price for Client 3's home: $931,636.36


### Sensitivity
An optimal model is not necessarily a robust model. Sometimes, a model is either too complex or too simple to sufficiently generalize to new data. Sometimes, a model could use a learning algorithm that is not appropriate for the structure of the data given. Other times, the data itself could be too noisy or contain too few samples to allow a model to adequately capture the target variable — i.e., the model is underfitted. 

In [10]:
vs.PredictTrials(features, prices, fit_model, client_data)

Trial 1: $391,183.33
Trial 2: $424,935.00
Trial 3: $415,800.00
Trial 4: $420,622.22
Trial 5: $418,377.27
Trial 6: $411,931.58
Trial 7: $399,663.16
Trial 8: $407,232.00
Trial 9: $351,577.61
Trial 10: $413,700.00

Range in prices: $73,357.39


### Applicability

The constructed model should not be used in a real-world setting due to various reasons but it could be a good approach to solve house pricing predictor if we have a very good and big dataset. Next, I will expose some of them.

- Outdated data set: The data set was collected from 1978, and the world has changed a lot since then. Not only because inflation has increased the price level of goods and services in the American economy (https://www.bls.gov/data/), but also for the cultural change and the damage in the world.

- Lack of features: The features used are not sufficient to describe the home, because there are others that can impact strongly in prices such as square feet of the plot area (house size) or important city services nearby (location within the city).

- Robustness: Looking at the sensitivity analysis, we can see the range in prices is 73,357.39 USD. This value is too high taking into account that the minimum price is 105000.0 USD. This difference in prices makes a huge difference between the two predictions for the same house.

- Generalizability to other areas: I cannot be applicable in others areas as a rural city because there is a big difference between Boston and a rural city because people and city characteristics are totally different and, therefore, the trend for selling houses should be different.

I think not only it is fair to judge the price of an individual home based on the characteristics of the neighborhood, but also I would consider the characteristics of city, area, and country, etc. And, of course, the characteristics of houses. I think that it is important to consider all these characteristics to choose key features and make a correct adjustment of that hyperparameters.