In [1]:
import pandas as pd
import numpy as np

import os # to open system file

## Training Data

In [5]:
# Read the dataset into a data table using Pandas
df = pd.read_csv("ml_house_data_set.csv")

# Create a web page view of the data for easy viewing
html = df[0:100].to_html()

# Save the html to a temporary file then we can open it using the web browser
with open("data.html", "w") as f:
    f.write(html)

# Open the web page in our web browser
full_filename = os.path.abspath("data.html")

df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,house_number,street_name,unit_number,city,zip_code,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,42670,Lopez Crossing,,Hallfort,10907,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,5194,Gardner Park,,Hallfort,10907,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,4366,Harding Islands,,Lake Christinaport,11203,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,3302,Michelle Highway,,Lake Christinaport,11203,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,582,Jacob Cape,,Lake Christinaport,11203,207897.0


## Features Engineering
We can see that the featuer *garage_type* take 3 possible values
* none: no carage
* attatechd: the carage attached to the house
* detached: the carage in seprate building

we need to preporss the data by applying one hot encoding.

We have the hause address: the house number the street name the znit number the coty and the zip code. The house number and the unit number  aren't useful, therefor we want to drope them from our model. 

IF we know the zip code of the house then we know the city of the house, therefor we need only to include one of them. The name of the street may not influnece the price of the house but it will make our model more complex. If we apply the one hot encoding then we will end up with a new feature for every single street.

The house sell price would be out y value which we are trying to predict with our model

In [6]:
# Remove the fields from the data set that we don't want to include in our model
del df['house_number']
del df['unit_number']
del df['street_name']
del df['zip_code']

# Replace categorical data with one-hot encoded data
features_df = pd.get_dummies(df, columns= ['garage_type','city'] )

# Remove the sale price from the feature data
del features_df['sale_price']

# Create the X and y arrays
X = features_df.as_matrix()   # we want x to be a NumPY matrix data type and not a pandas dataframe
y = df['sale_price'].as_matrix()

** Curse of Dimesionality: ** As the number of dimensions (or features) in the data increases, the number of data points required to build a good model growes exponentially.  

## Train the model:

We need to do two things:
* shuffle the data
* split the data

In [7]:
from sklearn.model_selection import train_test_split 
# Split the data set in a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Create and train our model. We use an algorithm called gradient boosting and then we will set the hyper-parameters for this model.
* **n_estimators**: tells the model how many decision trees to build. Higher number allow the model to be more accurate but it increases the amount of time required to run the model.' 
* **learning_rate**:  controls how much each additional decision tree influences the overall predicition. Lower rates usually lead to higher accuracy. 
* **max_depth**: controls how many layers deep each indivialual decision tree cann be.
* **min_samples_leaf**: controls how many times a value must appear in our training set for decision tree to make a decision based on it. For example if we say 9 then that means at least 9 houses must exhibit the same characterisitic before we considet it meaningful enough to build decision tree around it. This helps us to prevent the influence of outlier on our model.  
* **max_feature**: is the percentage of features in our model that we randomly choose to consider to create a branche in our decision tree.
* **loss**: controls how scikit-learn calculates the model's error rate.

In [16]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.externals import joblib

# Fit regression model
model = GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=6,
    min_samples_leaf=9,
    max_features=0.1,
    loss='huber',
    random_state=0
)
model.fit(X_train, y_train)

# Save the trained model to a file so we can use it in other programs
joblib.dump(model, 'trained_house_classifier_model.pkl')

['trained_house_classifier_model.pkl']

The next stip is to measure how well the model performs. We will use a measurment called **Mean Absolute Error** which looks at every pridiction and it gives us an average of how wrong it was.

In [18]:
from sklearn.metrics import mean_absolute_error 

# Find the error rate on the training set
mse = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error: %.4f" % mse)

# Find the error rate on the test set
mse = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

Training Set Mean Absolute Error: 48331.8249
Test Set Mean Absolute Error: 58388.4658


The defirence between these two numbers tells us alot about how well the model is working. 

## Improve Our System:

We can indirecty tell if our model is overfitting or underfitting based on the training set and test set error rates.  

**Overfitting**: The model perfectly fit our training data, but didn't generalize to fit the test data at all. 
* Training set error very low
* Test set error very high

To solve this problem we should make our model less comolex by applying the following solutions:
![](OverFitting.png)

 **Overfitting**: The model couldn't capture the patterns in the data set very well 
* Training set error very high
* Test set error very high

To solve this problem we can make our model more comolex by applying the following solutions:
![](UnderFitting.png)

### Grid Search:

To fix those Problems we should adapt the hyper-parameters. The Machine Learning Models have lots of hyper-parameter to adjust. The best method to tune them is through trial and error.

The first step is to declare our model without passing in any parameters.

In [8]:
from sklearn.ensemble import GradientBoostingRegressor

# Create the model
model = GradientBoostingRegressor()

Then we define the parameter grid. A good strategy is to try a few values for each parameter, where it increases or decreases by a signifcant amount. 

In [9]:
# Parameters we want to try
param_grid = {
    'n_estimators': [500, 1000, 3000],
    'max_depth': [4, 6],
    'min_samples_leaf': [3, 5, 9, 17],
    'learning_rate': [0.1, 0.05, 0.02, 0.01],
    'max_features': [1.0, 0.3, 0.1],
    'loss': ['ls', 'lad', 'huber']
}

Next define the grid search using the GreadSearchCV function. CV stands for cross-validation. This funcion will automatically slice up the training data into smaller subsets and use part of the data for trainig different models and a different part of the data for testing those models. 

In [None]:
from sklearn.model_selection import GridSearchCV
# Define the grid search we want to run. Run it with four cpus in parallel.
gs_cv = GridSearchCV(model, param_grid, n_jobs=4)

# Run the grid search - on only the training data!
gs_cv.fit(X_train, y_train)

# Print the parameters that gave us the best result!
print(gs_cv.best_params_)

In [None]:
# Find the error rate on the training set using the best parameters
mse = mean_absolute_error(y_train, gs_cv.predict(X_train))
print("Training Set Mean Absolute Error: %.4f" % mse)

# Find the error rate on the test set using the best parameters
mse = mean_absolute_error(y_test, gs_cv.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

### Features Selection:
There are many features that are probably really importent to deteminig the value of tha house for example the size of the house. Other features, probably matter less. With the tree based machine learning alghorithm like gradinet boosting, we can look at the train modle and have it tell us how often each feature is used in detemining the final price

In [3]:
from sklearn.externals import joblib

# These are the feature labels from our data set
feature_labels = np.array(['year_built', 'stories', 'num_bedrooms', 'full_bathrooms', 
                           'half_bathrooms', 'livable_sqft', 'total_sqft', 'garage_sqft',
                           'carport_sqft', 'has_fireplace', 'has_pool', 'has_central_heating',
                           'has_central_cooling', 'garage_type_attached', 'garage_type_detached',
                           'garage_type_none', 'city_Amystad', 'city_Brownport', 'city_Chadstad',
                           'city_Clarkberg', 'city_Coletown', 'city_Davidfort', 'city_Davidtown',
                           'city_East Amychester', 'city_East Janiceville', 'city_East Justin',
                           'city_East Lucas', 'city_Fosterberg', 'city_Hallfort', 'city_Jeffreyhaven',
                           'city_Jenniferberg', 'city_Joshuafurt', 'city_Julieberg', 'city_Justinport',
                           'city_Lake Carolyn', 'city_Lake Christinaport', 'city_Lake Dariusborough',
                           'city_Lake Jack', 'city_Lake Jennifer', 'city_Leahview', 'city_Lewishaven',
                           'city_Martinezfort', 'city_Morrisport', 'city_New Michele', 'city_New Robinton',
                           'city_North Erinville', 'city_Port Adamtown', 'city_Port Andrealand', 
                           'city_Port Daniel', 'city_Port Jonathanborough', 'city_Richardport',
                           'city_Rickytown', 'city_Scottberg', 'city_South Anthony', 'city_South Stevenfurt',
                           'city_Toddshire', 'city_Wendybury', 'city_West Ann', 'city_West Brittanyview',
                           'city_West Gerald', 'city_West Gregoryview', 'city_West Lydia', 'city_West Terrence'])

# Load the trained model created with train_model.py
model = joblib.load('trained_house_classifier_model.pkl')

# Create a numpy array based on the model's feature importances
importance = model.feature_importances_

# Sort the feature labels based on the feature importance rankings from the model
feauture_indexes_by_importance = importance.argsort()

# Print each feature label, from most important to least important (reverse order)
for index in feauture_indexes_by_importance:
    print("{} - {:.2f}%".format(feature_labels[index], (importance[index] * 100.0)))

city_Martinezfort - 0.00%
city_Julieberg - 0.00%
city_New Michele - 0.00%
city_New Robinton - 0.00%
city_Davidtown - 0.02%
city_Rickytown - 0.07%
city_West Terrence - 0.08%
city_West Brittanyview - 0.10%
city_Amystad - 0.11%
city_Leahview - 0.12%
city_Fosterberg - 0.12%
city_Lake Jennifer - 0.14%
city_Clarkberg - 0.15%
city_Port Daniel - 0.16%
city_Toddshire - 0.16%
city_South Stevenfurt - 0.17%
city_West Lydia - 0.18%
city_Brownport - 0.18%
city_Joshuafurt - 0.19%
city_Port Adamtown - 0.19%
city_East Justin - 0.20%
city_West Gerald - 0.21%
city_Davidfort - 0.22%
city_Lake Carolyn - 0.23%
city_Jenniferberg - 0.24%
city_East Lucas - 0.26%
city_Wendybury - 0.27%
city_Lake Christinaport - 0.30%
city_East Janiceville - 0.31%
city_Hallfort - 0.32%
city_Justinport - 0.33%
city_Morrisport - 0.33%
city_West Gregoryview - 0.34%
city_Port Jonathanborough - 0.36%
city_East Amychester - 0.36%
city_Scottberg - 0.37%
city_Lake Dariusborough - 0.40%
city_Richardport - 0.42%
city_South Anthony - 0.51%

## Using the Estimator in a Real-World Program:

In [4]:
from sklearn.externals import joblib

# Load the model we trained previously
model = joblib.load('trained_house_classifier_model.pkl')

# For the house we want to value, we need to provide the features in the exact same
# arrangement as our training data set.
house_to_value = [
    # House features
    2006,   # year_built
    1,      # stories
    4,      # num_bedrooms
    3,      # full_bathrooms
    0,      # half_bathrooms 
    2200,   # livable_sqft
    2350,   # total_sqft
    0,      # garage_sqft
    0,      # carport_sqft
    True,   # has_fireplace
    False,  # has_pool
    True,   # has_central_heating
    True,   # has_central_cooling
    
    # Garage type: Choose only one
    0,      # attached
    0,      # detached
    1,      # none
    
    # City: Choose only one
    0,      # Amystad
    1,      # Brownport
    0,      # Chadstad
    0,      # Clarkberg
    0,      # Coletown
    0,      # Davidfort
    0,      # Davidtown
    0,      # East Amychester
    0,      # East Janiceville
    0,      # East Justin
    0,      # East Lucas
    0,      # Fosterberg
    0,      # Hallfort
    0,      # Jeffreyhaven
    0,      # Jenniferberg
    0,      # Joshuafurt
    0,      # Julieberg
    0,      # Justinport
    0,      # Lake Carolyn
    0,      # Lake Christinaport
    0,      # Lake Dariusborough
    0,      # Lake Jack
    0,      # Lake Jennifer
    0,      # Leahview
    0,      # Lewishaven
    0,      # Martinezfort
    0,      # Morrisport
    0,      # New Michele
    0,      # New Robinton
    0,      # North Erinville
    0,      # Port Adamtown
    0,      # Port Andrealand
    0,      # Port Daniel
    0,      # Port Jonathanborough
    0,      # Richardport
    0,      # Rickytown
    0,      # Scottberg
    0,      # South Anthony
    0,      # South Stevenfurt
    0,      # Toddshire
    0,      # Wendybury
    0,      # West Ann
    0,      # West Brittanyview
    0,      # West Gerald
    0,      # West Gregoryview
    0,      # West Lydia
    0       # West Terrence
]

# scikit-learn assumes you want to predict the values for lots of houses at once, so it expects an array.
# We just want to look at a single house, so it will be the only item in our array.
homes_to_value = [
    house_to_value
]

# Run the model and make a prediction for each house in the homes_to_value array
predicted_home_values = model.predict(homes_to_value)

# Since we are only predicting the price of one house, just look at the first prediction returned
predicted_value = predicted_home_values[0]

print("This house has an estimated value of ${:,.2f}".format(predicted_value))

This house has an estimated value of $552,706.28


To see how are the other features importent we can change one or two of them and predict the price of the house if the chnage in the price was significant then we can say this feature is so importent. 