# Home App 
## Part V- Organizing model for App

The data for this project comes from Kaggle  [House Sales in King County, USA](https://www.kaggle.com/architdxb/king-countyusa/data). The list of variables is shown below.

The variables that we will use are:

* **id** : house identification number
* **price** : target variable for prediction
* **bedrooms** : number of bedrooms in the home
* **bathrooms** : number of bathrooms in the home
* **sqft_living** : square footage of the home
* **sqft_lot** : square footage of the lot
* **yr_built** : year the home was built
* **yr_renovated** : year the home was renovated
* **zipcode** : zip code of the home

For Part V of the project, we're just organizing everything in a simple format without any data exploration steps. 

In [1]:
# load necessary packages 
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
import numpy as np
import glob
import itertools
import seaborn as sns
from scipy import stats
from time import time

# pretty display for notebooks
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer

from IPython.display import display # Allows the use of display() for DataFrames

# hide warnings
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)

data_path = "C:/Users/Yiruru/Documents/launchCode/data/processed_data.csv"
df= pd.read_csv(data_path)
# Load and merge the housing data
try:
    df = pd.read_csv(data_path)
    print("Main dataset has {} samples with {} features each.".format(*df.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

Main dataset has 21969 samples with 8 features each.


### Train/Test data Split

In [3]:
new_data = df

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler()
numerical = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'year', 'zipcode']
price_raw = new_data['price']
features_raw = new_data.drop('price', axis = 1)
features_raw[numerical] = scaler.fit_transform(new_data[numerical])

features = pd.get_dummies(features_raw)
price = new_data['price']

In [4]:
# Train and Test 
X = new_data.ix[:, :-1]
Y = new_data.ix[:, -1]

# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the 'features' and 'income' data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 42)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state = 42)

# Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("Validation set has {} samples.".format(x_val.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))

# confirm % of test set
print("Testing set is {}%.".format(round(1.*x_test.shape[0]/features.shape[0]*100, 2)))

Training set has 15378 samples.
Validation set has 3296 samples.
Testing set has 3295 samples.
Testing set is 15.0%.


In [5]:
# Get data characteristics on the training set
n_train = round(len(x_train), 0)
n_val = round(len(x_val), 0)
n_test = round(len(x_test),1)

y_train_mean = y_train.mean()
y_val_mean = y_val.mean()
y_test_mean = y_test.mean()

basic_df = pd.DataFrame([['{:.0f}'.format(n_train), '{:.0f}'.format(n_val), '{:.0f}'.format(n_test)], 
                         ['{:.4f}'.format(y_train_mean), '{:.4f}'.format(y_val_mean), '{:.4f}'.format(y_test_mean)]], 
                        index=['Sample Size', 'log error'], 
                        columns=['Training Set', 'validation Set', 'Test Set'])
print('Table 3: Characteristics of Training, Validation, and Test Set')
basic_df

Table 3: Characteristics of Training, Validation, and Test Set


Unnamed: 0,Training Set,validation Set,Test Set
Sample Size,15378.0,3296.0,3295.0
log error,540842.6695,529067.96,534933.2155


### Modeling 

In [8]:
#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search
from sklearn.metrics import cohen_kappa_score, make_scorer, precision_score, recall_score, mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams

scoring_function = make_scorer(mean_squared_error, greater_is_better=False)
target = 'price'
predictors = new_data.columns[1:-1]

In [11]:
def modelfit(alg, dtrain, predictors, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='rmse', early_stopping_rounds=early_stopping_rounds, verbose_eval= True)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['price'],eval_metric='rmse')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
     
    #Print model report:
    print("\nModel Report")
    print("Mean Square error : %.4g" % metrics.mean_squared_error(dtrain['price'].values, dtrain_predictions))
    print("r2 Score : %.4g" % metrics.r2_score(dtrain['price'].values, dtrain_predictions))
    return dtrain_predictions

In [13]:
#combine x_train and y_train for model tuning 
df_train = pd.concat([x_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)

# gather test set 
df_test = pd.concat([x_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)
test_pred = xgb2.predict(df_test[predictors])
df_test['pred'] = test_pred

# final model 
xgb2 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=406,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.7,
 colsample_bytree=0.9,
 objective= 'reg:linear',
 reg_alpha = 100, 
 nthread=4,
 scale_pos_weight=1,
 seed=100)

# data set used for training
print(df_train)
xgb2.fit(df_train[predictors], df_train['price'],eval_metric='rmse')
dtest_predictions = xgb2.predict(df_test[predictors])
print("r2 Score : %.4g" % metrics.r2_score(df_test['price'].values, dtest_predictions))

               id  bedrooms  bathrooms  sqft_living  sqft_lot  zipcode  year  \
0      3990200065  4         2.50       2050         9143      98166    1992   
1      1525069021  3         2.50       2580         214315    98053    1986   
2      9553200125  3         1.50       2440         5750      98115    1939   
3      3080000030  3         2.50       2230         4000      98144    1954   
4      3782760040  3         3.25       2780         4002      98019    2009   
5      5152960350  4         2.75       2390         9650      98003    1976   
6      3992700585  3         1.75       1880         9360      98125    1941   
7      6071700020  3         2.25       1640         8400      98006    1962   
8      3902310210  4         2.50       2100         8800      98033    1980   
9      4435000705  3         1.00       1350         8700      98188    1942   
10     925049278   4         2.00       1490         4054      98115    1926   
11     9834200165  4         1.50       