# Classification by XGBRegressor

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Huang,Jing-Hau is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

## Introduction to House Sales in King County, USA dataset

This dataset consists of Price of Houses in King County, Washington from salesbetween May 2014 and May 2015.<br><br>
The data is from Kaggle.(https://www.kaggle.com/harlfoxem/housesalesprediction)

With house price it consists of informationon 18 house features, Date of Sale and ID of sale.<br><br>
In the following, we will describe the interpretation of the variables in the dataset.

Id: Unique ID for each home sold<br>
Date: Date of the home sale<br>
Price: Price of each home sold<br>
Bedrooms: Numbers of bedrooms<br>
Bathrooms: Number of bathrooms, where .5 accounts for a room with a toilet but no shower<br>
Sqft_living: Square footage of the apartments interior living space<br>
Sqft_lot: Square footage of the land space<br>
Floors: Numbers of floors<br>
Waterfront: A dummy variable for whether the apartment was overlooking the waterfront or not<br>
View: An index from 0 to 4 of how good the view of the property was<br>
Condition: An index from 1 to 5 on the condition of the apartment<br>
Grade: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design and 11-13 have a high quality level of construction and design<br>
Sqft_above: The square footage of the interior housing space that is above ground level<br>
Sqft_basement: The square footage of the interior housing space that is below ground level<br>
Yr_built: The year the house was initially built<br>
Yr_renovated: The year of the house's last renovation<br>
Zipcode: What zipcode area the house is in<br>
Lat: Lattitude<br>
Long: Longitude<br>
Sqft_living15: The square footage of interior housing living space for the nearest 15 neighbors<br>
Sqft_lot15: The square footage of the land lots of the nearest 15 neighbors

## XGBRegressor model

In [1]:
import xgboost as xgb
import pandas as pd

In [2]:
data = pd.read_csv('kc_house_data.csv')

In [3]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
print(data.shape[0], 'samples of data')
print(data.shape[1], 'features of data')

21613 samples of data
21 features of data


In [5]:
data.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [6]:
data['yr_built'] = data['date'].str[:4].astype(int) - data['yr_built']
mask = data['yr_renovated'] > 0
data.loc[mask, 'yr_renovated'] = data.loc[mask, 'date'].str[:4].astype(int) - data.loc[mask, 'yr_renovated']

In [7]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,59,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,63,23,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,82,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,49,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,28,0,98074,47.6168,-122.045,1800,7503


In [8]:
X = data.iloc[:, 3:]
y = data.iloc[:, 2]

In [9]:
# build model
model1 = xgb.XGBRegressor()
model1.fit(X, y)

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='reg:squarederror', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
       validate_parameters=False, verbosity=None)

In [10]:
# predict
y_model1 = model1.predict(X)

In [11]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y, y_model1)

3808752752.5778284

#### Hyperparameters of XGBoost
colsample_bytree: the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.<br><br>
learning_rate:  step size shrinkage used to prevent overfitting. Range is [0,1]<br><br>
max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.<br><br>

In [12]:
model2 = xgb.XGBRegressor(colsample_bytree = 0.1, learning_rate = 0.3, max_depth = 10)
model2.fit(X, y)
y_model2 = model2.predict(X)
mean_squared_error(y, y_model2)

14061365297.696045

## Cross validation and hyperparameter

In [13]:
from sklearn.model_selection import cross_val_score

In [14]:
print(cross_val_score(model1, X, y, cv=3))
print(cross_val_score(model2, X, y, cv=3))

[0.87884148 0.87869525 0.87208372]
[0.76148794 0.76169361 0.75721928]


apply grid search to find good hyperparameter

In [15]:
from sklearn.model_selection import GridSearchCV
model = xgb.XGBRegressor()
colsample_bytree = [0.1, 0.3, 0.5]
learning_rate = [0.1, 0.3, 0.5]
max_depth = [i*10 for i in range(1, 4)]

In [16]:
print(colsample_bytree)
print(learning_rate)
print(max_depth)

[0.1, 0.3, 0.5]
[0.1, 0.3, 0.5]
[10, 20, 30]


In [23]:
model.fit(X, y)
model.get_params()

{'base_score': 0.5,
 'booster': None,
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': None,
 'learning_rate': 0.300000012,
 'max_delta_step': 0,
 'max_depth': 6,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': 0,
 'num_parallel_tree': 1,
 'objective': 'reg:squarederror',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': None,
 'validate_parameters': False,
 'verbosity': None}

In [17]:
grid = GridSearchCV(estimator = model, cv = 3, 
                    param_grid = dict(colsample_bytree = colsample_bytree, learning_rate = learning_rate, 
                                      max_depth = max_depth))

In [18]:
grid.fit(X, y)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
       colsample_bynode=None, colsample_bytree=None, gamma=None,
       gpu_id=None, importance_type='gain', interaction_constraints=None,
       learning_rate=None, max_delta_step=None, max_depth=None,
       min_child_we...pos_weight=None, subsample=None,
       tree_method=None, validate_parameters=False, verbosity=None),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'colsample_bytree': [0.1, 0.3, 0.5], 'learning_rate': [0.1, 0.3, 0.5], 'max_depth': [10, 20, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [19]:
print(grid.best_score_)
print(grid.best_params_)

0.8827326307314575
{'colsample_bytree': 0.5, 'learning_rate': 0.1, 'max_depth': 10}


In [20]:
model3 = xgb.XGBRegressor(colsample_bytree = 0.5, learning_rate = 0.1,
                max_depth = 10)
model3.fit(X, y)
y_model3 = model3.predict(X)
mean_squared_error(y, y_model3)

1759004114.5372527