# House price prediction using regression

This example uses the [Kaggle house sale prices for King County](https://www.kaggle.com/harlfoxem/housesalesprediction/data) dataset.
It captures actual house price sales for about a year in that region as well
as some interesting features of each house such as number of bedrooms,
number of bathrooms, size of living area, etc.

We'll use [Tablesaw](https://jtablesaw.github.io/tablesaw/) to store and manipulate our data
and the linear regression class from the [Smile](http://haifengl.github.io/)
machine learning library. So, we'll add those libraries to the classpath and define some imports to simplify access to the classes we need.

In [158]:
%%classpath add mvn
tech.tablesaw tablesaw-beakerx 0.36.0
com.github.haifengl smile-core 1.5.3

In [159]:
%import tech.tablesaw.api.*
%import smile.regression.OLS

We'll also enable a BeakerX display widget for Tablesaw tables.

In [160]:
tech.tablesaw.beakerx.TablesawDisplayer.register()
OutputCell.HIDDEN

### Exploring the data

We start by loading data and printing its shape and structure.

In [161]:
records = Table.read().csv("../resources/kc_house_data.csv")
records.shape()

21613 rows X 21 cols

In [162]:
records.structure()

We might want to explore the _number of bedrooms_ feature. We can display a summary of that feature and examine some outliers.

In [163]:
records.column("bedrooms").summary().print()

         Column: bedrooms          
 Measure   |        Value         |
-----------------------------------
        n  |               21613  |
      sum  |               72854  |
     Mean  |   3.370841623097218  |
      Min  |                   0  |
      Max  |                  33  |
    Range  |                  33  |
 Variance  |  0.8650150097573497  |
 Std. Dev  |   0.930061831147451  |

In [164]:
records.where(records.column("bedrooms").isGreaterThan(10))

We can remove the 33 bedroom record as an outlier:

In [165]:
records = records.dropWhere(records.column("bedrooms").isGreaterThan(30))
records.shape()

21612 rows X 21 cols

We might want to explore the _number of bathrooms_ feature. We can find the maximum number of bathrooms and display a histogram.

In [166]:
maxBathrooms = records.column('bathrooms').toList().max()

8.0

In [167]:
plot = new Histogram(title: 'Bathroom histogram',
                     binCount: maxBathrooms,
                     xLabel: '#bathrooms',
                     yLabel: '#houses',
                     data: records.column('bathrooms').toList())

### Linear regression

We might posture that the more bedrooms in a house, the higher the price. If the relationship is linear, linear regression will give us the line of best fit according to this assumption. Ordinary least squares finds such a line by minimising residual errors. Let's use that algorithm from the Smile library: 

In [168]:
cols = ['bedrooms', 'price']
priceModel = new OLS(records.select(*cols).smile().numericDataset('price'))

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	-993338.9867	-203008.4337	-65410.8645	105991.5663	6824398.8589

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept110315.7263         9108.4603        12.1113          0.0000 ***
bedrooms	 127547.5691         2610.1285        48.8664          0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 348398.6399 on 21610 degrees of freedom
Multiple R-squared: 0.0995,    Adjusted R-squared: 0.0995
F-statistic: 2387.9246 on 1 and 21610 DF,  p-value: 0.000


We want the R-squared value as close to 1 as possible. The value below 0.1 indicates that bedrooms aren't a good indicator of price. We can explore bathrooms instead.

In [169]:
cols = ['bathrooms', 'price']
priceModel = new OLS(records.select(*cols).smile().numericDataset('price'))

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	-1438177.6355	-184517.8475	-41517.8475	113231.1207	5925318.2373

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept 10687.9535         6210.8477         1.7209          0.0853 .
bathrooms	 250331.9576         2759.5824        90.7137          0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 312443.3785 on 21610 degrees of freedom
Multiple R-squared: 0.2758,    Adjusted R-squared: 0.2757
F-statistic: 8228.9774 on 1 and 21610 DF,  p-value: 0.000


In [170]:
p0 = priceModel.predict([0] as double[])

10687.953547666326

In [171]:
pMax = priceModel.predict([maxBathrooms] as double[])

2013343.6143124562

In [172]:
plot = new Plot(title: 'Price x Bathrooms', xLabel: 'Bathrooms', yLabel: 'Price')
plot << new Points(x: records.column('bathrooms').toList(),
                   y: records.column('price').toList())
plot << new Line(x: [0, maxBathrooms], y: [p0, pMax])

Bathrooms are a better indicator than bedrooms but still not a great indicator. Let's try the size of the living area.

In [173]:
maxSqftLiving = records.column('sqft_living').toList().max()
cols = ['sqft_living', 'price']
priceModel = new OLS(records.select(*cols).smile().numericDataset('price'))

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	-1476117.9717	-147471.4363	-24025.9920	106199.1818	4362019.7516

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept-43603.3525         4402.7891        -9.9036          0.0000 ***
sqft_living	    280.6293            1.9364       144.9217          0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 261454.2971 on 21610 degrees of freedom
Multiple R-squared: 0.4929,    Adjusted R-squared: 0.4928
F-statistic: 21002.3042 on 1 and 21610 DF,  p-value: 0.000


In [174]:
p0 = priceModel.predict([0] as double[])
pMax = priceModel.predict([maxSqftLiving] as double[])
plot = new Plot(title: 'Price x Sqft living', xLabel: 'Sqft living', yLabel: 'Price')
plot << new Points(x: records.column('sqft_living').toList(),
                   y: records.column('price').toList())
plot << new Line(x: [0, maxSqftLiving], y: [p0, pMax])

This is much better but we can improve by considering multiple features.

In [175]:
cols = ['sqft_living', 'bathrooms', 'grade', 'view', 'bedrooms',
        'sqft_above', 'yr_renovated', 'waterfront']
priceModel = new OLS(records.select(*cols + ['price']).smile().numericDataset('price'))

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	-1283979.2993	-123010.7210	-17886.3115	95539.4141	4574026.4384

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept-502995.0577        14353.5849       -35.0432          0.0000 ***
sqft_living	    230.3461            4.4683        51.5509          0.0000 ***
bathrooms	 -22264.7003         3261.1853        -6.8272          0.0000 ***
grade	 101718.4289         2259.2452        45.0232          0.0000 ***
view	  59202.3221         2398.4641        24.6834          0.0000 ***
bedrooms	 -31514.1662         2243.0584       -14.0496          0.0000 ***
sqft_above	    -47.6528            4.2197       -11.2929          0.0000 ***
yr_renovated	     64.8914            3.9634        16.3725          0.0000 ***
waterfront	 566677.4369        19979.5034        28.3629          0.0000 ***
---------------------------------------------------------------------
Significance codes:

We have an even better R-squared value but visualizing it is a little more difficult. This time we will plot the actual values vs the values predicted by our regression model. 

In [177]:
plot = new Plot(title: 'Actual vs predicted price', xLabel: 'Actual', yLabel: 'Predicted')
predictions = cols.collect{ records.column(it).toList() }.transpose().collect{ priceModel.predict(it as double[]) }
actuals = records.column('price').toList()
plot << new Points(x: actuals, y: predictions)
to = [actuals.max(), predictions.max()].min()
from = [actuals.min(), predictions.min()].min()
plot << new Line(x: [from, to], y: [from, to])