# Housing price prediction feature engineering

In general, price of a house is determined under many factors and the location always plays a paramount role in making value of the property. In this notebook, we will discover how geo-related aspects affect housing price in the USA. We will consider a house based on the information provided in ***an existing dataset*** with some addtional spatial attributes extracted from its location using xarray-spatial ***(and probably some elevation dataset, and census-parquet as well?)***.

Existing features:
- ...

New features:
- ***Slope?*** (from an elevation dataset)
- ***Population density, ...?*** (from Census if none available in the existing features)
- Distance to nearest hospital (or grocery store / university / pharmacy)

We'll first build a machine learning model and train it with all existing features. For each newly added feature, we'll retrain it and compare the results to find out which features help enrich the model.

## Imports

First, let's import all neccesary libraries.

In [None]:
import numpy as np
import pandas as pd
import rasterio

import datashader as ds
from datashader.transfer_functions import shade
from datashader.transfer_functions import stack
from datashader.transfer_functions import dynspread
from datashader.transfer_functions import set_background
from datashader.colors import Elevation

from xrspatial import slope
from xrspatial import proximity

## Load the existing dataset

In [None]:
# assume the data contain lat lon coords with some additional values
df = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'x': [0, 1, 2, 0, 4],
    'y': [2, 0, 1, 3, 1],
    'column_1': [2, 3, 4, 2, 6],
    'price': [1, 3, 4, 3, 7]
})

## Build and train housing price model

We'll split the data into train data and test data.

Now let's build the model to predict housing price. 

After tuning the hyper parameters, we selected the best model as below.

Prediction accuracy on the test set.

## Calculated spatial attributes

As stated above, we'll calculate spatial attributes (slope, ...?) of each location and its proximities to some nearest services. 

**TBD**: What is the format of additional data? Is it in vector or raster format?
- If raster (preferred), load directly as 2D xarray DataArrays
- If vector, load into a pandas/geopandas DataFrame and rasterize with datashader.

Assume that the data is in vector format.

In [None]:
# bounding box of the raster
xmin, xmax, ymin, ymax = (
    df.x.min(),
    df.x.max(),
    df.x.min(),
    df.x.max()
)
xrange = (xmin, xmax)
yrange = (ymin, ymax)

# width and height of the raster image
W, H = 800, 600

# canvas object to rasterize the houses
cvs = ds.Canvas(plot_width=W, plot_height=H, x_range=xrange, y_range=yrange)
raster = cvs.points(df, x='x', y='y', agg=ds.min('id'))

# visualize the raster
points_shaded = dynspread(shade(raster, cmap='salmon', min_alpha=0, span=(0,1), how='linear'), threshold=1, max_px=5)
set_background(points_shaded, 'black')

Identify location in pixel space of houses.

Calculate new feature value.

Retrain the model with new feature and compute test accuracy.

## Feature selection