# Part 3 - Feature Engineering (and Selection)

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from geopy import Nominatim
import geojson
import folium
from branca.colormap import LinearColormap, StepColormap

%matplotlib inline

## Load and preview the data

In [None]:
df = pd.read_csv('./data/rew_van_jan12_clean.csv') # load contents of .csv into a pandas.DataFrame object
df.head(5) # display first 5 entries of DataFrame

In [None]:
df.columns

## Select the Features we wish to use
Feature Selection is normally an iterative process where we set up various experiments to test hypotheses generated during our EDA (e.g. `sqft` appears to have highest correlation and should have greatest impact on our model). We also try different combinations of features and test model accuracy.  
For the purpose of this example, we will simply select features which showed good correlation in our EDA.

In [None]:
selected_features = ['area', 'bath', 'bed', 'property_type', 'sqft', 'strata_type', 'sub_area']

In [None]:
# only keep selected features (and keep 'price' for obvious reasons)
selected_features.append('price')
df = df[selected_features]

## Create Dummy Variables from categorical features
In order to exploit the pertinent categorical features for use in our model, we must first convert them to [Dummy Variables]. This essentially creates new columns for each category and encodes a binary 1 or 0 for whether the category is present in that sample.  
To do this, we use Pandas built-in function `pandas.get_dummies()`
[Dummy Variables]: https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

To get a better understanding of dummy variables, let's get the dummy variables for `property_type`

In [None]:
print("before conversion:")
df[['property_type']].head(5)

In [None]:
print("after conversion:")
pd.get_dummies(data=df['property_type']).head(5)

We see that the `property_type` column has been replaced by the categories contained within `property_type` with a binary 1 or 0 representing whether that category exists in the sample.

Let's convert the entire dataframe to dummy variables (Pandas knows to omit numerical variables).

In [None]:
df_engineered = pd.get_dummies(data=df)

In [None]:
df_engineered.head(5)

Our dataset has now been expanded to include dummy variables!

## Feature Engineering  
We can further extend our dataset by engineering new features that can have a positive effect on our model.  
For example, we have yet to utilize the Lat/Long positions of the houses for any purpose other than visualization. 
Since location is a major factor in house prices, perhaps we could create a few new features: `average_1km`, `average_2km`, `average_3km` representing the average prices in a 1km, 2km, and 3km radius.

I will leave this as an exercise to the reader to engineer a few features and test if model performance increases.

 ## Save the dataframe to .csv file

In [None]:
df_engineered.to_csv('./data/rew_van_jan12_clean_engineered.csv', index=False)