# Part 3 - Feature Engineering (and Selection)

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from geopy import Nominatim
import geojson
import folium
from branca.colormap import LinearColormap, StepColormap

%matplotlib inline

## Load and preview the data

In [2]:
df = pd.read_csv('./data/sf/data_clean_imputed.csv') # load contents of .csv into a pandas.DataFrame object
df.head(5) # display first 5 entries of DataFrame

Unnamed: 0,title,address,city,state,postal_code,price,facts and features,url,bed,bath,sqft,property_type
0,Condo For Sale,220 Lombard St APT 116,San Francisco,CA,94111,849000.0,"1 bd , 1 ba , 830 sqft",https://www.zillow.com/homedetails/220-Lombard...,1.0,1.0,830.0,condo
1,Condo For Sale,101 Lombard St APT 603W,San Francisco,CA,94111,1650000.0,"2 bds , 2 ba , 1,500 sqft",https://www.zillow.com/homedetails/101-Lombard...,2.0,2.0,1500.0,condo
2,Condo For Sale,733 Front St UNIT 312,San Francisco,CA,94111,1195000.0,"1 bd , 1 ba , 1,189 sqft",https://www.zillow.com/homedetails/733-Front-S...,1.0,1.0,1189.0,condo
3,Condo For Sale,550 Davis St UNIT 44,San Francisco,CA,94111,1995000.0,"3 bds , 2 ba , 1,520 sqft",https://www.zillow.com/homedetails/550-Davis-S...,3.0,2.0,1520.0,condo
4,Condo For Sale,240 Lombard St APT 437,San Francisco,CA,94111,625000.0,"1 bd , 1 ba , 566 sqft",https://www.zillow.com/homedetails/240-Lombard...,1.0,1.0,566.0,condo


In [3]:
df.columns

Index(['title', 'address', 'city', 'state', 'postal_code', 'price',
       'facts and features', 'url', 'bed', 'bath', 'sqft', 'property_type'],
      dtype='object')

## Select the Features we wish to use
Feature Selection is normally an iterative process where we set up various experiments to test hypotheses generated during our EDA (e.g. `sqft` appears to have highest correlation and should have greatest impact on our model). We also try different combinations of features and test model accuracy.  
For the purpose of this example, we will simply select features which showed good correlation in our EDA.

In [4]:
# keep 'price' for obvious reasons
selected_features = ['bath', 'bed', 'property_type', 'sqft', 'postal_code', 'price']

Quick Note: Postal code is formatted as an integer so Pandas will not recognize it as a categorical variable. Let's format it as string

In [5]:
df.postal_code = df.postal_code.astype(str)

In [7]:
df = df[selected_features]
df

Unnamed: 0,bath,bed,property_type,sqft,postal_code,price
0,1.000000,1.0,condo,830.0,94111,849000.0
1,2.000000,2.0,condo,1500.0,94111,1650000.0
2,1.000000,1.0,condo,1189.0,94111,1195000.0
3,2.000000,3.0,condo,1520.0,94111,1995000.0
4,1.000000,1.0,condo,566.0,94111,625000.0
5,1.000000,1.0,condo,914.0,94111,1196000.0
6,1.000000,2.0,house,1250.0,94112,999000.0
7,1.500000,3.0,house,1325.0,94112,899000.0
8,1.000000,2.0,house,750.0,94112,699000.0
9,6.000000,6.0,house,3168.0,94112,1599999.0


## Create Dummy Variables from categorical features
In order to exploit the pertinent categorical features for use in our model, we must first convert them to [Dummy Variables]. This essentially creates new columns for each category and encodes a binary 1 or 0 for whether the category is present in that sample.  
To do this, we use Pandas built-in function `pandas.get_dummies()`
[Dummy Variables]: https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

To get a better understanding of dummy variables, let's get the dummy variables for `property_type`

In [11]:
print("before conversion:")
df[['property_type']].head(30)

before conversion:


Unnamed: 0,property_type
0,condo
1,condo
2,condo
3,condo
4,condo
5,condo
6,house
7,house
8,house
9,house


In [9]:
print("after conversion:")
pd.get_dummies(data=df['property_type']).head(10)

after conversion:


Unnamed: 0,apartment,auction,coming,condo,coop,house,lot,new
0,0,0,0,1,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0
5,0,0,0,1,0,0,0,0
6,0,0,0,0,0,1,0,0
7,0,0,0,0,0,1,0,0
8,0,0,0,0,0,1,0,0
9,0,0,0,0,0,1,0,0


We see that the `property_type` column has been replaced by the categories contained within `property_type` with a binary 1 or 0 representing whether that category exists in the sample.

Let's convert the entire dataframe to dummy variables (Pandas knows to omit numerical variables).

In [15]:
pd.get_dummies(data=df['postal_code']).head(10)

Unnamed: 0,94102,94103,94104,94105,94107,94108,94109,94110,94111,94112,...,94121,94122,94123,94124,94127,94131,94132,94133,94134,94158
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [16]:
df_engineered = pd.get_dummies(data=df)

In [17]:
df_engineered.head(5)

Unnamed: 0,bath,bed,sqft,price,property_type_apartment,property_type_auction,property_type_coming,property_type_condo,property_type_coop,property_type_house,...,postal_code_94121,postal_code_94122,postal_code_94123,postal_code_94124,postal_code_94127,postal_code_94131,postal_code_94132,postal_code_94133,postal_code_94134,postal_code_94158
0,1.0,1.0,830.0,849000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2.0,2.0,1500.0,1650000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1.0,1189.0,1195000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2.0,3.0,1520.0,1995000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,1.0,566.0,625000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Our dataset has now been expanded to include dummy variables!

## Feature Engineering  
We can further extend our dataset by engineering new features that can have a positive effect on our model.  
For example, we have yet to utilize the Lat/Long positions of the houses for any purpose other than visualization. 
Since location is a major factor in house prices, perhaps we could create a few new features: `average_1km`, `average_2km`, `average_3km` representing the average prices in a 1km, 2km, and 3km radius.

I will leave this as an exercise to the reader to engineer a few features and test if model performance increases.

 ## Save the dataframe to .csv file

In [18]:
df_engineered.to_csv('./data/sf/data_clean_engineered.csv', index=False)