# Data Prep

#### Data

For the blog article [ColorFrames: A Novel Machine Learning Approach to Assess Florida's Climate-Driven Real Estate Risk](https://medium.com/p/60e9a8913b85/edit). we will be using a new data set scraped from [Realtor.com](https://www.realtor.com).


This dataset includes records of over 1600 south Florida real estate properties, and includes flood risk information, demographic information, listing images, listing descriptions, how long the property has been listed, sales history, and lots of other related information. 

The target we want to predict is the difference in the rate of property value change (regression) vs the rate of change of similar properties nearby. 

In the dataset, approximately 66% of properties have a low flood risk (less than 5 on the [Flood Factor](https://floodfactor.com) scoring scale), and approx. 33% have a high flood risk. 

We will test the hypothesis that a high flood risk score is a strong predictor for the difference in the rate of property value change by building an ML model, and then applying various model explainability tools to it.

## Load the Property Data

Scraping data is a bit messy. Many values available on one real estate listing, might not also be found on another listing. The scaper easily gets confused with AJAX loading and other complications, therefore, we need to clean up things before we can build the model. Because it's time-consuming to clean-up everything, we will do a "quick-pass" here, but there will be additional data that could be cleaned in order to obtain more training data. Please feel free to do this and provide a link! 


In [26]:
import pandas as pd
import evalml
from evalml.preprocessing import load_data

property_data = pd.read_csv('../data/raw/property.csv').reset_index()
property_data

Unnamed: 0,index,Address,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,YearBuilt,Style,Description,PropertyFeatures,PropertyHistory,NearbySchoolInfo,NeighborhoodInfo,NearbyHomeValues,Page_URL,Meta_keywords
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,"$649,000",2bed,2bath,"1,350sqft",Noise: Medium,...,Year Built1970,Garage1 Car,Very desirable turnkey North East wonderful wr...,Very desirable turnkey North East wonderful wr...,Property PriceDateEventPricePrice/Sq FtSource0...,,1 Las Olas Cir Apt 517 is located in Central B...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,"$689,000",2bed,2bath,"1,479sqft",Noise: Medium,...,Year Built2008,Garage1 Car,Welcome to the exclusive Plaza at Oceanside! Y...,Welcome to the exclusive Plaza at Oceanside! Y...,Property PriceDateEventPricePrice/Sq FtSource0...,Rating* School Name Grades Distance 4/10Pomp...,1 N Ocean Blvd Apt 610 is located in Beach nei...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,"$650,000",1bed,1bath,845sqft,Noise: Medium,...,Year Built1965,Garage1 Car,Completely renovated ocean-view 845 sq ft one-...,Completely renovated ocean-view 845 sq ft one-...,Property PriceDateEventPricePrice/Sq FtSource0...,,100 Lincoln Rd Unit 643 is located in Miami Be...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,"$450,000",1bed,1bath,820sqft,Noise: Medium,...,Year Built1965,Garage1 Car,Completely renovated ocean-view 820 sq ft. one...,Completely renovated ocean-view 820 sq ft. one...,Property PriceDateEventPricePrice/Sq FtSource0...,,100 Lincoln Rd Unit 702 is located in Miami Be...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,"$2,495,000",3bed,3bath,"2,150sqft",Noise: Medium,...,Year Built1957,StatusFor Sale,Brand new luxury oceanfront condo in the heart...,Brand new luxury oceanfront condo in the heart...,Property PriceDateEventPricePrice/Sq FtSource0...,,100 N Ocean Blvd Unit 208 is located in Delray...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,1609,"Sunny Isles Beach, FL, 33160",https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,"$975,000",4bed,2.5bath,"2,667sqft",Noise: Medium,...,Year Built2000,Garage2 Cars,"Beautiful one story, 4bed/2.5bath house at Gol...","Beautiful one story, 4bed/2.5bath house at Gol...",Property PriceDateEventPricePrice/Sq FtSource0...,Rating* School Name Grades Distance 9/10Norm...,This Property is located in Golden Gate Estate...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
1610,1610,"Valrico, FL, 33594",https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,"$523,750",5bed,3.5bath,"3,416sqft",,...,Price per sqft$153,StatusFor Sale,Under Construction. Come home to our newest fl...,Under Construction. Come home to our newest fl...,Property PriceDateEventPricePrice/Sq FtSource0...,Contact an agent for Schools Information.,No neighborhood information is available.Popul...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
1611,1611,"Village of Islands, FL, 33036",https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,"$2,275,000",4bed,4.5bath,,Noise: Medium,...,,,BEACH RESIDENCE: The Beach house at Tarpon Poi...,BEACH RESIDENCE: The Beach house at Tarpon Poi...,Property PriceDateEventPricePrice/Sq FtSource0...,,No neighborhood information is available.Ask a...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,
1612,1612,"Village of Islands, FL, 33036",https://ap.rdcpix.com/17da5a866083e430abeb91c3...,https://ap.rdcpix.com/17da5a866083e430abeb91c3...,https://ap.rdcpix.com/17da5a866083e430abeb91c3...,"$1,595,000",4bed,4bath,,Noise: Medium,...,,,Oceanfront home in premiere gated Islamorada c...,Oceanfront home in premiere gated Islamorada c...,Property PriceDateEventPricePrice/Sq FtSource1...,Rating* School Name Grades Distance 7/10Plan...,No neighborhood information is available.Ask a...,#content-nearbyhomevalues .address{min-width:1...,https://www.realtor.com/realestateandhomes-det...,


## Load the Demographic Data

[Geocodio](https://www.geocod.io) provides a few data sets we can easily append to our property data, for additional studies and usage: 

* `The US Census Bureau`
* `Local city, county, and state datasets from OpenAddresses`
* `OpenStreetMap`
* `GeoNames`
* `CanVecPlus by Natural Resources Canada`
* `StatCan`
* `Legislator information from the UnitedStates project on GitHub`


In [27]:
demographics_data = pd.read_csv('../data/raw/demographics.csv', low_memory=False).reset_index()
demographics_data


Unnamed: 0,index,Address,Latitude,Longitude,Accuracy Score,Accuracy Type,Number,Street,Unit Type,Unit Number,...,Current Senator #2 OpenSecrets id,Current Senator #2 LIS id,Current Senator #2 C-SPAN id,Current Senator #2 GovTrack id,Current Senator #2 Vote Smart id,Current Senator #2 Ballotpedia id,Current Senator #2 Washington post id,Current Senator #2 ICPSR id,Current Senator #2 Wikipedia id,Current Senator #2 Source
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",26.118226,-80.106961,1.00,rooftop,1,Las Olas Cir,Apt,517,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",26.232514,-80.091608,1.00,rooftop,1,N Ocean Blvd,Apt,610,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",25.790450,-80.129027,1.00,rooftop,100,Lincoln Rd,Unit,643,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",25.790450,-80.129027,1.00,rooftop,100,Lincoln Rd,Unit,702,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",26.463537,-80.058697,1.00,rooftop,100,N Ocean Blvd,Unit,208,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,1609,"Sunny Isles Beach, FL, 33160",25.950648,-80.122823,0.80,place,,,,,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1610,1610,"Valrico, FL, 33594",27.940934,-82.242479,1.00,place,,,,,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1611,1611,"Village of Islands, FL, 33036",24.901690,-80.682667,0.53,place,,,,,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1612,1612,"Village of Islands, FL, 33036",24.901690,-80.682667,0.53,place,,,,,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...


### Merge the Data

The data has been presorted, and reindexed, therefore we can simply merge on the index id.

In [28]:
merged = pd.merge(property_data, demographics_data, on='index')
merged

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 OpenSecrets id,Current Senator #2 LIS id,Current Senator #2 C-SPAN id,Current Senator #2 GovTrack id,Current Senator #2 Vote Smart id,Current Senator #2 Ballotpedia id,Current Senator #2 Washington post id,Current Senator #2 ICPSR id,Current Senator #2 Wikipedia id,Current Senator #2 Source
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,"$649,000",2bed,2bath,"1,350sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,"$689,000",2bed,2bath,"1,479sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,"$650,000",1bed,1bath,845sqft,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,"$450,000",1bed,1bath,820sqft,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,"$2,495,000",3bed,3bath,"2,150sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,1609,"Sunny Isles Beach, FL, 33160",https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,"$975,000",4bed,2.5bath,"2,667sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1610,1610,"Valrico, FL, 33594",https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,https://ap.rdcpix.com/09bf9b7cab8bf2246d934171...,"$523,750",5bed,3.5bath,"3,416sqft",,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1611,1611,"Village of Islands, FL, 33036",https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,https://ap.rdcpix.com/7a97fcfa05de6110d59c75dd...,"$2,275,000",4bed,4.5bath,,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1612,1612,"Village of Islands, FL, 33036",https://ap.rdcpix.com/17da5a866083e430abeb91c3...,https://ap.rdcpix.com/17da5a866083e430abeb91c3...,https://ap.rdcpix.com/17da5a866083e430abeb91c3...,"$1,595,000",4bed,4bath,,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...


Note that the word `sqft` is contained within the column `Area`, therefore we will identify all rows which contain `sqft`, and disregard rows that dont. NOTE: rows that do not contain `sqft` would need to be cleaned in a seperate workflow. For no, we will focus on cleaning the bulk of the data only, and later on we can come back to clean the stragglers.

In [29]:
word = 'sqft'
new_df = merged[merged["Area"].str.contains(word) == True]
new_df

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 OpenSecrets id,Current Senator #2 LIS id,Current Senator #2 C-SPAN id,Current Senator #2 GovTrack id,Current Senator #2 Vote Smart id,Current Senator #2 Ballotpedia id,Current Senator #2 Washington post id,Current Senator #2 ICPSR id,Current Senator #2 Wikipedia id,Current Senator #2 Source
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,"$649,000",2bed,2bath,"1,350sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,"$689,000",2bed,2bath,"1,479sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,"$650,000",1bed,1bath,845sqft,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,"$450,000",1bed,1bath,820sqft,Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,"$2,495,000",3bed,3bath,"2,150sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1606,1606,"Satellite Beach, FL, 32937",https://ap.rdcpix.com/0d60946741e52dbde78ed306...,https://ap.rdcpix.com/0d60946741e52dbde78ed306...,https://ap.rdcpix.com/0d60946741e52dbde78ed306...,"$579,900",3bed,2bath,"2,121sqft",,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1607,1607,"Sebastian, FL, 32958",https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,"$234,892",3bed,2bath,"1,900sqft",Noise: Low,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1608,1608,"Sebastian, FL, 32958",https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,https://ap.rdcpix.com/7da32f8b901b314e749c0fea...,"$234,892",3bed,2bath,"1,900sqft",Noise: Low,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...
1609,1609,"Sunny Isles Beach, FL, 33160",https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,https://ap.rdcpix.com/94921820c16c62e5922977e6...,"$975,000",4bed,2.5bath,"2,667sqft",Noise: Medium,...,N00043290,S404,,412838,124204.0,Rick Scott,,41903.0,Rick Scott,Legislator data is originally collected and ag...



Other columns need similar cleaning, such as `Beds`, `FloodInfo`, and `YearBuilt`.  

In [30]:
word = 'bed'
new_df = new_df[new_df["Beds"].str.contains(word) == True]

word = 'Year Built'
new_df = new_df[new_df["YearBuilt"].str.contains(word) == True]

word = 'Flood Factor'
new_df = new_df[new_df["FloodInfo"].str.contains(word) == True]




The `Style` column is a total mess because many/most listings do not include this information. For now, we will simply drop it to save time parsing the mess.

In [31]:
new_df = new_df.drop('Style',axis=1)

### Using [Featuretools](https://featuretools.alteryx.com/en/stable/index.html) to speed up data cleaning.

The [Featuretools](https://featuretools.alteryx.com/en/stable/index.html) library has a few great data cleaning tools we will use to save time. Specifically:

* `remove_low_information_features`: Keep only features that have at least 2 unique values and that are not all null
* `remove_highly_null_features`: Removes columns from a feature matrix that have higher than a set threshold of null values.
* `remove_single_value_features`: Removes columns in feature matrix that are highly correlated with another column.
* `remove_highly_correlated_features`: Removes columns in feature matrix that are highly correlated with another column.


In [32]:
# see https://docs.featuretools.com/en/stable/_modules/featuretools/selection/selection.html
from featuretools.selection import remove_low_information_features, remove_highly_null_features, remove_single_value_features, remove_highly_correlated_features


df = new_df

"""
    Select features that have at least 2 unique values and that are not all null

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select

    Returns:
        (feature_matrix, features)

 """
df_t = remove_low_information_features(df)



"""
    Removes columns from a feature matrix that have higher than a set threshold
    of null values.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.
        pct_null_threshold (float): If the percentage of NaN values in an input feature exceeds this amount,
                that feature will be considered highly-null. Defaults to 0.95.

    Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions. Matches dfs output.
            If no feature list is provided as input, the feature list will not be returned.
"""
df_t = remove_highly_null_features(df_t)



"""
    Removes columns in feature matrix where all the values are the same.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.
        count_nan_as_value (bool): If True, missing values will be counted as their own unique value.
                    If set to False, a feature that has one unique value and all other
                    data missing will be removed from the feature matrix. Defaults to False.

     Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions.
            Matches dfs output.
            If no feature list is provided as input, the feature list will not be returned.
"""
df_t = remove_single_value_features(df_t)



"""
    Removes columns in feature matrix that are highly correlated with another column.

    Note:
        We make the assumption that, for a pair of features, the feature that is further
        right in the feature matrix produced by ``dfs`` is the more complex one.
        The assumption does not hold if the order of columns in the feature
        matrix has changed from what ``dfs`` produces.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature
                    names and rows are instances.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional):
                    List of features to select.
        pct_corr_threshold (float): The correlation threshold to be considered highly
                    correlated. Defaults to 0.95.
        features_to_check (list[str], optional): List of column names to check
                    whether any pairs are highly correlated. Will not check any
                    other columns, meaning the only columns that can be removed
                    are in this list. If null, defaults to checking all columns.
        features_to_keep (list[str], optional): List of colum names to keep even
                    if correlated to another column. If null, all columns will be
                    candidates for removal.

    Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions.
            Matches dfs output. If no feature list is provided as input,
            the feature list will not be returned. For consistent results,
            do not change the order of features outputted by dfs.
"""
df_t = remove_highly_correlated_features(df_t)
df_t

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 Party,Current Senator #2 Url,Current Senator #2 Address,Current Senator #2 Phone,Current Senator #2 Contact form,Current Senator #2 Twitter,Current Senator #2 Bioguide id,Current Senator #2 LIS id,Current Senator #2 Ballotpedia id,Current Senator #2 Wikipedia id
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,"$649,000",2bed,2bath,"1,350sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,"$689,000",2bed,2bath,"1,479sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,"$650,000",1bed,1bath,845sqft,Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,"$450,000",1bed,1bath,820sqft,Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,"$2,495,000",3bed,3bath,"2,150sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,"$349,900",2bed,2bath,"1,155sqft",Noise: High,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
1596,1596,"North Miami, FL, 33161",https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,"$365,000",3bed,1bath,"1,128sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
1600,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,"$469,900",5bed,2bath,"2,388sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott
1601,1601,"Pembroke Pines, FL, 33027",https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,"$495,000",4bed,3bath,"2,437sqft",Noise: Medium,...,Republican,https://www.rickscott.senate.gov,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott


### Clean up the Flood Risk Data

The `Flood Risk Data` is a prominent feature of this study, so we want to clean up the formatting a bit.

In [33]:
df_t[['FemaInfo','FloodFactorInfo']] = df_t.FloodInfo.str.split(' • ', expand=True) 
df_t['FloodFactorInfo'] = df_t['FloodFactorInfo'].astype(str).str.replace('/10 New','').str.replace('Flood Factor ','')
df_t['FemaInfo'] = df_t['FemaInfo'].astype(str).str.replace('FEMA Zone ','').str.replace('(est.)','')
df_t


Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 Address,Current Senator #2 Phone,Current Senator #2 Contact form,Current Senator #2 Twitter,Current Senator #2 Bioguide id,Current Senator #2 LIS id,Current Senator #2 Ballotpedia id,Current Senator #2 Wikipedia id,FemaInfo,FloodFactorInfo
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,"$649,000",2bed,2bath,"1,350sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),10
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,"$689,000",2bed,2bath,"1,479sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,"$650,000",1bed,1bath,845sqft,Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,"$450,000",1bed,1bath,820sqft,Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,"$2,495,000",3bed,3bath,"2,150sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,"$349,900",2bed,2bath,"1,155sqft",Noise: High,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),1
1596,1596,"North Miami, FL, 33161",https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,"$365,000",3bed,1bath,"1,128sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),3
1600,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,"$469,900",5bed,2bath,"2,388sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6
1601,1601,"Pembroke Pines, FL, 33027",https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,"$495,000",4bed,3bath,"2,437sqft",Noise: Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),1


# Clean up the Numerical Features

We cannot reformat things like `Area` and `Baths` and `Year Built` and `Days on Realtor.com` as long as they contain text characters, so we need to remove these in order to correctly format the dataset for training our model.


In [34]:
df_t['Beds'] = df_t['Beds'].str.replace('bed','')
df_t['Baths'] = df_t['Baths'].str.replace('bath','')
df_t['Noise'] = df_t['Noise'].str.replace('Noise:','')
df_t['PropertyType'] = df_t['PropertyType'].str.replace('Property Type','')
df_t['DaysOnRealtor'] = df_t['DaysOnRealtor'].str.replace('Days on Realtor.com','').str.replace('Days','')
df_t['Area'] = df_t['Area'].str.replace('sqft','').str.replace(',','')
df_t['Price'] = df_t['Price'].str.replace('$','').str.replace(',','').str.replace(',','')
df_t['PricePerSQFT'] = df_t['PricePerSQFT'].astype(str).str.replace(',','')
df_t['YearBuilt'] = df_t['YearBuilt'].astype(str).str.replace('Year Built','')
df_t

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 Address,Current Senator #2 Phone,Current Senator #2 Contact form,Current Senator #2 Twitter,Current Senator #2 Bioguide id,Current Senator #2 LIS id,Current Senator #2 Ballotpedia id,Current Senator #2 Wikipedia id,FemaInfo,FloodFactorInfo
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),10
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,650000,1,1,845,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,450000,1,1,820,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,2495000,3,3,2150,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),1
1596,1596,"North Miami, FL, 33161",https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,365000,3,1,1128,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),3
1600,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6
1601,1601,"Pembroke Pines, FL, 33027",https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,495000,4,3,2437,Medium,...,716 Hart Senate Office Building Washington DC ...,202-224-5274,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),1


### Split up the `LastSoldAmt` and `LastSoldYear` features

These columns were included together in the scraped data, so we need to split them up accordingly, in order to properly format them as model features.

In [35]:
df_t[['LastSoldAmt','LastSoldYear']] = df_t.LastSold.str.split(' in ', expand=True) 
df_t

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 Contact form,Current Senator #2 Twitter,Current Senator #2 Bioguide id,Current Senator #2 LIS id,Current Senator #2 Ballotpedia id,Current Senator #2 Wikipedia id,FemaInfo,FloodFactorInfo,LastSoldAmt,LastSoldYear
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),10,Last Sold$ 600k,2019
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6,Last Sold$ 640k,2015
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,650000,1,1,845,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10,Last Sold$ 520k,2015
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,450000,1,1,820,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10,Last Sold$ 520k,2014
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,2495000,3,3,2150,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),9,Last Sold$ 1.51M,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),1,Last Sold$ 250k,2012
1596,1596,"North Miami, FL, 33161",https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,365000,3,1,1128,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),3,Last Sold$ 325k,2017
1600,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6,Last Sold$ 320k,2016
1601,1601,"Pembroke Pines, FL, 33027",https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,495000,4,3,2437,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),1,Last Sold$ 207k,1999


### Cleanup the `LastSoldAmt`

The `LastSoldAmt` data used text characters to indicate thousands and millions, however, for our purposes we need to replace these with their numerical kin.

In [36]:
df_t['LastSoldAmt'] = df_t['LastSoldAmt'].astype(str).str.replace('k','000')
df_t['LastSoldAmt'] = df_t['LastSoldAmt'].astype(str).str.replace('M','000000').str.replace('.','').str.replace('Last Sold','').str.replace('$','').str.replace('000000','0000')
df_t

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,Current Senator #2 Contact form,Current Senator #2 Twitter,Current Senator #2 Bioguide id,Current Senator #2 LIS id,Current Senator #2 Ballotpedia id,Current Senator #2 Wikipedia id,FemaInfo,FloodFactorInfo,LastSoldAmt,LastSoldYear
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),10,600000,2019
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6,640000,2015
2,2,"100 Lincoln Rd Unit 643, Miami Beach, FL, 33139",https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,https://ap.rdcpix.com/32ac046dba8dbb679a722ce4...,650000,1,1,845,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10,520000,2015
3,3,"100 Lincoln Rd Unit 702, Miami Beach, FL, 33139",https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,https://ap.rdcpix.com/71c42415a0332246d3f616e4...,450000,1,1,820,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),10,520000,2014
4,4,"100 N Ocean Blvd Unit 208, Delray Beach, FL, 3...",https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,https://ap.rdcpix.com/5e41042394b16968e83d340b...,2495000,3,3,2150,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),9,1510000,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X (),1,250000,2012
1596,1596,"North Miami, FL, 33161",https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,https://ap.rdcpix.com/aa00b7ec33a3e6b53972a11f...,365000,3,1,1128,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,AE (),3,325000,2017
1600,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),6,320000,2016
1601,1601,"Pembroke Pines, FL, 33027",https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,https://ap.rdcpix.com/08b9b60a18e9af0d89489672...,495000,4,3,2437,Medium,...,https://www.rickscott.senate.gov/contact_rick,SenRickScott,S001217,S404,Rick Scott,Rick Scott,X500 (),1,207000,1999


### Drop Unecessary Columns and Save the Preprocessed Data

In [37]:
df_t = df_t.drop('LastSold',axis=1)
df_t = df_t.drop('index',axis=1)
df_t = df_t.reset_index()
drop_cols = [col for col in df_t.columns if 'url' in col.lower() or ' id' in col.lower()]
X_t = df_t
X_t = X_t.drop(drop_cols,axis=1)
t_datapath = '../data/processed/preprocessed.csv'
X_t.to_csv(t_datapath,index=False)

### Download the Property Images for Later Use
Each property in the dataset comes with 3 urls conatining listing images. While these images are not immediately of concern, we will be using them in a later blog post. For now, we will simply download them.  

In [38]:
import requests

def download_images(indx):
    file_name = str(indx)+'.png'
    urldata = df_t[df_t['index']==indx]
    url1 = df_t['Image_URL'].values[0]
    url2 = df_t['Image_URL1'].values[0]
    url3 = df_t['Image_URL2'].values[0]
    urls = [url1,url2,url3]
    ct=0
    for url in urls:
        response = requests.get(url)
        with open('../data/images/_'+str(ct)+'_'+file_name, "wb") as file:
            file.write(response.content)
            file.close()
        ct+=1
                    
# df_t['index'].apply(download_images)

## Merge the Zillow Data

#### ZHVI User Guide
One of Zillow’s most cited metrics is ZHVI, the Zillow Home Value Index. 

It tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time. For general information about ZHVI, please refer to this methodology guide and this lighter-hearted video.

When citing ZHVI, it’s important to understand how it works, including its cuts (which are like “flavors”) and naming conventions, and how to calculate common statistics with it. Here’s an overview:

#### Flagship ZHVI

ZHVI is calculated in a variety of cuts, consisting of different home types and adjustments. These cuts are available on Zillow Research’s data page, zillow.com/data.

The most-cited or “flagship” ZHVI is the all homes, middle tier, smoothed and seasonally adjusted cut. It is used by Zillow for most consumer-facing presentations of the ZHVI, such as on our Home Details Pages (the address page for most U.S. homes) and Home Values Pages (which shows ZHVI for various regions).This cut represents the middle of the market for all homes. It is smoothed to soften short-term variability, and seasonally adjusted to remove the effect of the seasonal cycle of housing (and instead focus on longer term trends).

In Zillow’s Econ Data API, this flagship ZHVI cut is keyed with a cutTypeKey of uc_sfrcondo_tier_0.33_0.67_sm_sa.

This same cut is also used to create the Zillow Home Value Forecast (ZHVF).

#### What to call ZHVI

ZHVI represents the “typical” home value for a region. When referring to the ZHVI dollar amount, it should be designated as the “typical home value for the region.” An earlier version of ZHVI represented a median value, but this is no longer the case. Wording should be changed to reflect the new ZHVI, and should be “typical home value” — it is NOT the “median home value”.

#### ZHVI/ZHVF release schedule

ZHVI and ZHVF are published on the Econ Data API on the third Thursday of each month. When data are released, the last valid point for ZHVI is the end of the month prior to the release month; the first month of ZHVF is the end of the month of the release month. For example:

It is the third Thursday of November 2020. Data have been pushed to the API for ZHVI and ZHVF.
The latest datapoint for ZHVI will be 10-31-2020.
The first datapoint for ZHVF is 11-30-2020.
The time point the ZHVI/ZHVF represents is designated in the timePeriodEndDateTime key. 

If you are accessing ZHVI data before the third Thursday of the month, there will be a two-month data lag. For example, if you pull ZHVI data on November 10 (before the third Thursday), the last valid ZHVI point will be 09-30-2020.

#### Calculating historical and forecasted growth

Yearly growth rates for ZHVI are always calculated same month to same month. For example, if you wish to display how much home values have changed over the prior year (in percentage terms), you use the formula:

100 * [ ZHVI_{this month current year} – ZHVI_{this month last year} ] / [ ZHVI_{this month last year} ]

 

To calculate the forecasted growth for the coming year, you use the same formula, but substitute  ZHVF for ZHVI for one point:

100 * [ ZHVF_{this month next year} – ZHVI_{this month current year} ] / [ ZHVI_{this month current year} ]


For more examples and info, [visit Zillow](https://www.zillow.com/research/zhvi-user-guide/)

In [39]:
zillow1beds = pd.read_csv('../data/raw/zillow1bed.csv')
zillow1beds['zipbeds'] = zillow1beds['RegionName'].astype(str)+'_'+str(1)

zillow2beds = pd.read_csv('../data/raw/zillow2bed.csv')
zillow2beds['zipbeds'] = zillow2beds['RegionName'].astype(str)+'_'+str(2)


zillow3beds = pd.read_csv('../data/raw/zillow3bed.csv')
zillow3beds['zipbeds'] = zillow3beds['RegionName'].astype(str)+'_'+str(3)


zillow4beds = pd.read_csv('../data/raw/zillow4bed.csv')
zillow4beds['zipbeds'] = zillow4beds['RegionName'].astype(str)+'_'+str(4)


zillow5beds = pd.read_csv('../data/raw/zillow5bed.csv')
zillow5beds['zipbeds'] = zillow5beds['RegionName'].astype(str)+'_'+str(5)

zillowdata = pd.concat([zillow1beds, zillow2beds, zillow3beds, zillow4beds, zillow5beds])
zillowdata

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1/31/96,...,5/31/20,6/30/20,7/31/20,8/31/20,9/30/20,10/31/20,11/30/20,12/31/20,1/31/21,zipbeds
0,61639,0,10025,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,107876.0,...,688401.0,687989.0,686916.0,685278.0,682705.0,681110.0,680490.0,681329.0,681740.0,10025_1
1,84654,1,60657,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,107928.0,...,217353.0,217807.0,218778.0,220062.0,221396.0,222713.0,224034.0,225383.0,226155.0,60657_1
2,61637,2,10023,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,179075.0,...,762927.0,763024.0,762572.0,762335.0,762696.0,764745.0,766037.0,768289.0,769852.0,10023_1
3,84616,4,60614,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,123763.0,...,246965.0,247455.0,247958.0,248226.0,248181.0,248364.0,249036.0,250130.0,250644.0,60614_1
4,91940,5,77449,Zip,TX,TX,Katy,Houston-The Woodlands-Sugar Land,Harris County,85415.0,...,138753.0,139456.0,140111.0,140617.0,142016.0,143976.0,146953.0,149562.0,151943.0,77449_1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22068,65438,34322,18348,Zip,PA,PA,Pocono Lake,East Stroudsburg,Monroe County,,...,208351.0,208001.0,209010.0,211422.0,214524.0,215693.0,214938.0,216432.0,219000.0,18348_5
22069,66881,34430,21405,Zip,MD,MD,Annapolis,Baltimore-Columbia-Towson,Anne Arundel County,454363.0,...,1041542.0,1044743.0,1047632.0,1054339.0,1059607.0,1071507.0,1080129.0,1099575.0,1110337.0,21405_5
22070,59376,34430,4109,Zip,ME,ME,Portland,Portland-South Portland,Cumberland County,,...,963020.0,970516.0,975945.0,986262.0,995443.0,1010473.0,1027687.0,1050814.0,1077295.0,4109_5
22071,95851,34430,89155,Zip,NV,NV,Las Vegas,Las Vegas-Henderson-Paradise,Clark County,227168.0,...,434705.0,436505.0,440215.0,444924.0,450580.0,454860.0,458650.0,461560.0,465385.0,89155_5


Now we merge on the key we're creating which concatenates the area Zip code and the property bedroom count in `zipbeds`.

In [40]:
# load preprocessed data
t_datapath = '../data/processed/preprocessed.csv'

target = 'Price'

#set to None for production / actual use, set lower for testing
n_rows=None

#set the index
index='index'

X, y = load_data(t_datapath, index=index, target=target, n_rows=n_rows)

y = y.reset_index().drop('index',axis=1).reset_index()[target]
X_t = X.reset_index().drop('index',axis=1).reset_index()
X_t[target]=y

df_t['LastSoldDate'] = '1/31/' + df_t['LastSoldYear'].astype(str).str[2:4]
df_t['zipbeds'] = df_t['Zip'].astype(str).str.replace('zip_','')+'_'+df_t['Beds'].astype(str)
zipbeds = list(set(df_t['zipbeds'].values))
zillowdata['zipbeds'] = zillowdata['zipbeds'].astype(str)
df_t['zipbeds'] = df_t['zipbeds'].astype(str)

             Number of Features
Categorical                  60
Numeric                     640

Number of training examples: 1071
Targets
325000     1.49%
450000     1.21%
350000     1.21%
339000     0.93%
349900     0.84%
           ...  
245000     0.09%
5750000    0.09%
74995      0.09%
2379000    0.09%
256000     0.09%
Name: Price, Length: 567, dtype: object


Columns (5,9,23,700) have mixed types.Specify dtype option on import or set low_memory=False.


In [41]:
df_t = pd.merge(df_t, zillowdata, on='zipbeds')
df_t

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,4/30/20,5/31/20,6/30/20,7/31/20,8/31/20,9/30/20,10/31/20,11/30/20,12/31/20,1/31/21
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,501044.0,502600.0,503676.0,503465.0,503912.0,504745.0,506914.0,509148.0,511553.0,511939.0
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,338394.0,340301.0,341701.0,342476.0,343887.0,345737.0,347743.0,349896.0,352125.0,353530.0
2,631,"2375 SE 10th St, Pompano Beach, FL, 33062",https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,689900,2,2,1522,Medium,...,338394.0,340301.0,341701.0,342476.0,343887.0,345737.0,347743.0,349896.0,352125.0,353530.0
3,1327,"710 N Ocean Blvd Apt 503, Pompano Beach, FL, 3...",https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,369900,2,2,1028,Medium,...,338394.0,340301.0,341701.0,342476.0,343887.0,345737.0,347743.0,349896.0,352125.0,353530.0
4,1361,"750 N Ocean Blvd Apt 1101, Pompano Beach, FL, ...",https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,699900,2,2,1140,Medium,...,338394.0,340301.0,341701.0,342476.0,343887.0,345737.0,347743.0,349896.0,352125.0,353530.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1022,1588,"Hialeah, FL, 33014",https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,495000,3,2,2250,Medium,...,350231.0,351812.0,353318.0,355472.0,357916.0,360602.0,362943.0,364841.0,368020.0,372115.0
1023,1592,"Miami Gardens, FL, 33055",https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,325000,4,2,1652,Medium,...,334367.0,335556.0,337416.0,339233.0,341314.0,343987.0,346848.0,349629.0,353215.0,357360.0
1024,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,267461.0,268577.0,269611.0,270693.0,271905.0,273320.0,274689.0,275988.0,278071.0,280484.0
1025,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,587467.0,590968.0,594095.0,595700.0,598211.0,601264.0,606193.0,610923.0,616772.0,620969.0


## Calculate the rate of change for each Property, and its `comparables`

In real estate, a `comparable` is a nearby property with similar features, such as the same number of bedrooms. In our case, for each property we merged on `zipbeds` we want to train a model to predict the difference in the rate of change of the price vs the rate of change in price for nearby comparables. To do this, we will find the `LastSoldDate` column for the target property, and lookup the corresponding ZHVI rate of change from that date until now.

In [42]:
X_t = df_t.copy()

time_series_cols = [col for col in X_t.columns if '/' in col and 'Percentage' not in col and 'Value' not in col and 'Margin of error' not in col and 'Metro' not in col and col != '1/31/21']

l = []
for ct in range(len(X_t)):
    try:
        indx = X_t['index'].values[ct]
        last_sold_date = X_t['LastSoldDate'].values[ct]
        zillow_price = X_t[last_sold_date].values[ct]
        X_ts = X_t[X_t['index']==indx]
        X_ts['zillow_price'] = zillow_price
        X_ts['zillow_price_change'] = X_ts['1/31/21'].astype(float) - X_ts['zillow_price'].astype(float)
        X_ts['zillow_price_change_rate'] = X_ts['zillow_price_change'].astype(float) / float(2021.0 - X_ts['LastSoldYear'].astype(float))
        X_ts['zillow_price_change_percent'] = X_ts['zillow_price_change'].astype(float) / X_ts['zillow_price'].astype(float)
        l.append(X_ts)
    except: pass
    
df = pd.concat(l)

df['last_sold_price_change'] = df['Price'].astype(float) - df['LastSoldAmt'].astype(float)
df['last_sold_price_change_percent'] = (df['Price'].astype(float) - df['LastSoldAmt'].astype(float)) / df['LastSoldAmt'].astype(float)
df['last_sold_price_change_rate'] = df['last_sold_price_change'].astype(float) / float(2021.0 - X_ts['LastSoldYear'].astype(float))
df['yearly_price_delta'] = df['last_sold_price_change_rate'].astype(float) - df['zillow_price_change_rate'].astype(float)
df

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,12/31/20,1/31/21,zillow_price,zillow_price_change,zillow_price_change_rate,zillow_price_change_percent,last_sold_price_change,last_sold_price_change_percent,last_sold_price_change_rate,yearly_price_delta
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,511553.0,511939.0,495061.0,16878.0,8439.000000,0.034093,49000.0,0.081667,2227.272727,-6211.727273
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,352125.0,353530.0,280804.0,72726.0,12121.000000,0.258992,49000.0,0.076563,2227.272727,-9893.727273
2,631,"2375 SE 10th St, Pompano Beach, FL, 33062",https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,689900,2,2,1522,Medium,...,352125.0,353530.0,221403.0,132127.0,12011.545455,0.596771,394900.0,1.338644,17950.000000,5938.454545
3,1327,"710 N Ocean Blvd Apt 503, Pompano Beach, FL, 3...",https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,369900,2,2,1028,Medium,...,352125.0,353530.0,329454.0,24076.0,12038.000000,0.073078,109900.0,0.422692,4995.454545,-7042.545455
4,1361,"750 N Ocean Blvd Apt 1101, Pompano Beach, FL, ...",https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,699900,2,2,1140,Medium,...,352125.0,353530.0,219779.0,133751.0,14861.222222,0.608570,456900.0,1.880247,20768.181818,5906.959596
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1022,1588,"Hialeah, FL, 33014",https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,495000,3,2,2250,Medium,...,368020.0,372115.0,194998.0,177117.0,22139.625000,0.908302,303000.0,1.578125,13772.727273,-8366.897727
1023,1592,"Miami Gardens, FL, 33055",https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,325000,4,2,1652,Medium,...,353215.0,357360.0,329200.0,28160.0,28160.000000,0.085541,-5000.0,-0.015152,-227.272727,-28387.272727
1024,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,278071.0,280484.0,154974.0,125510.0,13945.555556,0.809878,99900.0,0.399600,4540.909091,-9404.646465
1025,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,616772.0,620969.0,501475.0,119494.0,23898.800000,0.238285,149900.0,0.468438,6813.636364,-17085.163636


## Defining the Target Variable we will train our model to predict.

For our initial blog entry, we will use the feature `yearly_price_delta_percent` as our target variable. It is defined as follows:

In [43]:
df['yearly_price_delta_percent'] = df['last_sold_price_change_percent'].astype(float) -  df['zillow_price_change_percent'].astype(float)
df

Unnamed: 0,index,Address_x,Image_URL,Image_URL1,Image_URL2,Price,Beds,Baths,Area,Noise,...,1/31/21,zillow_price,zillow_price_change,zillow_price_change_rate,zillow_price_change_percent,last_sold_price_change,last_sold_price_change_percent,last_sold_price_change_rate,yearly_price_delta,yearly_price_delta_percent
0,0,"1 Las Olas Cir Apt 517, Fort Lauderdale, FL, 3...",https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,https://ap.rdcpix.com/0a94ad8f3794006d77fd642e...,649000,2,2,1350,Medium,...,511939.0,495061.0,16878.0,8439.000000,0.034093,49000.0,0.081667,2227.272727,-6211.727273,0.047574
1,1,"1 N Ocean Blvd Apt 610, Pompano Beach, FL, 33062",https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,https://ap.rdcpix.com/1041dd209aa1020ef9d06dc7...,689000,2,2,1479,Medium,...,353530.0,280804.0,72726.0,12121.000000,0.258992,49000.0,0.076563,2227.272727,-9893.727273,-0.182430
2,631,"2375 SE 10th St, Pompano Beach, FL, 33062",https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,https://ap.rdcpix.com/91b0b7d798e068eb11fca991...,689900,2,2,1522,Medium,...,353530.0,221403.0,132127.0,12011.545455,0.596771,394900.0,1.338644,17950.000000,5938.454545,0.741873
3,1327,"710 N Ocean Blvd Apt 503, Pompano Beach, FL, 3...",https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,https://ap.rdcpix.com/716d92179deb6bba96fb108c...,369900,2,2,1028,Medium,...,353530.0,329454.0,24076.0,12038.000000,0.073078,109900.0,0.422692,4995.454545,-7042.545455,0.349614
4,1361,"750 N Ocean Blvd Apt 1101, Pompano Beach, FL, ...",https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,https://ap.rdcpix.com/0a6b5a0e9bdc54d41f6d9eed...,699900,2,2,1140,Medium,...,353530.0,219779.0,133751.0,14861.222222,0.608570,456900.0,1.880247,20768.181818,5906.959596,1.271676
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1022,1588,"Hialeah, FL, 33014",https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,https://ap.rdcpix.com/5cb22efbeeacbe94e2c36bd8...,495000,3,2,2250,Medium,...,372115.0,194998.0,177117.0,22139.625000,0.908302,303000.0,1.578125,13772.727273,-8366.897727,0.669823
1023,1592,"Miami Gardens, FL, 33055",https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,https://ap.rdcpix.com/32a80c05c177393e5c2cb991...,325000,4,2,1652,Medium,...,357360.0,329200.0,28160.0,28160.000000,0.085541,-5000.0,-0.015152,-227.272727,-28387.272727,-0.100692
1024,1594,"Miami, FL, 33156",https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,https://ap.rdcpix.com/59a889d85648bd7c78f0b83f...,349900,2,2,1155,High,...,280484.0,154974.0,125510.0,13945.555556,0.809878,99900.0,0.399600,4540.909091,-9404.646465,-0.410278
1025,1600,"Pembroke Pines, FL, 33024",https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,https://ap.rdcpix.com/9948ccf7025e9b51f5586a14...,469900,5,2,2388,Medium,...,620969.0,501475.0,119494.0,23898.800000,0.238285,149900.0,0.468438,6813.636364,-17085.163636,0.230152


## Final Cleanup & Save

A few straggler text characters remain in some numerical columns, so remove them then save.

In [44]:
df = df.drop(time_series_cols, axis=1)
df =df[df['LastSoldAmt'] != 'Property TypeTownhome']
df = df[df['LastSoldAmt'] != 'Property TypeSingle Family Home']
df.to_csv('../data/processed/zillow_merged.csv',index=False)