# Tampa Real-Estate Recommender
## Feature Engineering
TB Real Estate Corporation is a real estate investment firm in the Tampa Bay, Florida area.  The real estate market in the Tampa Bay area is very active.  Single family homes are selling quickly.  TB Real Estate Corporation needs to be able to assess the value of homes coming onto the market quickly and accurately so that they can beat the competition in making a competitive offer.   They need to be able to evaluate the listing price against the predicted sale price in order to identify properties that may be priced below market value and would make good investments.  
<br>
The objective of the Feature Engineering is to prepare the features and to build training and testing datasets to optimally support the machine learning models in order to predict the sales price for residential properties.

# 1 Imports and File Locations<a id='1'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import category_encoders as ce

In [2]:
ext_data = '../data/external/'
raw_data = '../data/raw/'
interim_data = '../data/interim/'
report_figures = '../reports/figures/'

# 2 Read Sales data into dataframe<a id='2'></a>

In [3]:
df = pd.read_csv(interim_data + 'sales_df.csv', dtype={'FOLIO': object, 'DOR_CODE': object, 'NBHC': object, 'SECTION_CD': object, 'TOWNSHIP_CD': object, 'RANGE_CD': object}, parse_dates=['S_DATE'])
df.info()

  df = pd.read_csv(interim_data + 'sales_df.csv', dtype={'FOLIO': object, 'DOR_CODE': object, 'NBHC': object, 'SECTION_CD': object, 'TOWNSHIP_CD': object, 'RANGE_CD': object}, parse_dates=['S_DATE'])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847102 entries, 0 to 847101
Data columns (total 39 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   FOLIO            847102 non-null  object        
 1   DOR_CODE         847102 non-null  object        
 2   S_DATE           847102 non-null  datetime64[ns]
 3   VI               847102 non-null  object        
 4   QU               847102 non-null  object        
 5   REA_CD           847102 non-null  object        
 6   S_AMT            847102 non-null  float64       
 7   S_TYPE           847102 non-null  object        
 8   ORIG_SALES_DATE  847102 non-null  object        
 9   SITE_ADDR        847013 non-null  object        
 10  SITE_CITY        847095 non-null  object        
 11  SITE_ZIP         847102 non-null  object        
 12  tBEDS            847102 non-null  float64       
 13  tBATHS           847102 non-null  float64       
 14  tSTORIES         847

# 3 Feature Engineering<a id='3'></a>

In [4]:
# Replace DOR_CODE with boolean for Single Family House
df['single_family_house'] = np.where(df['DOR_CODE']=='0100',True,False)

In [5]:
# Replace S_DATE with integer sales date epoch
df['sales_date_epoch'] = (df['S_DATE'] - dt.datetime(1970,1,1)).dt.days

In [6]:
# Replace VI with boolean for Improved vs.Vacant
df['improved'] = np.where(df['VI']=='I',True,False)

In [7]:
# Calculate the age of the home in years
df['age_of_home'] = df['S_DATE'].dt.year - df['ACT']

In [8]:
# Extract 5-digit zip codes and mask any that had less than 1000 sales
df['zip5'] = df['SITE_ZIP'].astype(str).str[:5]
zip5 = df.zip5
zip_counts = df.zip5.value_counts()
mask = zip5.isin(zip_counts[zip_counts < 1000].index)
zip5[mask] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zip5[mask] = 'Other'


In [9]:
# Binary Encode zip5 values
ce_binary = ce.BinaryEncoder(cols = ['zip5'])
df = ce_binary.fit_transform(df)

  elif pd.api.types.is_categorical(cols):


In [10]:
df[[i for i in df.columns if i.startswith('zip5')]+['SITE_ZIP']].head()

Unnamed: 0,zip5_0,zip5_1,zip5_2,zip5_3,zip5_4,zip5_5,zip5_6,SITE_ZIP
0,0,0,0,0,0,0,1,33556
1,0,0,0,0,0,0,1,33556
2,0,0,0,0,0,0,1,33556
3,0,0,0,0,0,0,1,33556
4,0,0,0,0,0,0,1,33556


In [11]:
# Bin and One-Hot Encode number of bedrooms
bins = [-np.inf, 1.9, 2.9, 3.9, 4.9, 5.9, np.inf]
labels = ['<2', '2', '3', '4', '5', '>5']
df['bedrooms_binned'] = pd.cut(df['tBEDS'], bins=bins, labels=labels)
df = pd.get_dummies(df, columns=['bedrooms_binned'], drop_first=True, prefix='BED')

In [12]:
# Bin and One-Hot Encode number of bathrooms
bins = [-np.inf, 0.9, 1.4, 1.9, 2.4, 2.9, 3.4, 3.9, 4.4, 4.9, 5.4, np.inf]
labels = ['<1.0', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0', '>5.0']
df['bathrooms_binned'] = pd.cut(df['tBATHS'], bins=bins, labels=labels)
df = pd.get_dummies(df, columns=['bathrooms_binned'], drop_first=True, prefix='BATH')

In [13]:
# Replace tSTORIES with boolean for Single Story vs. Multiple Stories
df['single_story'] = np.where(df['tSTORIES']<2.0,True,False)

In [14]:
# Replace tUNITS with boolean for Single Unit vs. Multiple Units
df['single_unit'] = np.where(df['tUNITS']<2.0,True,False)

In [15]:
# Replace tBLDGS with boolean for Single Building vs. Multiple Buildings
df['single_buidling'] = np.where(df['tBLDGS']<2.0,True,False)

In [16]:
# Create Market Area and One-Hot Encode
df['MARKET_AREA_CD'] = df['NBHC'].astype(str).str[1:3]
df = pd.get_dummies(df, columns=['MARKET_AREA_CD'], drop_first=True, prefix='MKT')

In [17]:
# Binary Encode neighborhood code values
ce_binary = ce.BinaryEncoder(cols = ['NBHC'])
df = ce_binary.fit_transform(df)

  elif pd.api.types.is_categorical(cols):


In [18]:
# Convert Municipality Codes to Names and One-Hot Encode
df['MUNICIPALITY_CD'].replace({'A': 'Tampa', 'T': 'Temple Terrace', 'P': 'Plant City', 'U': 'Unincorporated'}, inplace=True)
df = pd.get_dummies(df, columns=['MUNICIPALITY_CD'], drop_first=True, prefix='CITY')

In [19]:
# Convert Township and Range Codes to single value
df['TOWNSHIP_RANGE'] = (df['RANGE_CD'].astype(int) - 16) + ((df['TOWNSHIP_CD'].astype(int) - 27)*6)

In [20]:
# Convert Section Codes to unique values across all Township/Ranges
df['SECTION_CD'] = df['SECTION_CD'] * df['TOWNSHIP_RANGE']

In [21]:
# One-Hot Encode Township - Range
df = pd.get_dummies(df, columns=['TOWNSHIP_RANGE'], drop_first=True, prefix='TR')

In [22]:
# Binary Encode Section Code values
ce_binary = ce.BinaryEncoder(cols = ['SECTION_CD'])
df = ce_binary.fit_transform(df)

  elif pd.api.types.is_categorical(cols):


In [23]:
# Create boolean for planned community (platted land) where Land Type ID is not 'ZZZ'
df['planned_community'] = np.where(df['LAND_TYPE_ID']!='ZZZ',True,False)

# 4 Create Training and Testing Datasets<a id='4'></a>

In [24]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847102 entries, 0 to 847101
Data columns (total 150 columns):
 #    Column               Dtype         
---   ------               -----         
 0    FOLIO                object        
 1    DOR_CODE             object        
 2    S_DATE               datetime64[ns]
 3    VI                   object        
 4    QU                   object        
 5    REA_CD               object        
 6    S_AMT                float64       
 7    S_TYPE               object        
 8    ORIG_SALES_DATE      object        
 9    SITE_ADDR            object        
 10   SITE_CITY            object        
 11   SITE_ZIP             object        
 12   tBEDS                float64       
 13   tBATHS               float64       
 14   tSTORIES             float64       
 15   tUNITS               float64       
 16   tBLDGS               float64       
 17   JUST                 float64       
 18   LAND                 float64       
 19   

In [25]:
y = df[['S_AMT']].copy()

In [27]:
drop_cols = ['S_AMT', 'FOLIO', 'DOR_CODE', 'S_DATE', 'VI', 'QU', 'REA_CD', 'S_TYPE', 'ORIG_SALES_DATE', 'SITE_ADDR', 
             'SITE_CITY', 'SITE_ZIP', 'tBEDS', 'tBATHS', 'tSTORIES', 'tUNITS', 'tBLDGS', 'ACT', 'EFF', 'SD1', 'SD2', 
             'TIF', 'BASE', 'TOWNSHIP_CD', 'RANGE_CD', 'LAND_TYPE_ID', 'BLOCK_NUM', 'LOT_NUM']
X = df.drop(drop_cols, axis=1)

In [28]:
X.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847102 entries, 0 to 847101
Data columns (total 122 columns):
 #    Column               Dtype  
---   ------               -----  
 0    JUST                 float64
 1    LAND                 float64
 2    BLDG                 float64
 3    EXF                  float64
 4    HEAT_AR              float64
 5    ASD_VAL              float64
 6    TAX_VAL              float64
 7    ACREAGE              float64
 8    NBHC_0               int64  
 9    NBHC_1               int64  
 10   NBHC_2               int64  
 11   NBHC_3               int64  
 12   NBHC_4               int64  
 13   NBHC_5               int64  
 14   NBHC_6               int64  
 15   NBHC_7               int64  
 16   NBHC_8               int64  
 17   NBHC_9               int64  
 18   SECTION_CD_0         int64  
 19   SECTION_CD_1         int64  
 20   SECTION_CD_2         int64  
 21   SECTION_CD_3         int64  
 22   SECTION_CD_4         int64  
 23   SECTION