# Tampa Real-Estate Recommender
## Exploratory Data Analysis
TB Real Estate Corporation is a real estate investment firm in the Tampa Bay, Florida area.  The real estate market in the Tampa Bay area is very active.  Single family homes are selling quickly.  TB Real Estate Corporation needs to be able to assess the value of homes coming onto the market quickly and accurately so that they can beat the competition in making a competitive offer.   They need to be able to evaluate the listing price against the predicted sale price in order to identify properties that may be priced below market value and would make good investments.  
<br>
The objective of the EDA is to identify which features are the best predictors of sales price for residential properties.

# 1 Imports and File Locations<a id='1'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

In [2]:
ext_data = '../data/external/'
raw_data = '../data/raw/'
interim_data = '../data/interim/'
report_figures = '../reports/figures/'

# 2 Read Sales data into dataframe<a id='2'></a>

In [3]:
sales_df = pd.read_csv(interim_data + 'sales_df.csv')
sales_df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,FOLIO,DOR_CODE,S_DATE,VI,QU,REA_CD,S_AMT,S_TYPE,ORIG_SALES_DATE,SITE_ADDR,...,BASE,ACREAGE,NBHC,MUNICIPALITY_CD,SECTION_CD,TOWNSHIP_CD,RANGE_CD,LAND_TYPE_ID,BLOCK_NUM,LOT_NUM
0,80100,100,1987-08-01,I,Q,1,50000.0,WD,1985-11-01,19859 ANGEL LN,...,2016,5.05878,211007.0,U,1,27,17,1,0,1.1
1,80100,100,1985-11-01,V,Q,1,24000.0,WD,1985-11-01,19859 ANGEL LN,...,2016,5.05878,211007.0,U,1,27,17,1,0,1.1
2,90100,100,2021-10-27,I,Q,1,750000.0,WD,1973-01-01,19913 ANGEL LN,...,1973,4.43849,211007.0,U,1,27,17,1,0,2.1
3,90100,100,1997-05-01,I,Q,1,169900.0,WD,1973-01-01,19913 ANGEL LN,...,1973,4.43849,211007.0,U,1,27,17,1,0,2.1
4,100000,100,1988-06-01,I,Q,1,52500.0,WD,1977-12-01,6934 W COUNTY LINE RD,...,1994,0.992559,211007.0,U,1,27,17,1,0,3.0


In [4]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847102 entries, 0 to 847101
Data columns (total 39 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   FOLIO            847102 non-null  int64  
 1   DOR_CODE         847102 non-null  int64  
 2   S_DATE           847102 non-null  object 
 3   VI               847102 non-null  object 
 4   QU               847102 non-null  object 
 5   REA_CD           847102 non-null  object 
 6   S_AMT            847102 non-null  float64
 7   S_TYPE           847102 non-null  object 
 8   ORIG_SALES_DATE  847102 non-null  object 
 9   SITE_ADDR        847013 non-null  object 
 10  SITE_CITY        847095 non-null  object 
 11  SITE_ZIP         847102 non-null  object 
 12  tBEDS            847102 non-null  float64
 13  tBATHS           847102 non-null  float64
 14  tSTORIES         847102 non-null  float64
 15  tUNITS           847102 non-null  float64
 16  tBLDGS           847102 non-null  floa

In [None]:
sales_curr_df['NBHC'].astype(str).str[1:3].unique()

In [None]:
# sales_df['S_DATE_epoch'] = sales_df['S_DATE'] - datetime.datetime(1970,1,1)
sales_df['S_DATE_epoch']  = sales_df['S_DATE'].apply(lambda x: (x - datetime.datetime(1970,1,1)).days)
sales_df['S_DATE_epoch'].describe()

In [None]:
plt.subplots(figsize=(12,10))
sns.heatmap(sales_df.corr());

In [None]:
sales_curr_df = sales_df[sales_df['S_DATE'] >= '2021-01-01']

In [None]:
plt.subplots(figsize=(12,10))
sns.heatmap(sales_curr_df.corr());

**S_AMT** is the sale price of the property.  This will be the target feature to predict.

In [None]:
print(sales_df['S_AMT'].describe().apply(lambda x: format(x, 'f')))

In [None]:
S_AMT_mean = sales_df['S_AMT'].mean()
S_AMT_std = sales_df['S_AMT'].std()
_ = plt.subplots(figsize=(15, 5))
_ = plt.hist(data=sales_df, x='S_AMT', bins=[0, 100000, 200000, 300000, 400000, 500000, 1000000])
_ = plt.xlabel('Sales Amount')
_ = plt.ylabel('count')
_ = plt.title('Distribution of Property Sale Amounts')
_ = plt.ticklabel_format(useOffset=False, style='plain')
_ = plt.axvline(S_AMT_mean, color='r')
_ = plt.axvline(S_AMT_mean+S_AMT_std, color='r', linestyle='--')
_ = plt.axvline(S_AMT_mean+(2*S_AMT_std), color='r', linestyle='-.')

**FOLIO** is a unique identifier for a property.  It will not impact the sales price and can be dropped from the analysis.

In [None]:
sales_df['FOLIO'].astype(str).describe()

In [None]:
sales_df.drop('FOLIO', axis=1, inplace=True)

**S_DATE** is the date of the sale.  Property sales increased sharply from the mid-1990's to the mid-2000's when the U.S. mortgage crisis crashed the housing market.  Over the 2010's the annual number of property sales increased to near the levels prior to the crash.  

In [None]:
sales_df['S_DATE'] = pd.to_datetime(sales_df['S_DATE'])

In [None]:
plt.subplots(figsize=(15, 5))
sns.countplot(x=sales_df['S_DATE'].dt.year)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.xticks(rotation = 90)
plt.title('Residential Property Sale Counts by Year')
sns.despine()
plt.show()

In [None]:
plt.subplots(figsize=(15, 5))
sales_df.groupby(sales_df['S_DATE'].dt.year)['S_AMT'].mean().plot()
plt.xlabel('Year')
plt.ylabel('Avg Sale Price')
plt.title('Avg Sale Price per Year for Residential Homes')
plt.show()

In [None]:
#Plot relationship between Sales Date and Sales Price
sns.jointplot(x='S_DATE', y='S_AMT', data=sales_df, kind="reg");

**DOR Code** is the Department of Revenue Code which indicates the type of property (i.e. single family home, condo, commercial, etc.)  The data has already been filtered for the following residential property types.
- 0100: Single Family Residential
- 0102: Single family home built around a mobile home
- 0106: Townhouse/Villa
- 0200: Mobile Home
- 0400: Condominium
- 0408: Mobile Home Condominium
- 0800: Multi-Family Residential (Duplex, Triplex, Quadplex, etc.) < 10 units
- 0801: Multi-Family Residential (units individually owned)
- 0802: Multi-Family Residential (units rentals)

In [None]:
sales_df['DOR_CODE'].value_counts()

In [None]:
sns.countplot(data=sales_df,
              x='DOR_CODE')
plt.xlabel('')
plt.ylabel('Frequency')
plt.title('Property Type Codes')
sns.despine()
plt.show()

In [None]:
sns.swarmplot(data=sales_df,
            x='DOR_CODE',
            y='S_AMT')
plt.xlabel('Property Type Code')
plt.ylabel('Sales Price')
sns.despine()
plt.savefig('Sales Prices by Property Type')
plt.show()