## Preprocessing and Training Data Development - Vacancy Rates 

Goal:  Create a cleaned development dataset you can use to complete the modeling step of your project


#### Steps: 
● 1. Create dummy or indicator features for categorical variables

● 2. Standardize the magnitude of numeric features using a scaler

● 3. Split into testing and training datasets

In [1]:
#imports
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from pandas_profiling import ProfileReport

In [2]:
#load data
path= '/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/interim'
os.chdir(path) 

In [3]:
# load cleaned data
df = pd.read_csv('master_complete_for_EDA.csv', dtype={'Zipcode': object})
df

Unnamed: 0,Zipcode,RentPrice,Year,SizeRank,State,City,Metro,CountyName,HomePrice,Vacancy_Rate%,MOE-VacancyRate%,int_rate,med_hIncome,uspop_growth,unemplt_rate,newHouse_starts,resConstruct_spending
0,02333,1368.536,2011,8782.0,MA,East Bridgewater,Boston-Cambridge-Newton,Plymouth County,,3.024027,2.199925,0.750000,57021.0,0.720018,8.933333,611.916667,255208.583333
1,02338,1311.076,2011,11179.0,MA,Halifax,Boston-Cambridge-Newton,Plymouth County,274920.17,3.116343,2.948791,0.750000,57021.0,0.720018,8.933333,611.916667,255208.583333
2,02339,1484.626,2011,8621.0,MA,Hanover,Boston-Cambridge-Newton,Plymouth County,415097.50,4.464646,2.066438,0.750000,57021.0,0.720018,8.933333,611.916667,255208.583333
3,02341,1266.816,2011,10079.0,MA,Hanson,Boston-Cambridge-Newton,Plymouth County,,3.586322,2.340722,0.750000,57021.0,0.720018,8.933333,611.916667,255208.583333
4,02343,1524.006,2011,9640.0,MA,Holbrook,Boston-Cambridge-Newton,Norfolk County,247510.42,3.732901,2.926524,0.750000,57021.0,0.720018,8.933333,611.916667,255208.583333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264955,98279,1059.870,2018,23400.0,WA,Olga,,San Juan County,552805.42,51.219512,10.993457,2.458333,64324.0,0.522337,3.891667,1248.250000,564448.750000
264956,98280,993.850,2018,25265.0,WA,Eastsound,,San Juan County,678499.00,51.329243,12.777549,2.458333,64324.0,0.522337,3.891667,1248.250000,564448.750000
264957,98311,1533.500,2018,4981.0,WA,Bremerton,Bremerton-Silverdale,Kitsap County,314320.83,6.540162,1.960476,2.458333,64324.0,0.522337,3.891667,1248.250000,564448.750000
264958,98326,778.990,2018,26185.0,WA,Clallam Bay,Port Angeles,Clallam County,150193.17,28.537736,14.679524,2.458333,64324.0,0.522337,3.891667,1248.250000,564448.750000


In [4]:
#drop margin of error of vacancy rate and other econometric data
df.drop(columns=['MOE-VacancyRate%', 'int_rate', 'med_hIncome', 'uspop_growth', 'unemplt_rate', 'newHouse_starts', 'resConstruct_spending'], axis=1, inplace=True)

In [5]:
df.dtypes

Zipcode           object
RentPrice        float64
Year               int64
SizeRank         float64
State             object
City              object
Metro             object
CountyName        object
HomePrice        float64
Vacancy_Rate%    float64
dtype: object

In [6]:
#not needed for this dataframe
'''
#Create a new dataframe, setting the index to 'Year'
df = DF.set_index('Year')
#Save the DATE labels 
df_index = df.index
#Save the column names
df_columns = df.columns
df.head()
'''

"\n#Create a new dataframe, setting the index to 'Year'\ndf = DF.set_index('Year')\n#Save the DATE labels \ndf_index = df.index\n#Save the column names\ndf_columns = df.columns\ndf.head()\n"

In [7]:
#not needed for this dataframe
'''
#split into two dataframes for future modeling and predicting vacancy rates in 2019-2020
df = df[df.Year < 2019]
df_2019_2020 = df[df.Year > 2018]
'''

'\n#split into two dataframes for future modeling and predicting vacancy rates in 2019-2020\ndf = df[df.Year < 2019]\ndf_2019_2020 = df[df.Year > 2018]\n'

In [9]:
#check NaNs
df.isna().sum()/len(df)*100

Zipcode           0.000000
RentPrice         7.513210
Year              0.000000
SizeRank         10.211353
State            10.211353
City             10.211353
Metro            31.328502
CountyName       10.211353
HomePrice        14.107035
Vacancy_Rate%     0.000000
dtype: float64

### For Nashivlle Trip - delete for research

In [24]:
nashville = df[df.City == 'Nashville']
nashville = df[df.State == 'TN']
nashville = nashville[nashville.Year > 2015]
nashville['price_rent_ratio'] = ((nashville.RentPrice*12)/nashville.HomePrice)*100
nashville.sort_values(by=['price_rent_ratio'], ascending=False)

Unnamed: 0,Zipcode,RentPrice,Year,SizeRank,State,City,Metro,CountyName,HomePrice,Vacancy_Rate%,price_rent_ratio
173102,38108,976.755,2016,6959.0,TN,Memphis,Memphis,Shelby County,23679.75,22.700472,49.498242
173100,38106,927.005,2016,4699.0,TN,Memphis,Memphis,Shelby County,22925.08,22.042282,48.523538
211058,38108,960.020,2017,6959.0,TN,Memphis,Memphis,Shelby County,25091.42,24.261553,45.913065
211056,38106,943.590,2017,4699.0,TN,Memphis,Memphis,Shelby County,25044.75,22.867006,45.211392
235795,38106,948.800,2018,4699.0,TN,Memphis,Memphis,Shelby County,25453.17,23.854647,44.731560
...,...,...,...,...,...,...,...,...,...,...,...
235802,38392,,2018,25226.0,TN,Mercer,Jackson,Madison County,87059.67,22.112211,
235888,38226,,2018,29401.0,TN,Dukedom,Martin,Weakley County,72354.67,18.750000,
235892,38311,,2018,22920.0,TN,Bath Springs,,Decatur County,130746.58,39.008264,
235901,38007,,2018,32966.0,TN,Ridgely,,Lake County,50009.92,0.000000,


In [21]:
nashville.Year.value_counts()

2018    32
2017    32
2016    32
2015    32
2014    32
2013    32
2012    32
2011    32
Name: Year, dtype: int64

In [116]:
#drop NaNs
df.dropna(subset=['RentPrice', 'SizeRank', 'HomePrice'], inplace=True)
df.isna().sum()/len(df)*100

Zipcode           0.000000
RentPrice         0.000000
Year              0.000000
SizeRank          0.000000
State             0.000000
City              0.000000
Metro            21.728894
CountyName        0.000000
HomePrice         0.000000
Vacancy_Rate%     0.000000
dtype: float64

In [117]:
#get percent data lost with NaN drop
(1 - len(df)/264960)*100

17.766832729468597

In [118]:
#check number unique values for each column
print('Data set has', df['Zipcode'].nunique(), 'zipcodes out of US total of ~42,000')
print('Data set has', df['State'].nunique(), 'states out of US total of 51')
print('Data set has', df['CountyName'].nunique(), 'counties out of US total of ~3,006')
#seems to be more metro areas than the US has
#maybe spelling differences and some listed more than once, or possibly data set has additional ones?
print('Data set has', df['Metro'].nunique(), 'metro areas out of US total of 384')
print('Data set has', df['City'].nunique(), 'cities out of US total of ~19,495')

#df['State'].value_counts()/len(df)*100

Data set has 29012 zipcodes out of US total of ~42,000
Data set has 51 states out of US total of 51
Data set has 1757 counties out of US total of ~3,006
Data set has 861 metro areas out of US total of 384
Data set has 14539 cities out of US total of ~19,495


In [119]:
#check partition sizes with a 80/20 train/test split
print('train size:', len(df) * .8, 'test size:', len(df) * .2)

train size: 174308.0 test size: 43577.0


###  1. Create dummy or indicator features for categorical variables
Hint: you’ll need to think about your old favorite pandas functions here like
get_dummies() . Consult this guide for help.
<https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40>

In [120]:
#get first 2 or 3 characters in zipcode so there are fewer dummy variables
#this preserves some of the geographic location of the zipcode
df['Zipcode_2'] = df.Zipcode.astype(str).str[:2]
df['Zipcode_3'] = df.Zipcode.astype(str).str[:3]
df.head()

Unnamed: 0,Zipcode,RentPrice,Year,SizeRank,State,City,Metro,CountyName,HomePrice,Vacancy_Rate%,Zipcode_2,Zipcode_3
1,2338,1311.076,2011,11179.0,MA,Halifax,Boston-Cambridge-Newton,Plymouth County,274920.17,3.116343,2,23
2,2339,1484.626,2011,8621.0,MA,Hanover,Boston-Cambridge-Newton,Plymouth County,415097.5,4.464646,2,23
4,2343,1524.006,2011,9640.0,MA,Holbrook,Boston-Cambridge-Newton,Norfolk County,247510.42,3.732901,2,23
5,2346,1310.016,2011,5289.0,MA,Middleborough,Boston-Cambridge-Newton,Plymouth County,264492.5,7.960256,2,23
6,2347,1307.736,2011,9579.0,MA,Lakeville,Boston-Cambridge-Newton,Plymouth County,309743.67,11.565968,2,23


In [121]:
#check dtypes
df.dtypes

Zipcode           object
RentPrice        float64
Year               int64
SizeRank         float64
State             object
City              object
Metro             object
CountyName        object
HomePrice        float64
Vacancy_Rate%    float64
Zipcode_2         object
Zipcode_3         object
dtype: object

### Quick EDA

In [70]:
#profile = ProfileReport(df.drop(columns=['Zipcode', 'Year', 'State', 'City', 'Metro', 'CountyName']))
#profile

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))






In [71]:
#save pandas profiling report as html
os.chdir('/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/reports')
#profile.to_file("3.2.1-EDA-Mentor_Feedback_zipcodeMaster.html")

HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




Seems like home prices and rent prices are highly skewed (same with vacancy rates)

### Reducing skew in variables 

In [122]:
#determine index for continuous variables
num_feats = df.dtypes[df.dtypes !='object'].index

#calculate skew and sort
skew_feats = df[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({'Skew':skew_feats})
print(skewness)

                   Skew
HomePrice      5.483811
RentPrice      2.763417
Vacancy_Rate%  2.002356
SizeRank       0.196360
Year          -0.040894


In [123]:
#perform box cox transformation to reduce skew in Home Price & Rent PRice
from scipy.stats import boxcox
df.HomePrice,lmbda=boxcox(df.HomePrice, lmbda=None)
df.RentPrice,lmbda=boxcox(df.RentPrice, lmbda=None)

In [124]:
#determine index for continuous variables
num_feats = df.dtypes[df.dtypes !='object'].index

#calculate skew and sort
skew_feats = df[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({'Skew':skew_feats})
print(skewness)

                   Skew
Vacancy_Rate%  2.002356
SizeRank       0.196360
HomePrice     -0.008246
Year          -0.040894
RentPrice     -0.161277


## Dealing with Outliers

In [87]:
#get outliers for home/rent prices
df.describe()

Unnamed: 0,RentPrice,Year,SizeRank,HomePrice,Vacancy_Rate%
count,217885.0,217885.0,217885.0,217885.0,217885.0
mean,2.550404,2014.573394,14615.051178,4.534452,16.178597
std,0.029817,2.284327,8955.875344,0.063595,14.006757
min,1.834068,2011.0,0.0,4.226036,0.0
25%,2.532008,2013.0,6938.0,4.492023,7.0
50%,2.548076,2015.0,14027.0,4.533992,12.020588
75%,2.568023,2017.0,21854.0,4.576174,20.332937
max,2.658538,2018.0,34430.0,4.786674,99.839744


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217885 entries, 1 to 264959
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Zipcode        217885 non-null  object 
 1   RentPrice      217885 non-null  float64
 2   Year           217885 non-null  int64  
 3   SizeRank       217885 non-null  float64
 4   State          217885 non-null  object 
 5   City           217885 non-null  object 
 6   Metro          170541 non-null  object 
 7   CountyName     217885 non-null  object 
 8   HomePrice      217885 non-null  float64
 9   Vacancy_Rate%  217885 non-null  float64
 10  Zipcode_2      217885 non-null  object 
 11  Zipcode_3      217885 non-null  object 
dtypes: float64(4), int64(1), object(7)
memory usage: 21.6+ MB


In [17]:
#drop RentPrice outliers 3 std above mean
df_no_outliers = df[np.abs(df.RentPrice-df.RentPrice.mean()) <= (3*df.RentPrice.std())]

In [18]:
#percentatge remaining after droping rent outliers
213662/217885

0.9806182160313927

In [19]:
#number of outliers in rentprice
len(df[(np.abs(df.RentPrice-df.RentPrice.mean()) > (3*df.RentPrice.std()))])

4223

In [20]:
#percentatge lost after droping rent outliers
4223/217885

0.01938178396860729

In [21]:
#drop HomePrice outliers 3 std above mean
df_no_outliers = df_no_outliers[np.abs(df_no_outliers.HomePrice-df_no_outliers.HomePrice.mean()) <= (3*df_no_outliers.HomePrice.std())]
#percentatge remaining after droping home price outliers
209807/213662

0.9819574842508261

In [22]:
#get total perecent dropped with dropping outliers
(217885-209807)/217885

0.037074603575280536

This data with dropped outliers was tested and results are below. Overall this data peformed very slightly better with 2 digit zipcodes and very slightly worse with 3 digit zipcodes than the data that contained the outliers.

In [None]:
#rename this dataset for easier coding below..
#df = df_no_outliers

### Substitute outliers instead of dropping them 

In [44]:
def cap_outliers(series, std_threshold=3, verbose=False):
    '''Caps outliers in series to closest existing value within threshold (std).'''

    lbound = np.mean(series) - std_threshold * np.std(series)
    ubound = np.mean(series) + std_threshold * np.std(series)

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the Standard deviation method:',
                 f'   Std threshold: {std_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

In [50]:
#cap outliers for rentprice and home price
df_cap_outliers = df.copy()
df_cap_outliers.RentPrice = cap_outliers(df.RentPrice, verbose=True)
df_cap_outliers.HomePrice = cap_outliers(df.HomePrice, verbose=True)

Capping outliers by the Standard deviation method:
   Std threshold: 3
   Lower bound: -384.93674461068895
   Upper bound: 2575.722647641905

Capping outliers by the Standard deviation method:
   Std threshold: 3
   Lower bound: -369911.64280612214
   Upper bound: 740812.8045544759



In [55]:
#compare data set after capping outliers to original dataset
df_cap_outliers.describe()

Unnamed: 0,RentPrice,Year,SizeRank,HomePrice,Vacancy_Rate%
count,217885.0,217885.0,217885.0,217885.0,217885.0
mean,1081.491701,2014.573394,14615.051178,178653.224927,16.178597
std,426.862711,2.284327,8955.875344,140986.026691,14.006757
min,19.96,2011.0,0.0,10956.33,0.0
25%,804.206,2013.0,6938.0,88016.67,7.0
50%,966.296,2015.0,14027.0,134667.5,12.020588
75%,1236.036,2017.0,21854.0,214877.0,20.332937
max,2575.67,2018.0,34430.0,740776.67,99.839744


In [53]:
df.describe()

Unnamed: 0,RentPrice,Year,SizeRank,HomePrice,Vacancy_Rate%
count,217885.0,217885.0,217885.0,217885.0,217885.0
mean,1095.392952,2014.573394,14615.051178,185450.6,16.178597
std,493.444364,2.284327,8955.875344,185121.2,14.006757
min,19.96,2011.0,0.0,10956.33,0.0
25%,804.206,2013.0,6938.0,88016.67,7.0
50%,966.296,2015.0,14027.0,134667.5,12.020588
75%,1236.036,2017.0,21854.0,214877.0,20.332937
max,5620.32,2018.0,34430.0,6141946.0,99.839744


In [56]:
#rename this dataset for easier coding below..
#df = df_cap_outliers

Data with outliers was:
- missing around 1 of the first 2-digit zipcodes (~100 in total US)
- missing around 113 of the first 3-digit zipcodes (~1000 in total US)

Data with dropped outliers was:
- missing around 1 of the first 2-digit zipcodes (~100 in total US)
- missing around 120 of the first 3-digit zipcodes (~1000 in total US)

Data with capped outliers was:
- missing around 1 of the first 2-digit zipcodes (~100 in total US)
- missing around 113 of the first 3-digit zipcodes (~1000 in total US)

### Create Dummy Variables 

In [143]:
#get dummy variables for 'object' columns 
df_dummy = pd.get_dummies(df.drop(columns=['Zipcode', 'Year', 'State', 'City', 'Metro', 'CountyName', 'Zipcode_3']))

#missing ~1 of the first 2-digit zipcodes (~100 in total US)
#missing ~113 of the first 3-digit zipcodes (~1000 in total US)
df_dummy.head()

Unnamed: 0,RentPrice,SizeRank,HomePrice,Vacancy_Rate%,Zipcode_2_00,Zipcode_2_01,Zipcode_2_02,Zipcode_2_03,Zipcode_2_04,Zipcode_2_05,...,Zipcode_2_90,Zipcode_2_91,Zipcode_2_92,Zipcode_2_93,Zipcode_2_94,Zipcode_2_95,Zipcode_2_96,Zipcode_2_97,Zipcode_2_98,Zipcode_2_99
1,2.572542,11179.0,4.596884,3.116343,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2.581766,8621.0,4.629315,4.464646,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2.583657,9640.0,4.588181,3.732901,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2.572481,5289.0,4.593701,7.960256,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2.572349,9579.0,4.606547,11.565968,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Split into testing and training datasets
Hint: don’t forget your sklearn functions here, like train_test_split().

In [144]:
#define variable X, y
X = df_dummy.drop('Vacancy_Rate%', axis=1)
y = df_dummy['Vacancy_Rate%']

In [145]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

### Establish Baseline Measurement Comparisons
Using a Dummy Regressor see what R2, MSE, and MAE would be if the mean of the DataFrames were used

In [129]:
#initial not even a model
train_mean = y_train.mean()

print(train_mean)

16.18981778612596


In [130]:
#Fit the dummy regressor on the training data
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
#create dummy regressor predictions 
y_tr_pred = dumb_reg.predict(X_train)
#Make prediction with the single value of the (training) mean.
y_te_pred = train_mean * np.ones(len(y_test))
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(0.0, -1.6361439166168168e-05)

In [131]:
#establish baseline for mean absolute error and mean square error 
print('MAEs:', mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred))
print('MSEs:', mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred))

MAEs: 9.986224021600865 9.931656462952944
MSEs: 197.13646433710005 192.39640964681763


### 3. Standardize the magnitude of numeric features using a scaler
Hint: you might need to employ Python code like this:

In [132]:
scaler = StandardScaler()
#fit the scaler on the training set
scaler.fit(X_train)
#apply the scaling to both the train and test split
X_tr_scaled = scaler.transform(X_train)
X_te_scaled = scaler.transform(X_test)

#### Initial Model: Train the model on the train split

In [133]:
%%time
lm = LinearRegression().fit(X_train, y_train)

CPU times: user 27.3 s, sys: 3.56 s, total: 30.8 s
Wall time: 20.1 s


In [134]:
#Make predictions using the model on both train and test splits
y_tr_pred = lm.predict(X_train)
y_te_pred = lm.predict(X_test)

In [135]:
#Assess model performance
# r^2 - train, test
r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
print('r2:', r2)

r2: (0.37291118296766945, 0.37050198578843974)


In [136]:
#MAE - train, test
mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
print('mae:', mae)

mae: (7.4384374537334, 7.402211432205946)


In [137]:
# MSE - train, test
mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
print('mse:', mse)

mse: (123.62207221508828, 121.11117626096285)


**This is better performance than when using Dummy variable/mean for R^2 (see earlier): it's also better than using 2-digits for zipcode (~23% R2, 8.3% MAE, 146.7% MSE)**

Stats for data before dropped outliers  (3 digit zipcodes):
- r2: (0.3767111514126693, 0.3751352905370249)
- mae: (7.487708463361065, 7.452121983056816)
- mse: (122.87295987124845, 120.21975964739423)

**Interestingly data with dropped outliers had slightly worst performance**

Stats for data with *dropped* outliers (2 digit zipcodes) *seems to be overfitting* 
- r2: (0.22952583661700865, 0.235109639685662)
- mae: (8.257680891773221, 8.218592583740403)
- mse: (146.05654935373445, 145.04292379880803)

Stats for data with *dropped* outliers (3 digit zipcodes) 
- r2: (0.36221669826682446, 0.3605874734215121)
- mae: (7.429639210875079, 7.450697391869509)
- mse: (120.90272810390731, 121.249092916289)

**overall capped outliers performed slightly better than dropped outliers, but commiserate with leaving outliers in**

Stats for data with *capped* outliers (2 digit zipcodes)
- r2: (0.23318119382420266, 0.23738971038225587)
- mae: (8.393471405034054, 8.324683650748993)
- mse: (151.1679482366927, 146.7210811141307)

Stats for data with *capped* outliers (3 digit zipcodes) *seems to be overfitting* 
- r2: (0.37703398888642803, 0.37500828828268473)
- mae:(7.501619749191513, 7.469675100354592)
- mse: (122.80931683311614, 120.24419402536472)

**Reducing skewness seemed to have not much effect vs. using original data**
Stats for data after box cox transformation to reduce skew in home/rent prices - with outliers (2 digit zipcodes) *seems to be overfitting* 
- r2: (0.22949476429165416, 0.2339725120183841)
- mae: (8.347900627480898, 8.276014415834373)
- mse: (151.89467792076718, 147.37852705362886)

Stats for data after box cox transformation to reduce skew in home/rent prices - with outliers (3 digit zipcodes)
- r2: (0.37291118296766945, 0.37050198578843974)
- mae: (7.4384374537334, 7.402211432205946)
- mse: (123.62207221508828, 121.11117626096285)

#### Test scaled data

In [138]:
%%time
lm = LinearRegression().fit(X_tr_scaled, y_train)

CPU times: user 28.8 s, sys: 4.78 s, total: 33.6 s
Wall time: 23.5 s


In [139]:
#Make predictions using the model on both train and test splits
y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)

In [140]:
#Assess model performance
# r^2 - train, test
r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
print('r2:', r2)

r2: (0.37291109056433713, 0.37050150992967323)


In [141]:
#MAE - train, test
mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
print('mae:', mae)

mae: (7.4384183667969745, 7.402225947633483)


In [142]:
# MSE - train, test
mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
print('mse:', mse)

mse: (123.62209043115452, 121.1112678129831)


**Not much difference between scaling and not scaling data**
**Interestingly more data in this notebook is performing worse than less data in notebook 4.2.1**
    - possibly due to less data being easier to predict (higher variation in larger dataset)
    - possibly due to Zillow rental data being better quality than the ACS rental data?
    - *could try combo of ACS and zillow data*
    
    
Stats for scaled data before dropped outliers (3 digit zipcodes): *seems to be overfitting* 
- r2: ~37.5% on test set
- mae: (7.488184295812186, 7.452440692995879)
- mse: (122.87383263653231, 120.22394615268081)

Stats for scaled data with dropped outliers (2 digit zipcodes) *seems to be overfitting* 
- r2: (0.22952580560458724, 0.23511028144732893)
- mae: (8.257714304821766, 8.21862410746757)
- mse: (146.0565552326691, 145.0428021042576)

Stats for scaled data with dropped outliers (3 digit zipcodes) **Interestingly performed a little worse than data with outliers**
- r2: (0.36220887494858034, 0.36060807932079775)
- mae: mae: (7.428875046209663, 7.449761685868122)
- mse: (120.90421114762445, 121.24518550676346)


**overall capped outliers performed slightly better than dropped outliers, but very slightly worse compared with leaving outliers in**

Stats for data with *capped* outliers (2 digit zipcodes)
- r2: 
- mae: 
- mse:

Stats for data with *capped* outliers (3 digit zipcodes) 
- r2:(0.37703297973414895, 0.37501029805764485)
- mae: (7.501407203913482, 7.469608056180069)
- mse: mse: (122.80951577382842, 120.24380735820462)

Stats for data after box cox transformation to reduce skew in home/rent prices - with outliers (2 digit zipcodes) *seems to be overfitting* 
- r2: (0.22949465952101533, 0.23397338683550317)
- mae: (8.347951615429123, 8.276031549742534)
- mse: (151.8946985748805, 147.37835874470983)

Stats for data after box cox transformation to reduce skew in home/rent prices - with outliers (3 digit zipcodes)* 
- r2: (0.37291109056433713, 0.37050150992967323)
- mae: (7.4384183667969745, 7.402225947633483)
- mse: (123.62209043115452, 121.1112678129831)

### NEXT MAKE PIPELINE AND TRY OTHER MODELS (ie. Random Forest)
- use 2 or 3 digit zipcodes
- drop or not drop outliers
- scale or not scale data

In [146]:
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)


RandomForestRegressor()

In [147]:
#Make predictions using the model on both train and test splits
y_tr_pred = rf_reg_reg.predict(X_train)
y_te_pred = rf_reg.predict(X_test)

In [148]:
#Assess model performance
# r^2 - train, test
r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
print('r2:', r2)

r2: (0.9631418616928961, 0.7423842823224307)


In [149]:
#MAE - train, test
mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
print('mae:', mae)

mae: (1.693676659290483, 4.4850607047333515)


In [150]:
# MSE - train, test
mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
print('mse:', mse)

mse: (7.266083067910291, 49.56352821910074)


## Save processed data

In [None]:
#save vacancy rate data for modeling - remember to use random state=42!
#df.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/VacancyRate_Zillow_2014_2018', index=False)
#df_2019_2020.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/VacancyRate_Zillow_2019_2020', index=False)

In [None]:
#save the scaled training and test splits

#X_tr_scaled.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/X_tr_scaled', index=False)
#X_te_scaled

### Summary

### NEEDS UPDATING
This summary should provide a quick overview for someone wanting to know quickly why the given model was chosen for the next part of the business problem to help guide important business decisions.

- dropped margin of error of vacancy rate as this is correlated with vacancy rate and we do not have variables in 2019-2020 thus would not serve us well in a predictive model
- inspected data and created dummy variables for categorical variables (ie. metro, state, city, county name)
- split into testing and training datasets 
- estabilshed baseline measurement comparisons with dummy regressors
- attempted to fit the training data on a linear regression model...
**Note: THIS NOTEBOOK WAS NOT COMPLETED BECAUSE THE .FIT() FUNCTION TOOK FAR TOO LONG**
    - model fitting for Linear regression, on 5000 rows. Took 69.6165668964386 seconds with -7 R2 score
    - model fitting for Linear regression, on 10000 rows. Took 711.7782809734344 seconds with -6.32 R2 score
- did not complete notebook

### Reflection: 
**Review the following questions and apply them to your dataset**:

● Does my data set have any categorical data, such as Gender or day of the week? **Yes noted above in summary**

● Do my features have data values that range from 0 - 100 or 0-1 or both and more **yes vacancy rate is 0-100**