## Preprocessing and Training Data Development
### Goal:  Create a cleaned development dataset you can use to complete the modeling step of your project

NOTE: Orginally the DataFrames had 3x variables, 7x, 8x, and 9x variable in addition to rental vacancy rate. After dropping ppi_residential construction variable these DataFrames now have 3x, 6x, 7x, 8x variables respectively. The df names were not changed in the notebook below (due to time limitations) to reflect this. Only when the DataFrames were sent to csv were they changed.



#### Steps: 
● 1. Create dummy or indicator features for categorical variables

● 2. Standardize the magnitude of numeric features using a scaler

● 3. Split into testing and training datasets

In [1]:
#imports
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from pandas_profiling import ProfileReport

In [2]:
#load data
path= '/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/interim'
os.chdir(path) 

In [3]:
DF = pd.read_csv('df3_1956')
DF

Unnamed: 0,DATE,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct,resConstruct_spending
0,1956-01-01,,,6.2,4.0,2.50,35.900,,,,
1,1956-02-01,,,6.2,3.9,2.50,35.900,,,,
2,1956-03-01,,,6.2,4.2,2.50,35.900,,,,
3,1956-04-01,,,5.9,4.0,2.65,36.000,,,,
4,1956-05-01,,,5.9,4.3,2.75,36.100,,,,
...,...,...,...,...,...,...,...,...,...,...,...
770,2020-03-01,0.473954,,6.6,4.4,0.25,339.519,215.160,1269.0,224.5,595963.0
771,2020-04-01,0.473954,,5.7,14.7,0.25,340.135,217.323,934.0,215.9,569892.0
772,2020-05-01,0.473954,,5.7,13.3,0.25,340.811,218.600,1038.0,217.3,549977.0
773,2020-06-01,0.473954,,5.7,11.1,0.25,341.294,219.819,1220.0,221.4,542307.0


In [4]:
#Create a new dataframe, setting the index to 'DATE'
df = DF.set_index('DATE')
#Save the DATE labels 
df_index = df.index
#Save the column names
df_columns = df.columns
df.head()

Unnamed: 0_level_0,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct,resConstruct_spending
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1956-01-01,,,6.2,4.0,2.5,35.9,,,,
1956-02-01,,,6.2,3.9,2.5,35.9,,,,
1956-03-01,,,6.2,4.2,2.5,35.9,,,,
1956-04-01,,,5.9,4.0,2.65,36.0,,,,
1956-05-01,,,5.9,4.3,2.75,36.1,,,,


### Deal with Remaining NaNs
Due to different variables in the time series having data beginning and ending at different times, we will create 4 DataFrames. Some with longer timeframes and less variables and some with more variables and shorter timeframes, as this was the tradeoff that had to be made.

Based on this we can use these 4 dataframes later to see which might produce the best predictive model for vacancy rate.

In [5]:
#deal with reamining NaNs in data (due to some variables starting later than other, 
df.isna().sum()

uspop_growth              60
med_hIncome              355
rentl_vacnyRate            0
unemplt_rate               0
int_rate                   0
cpi_rent                   0
homePrice_index          373
newHouse_starts           36
ppi_resConstruct         365
resConstruct_spending    553
dtype: int64

In [6]:
#drop any variable that is missing data during period rental vacany rate data (1956-2020)
df3x_1956_2020 = df.dropna(axis=1)
df3x_1956_2020

Unnamed: 0_level_0,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1956-01-01,6.2,4.0,2.50,35.900
1956-02-01,6.2,3.9,2.50,35.900
1956-03-01,6.2,4.2,2.50,35.900
1956-04-01,5.9,4.0,2.65,36.000
1956-05-01,5.9,4.3,2.75,36.100
...,...,...,...,...
2020-03-01,6.6,4.4,0.25,339.519
2020-04-01,5.7,14.7,0.25,340.135
2020-05-01,5.7,13.3,0.25,340.811
2020-06-01,5.7,11.1,0.25,341.294


In [7]:
#trim df to include all variables & no NaNs(2002-2018)
df9x_2002_2018 = df.dropna(axis=0)
df9x_2002_2018

Unnamed: 0_level_0,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct,resConstruct_spending
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2002-01-01,0.927797,59360.0,9.1,5.7,1.25,197.000,116.438,1698.0,140.3,382979.0
2002-02-01,0.927797,59360.0,9.1,5.7,1.25,197.700,116.918,1829.0,140.3,391434.0
2002-03-01,0.927797,59360.0,9.1,5.7,1.25,198.200,117.931,1642.0,140.9,390942.0
2002-04-01,0.927797,59360.0,8.4,5.9,1.25,198.500,119.211,1592.0,141.3,404255.0
2002-05-01,0.927797,59360.0,8.4,5.8,1.25,198.800,120.790,1764.0,141.2,399164.0
...,...,...,...,...,...,...,...,...,...,...
2018-08-01,0.522337,63179.0,7.1,3.8,2.50,320.651,205.448,1280.0,231.3,553691.0
2018-09-01,0.522337,63179.0,7.1,3.7,2.75,321.533,205.506,1246.0,231.7,553579.0
2018-10-01,0.522337,63179.0,6.6,3.8,2.75,322.628,205.514,1207.0,232.5,542175.0
2018-11-01,0.522337,63179.0,6.6,3.7,2.75,323.968,205.264,1204.0,229.2,539913.0


In [8]:
#trim data to match med_hIncome, homePrice_index, ppi_resConstruct
df_1987_2018 = df[(df.index >= '1987-01-01') & (df.index < '2019-01-01')]
#drop resConstruct_spending due to data only from 2002 on
df8x_1987_2018 = df_1987_2018.dropna(axis=1)
df8x_1987_2018.isna().sum()

uspop_growth        0
med_hIncome         0
rentl_vacnyRate     0
unemplt_rate        0
int_rate            0
cpi_rent            0
homePrice_index     0
newHouse_starts     0
ppi_resConstruct    0
dtype: int64

In [9]:
#trim data to match med_hIncome, homePrice_index, ppi_resConstruct
df_1987_2020 = df[(df.index >= '1987-01-01') & (df.index < '2020-06-01')]
df7x_1987_2020 = df_1987_2020.dropna(axis=1)
df7x_1987_2020

Unnamed: 0_level_0,uspop_growth,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1987-01-01,0.893829,7.4,6.6,5.50,121.300,63.755,1774.0,100.0
1987-02-01,0.893829,7.4,6.6,5.50,121.700,64.156,1784.0,100.4
1987-03-01,0.893829,7.4,6.6,5.50,121.800,64.491,1726.0,100.7
1987-04-01,0.893829,7.5,6.3,5.50,122.000,64.994,1614.0,101.1
1987-05-01,0.893829,7.5,6.3,5.50,122.300,65.568,1628.0,101.3
...,...,...,...,...,...,...,...,...
2020-01-01,0.473954,6.6,3.6,2.25,337.825,212.470,1617.0,228.0
2020-02-01,0.473954,6.6,3.5,2.25,338.616,213.255,1567.0,227.3
2020-03-01,0.473954,6.6,4.4,0.25,339.519,215.160,1269.0,224.5
2020-04-01,0.473954,5.7,14.7,0.25,340.135,217.323,934.0,215.9


In [10]:
df.corr()

Unnamed: 0,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct,resConstruct_spending
uspop_growth,1.0,-0.371142,0.006683,-0.098825,0.180918,-0.670099,-0.801649,0.206284,-0.797751,-0.093562
med_hIncome,-0.371142,1.0,0.317792,-0.62626,-0.359616,0.682694,0.673637,0.05985,0.516545,0.757655
rentl_vacnyRate,0.006683,0.317792,1.0,0.004194,-0.551596,0.498373,0.182388,-0.271587,0.066732,-0.316184
unemplt_rate,-0.098825,-0.62626,0.004194,1.0,0.079377,0.061554,-0.095361,-0.335132,0.065836,-0.757575
int_rate,0.180918,-0.359616,-0.551596,0.079377,1.0,-0.480865,-0.552798,0.2843,-0.723648,0.66467
cpi_rent,-0.670099,0.682694,0.498373,0.061554,-0.480865,1.0,0.940999,-0.375218,0.980302,0.144342
homePrice_index,-0.801649,0.673637,0.182388,-0.095361,-0.552798,0.940999,1.0,-0.15264,0.90182,0.681801
newHouse_starts,0.206284,0.05985,-0.271587,-0.335132,0.2843,-0.375218,-0.15264,1.0,-0.457428,0.788621
ppi_resConstruct,-0.797751,0.516545,0.066732,0.065836,-0.723648,0.980302,0.90182,-0.457428,1.0,-0.073348
resConstruct_spending,-0.093562,0.757655,-0.316184,-0.757575,0.66467,0.144342,0.681801,0.788621,-0.073348,1.0


In [11]:
df3x_1956_2020.corr()
#df3.. corr() same as original df dataset

Unnamed: 0,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent
rentl_vacnyRate,1.0,0.004194,-0.551596,0.498373
unemplt_rate,0.004194,1.0,0.079377,0.061554
int_rate,-0.551596,0.079377,1.0,-0.480865
cpi_rent,0.498373,0.061554,-0.480865,1.0


In [12]:
df9x_2002_2018.corr()

Unnamed: 0,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct,resConstruct_spending
uspop_growth,1.0,-0.182595,0.780945,0.035228,0.494537,-0.874804,-0.348468,0.405317,-0.826768,0.097091
med_hIncome,-0.182595,1.0,-0.43746,-0.817595,0.416378,0.341517,0.732775,0.435608,0.055424,0.757655
rentl_vacnyRate,0.780945,-0.43746,1.0,0.457909,0.279683,-0.792619,-0.501899,0.092708,-0.6624,-0.219973
unemplt_rate,0.035228,-0.817595,0.457909,1.0,-0.56931,-0.155006,-0.679573,-0.663699,0.053444,-0.880589
int_rate,0.494537,0.416378,0.279683,-0.56931,1.0,-0.349664,0.456235,0.600832,-0.362287,0.702287
cpi_rent,-0.874804,0.341517,-0.792619,-0.155006,-0.349664,1.0,0.624595,-0.500019,0.934835,-0.016194
homePrice_index,-0.348468,0.732775,-0.501899,-0.679573,0.456235,0.624595,1.0,0.122238,0.510446,0.656654
newHouse_starts,0.405317,0.435608,0.092708,-0.663699,0.600832,-0.500019,0.122238,1.0,-0.653296,0.814592
ppi_resConstruct,-0.826768,0.055424,-0.6624,0.053444,-0.362287,0.934835,0.510446,-0.653296,1.0,-0.204557
resConstruct_spending,0.097091,0.757655,-0.219973,-0.880589,0.702287,-0.016194,0.656654,0.814592,-0.204557,1.0


In [13]:
#df9x..corr() differs from original df dataset by:
'''
-higher correlation positive with uspopgrowth, unemployment rate, 
-lower correlation positive with interest rate (was neg corr),  new housing starts (was neg corr), 
-higher correlation negative with, medincome, rent price (was pos. corr), home prices (was pos. corr), ppi_resConstr (was pos.)
-lower correlation negative with resConstr_spending
'''

#df9x..corr() differs from original df8x_1987_2018 dataset by:
'''
-higher correlation positive with !!uspopgrowth, unemployment rate, int_rate (was neg corr),
-lower correlation positive with 
-higher correlation negative with, medincome, !!rent price, home prices, !!ppi_resContstruct
-lower correlation negative with 
'''

'\n-higher correlation positive with !!uspopgrowth, unemployment rate, int_rate (was neg corr),\n-lower correlation positive with \n-higher correlation negative with, medincome, !!rent price, home prices, !!ppi_resContstruct\n-lower correlation negative with \n'

In [14]:
df8x_1987_2018.corr()

Unnamed: 0,uspop_growth,med_hIncome,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct
uspop_growth,1.0,-0.491476,-0.163384,-0.053428,0.569004,-0.81379,-0.762166,0.26706,-0.785862
med_hIncome,-0.491476,1.0,0.132818,-0.597023,-0.138127,0.597789,0.673637,0.262778,0.498466
rentl_vacnyRate,-0.163384,0.132818,1.0,0.375742,-0.242571,0.189736,0.349123,-0.020481,0.189417
unemplt_rate,-0.053428,-0.597023,0.375742,1.0,-0.501656,0.056141,-0.086004,-0.675707,0.134883
int_rate,0.569004,-0.138127,-0.242571,-0.501656,1.0,-0.732514,-0.554135,0.488726,-0.728954
cpi_rent,-0.81379,0.597789,0.189736,0.056141,-0.732514,1.0,0.929831,-0.387818,0.980774
homePrice_index,-0.762166,0.673637,0.349123,-0.086004,-0.554135,0.929831,1.0,-0.159314,0.887943
newHouse_starts,0.26706,0.262778,-0.020481,-0.675707,0.488726,-0.387818,-0.159314,1.0,-0.472726
ppi_resConstruct,-0.785862,0.498466,0.189417,0.134883,-0.728954,0.980774,0.887943,-0.472726,1.0


In [15]:
#df8x.. corr() differs from original df dataset by:
'''
-higher correlation positive with home prices, unemployment rate, ppi_resConstruct,
-lower correlation positive with medincome,  rent prices, 
-higher correlation negative with uspop 
-lower correlation negative with int_rate, housing starts, 
'''

'\n-higher correlation positive with home prices, unemployment rate, ppi_resConstruct,\n-lower correlation positive with medincome,  rent prices, \n-higher correlation negative with uspop \n-lower correlation negative with int_rate, housing starts, \n'

In [16]:
df7x_1987_2020.corr()


Unnamed: 0,uspop_growth,rentl_vacnyRate,unemplt_rate,int_rate,cpi_rent,homePrice_index,newHouse_starts,ppi_resConstruct
uspop_growth,1.0,-0.01884,0.007084,0.556928,-0.846276,-0.799672,0.247318,-0.813892
rentl_vacnyRate,-0.01884,1.0,0.334149,-0.191856,0.042124,0.195045,-0.013439,0.070549
unemplt_rate,0.007084,0.334149,1.0,-0.463699,-0.00084,-0.113098,-0.629166,0.064533
int_rate,0.556928,-0.191856,-0.463699,1.0,-0.705715,-0.549737,0.485947,-0.713242
cpi_rent,-0.846276,0.042124,-0.00084,-0.705715,1.0,0.94039,-0.356749,0.979933
homePrice_index,-0.799672,0.195045,-0.113098,-0.549737,0.94039,1.0,-0.152146,0.901359
newHouse_starts,0.247318,-0.013439,-0.629166,0.485947,-0.356749,-0.152146,1.0,-0.44578
ppi_resConstruct,-0.813892,0.070549,0.064533,-0.713242,0.979933,0.901359,-0.44578,1.0


In [17]:
#corr() differs from df8x_1987_2018 dataset by:
'''
-higher correlation positive with
-lower correlation positive with unemployment, cpi_rent, homePrice, ppi_resConstruct
-higher correlation negative with  
-lower correlation negative with uspop, int_rate, newHouseStarts, 
'''

'\n-higher correlation positive with\n-lower correlation positive with unemployment, cpi_rent, homePrice, ppi_resConstruct\n-higher correlation negative with  \n-lower correlation negative with uspop, int_rate, newHouseStarts, \n'

In [18]:
profile8x = ProfileReport(df9x_2002_2018.drop(['ppi_resConstruct'], axis=1))
profile8x

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=24.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






In [19]:
#check partition sizes with a 70/30 train/test split for all DataFrames
dfs = [df3x_1956_2020, df7x_1987_2020, df8x_1987_2018, df9x_2002_2018]
for d in dfs:
   print('train size:', len(d) * .70, 'test size:', len(d) * .30)

train size: 542.5 test size: 232.5
train size: 280.7 test size: 120.3
train size: 268.79999999999995 test size: 115.19999999999999
train size: 142.79999999999998 test size: 61.199999999999996


### 1. Split into testing and training datasets
Hint: don’t forget your sklearn functions here, like train_test_split().

In [20]:
#define variable X, y for all 4 DFs
#drop ppi residential construction because it's highly correlated with cpi_rent
X3 = df3x_1956_2020.drop(['rentl_vacnyRate'], axis=1)
y3 = df3x_1956_2020['rentl_vacnyRate']

X7 = df7x_1987_2020.drop(['rentl_vacnyRate', 'ppi_resConstruct'], axis=1)
y7 = df7x_1987_2020['rentl_vacnyRate']

X8 = df8x_1987_2018.drop(['rentl_vacnyRate', 'ppi_resConstruct'], axis=1)
y8 = df8x_1987_2018['rentl_vacnyRate']

X9 = df9x_2002_2018.drop(['rentl_vacnyRate', 'ppi_resConstruct'], axis=1)
y9 = df9x_2002_2018['rentl_vacnyRate']

In [21]:
#train test split each of the 4 DFs
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.30, random_state=42)

X_train7, X_test7, y_train7, y_test7 = train_test_split(X7, y7, test_size=0.30, random_state=42)

X_train8, X_test8, y_train8, y_test8 = train_test_split(X8, y8, test_size=0.30, random_state=42)

X_train9, X_test9, y_train9, y_test9 = train_test_split(X9, y9, test_size=0.30, random_state=42)

### Establish Baseline Measurement Comparisons
Using a Dummy Regressor see what R2, MSE, and MAE would be if the mean of the DataFrames were used

In [22]:
#initial not even a model
train_mean3 = y_train3.mean()
train_mean7 = y_train7.mean()
train_mean8 = y_train8.mean()
train_mean9 = y_train9.mean()
print(train_mean3, train_mean7, train_mean8, train_mean9)

7.316789667896673 8.30285714285714 8.414552238805966 8.891549295774645


In [23]:
#Fit the dummy regressor on the training data
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train3, y_train3)
#create dummy regressor predictions 
y_tr_pred3 = dumb_reg.predict(X_train3)
#Make prediction with the single value of the (training) mean.
y_te_pred3 = train_mean3 * np.ones(len(y_test3))
r2_score(y_train3, y_tr_pred3), r2_score(y_test3, y_te_pred3)

(0.0, -0.0019935404380870825)

In [24]:
#repeat for other DFs
dumb_reg.fit(X_train7, y_train7)
y_tr_pred7 = dumb_reg.predict(X_train7)
y_te_pred7 = train_mean7 * np.ones(len(y_test7))
r2_score(y_train7, y_tr_pred7), r2_score(y_test7, y_te_pred7)

(0.0, -0.021688069284026223)

In [25]:
dumb_reg.fit(X_train8, y_train8)
y_tr_pred8 = dumb_reg.predict(X_train8)
y_te_pred8 = train_mean8 * np.ones(len(y_test8))
r2_score(y_train8, y_tr_pred8), r2_score(y_test8, y_te_pred8)

(0.0, -0.09102577573606774)

In [26]:
dumb_reg.fit(X_train9, y_train9)
y_tr_pred9 = dumb_reg.predict(X_train9)
y_te_pred9 = train_mean9 * np.ones(len(y_test9))
r2_score(y_train9, y_tr_pred9), r2_score(y_test9, y_te_pred9)

(0.0, -6.390443283788017e-05)

In [27]:
#establish baseline for mean absolute error and mean square error 
print('MAEs:', mean_absolute_error(y_train3, y_tr_pred3), mean_absolute_error(y_test3, y_te_pred3))
print('MSEs:', mean_squared_error(y_train3, y_tr_pred3), mean_squared_error(y_test3, y_te_pred3))

MAEs: 1.2185700085783147 1.2585884421075968
MSEs: 2.2213232731035797 2.3665555891614383


In [28]:
print('MAEs:', mean_absolute_error(y_train7, y_tr_pred7), mean_absolute_error(y_test7, y_te_pred7))
print('MSEs:', mean_squared_error(y_train7, y_tr_pred7), mean_squared_error(y_test7, y_te_pred7))

MAEs: 0.9914285714285713 1.093719008264462
MSEs: 1.357134693877551 1.5496232754258716


In [29]:
print('MAEs:', mean_absolute_error(y_train8, y_tr_pred8), mean_absolute_error(y_test8, y_te_pred8))
print('MSEs:', mean_squared_error(y_train8, y_tr_pred8), mean_squared_error(y_test8, y_te_pred8))

MAEs: 1.0497911561595012 0.9587493566649491
MSEs: 1.4399001726442413 1.1472408463984731


In [30]:
print('MAEs:', mean_absolute_error(y_train9, y_tr_pred9), mean_absolute_error(y_test9, y_te_pred9))
print('MSEs:', mean_squared_error(y_train9, y_tr_pred9), mean_squared_error(y_test9, y_te_pred9))

MAEs: 1.1267804007141444 1.2266242616992278
MSEs: 1.6478159095417577 1.8607120323028585


###  2. Create dummy or indicator features for categorical variables
Hint: you’ll need to think about your old favorite pandas functions here like
get_dummies() . Consult this guide for help.
<https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40>

In [31]:
#step not needed as there are no categorical variables in this data set

### 3. Standardize the magnitude of numeric features using a scaler
Hint: you might need to employ Python code like this:

In [32]:
'''
# Making a Scaler object
scaler = preprocessing.StandardScaler()
# Fitting data to the scaler object
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)
'''

'\n# Making a Scaler object\nscaler = preprocessing.StandardScaler()\n# Fitting data to the scaler object\nscaled_df = scaler.fit_transform(df)\nscaled_df = pd.DataFrame(scaled_df, columns=names)\n'

In [33]:
scaler = StandardScaler()
#fit the scaler on the training set
scaler.fit(X_train3)
#apply the scaling to both the train and test split
X_tr_scaled3 = scaler.transform(X_train3)
X_te_scaled3 = scaler.transform(X_test3)

In [34]:
#repeat for other 3 DFs
scaler.fit(X_train7)
#apply the scaling to both the train and test split
X_tr_scaled7 = scaler.transform(X_train7)
X_te_scaled7 = scaler.transform(X_test7)

In [35]:
scaler.fit(X_train8)
#apply the scaling to both the train and test split
X_tr_scaled8 = scaler.transform(X_train8)
X_te_scaled8 = scaler.transform(X_test8)

In [36]:
scaler.fit(X_train9)
#apply the scaling to both the train and test split
X_tr_scaled9 = scaler.transform(X_train9)
X_te_scaled9 = scaler.transform(X_test9)

#### Initial Model: Train the model on the train split

In [37]:
lm3 = LinearRegression().fit(X_tr_scaled3, y_train3)
lm7 = LinearRegression().fit(X_tr_scaled7, y_train7)
lm8 = LinearRegression().fit(X_tr_scaled8, y_train8)
lm9 = LinearRegression().fit(X_tr_scaled9, y_train9)

In [38]:
#Make predictions using the model on both train and test splits
y_tr_pred3 = lm3.predict(X_tr_scaled3)
y_te_pred3 = lm3.predict(X_te_scaled3)

y_tr_pred7 = lm7.predict(X_tr_scaled7)
y_te_pred7 = lm7.predict(X_te_scaled7)

y_tr_pred8 = lm8.predict(X_tr_scaled8)
y_te_pred8 = lm8.predict(X_te_scaled8)

y_tr_pred9 = lm9.predict(X_tr_scaled9)
y_te_pred9 = lm9.predict(X_te_scaled9)

In [39]:
#Assess model performance
# r^2 - train, test
r2_3 = r2_score(y_train3, y_tr_pred3), r2_score(y_test3, y_te_pred3)
r2_7 = r2_score(y_train7, y_tr_pred7), r2_score(y_test7, y_te_pred7)
r2_8 = r2_score(y_train8, y_tr_pred8), r2_score(y_test8, y_te_pred8)
r2_9 = r2_score(y_train9, y_tr_pred9), r2_score(y_test9, y_te_pred9)

print('r2_3:', r2_3)
print('r2_7:', r2_7)
print('r2_8:', r2_8)
print('r2_9:', r2_9)

r2_3: (0.3786286887979474, 0.36527363953055403)
r2_7: (0.5624284164387359, 0.4945240261279067)
r2_8: (0.7788776493349637, 0.702583887364401)
r2_9: (0.9037194857757508, 0.926397963119903)


**This is markedly better performance than when using Dummy variable/mean for R^2 (see earlier):**

Dummy3 - (0.0, -0.0019935404380870825)

Dummy7 - (0.0, -0.021688069284026223)

Dummy8 - (0.0, -0.09102577573606774)

Dummy9 - (0.0, -6.390443283788017e-05)


In [40]:
#MAE - train, test
mae3 = mean_absolute_error(y_train3, y_tr_pred3), mean_absolute_error(y_test3, y_te_pred3)
mae7 = mean_absolute_error(y_train7, y_tr_pred7), mean_absolute_error(y_test7, y_te_pred7)
mae8 = mean_absolute_error(y_train8, y_tr_pred8), mean_absolute_error(y_test8, y_te_pred8)
mae9 = mean_absolute_error(y_train9, y_tr_pred9), mean_absolute_error(y_test9, y_te_pred9)
print('mae3:', mae3)
print('mae7:', mae7)
print('mae8:', mae8)
print('mae9:', mae9)

mae3: (0.9749857684123913, 1.0066431519713577)
mae7: (0.6356097465140528, 0.6986135878501661)
mae8: (0.4544621075570944, 0.4658158009465815)
mae9: (0.3322796827360887, 0.2904618915941307)


In [41]:
# MSE - train, test
mse3 = mean_squared_error(y_train3, y_tr_pred3), mean_squared_error(y_test3, y_te_pred3)
mse7 = mean_squared_error(y_train7, y_tr_pred7), mean_squared_error(y_test7, y_te_pred7)
mse8 = mean_squared_error(y_train8, y_tr_pred8), mean_squared_error(y_test8, y_te_pred8)
mse9 = mean_squared_error(y_train9, y_tr_pred9), mean_squared_error(y_test9, y_te_pred9)

print('mse3:', mse3)
print('mse7:', mse7)
print('mse8:', mse8)
print('mse9:', mse9)

mse3: (1.3802665548120066, 1.4991266463657213)
mse7: (0.5938435771059314, 0.7666697476752088)
mse8: (0.31839411089808617, 0.3127404689980033)
mse9: (0.15865256311757933, 0.13694344433165437)


**This is markedly better performance than when using Dummy variable/mean for R^2 (see earlier):**

Dummy3 -
MAEs: 1.2185700085783147 1.2585884421075968
MSEs: 2.2213232731035797 2.3665555891614383

Dummy7 -
MAEs: 0.9914285714285713 1.093719008264462
MSEs: 1.357134693877551 1.5496232754258716

Dummy8 - 
MAEs: 1.1267804007141444 1.2266242616992278
MSEs: 1.6478159095417577 1.8607120323028585

Dummy9 -
MAEs: 1.1267804007141444 1.2266242616992278
MSEs: 1.6478159095417577 1.8607120323028585

## Save processed data

In [42]:
#scale the full data DataFrames (scaled and not), and save
df3_scale = scaler.fit_transform(df3x_1956_2020)
df3x_scale_1956_2020 = pd.DataFrame(df3_scale, columns=df3x_1956_2020.columns)
df3x_scale_1956_2020.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df3x_scale_1956_2020', index=False)
df3x_1956_2020.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df3x_1956_2020', index=False)

df7_scale = scaler.fit_transform(df7x_1987_2020)
df7x_scale_1987_2020 = pd.DataFrame(df7_scale, columns=df7x_1987_2020.columns)
df7x_scale_1987_2020.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df6x_scale_1987_2020', index=False)
df7x_1987_2020.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df6x_1987_2020', index=False)

df8_scale = scaler.fit_transform(df8x_1987_2018)
df8x_scale_1987_2018 = pd.DataFrame(df8_scale, columns=df8x_1987_2018.columns)
df8x_scale_1987_2018.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df7x_scale_1987_2018', index=False)
df8x_1987_2018.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df7x_1987_2018', index=False)


df9_scale = scaler.fit_transform(df9x_2002_2018)
df9x_scale_2002_2018 = pd.DataFrame(df9_scale, columns=df9x_2002_2018.columns)
df9x_scale_2002_2018.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df8x_scale_2002_2018', index=False)
df9x_2002_2018.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/df9x_2002_2018', index=False)

In [43]:
#save the scaled training and test splits

X_tr_scaled3.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/X_tr_scaled3', index=False)
X_te_scaled3 

AttributeError: 'numpy.ndarray' object has no attribute 'to_csv'

### Summary
This summary should provide a quick overview for someone wanting to know quickly why the given model was chosen for the next part of the business problem to help guide important business decisions.

In [None]:
#complete summary

#dropped ppi_res contruct because of high correlation with cpi_rent

### Reflection: 
**Review the following questions and apply them to your dataset**:

● Does my data set have any categorical data, such as Gender or day of the week?

● Do my features have data values that range from 0 - 100 or 0-1 or both and more

In [None]:
#MAKE PIPELINE AND TRY PREDICTIONS
#scale the test and training splits...

#look again at DataCamp for help with non-linear variables
    #standardize, log-transform or normalize your data, as well as statistically valid ways to remove outliers.
    #make violin or box plots of each variable
    
    
#try percent change (pct_change) due to time series as well, see this and then maybe make 3rd data set?
 #check FRED ReadMe/literature on each variable to make sure they don't need to be converted