## Preprocessing and Training Data Development - Vacancy Rates 
### Goal:  Create a cleaned development dataset you can use to complete the modeling step of your project


#### Steps: 
● 1. Create dummy or indicator features for categorical variables

● 2. Standardize the magnitude of numeric features using a scaler

● 3. Split into testing and training datasets

In [1]:
#imports
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from pandas_profiling import ProfileReport

In [2]:
#load data
path= '/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/interim'
os.chdir(path) 

In [3]:
DF = pd.read_csv('VacancyRate_Zipcode_AND_National_2011_2020.csv',  dtype={'Zipcode': object}, parse_dates=['Year'])
DF

Unnamed: 0,Zipcode,Vacancy_Rate%,MOE-VacancyRate%,Year
0,02333,3.024027,2.199925,2011-01-01
1,02338,3.116343,2.948791,2011-01-01
2,02339,4.464646,2.066438,2011-01-01
3,02341,3.586322,2.340722,2011-01-01
4,02343,3.732901,2.926524,2011-01-01
...,...,...,...,...
264965,NATNL,6.850000,0.000000,2016-01-01
264966,NATNL,7.175000,0.000000,2017-01-01
264967,NATNL,6.875000,0.000000,2018-01-01
264968,NATNL,6.750000,0.000000,2019-01-01


In [4]:
DF.dtypes

Zipcode                     object
Vacancy_Rate%              float64
MOE-VacancyRate%           float64
Year                datetime64[ns]
dtype: object

In [5]:
#Create a new dataframe, setting the index to 'Year'
df = DF.set_index('Year')
#Save the DATE labels 
df_index = df.index
#Save the column names
df_columns = df.columns
df.head()

Unnamed: 0_level_0,Zipcode,Vacancy_Rate%,MOE-VacancyRate%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01-01,2333,3.024027,2.199925
2011-01-01,2338,3.116343,2.948791
2011-01-01,2339,4.464646,2.066438
2011-01-01,2341,3.586322,2.340722
2011-01-01,2343,3.732901,2.926524


In [6]:
#check for NaNs
df.isna().sum()

Zipcode             0
Vacancy_Rate%       0
MOE-VacancyRate%    0
dtype: int64

In [7]:
#check unique values for each column
df['MOE-VacancyRate%'].value_counts()/len(df)*100

0.000000      5.630071
100.000000    0.207571
20.000000     0.046043
33.333333     0.040759
25.000000     0.037363
                ...   
8.671855      0.000377
17.389154     0.000377
2.754040      0.000377
26.238062     0.000377
5.466803      0.000377
Name: MOE-VacancyRate%, Length: 234862, dtype: float64

In [8]:
#change National ('NATNL') zipcode to '99999' for later modeling
df.Zipcode.replace('NATNL', '99999', inplace=True)
df.Zipcode = df.Zipcode.astype('int')
df.dtypes

Zipcode               int64
Vacancy_Rate%       float64
MOE-VacancyRate%    float64
dtype: object

In [9]:
#check partition sizes with a 8 fold split train/test time series split for all DataFrames
print('train size:', len(df) * .875, 'test size:', len(df) * .125)

train size: 231848.75 test size: 33121.25


### 1. Split into testing and training datasets
Hint: don’t forget your sklearn functions here, like train_test_split().

In [10]:
#separate 2020 data from data set for later use and prediction
df_2019AND20 = df[df.index > '2018']
df = df[df.index < '2019']

In [11]:
#define variable X, y
X = df.drop('Vacancy_Rate%', axis=1)
y = df['Vacancy_Rate%']

In [12]:
#train test split
tss = TimeSeriesSplit(n_splits = 8)
for train_index, test_index in tss.split(X):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

### Establish Baseline Measurement Comparisons
Using a Dummy Regressor see what R2, MSE, and MAE would be if the mean of the DataFrames were used

In [13]:
#initial not even a model
train_mean = y_train.mean()

print(train_mean)

17.601757567048768


In [14]:
#Fit the dummy regressor on the training data
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
#create dummy regressor predictions 
y_tr_pred = dumb_reg.predict(X_train)
#Make prediction with the single value of the (training) mean.
y_te_pred = train_mean * np.ones(len(y_test))
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(0.0, -0.0014593635752446765)

In [15]:
#establish baseline for mean absolute error and mean square error 
print('MAEs:', mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred))
print('MSEs:', mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred))

MAEs: 11.649025685891326 12.356698194671722
MSEs: 266.9221614513521 296.48278858832686


###  2. Create dummy or indicator features for categorical variables
Hint: you’ll need to think about your old favorite pandas functions here like
get_dummies() . Consult this guide for help.
<https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40>

In [16]:
#step not needed as the zipcode variable was changed to integer above

### 3. Standardize the magnitude of numeric features using a scaler
Hint: you might need to employ Python code like this:

In [18]:
#don't need because everythin is % (0-100)

'\nscaler = StandardScaler()\n#fit the scaler on the training set\nscaler.fit(X_train)\n#apply the scaling to both the train and test split\nX_tr_scaled = scaler.transform(X_train)\nX_te_scaled = scaler.transform(X_test)\n'

#### Initial Model: Train the model on the train split

In [19]:
lm = LinearRegression().fit(X_train, y_train)

In [20]:
#Make predictions using the model on both train and test splits
y_tr_pred = lm.predict(X_train)
y_te_pred = lm.predict(X_test)

In [21]:
#Assess model performance
# r^2 - train, test
r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
print('r2:', r2)

r2: (0.4826938195558145, 0.4819883171151147)


**This is markedly better performance than when using Dummy variable/mean for R^2 (see earlier):**

Dummy R2 = (0.0, -0.0014593635752446765)

In [22]:
#MAE - train, test
mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
print('mae:', mae)

mae: (7.940877958949165, 8.326015280389571)


In [23]:
# MSE - train, test
mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
print('mse:', mse)

mse: (138.0804838163052, 153.35774355811242)


**This is markedly better performance than when using Dummy variable/mean for R^2 (see earlier):**

Dummy -

MAEs: 11.649025685891326 12.356698194671722

MSEs: 266.9221614513521 296.48278858832686

**MSE still very high (possibly due to this being a large data set**

## Save processed data

In [24]:
#save vacancy rate data for modeling - remember to use random state=42!
df.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/VacancyRate_Zipcode_AND_National_2011_2018', index=False)
df_2019AND20.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/VacancyRate_Zipcode_National_2019_2020', index=False)

In [25]:
#save the scaled training and test splits

#X_tr_scaled.to_csv(r'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/processed/X_tr_scaled', index=False)
#X_te_scaled

### Summary
This summary should provide a quick overview for someone wanting to know quickly why the given model was chosen for the next part of the business problem to help guide important business decisions.

- loaded data and made 'Year' column as datetime index to prepare for time series analysis
- inspected data for NaNs and unique values
- separated data from years 2014-2018 and 2019-2020 data for later use and prediction
- performed TimeSeries train test split 
- estabilshed baseline comparisons using dummy regressors
- trained a linear regression model on the training data
        - this yielded an R2 of .48 on test set
        - MAE of 8.33 on test set
        - MSE of 153.36 on test set
        - markedly better performance than when using Dummy variable/mean 
        - model underpeforms the other multiple regression models, ie. linear regression, ridge regression, random forest models in notebook 4.3