# Baseline Modeling

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

In [2]:
groupdf = pd.read_csv('./data/groupdf.csv')

In [3]:
groupdf.head()

Unnamed: 0,location,totalvalue,latitude_x,longitude_x,logvalue,review_count,latitude_y,longitude_y,log_reviews,price_1.0,...,type_Venues,type_Vietnamese,type_Vitaminssupplements,type_Waffles,type_Whiskeybars,type_Wine_Bars,type_Winetastingroom,type_Womenscloth,type_Wraps,type_Yoga
0,90001,292490.6,33.968543,-118.261693,12.564232,8588,6353.401556,-22112.494475,502.254649,135.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,90002,287087.7,33.946024,-118.250578,12.551515,1081,1459.928001,-5084.555034,98.1702,39.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,90003,297284.7,33.961248,-118.273066,12.573247,4701,4755.757869,-16558.215243,383.669881,118.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,90004,1024344.0,34.077047,-118.313083,13.743391,69031,7939.550715,-27565.505661,1115.333722,98.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,90005,1113551.0,34.058708,-118.319786,13.858325,106745,7901.772113,-27445.860659,1236.250662,73.0,...,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Setting our primary explanatory variables for our baseline model with just the top 6 restaurant types

In [4]:
X = groupdf[['price_1.0',
             'price_2.0', 
            'price_3.0', 
            'price_4.0', 
            'rating_1.0', 
            'rating_2.0', 
            'rating_2.5', 
            'rating_3.0',
            'rating_3.5', 
            'rating_4.0', 
            'rating_4.5', 
            'rating_5.0', 
           'type_Mexican',
           'type_Coffee',
           'type_Pizza',
           'type_Hotdogs',
           'type_Burgers',
           'type_Bakeries',
            'log_reviews']]

Setting our outcome variable as the log-transformed home values

In [5]:
y = groupdf.logvalue

Instantiating a Linear regression model, train/test splitting, and scaling the model.

In [6]:
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [7]:
sc = StandardScaler()

In [8]:
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [9]:
lr.fit(X_train_sc, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
lr.score(X_train_sc, y_train)

0.694967392695512

In [11]:
lr.score(X_test_sc, y_test)

-0.0974639276197198

So it seems like our baseline linear regression model is very overfit on the training set and due to multicollinearity issues, fails to converge on the test set. Thus, going to try Lasso and Ridge, to see if we can ameliorate some of the multicollinearity issues.

In [12]:
lasso = Lasso(alpha=3, max_iter=50000)
lasso.fit(X_train, y_train)
lasso.score(X_train, y_train)

0.5642698015054128

In [13]:
lasso.score(X_test, y_test)

0.2705440575138798

As we can see, the score for the training set dropped from .69 to .59, but at least for the test set we have an actual score of .27. I tried various alphas, and three ended up with the highest scores. Next we'll try it with Ridge.

In [14]:
ridge = Ridge(alpha=1000)
ridge.fit(X_train, y_train)
ridge.score(X_train, y_train)

0.6653061306317665

In [15]:
ridge.score(X_test, y_test)

0.2541392669246998

Despite cranking up the alpha to 1000 (after that, the score did not increase any more), the ridge regression did not perform as well as Lasso, so now I'm going to try out Random Forests. 

In [21]:
rf = RandomForestRegressor(n_estimators=100)

In [22]:
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [23]:
rf.score(X_train, y_train)

0.9329949018504071

In [24]:
rf.score(X_test, y_test)

0.3758669494527853

Looks like Random Forests has the best score thus far. After various hypertuning, without much gain in the scores on the test set, we decided to leave the default hyperparameters. To be able to interpret the statistical significance of our coeeficients, I am now going to try Statsmodels.

In [None]:
X = sm.add_constant(X)

In [26]:
model = sm.OLS(y, X).fit()

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,logvalue,R-squared:,0.621
Model:,OLS,Adj. R-squared:,0.534
Method:,Least Squares,F-statistic:,7.158
Date:,"Mon, 29 Apr 2019",Prob (F-statistic):,7.61e-11
Time:,15:11:53,Log-Likelihood:,-27.519
No. Observations:,103,AIC:,95.04
Df Residuals:,83,BIC:,147.7
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,13.5778,0.088,154.939,0.000,13.404,13.752
price_1.0,-0.0264,0.011,-2.317,0.023,-0.049,-0.004
price_2.0,-0.0167,0.012,-1.407,0.163,-0.040,0.007
price_3.0,-0.0191,0.012,-1.577,0.119,-0.043,0.005
price_4.0,-0.0233,0.017,-1.401,0.165,-0.056,0.010
rating_1.0,0.0051,0.042,0.121,0.904,-0.078,0.089
rating_2.0,0.0184,0.017,1.114,0.268,-0.014,0.051
rating_2.5,0.0270,0.016,1.738,0.086,-0.004,0.058
rating_3.0,0.0234,0.013,1.832,0.071,-0.002,0.049

0,1,2,3
Omnibus:,5.425,Durbin-Watson:,1.401
Prob(Omnibus):,0.066,Jarque-Bera (JB):,4.78
Skew:,-0.473,Prob(JB):,0.0917
Kurtosis:,3.468,Cond. No.,2700.0


In [28]:
model.pvalues[model.pvalues<0.05]

const        5.471415e-104
price_1.0     2.295518e-02
dtype: float64

In [29]:
model.params[model.params.index.isin(model.pvalues[model.pvalues<0.05].index)]

const        13.577829
price_1.0    -0.026399
dtype: float64

As we can see, the R-squared is .62, meaning that our model explains 62% of the variance in home values. However, almost none of the coefficients are statistically significant. Only the coefficient on 1-dollar-sign restaurants is statistically significant with an $\alpha$ of .05. The interpretation of the coefficient would be that a 1-standard-deviation increase in the number of 1-dollar-sign restaurants leads to a 2.6% decrease in average home values across all zip codes, on average and holding all other variables in our model constant. As such, 

In [30]:
pwd

'/Users/Hovanes/dsi/Projects/project_4/project-client_project'