# Parsimonious Models

## Package Imports

Code to import some packages has been included below.  Feel free to import any additional packages that you need here, or later in the assignment.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from scipy.stats import norm
from scipy.stats import t
import statsmodels.api as sm
import statsmodels.formula.api as smf

<hr>

## <u>Case Study</u>: Predicting Airbnb Superhost Status

Suppose that you have a spare room in your house (in Seattle) and you are considering advertising this room to guests on Airbnb. Airbnb superhosts are considered to be experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. Once a host reaches Superhost status, a superhost badge will automatically appear on their listing and profile to help guests identify them. You would like to assess your chances of being named a superhost with your property.

The following dataset is a sample (assume random) of available Airbnb listings in Seattle, WA. These listings were collected in January 2016, and filtered to just contain listings from the five most popular Seattle neighborhoods (for Airbnb listings) and just contain listings that are either in a house or apartment property.

This dataset contains the following variables.

**Listing Information**
The dataset contains the following information about the Airbnb *listing*:
* <u>price</u>: price of the listing (per night)(in US dollars)
* <u>review_scores_rating</u>: the average rating of the listing [0,100] (100 is the best)
* <u>number_of_reviews</u>: the number of reviews for the listing
* <u>security_deposit</u>: the security deposit required for the listing (in US dollars)
* <u>cleaning_fee</u>: the cleaning fee required for the listing (in US dollars)
* <u>neighborhood</u>: the neighborhood of Seattle the listing is located in
* <u>property_type</u>: is the listing in a 'House' or 'Apartment'
* <u>room_type</u>: is the listing a 'Entire home/apt', 'Private room', or 'Shared room'
* <u>accommodates</u>: how many guests will the listing accommodate
* <u>bathrooms</u>:how many bathrooms does the listing have
* <u>beds</u>: how many beds does the listing have

**Host Information**
The dataset also contains the following information about the *host* of the given Airbnb listing:
* <u>host_is_superhost</u>: is the host a "superhost": t=True, f=False
* <u>host_has_profile_pic</u>: does the host have a profile pic in their bio: t=True, f=False
* <u>host_response_time</u>: how fast will the host respond to requests (on average)
* <u>host_acceptance_rate</u>: what percent of booking requests will the host accept


## 1. Model Searching 

We'd like to fit a parsimonious model to predict the **price** of an Airbnb in Seattle.

For this question, we will consider as possible predictors:
- **cleaning_fee**
- **room_type**
- **beds**
- **bathrooms**
- **host_acceptance_rate**
- **host_response_time**

For this question, you will perform backward elimination using adjusted $R^2$ as the metric.

From this output, you will be asked to report:
- the baseline level for the variable host_response_time
- the first variable removed from the model
- the number of distinct models fit
- the number of possible models you could fit
- the optimal adjusted $R^2$
- and other, theoretical questions.

In [2]:
df = pd.read_csv('seattle_airbnb_listings_cleaned.csv')

In [3]:
mod1 = smf.ols('price~cleaning_fee+room_type+beds+bathrooms+host_acceptance_rate+host_response_time', data=df).fit()
mod1.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.464
Model:,OLS,Adj. R-squared:,0.449
Method:,Least Squares,F-statistic:,31.61
Date:,"Wed, 03 May 2023",Prob (F-statistic):,1.08e-39
Time:,19:14:19,Log-Likelihood:,-1910.0
No. Observations:,339,AIC:,3840.0
Df Residuals:,329,BIC:,3878.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,88.4181,99.754,0.886,0.376,-107.819,284.655
room_type[T.Private room],-53.9800,11.587,-4.659,0.000,-76.775,-31.185
room_type[T.Shared room],-53.7469,15.005,-3.582,0.000,-83.264,-24.230
host_response_time[T.within a day],-4.6667,70.392,-0.066,0.947,-143.142,133.809
host_response_time[T.within a few hours],5.4227,69.455,0.078,0.938,-131.210,142.055
host_response_time[T.within an hour],-5.6033,69.242,-0.081,0.936,-141.817,130.610
cleaning_fee,0.5392,0.121,4.468,0.000,0.302,0.777
beds,13.6743,4.114,3.324,0.001,5.581,21.767
bathrooms,40.5366,8.806,4.604,0.000,23.214,57.859

0,1,2,3
Omnibus:,467.513,Durbin-Watson:,1.951
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80135.686
Skew:,6.536,Prob(JB):,0.0
Kurtosis:,77.179,Cond. No.,2870.0


The adjusted R-squared value of this model 1 is 0.449.

**Iteration 1 of Backward Elimination**

In [31]:
# eliminate the variable 'cleaning fee'
mod1_2 = smf.ols('price~room_type+beds+bathrooms+host_acceptance_rate+host_response_time', data=df).fit()
mod1_2.rsquared_adj

0.41742637884273226

In [5]:
# eliminate the variable 'room_type'
mod1_3 = smf.ols('price~cleaning_fee+beds+bathrooms+host_acceptance_rate+host_response_time', data=df).fit()
mod1_3.rsquared_adj

0.4021741670720679

In [6]:
# eliminate the variable 'beds'
mod1_4 = smf.ols('price~cleaning_fee+room_type+bathrooms+host_acceptance_rate+host_response_time', data=df).fit()
mod1_4.rsquared_adj

0.4323043393911763

In [7]:
# eliminate the variable 'bathrooms'
mod1_5 = smf.ols('price~cleaning_fee+room_type+beds+host_acceptance_rate+host_response_time', data=df).fit()
mod1_5.rsquared_adj

0.4153680674029714

In [8]:
# eliminate the variable 'host acceptance rate'
mod1_6 = smf.ols('price~cleaning_fee+room_type+beds+bathrooms+host_response_time', data=df).fit()
mod1_6.rsquared_adj

0.45004319831677364

In [9]:
# eliminate the variable 'host response time'
mod1_7 = smf.ols('price~cleaning_fee+room_type+beds+bathrooms+host_acceptance_rate', data=df).fit()
mod1_7.rsquared_adj

0.4513451665178705

There is at least one model that has lower adjusted R squared value that model 1 (with all explantory variables included). The model with the highest adjusted R squared value is model1_7, with the variable 'host response time' eliminated. 

**Iteration 2 of Backward Elimination**

In [24]:
mod2 = smf.ols('price~cleaning_fee+room_type+beds+bathrooms+host_acceptance_rate', data=df).fit()
mod2.rsquared_adj

0.4513451665178705

In [25]:
# eliminate the variable 'cleaning fee'
mod2_2 = smf.ols('price~room_type+beds+bathrooms+host_acceptance_rate', data=df).fit()
mod2_2.rsquared_adj

0.4180554656906412

In [26]:
# eliminate the variable 'room_type'
mod2_3 = smf.ols('price~cleaning_fee+beds+bathrooms+host_acceptance_rate', data=df).fit()
mod2_3.rsquared_adj

0.4068338033919704

In [27]:
# eliminate the variable 'beds'
mod2_4 = smf.ols('price~cleaning_fee+room_type+bathrooms+host_acceptance_rate', data=df).fit()
mod2_4.rsquared_adj

0.43290995065591764

In [28]:
# eliminate the variable 'bathrooms'
mod2_5 = smf.ols('price~cleaning_fee+room_type+beds+host_acceptance_rate', data=df).fit()
mod2_5.rsquared_adj

0.4146529620394587

In [29]:
# eliminate the variable 'host acceptance rate'
mod2_6 = smf.ols('price~cleaning_fee+room_type+beds+bathrooms', data=df).fit()
mod2_6.rsquared_adj

0.45226586314160244

There is at least one model that has lower adjusted R squared value that model 2 (with all explantory variables included). The model with lowest adjusted R squared value is model2_6, with the variable 'host acceptance rate' eliminated. 

**Iteration 3 of Backward Elimination**

In [32]:
mod3= smf.ols('price~cleaning_fee+room_type+beds+bathrooms', data=df).fit()
mod3.rsquared_adj

0.45226586314160244

In [33]:
# eliminate the variable 'cleaning_fee'
mod3_2= smf.ols('price~room_type+beds+bathrooms', data=df).fit()
mod3_2.rsquared_adj

0.4193452166994731

In [34]:
# eliminate the variable 'room_type'
mod3_2= smf.ols('price~cleaning_fee+beds+bathrooms', data=df).fit()
mod3_2.rsquared_adj

0.40854853470251995

In [35]:
# eliminate the variable 'beds'
mod3_2= smf.ols('price~cleaning_fee+room_type+bathrooms', data=df).fit()
mod3_2.rsquared_adj

0.4339166212592471

In [36]:
# eliminate the variable 'bathrooms'
mod3_2= smf.ols('price~cleaning_fee+room_type+beds', data=df).fit()
mod3_2.rsquared_adj

0.4152873668013447

There is no more model with larger adjusted R squared than model 3. Therefore I can stop the iteration.<br>
Therefore, the final selected variables are 'cleaning_fee', 'room_type','beds', and 'bathrooms'. 

In [37]:
mod_final = smf.ols('price~cleaning_fee+room_type+beds+bathrooms', data=df).fit()
mod_final.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.46
Model:,OLS,Adj. R-squared:,0.452
Method:,Least Squares,F-statistic:,56.82
Date:,"Wed, 03 May 2023",Prob (F-statistic):,1.28e-42
Time:,20:23:29,Log-Likelihood:,-1911.0
No. Observations:,339,AIC:,3834.0
Df Residuals:,333,BIC:,3857.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,36.1562,9.923,3.644,0.000,16.636,55.677
room_type[T.Private room],-51.1213,11.309,-4.521,0.000,-73.367,-28.876
room_type[T.Shared room],-53.1608,14.617,-3.637,0.000,-81.914,-24.408
cleaning_fee,0.5487,0.120,4.591,0.000,0.314,0.784
beds,14.2010,4.068,3.491,0.001,6.200,22.202
bathrooms,42.1904,8.694,4.853,0.000,25.088,59.293

0,1,2,3
Omnibus:,472.292,Durbin-Watson:,1.954
Prob(Omnibus):,0.0,Jarque-Bera (JB):,84105.822
Skew:,6.648,Prob(JB):,0.0
Kurtosis:,79.01,Cond. No.,308.0


The baseline level for the variable 'host_response_time' is a level 'a few days or more'. <br>
The first variable removed from the model is 'host_response_time'. <br>
The number of distinct models (I made through the iteration 3) fit is $1(original full model)+6(first iteration)+5(second iteration)+4(third iteration)=16$.<br>
The number of possible models I could fit is $2^6=64$. <br>
The optimal adjusted $R^2$ is 0.452. <br>

## 2. Regularized Regression 
Suppose now that I'd like to understand what features are associated with whether you have the whole space or not (**room_type** is Entire home/apt compared to any other option).

In order to understand this, I'd like to consider the following variables as potential predictors:
- **bathrooms**
- **accommodates**
- **security_deposit**
- **property_type**
- **host_is_superhost**

I know that I might not want to include all of these variables in the model, so I opt to fit a LASSO model with a $\lambda = 20$.

I will compare this to a second LASSO model with a $\lambda = 2$.

For this problem, fit both of these two models.  You will be asked to report the number of slope coefficients that are equal to 0 for each of these two models.

Then, once you have selected your predictors, fit non-regularized models with the predictors suggested by these two LASSO models.

You will be asked to report the optimal BIC from the last two models and the variable whose coefficient is the largest (in magnitude).

In [39]:
from sklearn.linear_model import LogisticRegression

In [40]:
df.head()

Unnamed: 0,price,review_scores_rating,number_of_reviews,security_deposit,cleaning_fee,neighborhood,property_type,room_type,accommodates,bathrooms,beds,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified
0,300,100,24,500,95,Wallingford,House,Entire home/apt,5,1.5,3,within a few hours,1.0,1,t,t,t
1,149,96,11,300,105,Wallingford,Apartment,Entire home/apt,6,1.0,3,within an hour,1.0,1,f,t,t
2,95,95,79,150,40,Wallingford,Apartment,Entire home/apt,3,1.0,2,within an hour,1.0,1,f,t,t
3,105,100,13,500,50,Wallingford,House,Private room,2,2.0,1,within a few hours,1.0,1,t,t,t
4,140,99,30,250,65,Wallingford,House,Entire home/apt,2,1.0,1,within an hour,1.0,1,t,t,t


In [45]:
# features matrix, X
X = df[['bathrooms', 'accommodates', 'security_deposit', 'property_type', 'host_is_superhost']]
X.head()

Unnamed: 0,bathrooms,accommodates,security_deposit,property_type,host_is_superhost
0,1.5,5,500,House,t
1,1.0,6,300,Apartment,f
2,1.0,3,150,Apartment,f
3,2.0,2,500,House,t
4,1.0,2,250,House,t


In [46]:
X = pd.get_dummies(X, drop_first=True)
X.head()

Unnamed: 0,bathrooms,accommodates,security_deposit,property_type_House,host_is_superhost_t
0,1.5,5,500,1,1
1,1.0,6,300,0,0
2,1.0,3,150,0,0
3,2.0,2,500,1,1
4,1.0,2,250,1,1


In [50]:
# Target array, y
df['room_type']=df['room_type'].map({'Entire home/apt':1,'Private room':0, 'Shared room':0})
y = df['room_type']
y.head()

0    1
1    1
2    1
3    0
4    1
Name: room_type, dtype: int64

In [53]:
# LASSO model with a lambda = 20
clf1 = LogisticRegression('Yl1', solver='liblinear', 
                          max_iter=1000, C=1/20)
clf1.fit(X,y)

LogisticRegression(C=0.05, max_iter=1000, penalty='l1', solver='liblinear')

In [58]:
clf1.coef_.T

array([[ 0.00000000e+00],
       [ 6.61861120e-01],
       [-4.11917270e-04],
       [-1.08188287e+00],
       [ 0.00000000e+00]])

In [66]:
mod_1 = smf.logit('y~accommodates+security_deposit+property_type', data=df).fit()
mod_1.bic

Optimization terminated successfully.
         Current function value: 0.224184
         Iterations 9


175.30065921409513

In [54]:
# LASSO model with a lambda = 2
clf2 = LogisticRegression('l1', solver='liblinear', 
                          max_iter=1000, C=1/2)
clf2.fit(X,y)

LogisticRegression(C=0.5, max_iter=1000, penalty='l1', solver='liblinear')

In [59]:
clf2.coef_.T

array([[-8.01324088e-01],
       [ 1.58475589e+00],
       [ 6.55967246e-04],
       [-2.86542044e+00],
       [ 1.16438420e-01]])

In [65]:
mod_2 = smf.logit('y~bathrooms+accommodates+security_deposit+property_type+host_is_superhost', data=df).fit()
mod_2.bic

Optimization terminated successfully.
         Current function value: 0.216972
         Iterations 9


182.0629114420658