# Lab 12 Building Parsimonious Models - Group - [5 points] - Solutions


## <u>Case Study</u>: Forward Selection with BIC and your Own Dataset

In this analysis, you will do the following.

1. **Choose your own dataset that meets the following specifications.**
    * It is not the fake vs. real Instagram account dataset. Any other dataset that meets the specifications below is fine.
    * This dataset has at least one categorical variable that you can use as your response variable in a logistic regression model.
        - This categorical variable should have just two levels.
        - Alternatively, you can create a categorical variable that has just two levels.
    * This dataset has at least 4 other variables that you will use as *potential* explanatory variables to put in the model.
    
2. **Use a forward selection algorithm with BIC** to try to find the parsimonious logistic regression model (taking into account your 4 explanatory variables that you are considering).



### Imports

In [2]:
import numpy as np
import pandas as pd
import zipfile as zp
import statsmodels.api as sm
import statsmodels.formula.api as smf

## 1. [0.5 pt] Data Preliminaries

Load your csv file into a dataframe. Then create a new 0/1 response variable in your dataframe where 1 = the response variable level that you are trying to predict (ie. the success level) and 0 = the response variable level that you are not trying to predict (ie. the failure level). Finally, display the first 5 rows of your updated dataframe below.

In [3]:
df = pd.read_csv('seattle_airbnb_listings_cleaned.csv')
df.head()

Unnamed: 0,price,review_scores_rating,number_of_reviews,security_deposit,cleaning_fee,neighborhood,property_type,room_type,accommodates,bathrooms,beds,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified
0,300,100,24,500,95,Wallingford,House,Entire home/apt,5,1.5,3,within a few hours,1.0,1,t,t,t
1,149,96,11,300,105,Wallingford,Apartment,Entire home/apt,6,1.0,3,within an hour,1.0,1,f,t,t
2,95,95,79,150,40,Wallingford,Apartment,Entire home/apt,3,1.0,2,within an hour,1.0,1,f,t,t
3,105,100,13,500,50,Wallingford,House,Private room,2,2.0,1,within a few hours,1.0,1,t,t,t
4,140,99,30,250,65,Wallingford,House,Entire home/apt,2,1.0,1,within an hour,1.0,1,t,t,t


In [6]:
df['y'] = df['host_is_superhost'].map({'f':0, 't':1})
df.head()

Unnamed: 0,price,review_scores_rating,number_of_reviews,security_deposit,cleaning_fee,neighborhood,property_type,room_type,accommodates,bathrooms,beds,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,y
0,300,100,24,500,95,Wallingford,House,Entire home/apt,5,1.5,3,within a few hours,1.0,1,t,t,t,1
1,149,96,11,300,105,Wallingford,Apartment,Entire home/apt,6,1.0,3,within an hour,1.0,1,f,t,t,0
2,95,95,79,150,40,Wallingford,Apartment,Entire home/apt,3,1.0,2,within an hour,1.0,1,f,t,t,0
3,105,100,13,500,50,Wallingford,House,Private room,2,2.0,1,within a few hours,1.0,1,t,t,t,1
4,140,99,30,250,65,Wallingford,House,Entire home/apt,2,1.0,1,within an hour,1.0,1,t,t,t,1


<hr>

## <u>Tutorial</u>: Fitting a Regression Curve with No Explanatory Variables

If you would like to fit a logistic regression (or linear regression) model in the **statsmodels.formula.api** package that does not have any explanatory variables (ie. just the intercept), you can write a 1 where you normally put the explanatory variables in the **.logit()** function or the **.ols()** function as shown below.

In [7]:
import pandas as pd
df_temp=pd.DataFrame({'y': [1,1,0,0,1,1,0]})
df_temp

Unnamed: 0,y
0,1
1,1
2,0
3,0
4,1
5,1
6,0


In [8]:
import statsmodels.formula.api as smf
example_model = smf.logit('y~1', data=df_temp).fit()
example_model.summary()

Optimization terminated successfully.
         Current function value: 0.682908
         Iterations 4


0,1,2,3
Dep. Variable:,y,No. Observations:,7.0
Model:,Logit,Df Residuals:,6.0
Method:,MLE,Df Model:,0.0
Date:,"Wed, 10 Nov 2021",Pseudo R-squ.:,5.282e-12
Time:,15:29:41,Log-Likelihood:,-4.7804
converged:,True,LL-Null:,-4.7804
Covariance Type:,nonrobust,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.2877,0.764,0.377,0.706,-1.209,1.785


<hr>

## 2.  [4.5 pts]  Forward Selection with BIC Score

Next, starting with the logistic regression model with **no explanatory variables**, perform a forward selection algorithm using the BIC in attempt to find a logistic regression with the lowest BIC score.
* You should consider 4 possible explanatory variables in this algorithm.
* You will use your **full dataset** when fitting this logistic regression model (ie. you should *not* split this dataset up into training and a test datasets in this particular assignment).

Once the algorithm has stopped, print out the summary output of your **final model**. 

In [11]:
current_mod = smf.logit('y ~ accommodates', data = df).fit()
print('BIC of Current Model', current_mod.bic)

Optimization terminated successfully.
         Current function value: 0.609104
         Iterations 5
BIC of Current Model 424.62445385910297


In [13]:
test_mod = smf.logit('y ~ accommodates + bathrooms', data = df).fit()
print('BIC of Current Model', test_mod.bic)

Optimization terminated successfully.
         Current function value: 0.605746
         Iterations 5
BIC of Current Model 428.174007530974


In [17]:
test_mod = smf.logit('y ~ accommodates + bathrooms + host_response_rate', data = df).fit()
print('BIC of Current Model', test_mod.bic)

Optimization terminated successfully.
         Current function value: 0.602282
         Iterations 6
BIC of Current Model 431.650951050532


In [18]:
test_mod = smf.logit('y ~ accommodates + bathrooms + host_response_rate + review_scores_rating', data = df).fit()
print('BIC of Current Model', test_mod.bic)

Optimization terminated successfully.
         Current function value: 0.536105
         Iterations 7
BIC of Current Model 392.6095012697017


In [19]:
final_mod = smf.logit('y ~ accommodates + bathrooms + host_response_rate + review_scores_rating', data = df).fit()
print("BIC of Final Model", final_mod.bic)

Optimization terminated successfully.
         Current function value: 0.536105
         Iterations 7
BIC of Final Model 392.6095012697017


In [20]:
final_mod.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,339.0
Model:,Logit,Df Residuals:,334.0
Method:,MLE,Df Model:,4.0
Date:,"Wed, 10 Nov 2021",Pseudo R-squ.:,0.1198
Time:,15:36:37,Log-Likelihood:,-181.74
converged:,True,LL-Null:,-206.49
Covariance Type:,nonrobust,LLR p-value:,4.606e-10

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-23.5148,4.207,-5.590,0.000,-31.760,-15.270
accommodates,-0.0599,0.075,-0.802,0.423,-0.206,0.087
bathrooms,0.2007,0.270,0.743,0.457,-0.329,0.730
host_response_rate,1.8918,1.451,1.304,0.192,-0.952,4.735
review_scores_rating,0.2168,0.041,5.300,0.000,0.137,0.297


Group Members: Hadley So and Mithil Guruvugari