Peer review form

https://forms.gle/Nou31utuFUxadzf37

---

## My answers

### Your EDA summary should include AT LEAST this:
- N=1941. 
- The unit of observation is house sales, 
- from 2006-2008. 

But it should go much farther than this. Finding variables with missing values and outliers should be quite easy, and beyond that, there are a lot of creative ways you should consider manipulating variables. 

### Regressions

In [1]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols as sm_ols
from statsmodels.iolib.summary2 import summary_col # nicer tables


housing = (pd
           .read_csv('input_data2/housing_train.csv')
           .assign(TotSF = lambda x: x.v_Total_Bsmt_SF + x.v_Gr_Liv_Area,
                         l_salePrice = lambda x:np.log(x['v_SalePrice']),
                         l_TotSF =  lambda x: np.log(x.TotSF)  ,
                         l_1st_Flr_SF =  lambda x: np.log(x.v_1st_Flr_SF)  ,
                         l_garage  =  lambda x: np.log(1+x.v_Garage_Area),
                         l_base  =  lambda x: np.log(1+x.v_Total_Bsmt_SF),
                         l_lot = lambda x: np.log(1+x.v_Lot_Area), 
                         age =  lambda x: x.v_Yr_Sold - x.v_Year_Built
                        )
          )

m1 = sm_ols('v_SalePrice ~ v_1st_Flr_SF  ', data=housing).fit()
m2 = sm_ols('v_SalePrice ~ l_1st_Flr_SF  ', data=housing).fit()
m3 = sm_ols('l_salePrice ~ v_1st_Flr_SF  ', data=housing).fit()
m4 = sm_ols('l_salePrice ~ l_1st_Flr_SF  ', data=housing).fit()
m5 = sm_ols('l_salePrice ~ v_Yr_Sold  ', data=housing).fit()
m6 = sm_ols('l_salePrice ~ C(v_Yr_Sold)  ', data=housing).fit()

# # format extra stats for the regression table
info_dict={'No. observations' : lambda x: f"{int(x.nobs):d}"}
info_dict={'No. observations' : lambda x: "{:,.0f}".format(x.nobs)}

# print out multiple regression results at once
table = summary_col(results=[m1,m2,m3,m4,m5,m6],
                    float_format='%0.2f',
                    stars = True,
                    model_names=['m1','m2','m3','m4','m5','m6'],
                    info_dict=info_dict,
                    regressor_order=['Intercept','v_1st_Flr_SF','l_1st_Flr_SF','v_Yr_Sold',
                                    'C(v_Yr_Sold)[T.2007]','C(v_Yr_Sold)[T.2008]',
                                    'l_TotSF','l_lot'],
                   )
table.add_title('OLS Regressions of Interest Rate')
print(table)
print('Model 7 is not showing many variables.')


                        OLS Regressions of Interest Rate
                          m1           m2         m3       m4      m5      m6   
--------------------------------------------------------------------------------
Intercept            40435.21*** -866085.93*** 11.33*** 6.62*** 22.29   12.02***
                     (4506.89)   (31862.17)    (0.02)   (0.16)  (22.94) (0.02)  
v_1st_Flr_SF         121.95***                 0.00***                          
                     (3.67)                    (0.00)                           
l_1st_Flr_SF                     149632.27***           0.77***                 
                                 (4543.94)              (0.02)                  
v_Yr_Sold                                                       -0.01           
                                                                (0.01)          
C(v_Yr_Sold)[T.2007]                                                    0.03    
                                                    

Written answers:

1. List $\beta_1$ for Models 1-6 to make it easier on your graders.
    - See above
1. Interpret $\beta_1$ in Model 2. 
    - A 1% increase in first floor square footage is associated with sales prices that are \$1,496 dollars higher, all else equal.
1. Interpret $\beta_1$ in Model 3. 
    - A 1 st ft increase in first floor square footage is associated with sales prices that are 0.06% higher, all else equal.
(0.06% is the right amount!) Not 0.0006%. 
    - I had to print out the model with more decimal points to figure that out. 
1. Of models 1-4, which do you think best explains the data and why?
    - Model 4 has the highest R2 and Adj R2
1. Interpret $\beta_1$ In Model 5
    - Mechanical version: A 1 unit increase in year is assoc with sales prices that are 1% lower, all else equal. (1%, not 0.01%). 
(I.e. Sales prices declined by 1% each year, on average, in this sample.)
1. Interpret $\alpha$ and $\beta_1$ in Model 6
    - Alpha: the average log price in 2006 is 12.02. 
    - Beta1: The average log price in 2007 is 3% higher than 2006. 
1. Why is the R2 of Model 6 higher than the R2 of Model 5?
    - Model 5 models the average price over time as a strict line with a slope. Model 6 *can* be a line, but it can also be any sequence of averages. Thus, model 6 is more flexible. 
1. Speculate (not graded): Could you use the specification of Model 6 in a predictive regression? How about Model 5?
    - Model 6: **Not well! What if your model has to be applied to a home sale in 2009? You can't estimate a beta on it because there is no 2009 observations in your sample! So year=2007 will be 0, and year=2008 will be 0. So the 2009 sales will get 2006 average prices as the predicted value!**
    - Model 5: Yes. (Predictive power TBD.) If the year is 2009, the predicted log price is 22-2009*(-0.01).
