# Multiple Linear Regression (also called Multi-Linear or ~~Multivariate~~ regression)
- Previously, we did **Simple linear regression**
- This time...

$$ AmazonPrice = \beta_0 + \beta_1 ListPrice + \beta_H Height + \beta_W Width + \beta_T \underbrace{I(HardorPaper == H)}_{\text{indicator function}}$$ 

The indicator function is a $1$ if the book type is hardback.

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 \underbrace{I(x_4 == H)}_{\text{indicator function}}$$ 

- The $x's$ are called **covariates** (in statistics) or **features** (in machine learning)

In [1]:
import pandas as pd
amazonbooks = pd.read_csv("amazonbooks.csv", encoding="ISO-8859-1")
amazonbooks

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Height,Width,Thick,Weight_oz
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304.0,Adams Media,2010.0,1605506249,7.8,5.5,0.8,11.2
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273.0,Free Press,2008.0,1416564195,8.4,5.5,0.7,7.2
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96.0,Dover Publications,1995.0,486285537,8.3,5.2,0.3,4.0
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672.0,Harper Perennial,2008.0,61564893,8.8,6.0,1.6,28.8
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720.0,Knopf,2011.0,307265722,8.0,5.2,1.4,22.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192.0,HarperCollins,2004.0,60572345,9.3,6.6,1.1,24.0
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160.0,Worth Publishers,2011.0,1429233443,9.1,6.1,0.7,8.0
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224.0,St Martin's Griffin,2005.0,031233446X,8.0,5.4,0.7,6.4
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480.0,W. W. Norton & Company,2010.0,393934942,10.7,8.9,0.9,14.4


In [23]:
import statsmodels.formula.api as smf
fit_model = smf.ols(formula='Q("Amazon Price") ~ Q("List Price") + Height + Width + C(Hard_or_Paper)', 
                    data=amazonbooks).fit()

In [14]:
dir(fit_model)

['HC0_se',
 'HC1_se',
 'HC2_se',
 'HC3_se',
 '_HCCM',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abat_diagonal',
 '_cache',
 '_data_attr',
 '_data_in_cache',
 '_get_robustcov_results',
 '_is_nested',
 '_use_t',
 '_wexog_singular_values',
 'aic',
 'bic',
 'bse',
 'centered_tss',
 'compare_f_test',
 'compare_lm_test',
 'compare_lr_test',
 'condition_number',
 'conf_int',
 'conf_int_el',
 'cov_HC0',
 'cov_HC1',
 'cov_HC2',
 'cov_HC3',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'eigenvals',
 'el_test',
 'ess',
 'f_pvalue',
 'f_test',
 'fittedvalues',
 'fvalue',
 'get_influence',
 'get_prediction',
 'get_robustcov_results',
 'info_criteria',


In [16]:
dir(fit_model.model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_kwargs',
 '_data_attr',
 '_df_model',
 '_df_resid',
 '_fit_collinear',
 '_fit_ridge',
 '_fit_zeros',
 '_formula_max_endog',
 '_get_init_kwds',
 '_handle_data',
 '_init_keys',
 '_kwargs_allowed',
 '_setup_score_hess',
 '_sqrt_lasso',
 'data',
 'df_model',
 'df_resid',
 'endog',
 'endog_names',
 'exog',
 'exog_names',
 'fit',
 'fit_regularized',
 'formula',
 'from_formula',
 'get_distribution',
 'hessian',
 'hessian_factor',
 'information',
 'initialize',
 'k_constant',
 'loglike',
 'nobs',
 'normalized_cov_params',
 'pinv_wexog',
 'predict',
 'rank',
 'score',
 'weights',
 'wendog',
 'wexog',
 'wexog_singular_value

In [17]:
fit_model.model.formula

'Q("Amazon Price") ~ Q("List Price") + Height + Width + C(Hard_or_Paper)'

$$ AmazonPrice = \beta_0 + \beta_1 ListPrice + \require{enclose} \enclose{horizontalstrike}{\beta_H Height + \beta_W Width} + \beta_P \underbrace{I(HardorPaper == P)}_{\text{indicator function}}$$ 

The p-values in the table below are for the null hypothesis that the coefficient if 0. If a p-value is not quite small it does not provide evidence against this null hypthesis. Since the `Height` and `Width` variables do not have small p-values, we do not have evidence against the assumption that the coefficients of these variables in the linear model form are $0$.  Therefore, we remove these variables from the model, because if the coefficients were zero (an assumption we have not provided evidence against), then $0$ times those variables is just $0$... and they disappear from the model...

In [27]:
fit_model.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.911
Model:,OLS,Adj. R-squared:,0.909
Method:,Least Squares,F-statistic:,799.5
Date:,"Mon, 08 May 2023",Prob (F-statistic):,3.32e-163
Time:,14:23:50,Log-Likelihood:,-873.34
No. Observations:,319,AIC:,1757.0
Df Residuals:,314,BIC:,1776.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.7254,2.371,-1.150,0.251,-7.390,1.939
C(Hard_or_Paper)[T.P],2.0305,0.498,4.081,0.000,1.051,3.009
"Q(""List Price"")",0.8463,0.018,46.533,0.000,0.811,0.882
Height,-0.3326,0.289,-1.151,0.251,-0.901,0.236
Width,0.2316,0.306,0.756,0.450,-0.371,0.834

0,1,2,3
Omnibus:,100.776,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1617.753
Skew:,0.818,Prob(JB):,0.0
Kurtosis:,13.91,Cond. No.,282.0


In [28]:
# this was last time...
smf.ols(formula='Q("Amazon Price") ~ Q("List Price")', data=amazonbooks).fit().summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.903
Model:,OLS,Adj. R-squared:,0.903
Method:,Least Squares,F-statistic:,3002.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,2.82e-165
Time:,14:24:27,Log-Likelihood:,-897.98
No. Observations:,324,AIC:,1800.0
Df Residuals:,322,BIC:,1808.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.4070,0.354,-6.791,0.000,-3.104,-1.710
"Q(""List Price"")",0.8298,0.015,54.789,0.000,0.800,0.860

0,1,2,3
Omnibus:,114.617,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1830.463
Skew:,0.992,Prob(JB):,0.0
Kurtosis:,14.474,Cond. No.,38.5


In [30]:
fit_model2 = smf.ols(formula='Q("Amazon Price") ~ Q("List Price") + Thick + Weight_oz + NumPages + C(Hard_or_Paper)', 
                     data=amazonbooks).fit()
fit_model2.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.921
Model:,OLS,Adj. R-squared:,0.92
Method:,Least Squares,F-statistic:,719.6
Date:,"Mon, 08 May 2023",Prob (F-statistic):,1.67e-167
Time:,14:35:10,Log-Likelihood:,-813.63
No. Observations:,314,AIC:,1639.0
Df Residuals:,308,BIC:,1662.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.8980,0.812,-1.106,0.270,-2.496,0.700
C(Hard_or_Paper)[T.P],1.2895,0.542,2.380,0.018,0.223,2.356
"Q(""List Price"")",0.8552,0.016,54.369,0.000,0.824,0.886
Thick,-1.8302,1.133,-1.615,0.107,-4.060,0.400
Weight_oz,-0.0664,0.044,-1.519,0.130,-0.153,0.020
NumPages,-0.0012,0.002,-0.491,0.624,-0.006,0.004

0,1,2,3
Omnibus:,121.423,Durbin-Watson:,2.118
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1675.391
Skew:,1.177,Prob(JB):,0.0
Kurtosis:,14.068,Cond. No.,2670.0


In [32]:
fit_model.model.formula

'Q("Amazon Price") ~ Q("List Price") + Height + Width + C(Hard_or_Paper)'

In [31]:
fit_model2.model.formula

'Q("Amazon Price") ~ Q("List Price") + Thick + Weight_oz + NumPages + C(Hard_or_Paper)'

![](https://www.jcpcarchives.org/userfiles/values-of-p-Inference.jpg)

In [33]:
fit_model3 = smf.ols(formula='Q("Amazon Price") ~ Q("List Price") + Thick + Weight_oz + C(Hard_or_Paper)', 
                     data=amazonbooks).fit()
fit_model3.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.921
Model:,OLS,Adj. R-squared:,0.92
Method:,Least Squares,F-statistic:,902.3
Date:,"Mon, 08 May 2023",Prob (F-statistic):,2.3700000000000002e-169
Time:,14:38:55,Log-Likelihood:,-816.27
No. Observations:,315,AIC:,1643.0
Df Residuals:,310,BIC:,1661.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.6885,0.736,-0.935,0.350,-2.137,0.760
C(Hard_or_Paper)[T.P],1.1079,0.450,2.463,0.014,0.223,1.993
"Q(""List Price"")",0.8540,0.016,54.641,0.000,0.823,0.885
Thick,-2.3027,0.737,-3.124,0.002,-3.753,-0.853
Weight_oz,-0.0672,0.043,-1.580,0.115,-0.151,0.016

0,1,2,3
Omnibus:,122.742,Durbin-Watson:,2.114
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1760.075
Skew:,1.178,Prob(JB):,0.0
Kurtosis:,14.338,Cond. No.,132.0


In [34]:
fit_model4 = smf.ols(formula='Q("Amazon Price") ~ Q("List Price") + Thick + C(Hard_or_Paper)', 
                     data=amazonbooks).fit()
fit_model4.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,1280.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,1.93e-177
Time:,14:39:29,Log-Likelihood:,-855.77
No. Observations:,323,AIC:,1720.0
Df Residuals:,319,BIC:,1735.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.0255,0.766,-1.339,0.181,-2.532,0.481
C(Hard_or_Paper)[T.P],1.1440,0.464,2.464,0.014,0.231,2.057
"Q(""List Price"")",0.8663,0.014,61.780,0.000,0.839,0.894
Thick,-3.1085,0.561,-5.541,0.000,-4.212,-2.005

0,1,2,3
Omnibus:,101.699,Durbin-Watson:,2.083
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1295.817
Skew:,0.897,Prob(JB):,4.14e-282
Kurtosis:,12.647,Cond. No.,115.0


In [35]:
fit_model4.model.formula

'Q("Amazon Price") ~ Q("List Price") + Thick + C(Hard_or_Paper)'

$$AmazonPrice =  \enclose{horizontalstrike}{\beta_0}0 + \beta_L ListPrice + \beta_T Thick + \beta_P I(Hard\_or\_Paper==P) + \beta_P I(Hard\_or\_Paper==H)$$

Using the `-1` notation in the formula didn't actually quite do what I wanted!
I wanted to have not intercept and just an indicator of paperback, like this (as opposed to the above):

$$AmazonPrice =  \enclose{horizontalstrike}{\beta_0}0 + \beta_L ListPrice + \beta_T Thick + \beta_P I(Hard\_or\_Paper==P)$$

In [36]:
fit_model5 = smf.ols(formula='Q("Amazon Price") ~ -1 + Q("List Price") + Thick + C(Hard_or_Paper)', 
                     data=amazonbooks).fit()
fit_model5.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,1280.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,1.93e-177
Time:,14:48:23,Log-Likelihood:,-855.77
No. Observations:,323,AIC:,1720.0
Df Residuals:,319,BIC:,1735.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
C(Hard_or_Paper)[H],-1.0255,0.766,-1.339,0.181,-2.532,0.481
C(Hard_or_Paper)[P],0.1185,0.547,0.217,0.829,-0.958,1.195
"Q(""List Price"")",0.8663,0.014,61.780,0.000,0.839,0.894
Thick,-3.1085,0.561,-5.541,0.000,-4.212,-2.005

0,1,2,3
Omnibus:,101.699,Durbin-Watson:,2.083
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1295.817
Skew:,0.897,Prob(JB):,4.14e-282
Kurtosis:,12.647,Cond. No.,123.0


In [38]:
fit_model4.model.formula

'Q("Amazon Price") ~ Q("List Price") + Thick + C(Hard_or_Paper)'

In [39]:
fit_model4.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,1280.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,1.93e-177
Time:,14:50:22,Log-Likelihood:,-855.77
No. Observations:,323,AIC:,1720.0
Df Residuals:,319,BIC:,1735.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.0255,0.766,-1.339,0.181,-2.532,0.481
C(Hard_or_Paper)[T.P],1.1440,0.464,2.464,0.014,0.231,2.057
"Q(""List Price"")",0.8663,0.014,61.780,0.000,0.839,0.894
Thick,-3.1085,0.561,-5.541,0.000,-4.212,-2.005

0,1,2,3
Omnibus:,101.699,Durbin-Watson:,2.083
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1295.817
Skew:,0.897,Prob(JB):,4.14e-282
Kurtosis:,12.647,Cond. No.,115.0


In [40]:
fit_model6 = smf.ols(formula='Q("Amazon Price") ~ -1 + Q("List Price") + Thick + (Hard_or_Paper=="P")', 
                     data=amazonbooks).fit()
fit_model6.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,1280.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,1.93e-177
Time:,14:57:02,Log-Likelihood:,-855.77
No. Observations:,323,AIC:,1720.0
Df Residuals:,319,BIC:,1735.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
"Hard_or_Paper == ""P""[False]",-1.0255,0.766,-1.339,0.181,-2.532,0.481
"Hard_or_Paper == ""P""[True]",0.1185,0.547,0.217,0.829,-0.958,1.195
"Q(""List Price"")",0.8663,0.014,61.780,0.000,0.839,0.894
Thick,-3.1085,0.561,-5.541,0.000,-4.212,-2.005

0,1,2,3
Omnibus:,101.699,Durbin-Watson:,2.083
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1295.817
Skew:,0.897,Prob(JB):,4.14e-282
Kurtosis:,12.647,Cond. No.,123.0


In [116]:
amazonbooks['Paperback'] = (amazonbooks['Hard_or_Paper']=='P').astype(int)
# edit -- I like using hardback better than paperback...
# - there's more paperback, so having an indicator of hardback is more like a boost for the smaller book type
# - Hardback has a better significance than paperback
amazonbooks['Hardback'] = (amazonbooks['Hard_or_Paper']=='H').astype(int)

In [117]:
#fit_model7 = smf.ols(formula='Q("Amazon Price") ~ -1 + Q("List Price") + Thick + Paperback', 
fit_model7 = smf.ols(formula='Q("Amazon Price") ~ -1 + Q("List Price") + Thick + Hardback', 
                     data=amazonbooks).fit()
fit_model7.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared (uncentered):,0.963
Model:,OLS,Adj. R-squared (uncentered):,0.963
Method:,Least Squares,F-statistic:,2805.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,2.37e-229
Time:,15:49:31,Log-Likelihood:,-855.79
No. Observations:,323,AIC:,1718.0
Df Residuals:,320,BIC:,1729.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
"Q(""List Price"")",0.8673,0.013,66.095,0.000,0.842,0.893
Thick,-3.0112,0.336,-8.968,0.000,-3.672,-2.351
Hardback,-1.1581,0.459,-2.523,0.012,-2.061,-0.255

0,1,2,3
Omnibus:,99.007,Durbin-Watson:,2.079
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1262.07
Skew:,0.86,Prob(JB):,8.81e-275
Kurtosis:,12.53,Cond. No.,59.1


In [118]:
fit_model7.model.formula

'Q("Amazon Price") ~ -1 + Q("List Price") + Thick + Hardback'

In [119]:
fit_model8 = smf.ols(formula='Q("Amazon Price") ~ Thick + Weight_oz + Height+ Width + Paperback', 
                     data=amazonbooks).fit()
fit_model8.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.232
Model:,OLS,Adj. R-squared:,0.22
Method:,Least Squares,F-statistic:,18.62
Date:,"Mon, 08 May 2023",Prob (F-statistic):,3.62e-16
Time:,15:49:35,Log-Likelihood:,-1208.2
No. Observations:,314,AIC:,2428.0
Df Residuals:,308,BIC:,2451.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-33.6768,8.112,-4.152,0.000,-49.638,-17.716
Thick,-6.3227,2.771,-2.282,0.023,-11.775,-0.870
Weight_oz,0.4550,0.166,2.741,0.006,0.128,0.782
Height,3.0343,0.936,3.241,0.001,1.192,4.877
Width,3.5401,0.952,3.718,0.000,1.667,5.414
Paperback,2.9002,1.614,1.797,0.073,-0.276,6.076

0,1,2,3
Omnibus:,376.037,Durbin-Watson:,2.113
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24357.299
Skew:,5.355,Prob(JB):,0.0
Kurtosis:,44.797,Cond. No.,217.0


In [60]:
fit_model9 = smf.ols(formula='Q("Amazon Price") ~ Thick + Weight_oz + Height + Width', 
                     data=amazonbooks).fit()
fit_model9.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.224
Model:,OLS,Adj. R-squared:,0.214
Method:,Least Squares,F-statistic:,22.31
Date:,"Mon, 08 May 2023",Prob (F-statistic):,3.36e-16
Time:,15:07:46,Log-Likelihood:,-1209.9
No. Observations:,314,AIC:,2430.0
Df Residuals:,309,BIC:,2448.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-28.6618,7.644,-3.750,0.000,-43.702,-13.622
Thick,-7.3698,2.719,-2.711,0.007,-12.719,-2.020
Weight_oz,0.4505,0.167,2.704,0.007,0.123,0.778
Height,3.0165,0.940,3.210,0.001,1.168,4.865
Width,3.2323,0.940,3.439,0.001,1.383,5.082

0,1,2,3
Omnibus:,381.525,Durbin-Watson:,2.076
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25810.105
Skew:,5.476,Prob(JB):,0.0
Kurtosis:,46.044,Cond. No.,204.0


In [58]:
fit_model10 = smf.ols(formula='Q("Amazon Price") ~ -1 + Thick + Weight_oz + Width', 
                     data=amazonbooks).fit()
fit_model10.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared (uncentered):,0.592
Model:,OLS,Adj. R-squared (uncentered):,0.588
Method:,Least Squares,F-statistic:,150.5
Date:,"Mon, 08 May 2023",Prob (F-statistic):,2.91e-60
Time:,15:06:55,Log-Likelihood:,-1217.1
No. Observations:,314,AIC:,2440.0
Df Residuals:,311,BIC:,2451.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Thick,-11.7762,2.314,-5.089,0.000,-16.329,-7.223
Weight_oz,0.7890,0.141,5.578,0.000,0.511,1.067
Width,2.4681,0.312,7.899,0.000,1.853,3.083

0,1,2,3
Omnibus:,396.53,Durbin-Watson:,2.073
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28642.71
Skew:,5.846,Prob(JB):,0.0
Kurtosis:,48.305,Cond. No.,52.8


In [120]:
fit_model7.model.formula

'Q("Amazon Price") ~ -1 + Q("List Price") + Thick + Hardback'

$$\large \hat y = \beta_L ListPrice + \beta_T Thick + \beta_P Paperback$$

$$\large \hat y = 0.86 ListPrice - 3.69 Thick + 0.70 Paperback$$


In [73]:
# the original paperback based model
0.86 * 12.95 - 3.69 * 0.8 + 0.70 *1

8.884999999999998

In [121]:
fit_model7.params

Q("List Price")    0.867349
Thick             -3.011230
Hardback          -1.158125
dtype: float64

$$\large \hat y = \beta_L ListPrice + \beta_T Thick + \beta_P Hardback$$

$$\large \hat y = 0.867 ListPrice - 3.91 Thick -1.158 Hardback$$


In [77]:
fit_model7.predict()[0]

8.8901407366203

In [124]:
# model 7 changed to hardback
0.867 * 12.95 - 3.01 * 0.8 -1.15 *0

8.81965

In [125]:
fit_model7.predict()[0]

8.823188806135462

In [61]:
fit_model9.model.formula

'Q("Amazon Price") ~ Thick + Weight_oz + Height + Width'

In [59]:
fit_model10.model.formula

'Q("Amazon Price") ~ -1 + Thick + Weight_oz + Width'

In [62]:
fit_model9.params

Intercept   -28.661822
Thick        -7.369776
Weight_oz     0.450477
Height        3.016454
Width         3.232274
dtype: float64

$$\large \hat y = Intercept + \beta_T Thick + \beta_W Weight\_oz + \beta_H Height + \beta_W Width $$

$$\large \hat y = -28.66 - 7.36 Thick + 0.45 Weight\_oz + 3.01 Height + 3.23 Width $$

In [65]:
# y-hat for the first row observation
-28.66 - 7.36*0.8 + 0.45 * 11.2 + 3.01 * 7.8 + 3.23 *5.5

11.734999999999996

In [71]:
fit_model9.predict()[0]

11.79354246101503

In [64]:
amazonbooks

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Height,Width,Thick,Weight_oz,Paperback
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304.0,Adams Media,2010.0,1605506249,7.8,5.5,0.8,11.2,1
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273.0,Free Press,2008.0,1416564195,8.4,5.5,0.7,7.2,1
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96.0,Dover Publications,1995.0,486285537,8.3,5.2,0.3,4.0,1
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672.0,Harper Perennial,2008.0,61564893,8.8,6.0,1.6,28.8,1
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720.0,Knopf,2011.0,307265722,8.0,5.2,1.4,22.4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192.0,HarperCollins,2004.0,60572345,9.3,6.6,1.1,24.0,0
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160.0,Worth Publishers,2011.0,1429233443,9.1,6.1,0.7,8.0,1
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224.0,St Martin's Griffin,2005.0,031233446X,8.0,5.4,0.7,6.4,1
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480.0,W. W. Norton & Company,2010.0,393934942,10.7,8.9,0.9,14.4,1


In [99]:
import plotly.express as px
import plotly.graph_objects as go
import numpy as np

In [127]:
# fit_model7
y_7 = fit_model7.model.endog
print("r =",np.corrcoef(y_7, fit_model7.predict())[0,1])
fig = px.scatter(x=y_7, y=fit_model7.predict(),
                 labels={"x": "y", "y": "y-hat"},)
fig.add_trace(go.Scatter(x=[0,140], y=[0,140], name="y=x", line_shape='linear'))
fig.show()

r = 0.960867144102875


In [131]:
0.960867144102875**2

0.9232656686164151

In [104]:
# fit_model9
y_9 = fit_model9.model.endog
print("r =",np.corrcoef(y_9, fit_model9.predict())[0,1])
fig = px.scatter(x=y_9, y=fit_model9.predict(),
                 labels={"x": "y", "y": "y-hat"},)
fig.add_trace(go.Scatter(x=[0,140], y=[0,140], name="y=x", line_shape='linear'))
fig.show()

r = 0.4733900350278686


In [103]:
# fit_model10
y_10 = fit_model10.model.endog
print("r =",np.corrcoef(y_10, fit_model10.predict())[0,1])
fig = px.scatter(x=y_10, y=fit_model10.predict(),
                 labels={"x": "y", "y": "y-hat"},)
fig.add_trace(go.Scatter(x=[0,140], y=[0,140], name="y=x", line_shape='linear'))
fig.show()

r = 0.4346690986911411


In [109]:
# y versus y-hat correlation for different model fits

# model 7 # model 9 # model 10
0.960**2, 0.473**2, 0.434**2

(0.9216, 0.22372899999999998, 0.188356)

$R^2$ is called the coefficient of determination for a linear model
- $R^2$ is the "proportion of variation explained by the model"

In [128]:
fit_model11 = smf.ols(formula='Q("Amazon Price") ~ -1 + Q("List Price") * Thick * Paperback', 
                    data=amazonbooks).fit()
# fit_model7.summary() # model has 3 covariates only
fit_model11.summary() # model 11 has 7 covariates

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared (uncentered):,0.967
Model:,OLS,Adj. R-squared (uncentered):,0.966
Method:,Least Squares,F-statistic:,1306.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,5.2099999999999996e-229
Time:,15:53:54,Log-Likelihood:,-840.89
No. Observations:,323,AIC:,1696.0
Df Residuals:,316,BIC:,1722.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
"Q(""List Price"")",0.7693,0.056,13.774,0.000,0.659,0.879
Thick,-4.6291,0.686,-6.747,0.000,-5.979,-3.279
"Q(""List Price""):Thick",0.0996,0.042,2.369,0.018,0.017,0.182
Paperback,-4.2854,1.059,-4.048,0.000,-6.368,-2.202
"Q(""List Price""):Paperback",0.3562,0.079,4.535,0.000,0.202,0.511
Thick:Paperback,7.8834,1.555,5.070,0.000,4.824,10.943
"Q(""List Price""):Thick:Paperback",-0.4716,0.087,-5.426,0.000,-0.643,-0.301

0,1,2,3
Omnibus:,100.989,Durbin-Watson:,2.049
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1023.142
Skew:,0.966,Prob(JB):,6.72e-223
Kurtosis:,11.502,Cond. No.,381.0


# Interactions! Automatically create new data columns to help predict!

$$\large \hat y = \beta_L ListPrice * \beta_T Thick * \beta_P Paperback$$


- For model 7: $R^2$ reported in the summary function was 0.963
- For model 11: $R^2$ reported in the summary function was 0.967

In [130]:
amazonbooks

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Height,Width,Thick,Weight_oz,Paperback,Hardback
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304.0,Adams Media,2010.0,1605506249,7.8,5.5,0.8,11.2,1,0
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273.0,Free Press,2008.0,1416564195,8.4,5.5,0.7,7.2,1,0
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96.0,Dover Publications,1995.0,486285537,8.3,5.2,0.3,4.0,1,0
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672.0,Harper Perennial,2008.0,61564893,8.8,6.0,1.6,28.8,1,0
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720.0,Knopf,2011.0,307265722,8.0,5.2,1.4,22.4,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192.0,HarperCollins,2004.0,60572345,9.3,6.6,1.1,24.0,0,1
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160.0,Worth Publishers,2011.0,1429233443,9.1,6.1,0.7,8.0,1,0
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224.0,St Martin's Griffin,2005.0,031233446X,8.0,5.4,0.7,6.4,1,0
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480.0,W. W. Norton & Company,2010.0,393934942,10.7,8.9,0.9,14.4,1,0
