# Question 0 - R-Squared Warmup [20 points]

In this question you will fit a model to the ToothGrowth data used in the notes on Resampling and Statsmodels-OLS. Read the data, log transform tooth length, and then fit a model with indpendent variables for supplement type, dose (as categorical), and their interaction. Demonstrate how to compute the R-Squared and Adjusted R-Squared values and compare your compuations to the attributes (or properties) already present in the result object.

In [3]:
# model imports
import numpy as np
import pandas as pd
from scipy.stats import t
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import statsmodels.formula.api as smf
import statsmodels.api as sm
from os.path import exists

In [34]:
# use the "ToothGrowth" data from the R datasets package.
file = 'tooth_growth.feather'
if exists(file):
    tg_data = pd.read_feather(file)
else: 
    tooth_growth = sm.datasets.get_rdataset('ToothGrowth')
    #print(tooth_growth.__doc__)
    tg_data = tooth_growth.data
    tg_data.to_feather(file)

In [35]:
# log transform tooth length and transform supplement type, dose as categorical type
trans_tg_data = tg_data
trans_tg_data["log_len"] = np.log(trans_tg_data["len"])
trans_tg_data['supp'] = pd.get_dummies(trans_tg_data['supp'])['OJ']
trans_tg_data

Unnamed: 0,len,supp,dose,log_len
0,4.2,0,0.5,1.435085
1,11.5,0,0.5,2.442347
2,7.3,0,0.5,1.987874
3,5.8,0,0.5,1.757858
4,6.4,0,0.5,1.856298
5,10.0,0,0.5,2.302585
6,11.2,0,0.5,2.415914
7,11.2,0,0.5,2.415914
8,5.2,0,0.5,1.648659
9,7.0,0,0.5,1.94591


In [36]:
# fit a model
mod1 = sm.OLS.from_formula('log_len ~ supp*dose', data=trans_tg_data)
res1 = mod1.fit()
res1.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.666
Dependent Variable:,log_len,AIC:,25.3657
Date:,2021-11-03 22:49,BIC:,33.7431
No. Observations:,60,Log-Likelihood:,-8.6829
Df Model:,3,F-statistic:,40.26
Df Residuals:,56,Prob (F-statistic):,5.28e-14
R-squared:,0.683,Scale:,0.083789

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,1.8019,0.1121,16.0726,0.0000,1.5773,2.0265
supp,0.6564,0.1585,4.1404,0.0001,0.3388,0.9741
dose,0.7639,0.0847,9.0136,0.0000,0.5941,0.9336
supp:dose,-0.3294,0.1198,-2.7484,0.0080,-0.5695,-0.0893

0,1,2,3
Omnibus:,2.645,Durbin-Watson:,1.373
Prob(Omnibus):,0.266,Jarque-Bera (JB):,2.575
Skew:,-0.469,Prob(JB):,0.276
Kurtosis:,2.613,Condition No.:,11.0


The $R^2 = 0.683$ and $R_{Adj}^2 = 0.666$ 

In [47]:
# How to compute R-squared
mean_log_len = np.mean(trans_tg_data["log_len"])
trans_tg_data["SST"] = trans_tg_data.apply(lambda x: (x["log_len"]-mean_log_len)**2, axis=1)
TSS = sum(trans_tg_data["SST"])
RSS = np.sum(res1.resid**2)
R_Squared = 1 - RSS/TSS
print(R_Squared)
R_adj = 1 - (res1.df_resid+res1.df_model-1)/(res1.df_resid-1)*(RSS/TSS)
print(R_adj)

0.6832376344099036
0.6659596871958984


The calculated R-squared and Adjusted R-Squared are the same with the table above.