---
author: Krtin Juneja (KJUNEJA@falcon.bentley.edu)
---

The solution below uses an example dataset about car design and fuel consumption from a 1974 Motor Trend magazine.  (See how to quickly load some sample data.)

We will create two models, one nested inside the other, in a natural way in this example.
But this is not the only way to create nested models; it is just an example.

In [2]:
from rdatasets import data
df = data('mtcars')

Consider a model using number of cylinders (cyl) and weight of car (wt) to predict its fuel efficiency (mpg). We create this model and perform an ANOVA to see if the predictors are significant. We use the Ordinary Least Squares module from `statsmodels`.

In [3]:
from statsmodels.formula.api import ols
add_model = ols('mpg ~ cyl + wt', data = df).fit()

import statsmodels.api as sm
sm.stats.anova_lm(add_model, typ= 1)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
cyl,1.0,817.712952,817.712952,124.043687,5.424327e-12
wt,1.0,117.162269,117.162269,17.773034,0.00022202
Residual,29.0,191.171966,6.592137,,


In the final column of output we see that all numbers are below $0.05$, which suggests that both predictors are significant.  A natural question to ask is whether the two predictors have an interaction effect.  Let's create a model containing the interaction term.

In [5]:
int_model = ols('mpg ~ cyl*wt', data = df).fit()
sm.stats.anova_lm(int_model, typ= 1)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
cyl,1.0,817.712952,817.712952,145.856269,1.280635e-12
wt,1.0,117.162269,117.162269,20.89835,8.942713e-05
cyl:wt,1.0,34.195767,34.195767,6.099533,0.01988242
Residual,28.0,156.976199,5.606293,,


As seen in the final column of output, there is a significant interaction between the two predictors (bottom number being below $0.05$).

We now have one model (`add_model`) nested inside a larger model (`int_model`).
To check which model is better, we can conduct an ANOVA comparing the two models. We use the `anova_lm` function from `statsmodels`.

In [6]:
from statsmodels.stats.anova import anova_lm
anova_lm(add_model, int_model)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,29.0,191.171966,0.0,,,
1,28.0,156.976199,1.0,34.195767,6.099533,0.019882


We have just performed this hypothesis test:

$H_0 =$ the two models are equally useful for predicting the outcome

$H_a =$ the larger model is significantly better than the smaller model

In the final column of the output, called **Pr(>F)**,
the only number in that column is our test statistic, $0.019882$.
Since is below our chosen threshold of $0.05$, we reject the null hypothesis,
and prefer to use the second model.

This method can be used to check if covariates should be included in the model, or if additional variables should be added as well.