In [1]:
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import scipy.stats as stats

### Question 1

**Single Factor Studies - ANOVA Model II**

Random Cell Means Model: $Y_{ij} = \mu_{i} + \epsilon_{ij}$

- $\mu_i$ are independent $N(\mu_., \sigma_\mu^2)$
- $\epsilon_{ij}$ are independent $N(0, \sigma^2)$
- $\mu_i$ and $\epsilon_{ij}$ are independent random variables
- $i = 1, 2, \dots, r$ and $j = 1, 2, \dots, n$

In [2]:
data = np.array([
    [24.4, 1, 1],
    [22.6, 1, 2],
    [23.8, 1, 3],
    [22.0, 1, 4],
    [24.5, 1, 5],
    [22.3, 1, 6],
    [25.0, 1, 7],
    [24.5, 1, 8],
    [10.2, 2, 1],
    [12.1, 2, 2],
    [10.3, 2, 3],
    [10.2, 2, 4],
    [9.9, 2, 5],
    [11.2, 2, 6],
    [12.0, 2, 7],
    [9.5, 2, 8],
    [19.2, 3, 1],
    [19.4, 3, 2],
    [19.8, 3, 3],
    [19.0, 3, 4],
    [19.6, 3, 5],
    [18.3, 3, 6],
    [20.0, 3, 7],
    [19.4, 3, 8],
    [17.4, 4, 1],
    [18.1, 4, 2],
    [16.7, 4, 3],
    [18.3, 4, 4],
    [17.6, 4, 5],
    [17.5, 4, 6],
    [18.0, 4, 7],
    [16.4, 4, 8],
    [13.4, 5, 1],
    [15.0, 5, 2],
    [14.1, 5, 3],
    [13.1, 5, 4],
    [14.9, 5, 5],
    [15.0, 5, 6],
    [13.4, 5, 7],
    [14.8, 5, 8],
    [21.3, 6, 1],
    [20.2, 6, 2],
    [20.7, 6, 3],
    [20.8, 6, 4],
    [20.1, 6, 5],
    [18.8, 6, 6],
    [21.1, 6, 7],
    [20.3, 6, 8],
], dtype=np.float32)

#### 1.a



- **Null Hypothesis ($H_0$)**: The mean sodium content is the same across all brands.<br>
Variance between the brand means $var(\sigma_\mu^2)$ = 0.

- **Alternative Hypothesis ($H_1$)**: There is a difference in the mean sodium content among the brands.<br>Variance between the brand means $var(\sigma_{\mu}^2) > 0$.

**Test Statistic**
$$ F = \frac{\text{Mean Square Between Groups (MSTR)}}{\text{Mean Square Within Groups (MSE)}} $$


**Decision Rule**
- Conclude $H_0$ if $F \leq F_{1-\alpha, r-1, r(n-1)}$
- Conclude $H_1$ if $F > F_{1-\alpha, r-1, r(n-1)}$

In [6]:
# process data into categorical variables
df = pd.DataFrame(data, columns=["Sodium", "Brand", "Replicate"])
df["Brand"] = df["Brand"].astype("category")

# setup and fit OLS model
model = ols("Sodium ~ C(Brand)", data=df).fit()

# perform ANOVA decomposition using model 2
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
print()


# Mean Square Between Groups
dfB = anova_table.at["C(Brand)", "df"]
MSTR = anova_table.at["C(Brand)", "sum_sq"] / dfB

# Mean Square Within Groups
dfW = anova_table.at["Residual", "df"]
MSE = anova_table.at["Residual", "sum_sq"] / dfW

f_calculated = MSTR / MSE
print("Calculated F-statistic: ", f_calculated)

alpha = 0.01
f_critical = stats.f.ppf(1 - alpha, dfB, dfW)
print("Critical F-statistic: ", f_critical)

if f_calculated > f_critical:
    print("\nReject null hypothesis")
else:
    print("\nFail to reject null hypothesis")

              sum_sq    df           F        PR(>F)
C(Brand)  854.529147   5.0  238.711145  1.083748e-29
Residual   30.070003  42.0         NaN           NaN

Calculated F-statistic:  238.71114463850907
Critical F-statistic:  3.488234863858266

Reject null hypothesis


From the result of the analysis above:
- F-statistic: 238.71
- p-value: approximately $1.08 \times 10^{-29}$

**Conclusion**
- p-value is much lower $\alpha = 0.01$, we reject the null hypothesis.
- There is significant evidence to suggest that the mean sodium content is not the same across all brands in the dataset.