## Two way ANOVA

Checking if fare varies significantly based on both class and the port embarked

#### Hypothesis  
Null hypothesis:  
The mean fare doesn't significantly differ across passenger classes.  
Alternative Hypothesis:  
The mean fare differs for at least one passenger class.

In [1]:
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [2]:
# Load Titanic dataset
titanic = sns.load_dataset('titanic')

In [3]:
# Check for missing values in relevant columns
print(titanic[['fare', 'class', 'embarked']].isnull().sum())

fare        0
class       0
embarked    2
dtype: int64


In [4]:
# Drop rows with missing values
titanic_clean = titanic[['fare', 'class', 'embarked']].dropna()

In [5]:
# Renaming the column as it is conflicting with a reserved keyword
titanic_clean.rename(columns={'class': 'class_boarded'}, inplace=True)

In [6]:
titanic_clean.columns

Index(['fare', 'class_boarded', 'embarked'], dtype='object')

#### Generating the Anova table

In order to generate the ANOVA table, we first fit a linear model and then generate the table from this object.  
The formula:  

Control_Column ~ C(factor_col1) + factor_col2 + C(factor_col3) + ... + X

In [7]:
# Perform Two-Way ANOVA
model = ols('fare ~ C(class_boarded) + C(embarked) + C(class_boarded):C(embarked)', data=titanic_clean).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("Two-Way ANOVA Results:\n", anova_table)

# Interpretation
alpha = 0.05
for factor, row in anova_table.iterrows():
    p_value = row['PR(>F)']
    if p_value < alpha:
        print(f"\nReject the null hypothesis: The factor '{factor}' significantly affects the fare.")
    else:
        print(f"\nFail to reject the null hypothesis: The factor '{factor}' does not significantly affect the fare.")

Two-Way ANOVA Results:
                                     sum_sq     df           F        PR(>F)
C(class_boarded)              6.199766e+05    2.0  200.611239  1.650342e-72
C(embarked)                   2.234883e+04    2.0    7.231607  7.671544e-04
C(class_boarded):C(embarked)  3.959226e+04    4.0    6.405606  4.408632e-05
Residual                      1.359793e+06  880.0         NaN           NaN

Reject the null hypothesis: The factor 'C(class_boarded)' significantly affects the fare.

Reject the null hypothesis: The factor 'C(embarked)' significantly affects the fare.

Reject the null hypothesis: The factor 'C(class_boarded):C(embarked)' significantly affects the fare.

Fail to reject the null hypothesis: The factor 'Residual' does not significantly affect the fare.


### Interpretation of Factors:
C(class_boarded)  
P-value (1.65e-72) is much smaller than alpha = 0.05.
Conclusion: The fare is significantly affected by the passenger class. 

C(embarked)  
P-value (7.67e-4) is smaller than alpha = 0.05.
Conclusion: The fare is significantly affected by the port of embarkation.  

C(class_boarded):C(embarked) (Interaction Effect)  
P-value (4.41e-5) is smaller than alpha = 0.05.
Conclusion: There is a significant interaction effect between the passenger class and embarkation port on the fare.
This means the way fare changes with the class also depends on the embarkation port.  

Residual  
Residuals represent variation not explained by the model.
No p-value here since it represents remaining variability after factoring in the predictors.

### Summary
We performed a Two-Way ANOVA to examine the effect of passenger class and embarkation port on fare. The results indicate that class_boarded (p < 0.001) and embarked (p < 0.001) both significantly affect fare. Furthermore, there is a significant interaction effect (p < 0.001) between the two variables. This suggests that the influence of the passenger class on fare varies depending on the embarkation port.

In [9]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(class_boarded),619976.6,2.0,200.611239,1.650342e-72
C(embarked),22348.83,2.0,7.231607,0.0007671544
C(class_boarded):C(embarked),39592.26,4.0,6.405606,4.408632e-05
Residual,1359793.0,880.0,,
