## Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

* Assumptions for ANOVA

1. Normality of sampling distribution of mean: It means the distribution of sample means must follow normal distribution (central limit theory)
2. Absence of outliers: It means that outlying scores need to be removed before performing ANOVA
3. Homogenity of variance: It means that population variance in different level of the each independent variable or factor is the same. [σ1²=σ2²=σ3²]
4. Samples are independent avnd random

* Some examples of violations that could impact the validity of the results:

__Non-normality:__ If the data within each group is not normally distributed, the ANOVA results may be unreliable. For example, if the data is skewed or has outliers, it may violate the assumption of normality.

__Heteroscedasticity:__ If the variances of the data within each group are not equal, the ANOVA results may be unreliable. For example, if the data within one group has much larger variances than another group, it may violate the assumption of homogeneity of variances.

__Lack of independence:__ If the observations within each group are not independent of each other, the ANOVA results may be unreliable. For example, if the same group of individuals is measured at multiple time points, the observations may not be independent, violating the assumption of independence.



## What are the three types of ANOVA, and in what situations would each be used?

 Three main types of ANOVA are:

1. One Way ANOVA : Used when there is one factor with atleast 2 levels and the levels are independent of each other
2. Repeated Measures ANOVA : Used when ther is one factor with atleast 2 levels and the levels are dependent on each other
3. Factorial ANOVA : Used when there are 2 or more factors and each factor has 2 or more levels and the levels may be dependent or independent

## What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to hypothesis testing and it is as follows:

* Null hypothesis (H0) : σ1²=σ2²=σ3²= .......σk² (k = number of levels) 

* Alternate hypothesis (Ha): Atleas one of the sample mean is not equal

* The test statistic in ANOVA is the F test:

* F = (Variance between samples) / (Variance within samples)

## How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Group 1":[79,78,88,94,92,85,83,85,82,81],"Group 2":[85,86,88,75,78,94,98,79,71,80],"Group 3":[91,92,93,85,87,84,82,88,95,96]})
df.shape

(10, 3)

In [2]:
# Calculate the group means and the total mean.
total_mean = df.unstack().mean()
group_mean = df.mean()
print('Group mean:')
print(group_mean)
print('\nTotal mean = ',total_mean)

Group mean:
Group 1    84.7
Group 2    83.4
Group 3    89.3
dtype: float64

Total mean =  85.8


In [3]:
#Calculate the sum of squares total
## tss = summation ( sample value - total mean)**2

tss = 0
for i in df.unstack():
    tss += (i - total_mean)**2
    
print('Total Sum of Square, SST = ',tss.round(2))

Total Sum of Square, SST =  1292.8


In [4]:
# Calculate explained sum of squares 
## sse = summation ( length of group*( group mean - total mean)**2)

n= 10
sse = np.sum(n * (group_mean -  total_mean)**2)
print('Explained Sum of Squares, SSE = ',sse.round(2))

Explained Sum of Squares, SSE =  192.2


In [5]:
# Calculate the sum of squares residual

ssr = tss - sse
print('Residual Sum of Squares, SSR',ssr.round(2))

Residual Sum of Squares, SSR 1100.6


## In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# here we have a data of consumption of fuel of a car when father drives and son drives with frequencies of a week
data = pd.DataFrame({'freq_father': np.repeat([1,2,3], 10), 
                     'freq_son': np.repeat([2,3,1], 10), 
                     'consumption': [10, 12, 13, 9, 8, 14, 12, 11, 14, 8, 
                                    22, 28, 23, 22, 27, 26, 25, 24, 23, 28, 
                                    15, 21, 16, 17, 20, 21, 20, 19, 18, 17]})

# Fit the ANOVA model using the formula interface
model = ols('consumption ~ freq_father + freq_son + freq_father: freq_son', data=data).fit()

# Calculate the ANOVA table
result = sm.stats.anova_lm(model,  typ = 2)

# Extract the main effects and interaction effects from the table
main = result.iloc[:-2, :-1]
interaction = result.iloc[2:, :-1]

# Print the main effects and interaction effects
print(f'Main Effects:\n{main}\n')
print(f'Interactions:\n{interaction}')

Main Effects:
                 sum_sq   df           F
freq_father  819.597644  1.0  161.644532
freq_son     755.906043  1.0  149.083003

Interactions:
                          sum_sq    df          F
freq_father:freq_son  195.381663   1.0  38.534002
Residual              136.900000  27.0        NaN


## Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Since the value of p i.e., 0.02 is less than alpha i.e., 0.05.
* we can reject the null hypothesis.
* which means we can state that there is a significant difference in the means between the groups.
* There is a significant effect of the independent variable on the dependent variable.

## In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In a repeated measures ANOVA, we can handle missing data by implementing one of the below methods:

__Multiple imputation (MI):__ In this method, missing data are imputed multiple times to create several complete datasets. The consequence of using this method is that it can be computationally intensive and may require assumptions about the distribution of the missing data.

__Last observation carried forward (LOCF):__ In this method, the last observed value is carried forward to replace the missing value. The consequence of using this method is that it assumes that the missing value is the same as the last observed value, which may not be accurate if the outcome variable is changing over time.

The consequences of using different methods to handle missing data can be significant, and the choice of method should be based on the nature and extent of the missing data, as well as the assumptions underlying the method. In general, it is recommended to use multiple imputation or other methods that account for the uncertainty associated with missing data whenever possible, to avoid bias and improve the accuracy of the estimates.

## What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used in ANOVA to compare means between different groups after a significant main effect has been detected. These tests help to identify which groups are significantly different from one another. Here are some common post-hoc tests used after ANOVA, along with situations where they might be appropriate:

* Tukey's HSD (Honestly Significant Difference) Test: This test is used when the number of groups is equal and the sample sizes are equal or nearly equal. Tukey's HSD test controls the family-wise error rate, meaning it is less likely to produce false positive results (Type I error).

* Bonferroni Correction: This test is used when the number of pairwise comparisons is large. It is more conservative than other post-hoc tests because it adjusts the significance level to account for multiple comparisons.

* Scheffe's Test: This test is used when the number of groups is unequal, the sample sizes are unequal, and the variances are unequal. Scheffe's test is more conservative than other post-hoc tests, meaning it is less likely to produce false positive results (Type I error).

* Games-Howell Test: This test is used when the number of groups is unequal and the variances are unequal. It is a more powerful test than Scheffe's test, but it is less powerful than other post-hoc tests when the variances are equal.

## A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [7]:
import numpy as np
import scipy.stats as stat

# Generate random weight loss data for each diet
np.random.seed(6)
diet_a = np.random.normal(10,8,20)
diet_b = np.random.normal(12, 9, 15)
diet_c = np.random.normal(8, 5, 15)
alpha = 0.05 #(default)

# Conduct one-way ANOVA on the weight loss data
f_stat, p_val = stat.f_oneway(diet_a, diet_b, diet_c)

# decision making and printing the results
if(p_val <alpha):
    print("Reject the Null Hypothesis: there is significant differences between the mean weight loss of at least one of the three diets.")    
else:
    print("We Fail to Reject the NULL hypothesis")


Reject the Null Hypothesis: there is significant differences between the mean weight loss of at least one of the three diets.


## A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [8]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

prog = np.repeat(('A','B','C'), 10)
expLev = np.repeat(('novice', 'experienced'),15)

# Generate random data
np.random.seed(1)
np.random.shuffle(prog)
np.random.shuffle(expLev)
time = np.random.randint(10, 25, 30)

df = pd.DataFrame({'Program': prog, 'ExperienceLevel': expLev, 'Time': time})

# Performing two-way ANOVA
model = ols('Time ~ C(Program)+ C(ExperienceLevel) + C(Program):C(ExperienceLevel)', data = df).fit()

sm.stats.anova_lm(model, typ =2)


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Program),0.186339,2.0,0.005345,0.99467
C(ExperienceLevel),18.419672,1.0,1.056791,0.314198
C(Program):C(ExperienceLevel),22.164852,2.0,0.635832,0.538182
Residual,418.315476,24.0,,


we can see that the p-values for all the main effects and interaction effect are greater than 0.05, which suggests that none of these effects are statistically significant at the 5% level. Therefore, we can conclude that there is no evidence of any significant differences in the average time it takes to complete the task across the different software programs or between novice and experienced employees.

## An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [9]:
import numpy as np
import pandas as pd
import scipy.stats as stat
from statsmodels.stats.multicomp import pairwise_tukeyhsd

'''
h0 = mean1 = mean2
h1 = mean1 != mean2
'''

# Generate random data
np.random.seed(2)
score = np.random.randint(50,100, 100)
group = np.repeat(('Control', 'Experimental'), 50)
np.random.shuffle(group)

# make dataframe
df = pd.DataFrame({'Score': score, 'Group': group})

# seperate the group score
grp1 = df[df['Group']== 'Control']['Score']
grp2 = df[df['Group']== 'Experimental']['Score']

# conduct two sample t-test
stat , p_val = stat.ttest_ind(grp1, grp2 )

alpha = 0.05
if p_val < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

reject the null hypothesis


#### Since the difference is significant, performing post-hoc test

In [10]:
tukey=pairwise_tukeyhsd( df['Score'],df['Group'])
print(tukey)

    Multiple Comparison of Means - Tukey HSD, FWER=0.05     
 group1    group2    meandiff p-adj   lower    upper  reject
------------------------------------------------------------
Control Experimental    -7.02 0.0132 -12.5401 -1.4999   True
------------------------------------------------------------


This means that the mean test score for the Experimental group is 7.02 points lower than that of the Control group. 

## A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each storeon those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [201]:
import pingouin as pg
import numpy as np
import pandas as pd
import statsmodels.stats.multicomp as mp

# generate random data
np.random.seed(4)
store = np.repeat(('A', 'B', 'C'), 30)
sales = np.concatenate([np.random.randint(300, 450, 30), np.random.randint(200, 210, 30) , np.random.randint(250, 350, 30)])
df = pd.DataFrame({'store': store, 'sales': sales, 'day': list(range(1,31))*3})

# conducting repeated measure anova
pg.rm_anova(dv = 'sales', within = 'day', subject='store', data= df)


Unnamed: 0,Source,ddof1,ddof2,F,p-unc,ng2,eps
0,day,29,58,1.961647,0.014657,0.06514,0.059421


p-value of 0.014657, which is less than the alpha level of 0.05 indicating that there is a significant difference between at least two of the conditions.
 
   
One common post-hoc test is the Tukey HSD test, which can be performed using the pairwise_tukeyhsd function from the statsmodels library.

In [202]:
# perform Tukey HSD post-hoc test
tus = mp.pairwise_tukeyhsd(df['sales'], df['store'] )
print(tus)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj   lower     upper   reject
--------------------------------------------------------
     A      B -177.9333   0.0 -195.6434 -160.2233   True
     A      C  -84.9333   0.0 -102.6434  -67.2233   True
     B      C      93.0   0.0     75.29    110.71   True
--------------------------------------------------------


The p-values for all three pairwise comparisons are less than the alpha level of 0.05, indicating that there are significant differences between the means of at least two of the groups. We can see that group A has significantly lower mean sales than groups B and C, while there is a significant difference in mean sales between groups B and C.