# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

A one-way ANOVA is a statistical test used to determine whether or not there is a significant difference between the means of three or more independent groups. 

Before we can conduct a one-way ANOVA, we must first check to make sure that three assumptions are met.

1. Normality – Each sample was drawn from a normally distributed population.


2. Equal Variances – The variances of the populations that the samples come from are equal.


3. Independence – The observations in each group are independent of each other and the observations within groups were obtained by a random sample.



To check the assumptions of ANOVA, various diagnostic plots can be used, such as histograms, box plots, and Q-Q plots. If the assumptions are violated, alternative statistical methods may need to be used, such as non-parametric tests or transformations of the data.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

An ANOVA test is a way to find out if survey or experiment results are significant. 

The Three types Of Anova are: 
    
    1- One Way ANOVA : 
        
            -A one way ANOVA is used to compare two means from two independent (unrelated) groups using the F-distribution. The null hypothesis for the test is that the two means are equal. Therefore, a significant result means that the two means are unequal.

            - For example, you might be studying the effects of tea on weight loss and form three groups: green tea, black tea, and no tea.
    
    2- Two-Way Anova :
        
        -A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.
        
        -  A two-way ANOVA test analyzes the effect of the independent variables on the expected outcome along with their relationship to the outcome itself. Random factors would be considered to have no statistical influence on a data set, while systematic factors would be considered to have statistical significance.
        
        
        
    3- Three-way ANOVA:

        - This is used when we have three independent variables and one dependent variable. For example, if we want to examine the effect of gender, age, and income on a particular outcome, we would use a three-way ANOVA.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


An ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the variance within a group. Another way to view this problem is that we could partition variance, that is, we could divide the total variance in our data into the various sources of that variation.


The total variance in ANOVA can be divided into two parts: 

    1- The variance between groups

    2- The variance between groups measures the differences in means between each group.

The variance within groups :

    The variance within groups measures the variation within each group.

- The partitioning of variance is important because it allows researchers to determine the proportion of the total variation that can be attributed to differences between groups versus differences within groups.


- This information is used to calculate the F-statistic, which is used to test the hypothesis that the means of the groups are equal.


- If the variance between groups is large relative to the variance within groups, then the F-statistic will be large, indicating that there are significant differences between the groups.

- If the variance within groups is large relative to the variance between groups, then the F-statistic will be small, indicating that there are no significant differences between the groups.


# Q4. How would we calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

SST  - The sum of squares total,SSR-  the sum of squares regression, and SSE- the sum of squares error.

SST :- 
    
        - The sum of squares total, denoted SST, is the squared differences between the observed dependent variable and its mean. You can think of this as the dispersion of the observed variables around the mean – much like the variance in descriptive statistics.
        
        - It is a measure of the total variability of the dataset.
        


SSR :- 
    
    - The sum of squares due to regression, or SSR. It is the sum of the differences between the predicted value and the mean of the dependent variable. Think of it as a measure that describes how well our line fits the data.
    
    - If this value of SSR is equal to the sum of squares total, it means our regression model captures all the observed variability and is perfect.
    
SSE :-
    
    - The error is the difference between the observed value and the predicted value.
    
    - We usually want to minimize the error. The smaller the error, the better the estimation power of the regression.
    

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

#create pandas DataFrame
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#view first five rows of DataFrame
df.head()



Unnamed: 0,hours,score
0,1,68
1,1,76
2,1,74
3,2,80
4,2,76


In [2]:
#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print(sse)

331.07488479262696

#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print(ssr)

917.4751152073725

#calculate sst
sst = ssr + sse
print(sst)

331.0748847926267
917.4751152073726
1248.5499999999993


# Q5. In a two-way ANOVA, how would we calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data
data = pd.read_csv('record.csv')

# fit two-way ANOVA model
model = ols('response_variable ~ group1 + group2 + group1:group2', data=data).fit()

# calculate main effects
main_effect_1 = sm.stats.anova_lm(model, typ=2)['sum_sq']['group1']
main_effect_2 = sm.stats.anova_lm(model, typ=2)['sum_sq']['group2']

# calculate interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq']['group1:group2']

# print the results
print('Main effect 1:', main_effect_1)
print('Main effect 2:', main_effect_2)
print('Interaction effect:', interaction_effect)

# Q6. Suppose we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can we conclude about the differences between the groups, and how would we interpret these results?

- If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a statistically significant difference between the groups. Specifically, we can conclude that at least one of the groups differs significantly from the others in terms of the mean value of the response variable.

- The F-statistic of 5.23 indicates the ratio of the variation between groups to the variation within groups. The larger the F-statistic, the more likely it is that there is a significant difference between the groups. The p-value of 0.02 indicates the probability of obtaining such an extreme F-statistic by chance alone, assuming that there is no true difference between the groups. Since the p-value is less than the typical significance level of 0.05, we can reject the null hypothesis that there is no difference between the groups and conclude that there is a significant difference between them.

- To interpret the results, we can perform a post-hoc analysis, such as a Tukey HSD test, to identify which groups differ significantly from each other. Additionally, we can calculate effect sizes, such as eta-squared or Cohen's d, to estimate the magnitude of the differences between the groups.

- Obtaining a significant F-statistic and p-value in a one-way ANOVA indicates that there is a significant difference between the groups in terms of the mean value of the response variable.

- Further analyses can be performed to identify which groups differ significantly and to estimate the magnitude of the differences.


# Q7. In a repeated measures ANOVA, how would we handle missing data, and what are the potential consequences of using different methods to handle missing data?

We can handle missing data by using following ways:
    
    1-  Listwise deletion :
            Any participant with missing data on any of the variables is excluded from the analysis.This approach is simple to implement, but it can reduce the sample size and potentially introduce bias if the missing data is related to the outcome variable or the other variables in the analysis.
            
    2 - Multiple imputation :
            One of the most effective ways of dealing with missing data is multiple imputation (MI). Using MI, we can create multiple plausible replacements of the missing data, given what we have observed and a statistical model (the imputation model).
            
    3 - Pairwise Deletion :
            The available data for each participant is used for the analysis, even if some variables are missing for some participants.This approach retains more participants in the analysis but can introduce bias if the missing data is not missing at random.


# Q8. What are some common post-hoc tests used after ANOVA, and when would we use each one? Provide an example of a situation where a post-hoc test might be necessary.

An ANOVA is a statistical test that is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups. 

In order to find out exactly which groups are different from each other, we must conduct a post hoc test (also known as a multiple comparison test), which will allow us to explore the difference between multiple group means while also controlling for the family-wise error rate.

- Some common post-hoc tests include:
        
        
        - Tukey’s Test – useful when you want to make every possible pairwise comparison

        - Holm’s Method – a slightly more conservative test compared to Tukey’s Test

        - Dunnett’s Correction – useful when you want to compare every group mean to a control mean, and you’re not interested in comparing the treatment means with one another.
        
        - Scheffe's test - 
                This test is a conservative post-hoc test that can be used when the sample sizes are unequal or the variances are not equal.It is useful when we want to compare multiple groups, but we are not confident in the assumptions of the ANOVA.
                
                
- Conclusion: 

- If an ANOVA produces a p-value that is less than our significance level, we can use post hoc tests to find out which group means differ from one another.
- Post hoc tests allow us to control the family-wise error rate while performing multiple pairwise comparisons.
- The tradeoff of controlling the family-wise error rate is lower statistical power. We can reduce the effects of lower statistical power by making fewer pairwise comparisons.
- You should determine beforehand which groups you’d like to make pairwise comparisons on and which post hoc test you will use to do so.

- We cantake plant growing as an example.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Generate random weight loss data for three diets
np.random.seed(1)
diet_a = np.random.normal(5, 1, 50)
diet_b = np.random.normal(6, 1, 50)
diet_c = np.random.normal(4, 1, 50)

# Conduct one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-statistic: ", f_stat)
print("p-value: ", p_val)

F-statistic:  68.0129472265407
p-value:  1.2263106300978192e-21


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random time data for three programs and two experience levels
np.random.seed(1)
data = {'program': ['A', 'B', 'C'] * 20,
        'experience': ['novice']*30 + ['experienced']*30,
        'time': np.random.normal(10, 2, 60)}
df = pd.DataFrame(data)

# Conduct two-way ANOVA
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
tab = sm.stats.anova_lm(model, typ=2)

# Print results
print(tab)

                              sum_sq    df         F    PR(>F)
C(program)                  1.181428   2.0  0.171062  0.843224
C(experience)               1.118041   1.0  0.323769  0.571711
C(program):C(experience)   17.222352   2.0  2.493673  0.092075
Residual                  186.473318  54.0       NaN       NaN


The results suggest that there is a significant main effect of experience level on task completion time, but no significant main effect of software program or interaction effect between software program and experience level.

# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a data frame with test scores and group assignments
data = pd.DataFrame({
    'test_scores': [85, 70, 75, 80, 90, 65, 70, 75, 95, 80, 
                    75, 78, 80, 85, 90, 95, 80, 75, 85, 70, 
                    85, 75, 80, 90, 70, 75, 85, 80, 90, 75,
                    70, 60, 75, 85, 90, 80, 75, 70, 85, 90,
                    85, 55, 95, 90, 75, 70, 80, 85, 90, 75],
    'group': ['experimental']*25 + ['control']*25
})

# Compute the t-test
control = data.loc[data['group'] == 'control', 'test_scores']
experimental = data.loc[data['group'] == 'experimental', 'test_scores']
t_stat, p_val = stats.ttest_ind(control, experimental)
print("t-statistic:", t_stat)
print("p-value:", p_val)

# Compute the post-hoc test
tukey_result = pairwise_tukeyhsd(data['test_scores'], data['group'])
print(tukey_result)

t-statistic: -0.20344395903324447
p-value: 0.839648074666957
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower  upper  reject
----------------------------------------------------------
control experimental     0.52 0.8396 -4.6192 5.6592  False
----------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import pingouin as pg
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM

# create a sample dataset
data = pd.DataFrame({
    'subject': ['s%d' % (i//30+1) for i in range(90)],
    'store': ['A', 'B', 'C'] * 30,
    'sales': np.random.randint(100, 1000, 90)
})

# reshape the data
data_wide = data.pivot(index='subject', columns='store', values='sales')

# create a model using AnovaRM
model = AnovaRM(data_wide, 'sales', 'subject', within=['store'])
results = model.fit()

# print the ANOVA table
print(results.anova_table)

# perform post-hoc test using pairwise_tukey
posthoc = pg.pairwise_tukey(data, dv='sales', between='store', subject='subject')
print(posthoc)