Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

1.Independence: If the observations are not independent, the results of the ANOVA may be biased. For example, if you are comparing the exam scores of students from different schools, but the students within each school are not independent (e.g. they all took the same prep course), this violates the independence assumption.

2.Normality: If the data within each group is not normally distributed, this can lead to inaccurate results. For example, if you are comparing the blood pressure of two different groups, but the data within each group is heavily skewed, this violates the normality assumption.

3.Homogeneity of variance: If the variances of the groups being compared are not equal, the ANOVA results may be inaccurate. For example, if you are comparing the weight of apples from two different orchards, but one orchard uses a different fertilizer that causes the weights to vary more widely, this violates the homogeneity of variance assumption.

Q2. What are the three types of ANOVA, and in what situations would each be used?

1.One-Way ANOVA: This is the most basic type of ANOVA and is used when there is only one independent variable. It is used to compare the means of three or more groups to determine if there is a statistically significant difference between them. 

2.One-Way ANOVA: This is the most basic type of ANOVA and is used when there is only one independent variable. It is used to compare the means of three or more groups to determine if there is a statistically significant difference between them. 

3.Three-Way ANOVA: This type of ANOVA is used when there are three independent variables. It is used to determine if there is a significant interaction between all three independent variables on the dependent variable. 

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to the breakdown of the total variance in the data into different sources of variation.

It is important to understand the concept of partitioning of variance in ANOVA because it helps to identify the sources of variation that are contributing to the differences observed between the groups.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample data frame
data = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'Values': [1, 2, 3, 4, 5, 6]})

# fit the one-way ANOVA model
model = ols('Values ~ Group', data).fit()

# calculate the total sum of squares (SST)
sst = ((data['Values'] - data['Values'].mean()) ** 2).sum()

# calculate the explained sum of squares (SSE)
sse = ((model.fittedvalues - data['Values'].mean()) ** 2).sum()

# calculate the residual sum of squares (SSR)
ssr = ((model.resid) ** 2).sum()

# print the results
print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 17.5
SSE: 16.000000000000007
SSR: 1.5


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit the two-way ANOVA model with interaction
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data).fit()

# Calculate the main effects and interaction effects
main_effects = sm.stats.anova_lm(model, typ=1)['sum_sq'][:-1]
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][-1]

print('Main effects:')
print(main_effects)

print('Interaction effect:')
print(interaction_effect)


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

The F-statistic indicates the ratio of between-group variance to within-group variance. A large F-statistic indicates that the between-group variance is larger than the within-group variance, which means that there are significant differences between the groups. The p-value of 0.02 indicates that the probability of obtaining such a large F-statistic by chance alone is only 2%, which is considered statistically significant at the commonly used significance level of 0.05.To interpret these results, you could say something like: "The results of the one-way ANOVA indicated significant differences between the groups (F(2, 27) = 5.23, p = 0.02). 

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

1.Pairwise deletion: This method involves excluding any cases with missing data from the analysis.

2.Mean imputation: This method involves replacing missing values with the mean of the observed values for that variable. 

3.Last observation carried forward (LOCF): This method involves replacing missing values with the last observed value for that variable.

4.Multiple imputation: This method involves creating multiple imputed datasets, each with a different imputed value for the missing data, and combining the results of the repeated measures ANOVA across the imputed datasets. 

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA are that the results may be biased, invalid, or unreliable. 

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

1.Tukey's Honestly Significant Difference (HSD) Test: This test is used when comparing all possible pairs of means.

2.Bonferroni Correction: This test is used to control for Type I errors when making multiple comparisons.

3.Scheffe Test: This test is used when comparing all possible combinations of means, but it is more conservative than other tests. 

4.Fisher's Least Significant Difference (LSD) Test: This test is used to compare means between pairs of groups. 

A post-hoc test might be necessary in a situation where a researcher is comparing the effectiveness of three different treatments for a medical condition. After conducting an ANOVA, the researcher finds that there is a significant difference between the three treatments. To determine which treatments are significantly different from each other, a post-hoc test such as Tukey's HSD could be used.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [10]:
import numpy as np
from scipy.stats import f_oneway

# generate some sample data
diet_a = np.random.normal(loc=5, scale=2, size=50)
diet_b = np.random.normal(loc=7, scale=2, size=50)
diet_c = np.random.normal(loc=6, scale=2, size=50)

# perform one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic:", f_stat)
print("p-value:", p_val)

F-statistic: 13.14157130813411
p-value: 5.613515257394152e-06


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [15]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(123)
n = 30
programs = np.repeat(['A', 'B', 'C'], 2*n)
experience = np.tile(['novice', 'experienced'], 3*n)
time = np.round(np.random.normal(10, 2, 6*n), 2)
data = pd.DataFrame({'programs': programs, 'experience': experience, 'time': time})

# Fit ANOVA model
model = ols('time ~ programs + experience + programs:experience', data=data).fit()

# Print ANOVA table
print(sm.stats.anova_lm(model, typ=2))

                         sum_sq     df         F    PR(>F)
programs               3.429288    2.0  0.385040  0.681002
experience             8.558681    1.0  1.921935  0.167417
programs:experience    4.438774    2.0  0.498385  0.608376
Residual             774.849623  174.0       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group  traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [16]:
import numpy as np
import pandas as pd
from scipy import stats

# Generate sample data
np.random.seed(123)
control_scores = np.random.normal(70, 10, 100)
experimental_scores = np.random.normal(75, 12, 100)

# Conduct two-sample t-test
t, p = stats.ttest_ind(control_scores, experimental_scores)
print('t = {:.2f}, p = {:.3f}'.format(t, p))

# Conduct post-hoc test (Tukey's HSD)
data = pd.DataFrame({'score': np.concatenate([control_scores, experimental_scores]),
                     'group': np.concatenate([np.repeat('control', 100), np.repeat('experimental', 100)])})
from statsmodels.stats.multicomp import pairwise_tukeyhsd
posthoc = pairwise_tukeyhsd(data['score'], data['group'])
print(posthoc)

t = -2.76, p = 0.006
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   4.4945 0.0063 1.2815 7.7074   True
---------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [17]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(123)
store_a_sales = np.random.normal(1000, 100, 30)
store_b_sales = np.random.normal(1200, 150, 30)
store_c_sales = np.random.normal(800, 80, 30)

# Combine data into a single dataframe
data = pd.DataFrame({'sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales]),
                     'store': np.repeat(['A', 'B', 'C'], 30)})

# Conduct one-way ANOVA
model = stats.f_oneway(store_a_sales, store_b_sales, store_c_sales)
f, p = model.statistic, model.pvalue
print('F = {:.2f}, p = {:.3f}'.format(f, p))

# Conduct post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['sales'], data['store'])
print(posthoc)


F = 74.10, p = 0.000
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj   lower     upper   reject
--------------------------------------------------------
     A      B  216.7514   0.0  133.0069  300.4959   True
     A      C -210.7784   0.0 -294.5228 -127.0339   True
     B      C -427.5297   0.0 -511.2742 -343.7853   True
--------------------------------------------------------
