Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of variance (ANOVA) is a statistical technique used to compare the means of three or more groups. The following are the assumptions that must be met in order to use ANOVA:

Independence: The data points within each group must be independent of each other.

Normality: The data within each group must be normally distributed.

Homogeneity of variance: The variance of the data in each group must be equal.

Random sampling: The data must be randomly sampled from the population.

Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of possible violations:

Independence: Violations of independence occur when the data points within a group are not independent. For example, if measurements are taken from the same individual over time, the data points may be correlated.

Normality: A violation of normality occurs when the data within a group is not normally distributed. For example, if the data is skewed or has extreme outliers, it may not meet the assumption of normality.

Homogeneity of variance: A violation of homogeneity of variance occurs when the variance of the data in each group is not equal. For example, if the variance of the data in one group is much larger than the variance in another group, it may violate this assumption.

Random sampling: Violations of random sampling occur when the data is not randomly sampled from the population. For example, if certain groups are intentionally oversampled, it may bias the results.

It is important to check for violations of these assumptions before conducting an ANOVA to ensure that the results are valid. If any of these assumptions are violated, alternative statistical tests may be more appropriate.

Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA stands for analysis of variance, which is a statistical method used to test for differences between groups. There are three main types of ANOVA:

One-way ANOVA: This is used when you have one independent variable (also called a factor) with three or more levels, and you want to test for differences between the means of the dependent variable (the outcome variable) across the levels of the independent variable. For example, you might use a one-way ANOVA to test whether there are differences in weight gain among people who are on three different diets.

Two-way ANOVA: This is used when you have two independent variables (also called factors), and you want to test for the effects of each factor on the dependent variable, as well as their interaction. For example, you might use a two-way ANOVA to test whether there are differences in weight gain among people who are on three different diets, and whether the effect of the diet varies by gender.

Repeated measures ANOVA: This is used when you have repeated measurements of the same individuals on the dependent variable, and you want to test for differences in the means across the levels of the independent variable. For example, you might use a repeated measures ANOVA to test whether there are differences in blood pressure among people who are given three different doses of a medication.

Each type of ANOVA is used in different situations, depending on the research question and design of the study. One-way ANOVA is used when there is only one factor of interest, while two-way ANOVA is used when there are two factors of interest, and their interaction is of interest. Repeated measures ANOVA is used when measurements of the dependent variable are taken from the same individuals over time or under different conditions.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of breaking down the total variation in the dependent variable into different sources of variation. In ANOVA, the total variation in the dependent variable is partitioned into two components: variance between groups (also called the "treatment" variance) and variance within groups (also called the "error" variance).

The variance between groups represents the differences in the means of the dependent variable across the different levels of the independent variable. The variance within groups represents the variability of the data within each group, which is not due to the independent variable.

Understanding the partitioning of variance is important because it allows researchers to determine whether the differences in the means of the dependent variable across the levels of the independent variable are statistically significant or not. If the between-groups variance is large relative to the within-groups variance, it suggests that the independent variable is having a significant effect on the dependent variable. On the other hand, if the between-groups variance is small relative to the within-groups variance, it suggests that the independent variable is not having a significant effect on the dependent variable.

Additionally, understanding the partitioning of variance can help researchers identify potential sources of error or variability in their study design. For example, if the within-groups variance is very large, it may suggest that there is a lot of variability in the data that is not explained by the independent variable, indicating a need for further investigation into other sources of variability.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# create a DataFrame with the data
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'],
                   'score': [6, 8, 9, 12, 8, 10]})

# fit the one-way ANOVA model
model = ols('score ~ group', data=df).fit()

# calculate the total sum of squares (SST)
sst = ((df['score'] - df['score'].mean()) ** 2).sum()

# calculate the explained sum of squares (SSE)
sse = model.ess

# calculate the residual sum of squares (SSR)
ssr = model.ssr

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 20.833333333333332
SSE: 12.333333333333332
SSR: 8.5


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# create example data
df = pd.DataFrame({
    'score': [10, 12, 9, 14, 11, 13, 8, 10, 7, 12, 9, 14, 10, 11, 13, 8],
    'group1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'group2': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y']
})

# encode categorical variables as dummies
df = pd.get_dummies(df, columns=['group1', 'group2'])

# fit two-way ANOVA model
model = ols('score ~ group1_A + group2_X + group1_B:group2_X + group1_A:group2_Y', data=df).fit()

# calculate the main effects
main_effect_group1 = model.params['group1_A'] - model.params['group1_B']
main_effect_group2 = model.params['group2_X'] - model.params['group2_Y']

# calculate the interaction effect
interaction_effect = model.params['group1_B:group2_X'] - model.params['group1_A:group2_Y']

# calculate the ANOVA table
anova_table = anova_lm(model)


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is evidence of a significant difference between the means of the groups.

The F-statistic is a ratio of the between-group variability to the within-group variability. A large F-statistic indicates that the between-group variability is large relative to the within-group variability, which is evidence in favor of the alternative hypothesis that at least one group mean is different from the others. In this case, the F-statistic of 5.23 is large enough to suggest that the null hypothesis of equal means across all groups should be rejected.

The p-value of 0.02 indicates that the probability of obtaining an F-statistic as large as or larger than the observed value, assuming the null hypothesis is true, is only 2%. This is below the commonly used significance level of 0.05, which suggests that the differences between the means of the groups are statistically significant.

In interpreting these results, it is important to keep in mind that a significant result in an ANOVA only tells us that there is evidence of a difference between at least two groups, but it does not tell us which specific groups differ from each other. To determine which groups differ, post-hoc tests or contrasts can be conducted.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In repeated measures ANOVA, missing data can occur when a participant drops out of the study or fails to complete one or more of the measurements. There are several methods for handling missing data in ANOVA, including listwise deletion, pairwise deletion, mean substitution, and multiple imputation.

Listwise deletion involves removing any participant with missing data from the analysis. This can result in reduced statistical power and potential bias if the missing data are not completely at random.

Pairwise deletion involves only removing cases with missing data for specific variables, allowing the remaining data to be used for other variables. This can increase statistical power but can also lead to biased estimates if the missing data are related to other variables in the analysis.

Mean substitution involves replacing missing values with the mean value for that variable across all participants. This can lead to biased estimates if the missing data are not missing at random.

Multiple imputation involves using statistical methods to estimate plausible values for missing data based on patterns in the available data. This can increase statistical power and reduce bias compared to the other methods mentioned above.

The choice of method for handling missing data can affect the results of the analysis, and it is important to carefully consider the potential consequences of each method before making a decision.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA, post-hoc tests can be used to compare specific pairs of groups and identify which ones differ significantly. Some common post-hoc tests include:

Tukey's HSD (honest significant difference): This test is commonly used when the number of groups is equal or larger than 3, and all pairwise comparisons between groups need to be tested. It is generally conservative in that it controls the family-wise error rate.

Bonferroni correction: This test is used when multiple comparisons are made, and it controls the family-wise error rate. It is a more conservative test than Tukey's HSD, and it is recommended when there are few groups.

Scheffe's test: This test is less conservative than Tukey's HSD and Bonferroni correction and can be used for both planned and unplanned comparisons. It is recommended when there are fewer groups, and the sample sizes are unequal.

Fisher's LSD (least significant difference): This test is used when comparing all pairs of means in a one-way ANOVA, and it is recommended when there are equal sample sizes and equal variances.

In a situation where a post-hoc test might be necessary, suppose a study was conducted to investigate the effect of three different treatments on the height of a plant. The study involved three treatment groups, and the height of the plant was measured at the end of the treatment. The results of the ANOVA indicate that there is a significant difference between the means of the three treatment groups. A post-hoc test such as Tukey's HSD can be used to determine which treatment group(s) significantly differ from the others.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
from scipy.stats import f_oneway

# create data
diet_a = np.array([1.2, 0.8, 1.5, 2.0, 1.3, 1.1, 0.9, 1.6, 1.2, 1.8,
                   0.7, 1.0, 1.1, 1.2, 1.3, 0.8, 1.5, 1.1, 1.6, 1.0,
                   1.3, 1.2, 0.9, 1.1, 1.4])
diet_b = np.array([1.5, 1.3, 1.1, 1.6, 1.4, 1.8, 1.5, 1.7, 1.2, 1.4,
                   1.6, 1.3, 1.2, 1.5, 1.4, 1.1, 1.6, 1.8, 1.7, 1.3,
                   1.5, 1.2, 1.4, 1.7, 1.3])
diet_c = np.array([2.0, 1.8, 2.1, 1.9, 1.7, 1.6, 2.2, 1.8, 2.0, 1.9,
                   1.5, 1.7, 1.8, 1.9, 2.1, 1.6, 1.8, 1.7, 1.9, 2.0,
                   2.2, 2.1, 1.8, 1.6, 1.9])

# conduct one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# report the results
print("F-statistic:", f_stat)
print("p-value:", p_val)


F-statistic: 43.60337243401761
p-value: 3.920642179257211e-13


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create data
data = pd.read_csv("data.csv")

# conduct two-way ANOVA
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# report the results
print(anova_table)


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [3]:
import pandas as pd
import scipy.stats as stats

# create data
control_scores = [70, 75, 68, 72, 74, 73, 69, 71, 77, 75, 80, 72, 76, 73, 75, 78, 70, 74, 71, 73, 76, 72, 75, 79, 73, 75, 71, 77, 74, 72, 75, 71, 73, 72, 74, 76, 75, 73, 70, 74, 77, 75, 72, 73, 71, 76, 75, 72, 73, 70]
experimental_scores = [74, 78, 81, 76, 77, 79, 75, 83, 82, 79, 85, 78, 80, 77, 76, 81, 79, 75, 78, 80, 82, 77, 79, 81, 80, 76, 78, 77, 82, 80, 81, 79, 80, 75, 77, 78, 76, 80, 82, 79, 78, 77, 81, 79, 80, 76, 78, 79]

# conduct two-sample t-test
t_stat, p_val = stats.ttest_ind(control_scores, experimental_scores, equal_var=False)

# report the results
print("t-statistic: ", t_stat)
print("p-value: ", p_val)

# conduct post-hoc test (Tukey's HSD)
from statsmodels.stats.multicomp import pairwise_tukeyhsd

scores = control_scores + experimental_scores
groups = ["Control"] * len(control_scores) + ["Experimental"] * len(experimental_scores)

tukey_results = pairwise_tukeyhsd(scores, groups)
print(tukey_results)


t-statistic:  -10.459658382217075
p-value:  1.573821467619323e-17
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental     5.21   0.0 4.2195 6.2005   True
--------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create data
data = pd.read_csv("data.csv")

# conduct one-way ANOVA
f_stat, p_val = stats.f_oneway(data['store_A'], data['store_B'], data['store_C'])
print("F-statistic:", f_stat)
print("p-value:", p_val)

# follow up with post-hoc test
posthoc = pairwise_tukeyhsd(data.iloc[:,1], data.iloc[:,0], alpha=0.05)
print(posthoc)
