Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of Variance (ANOVA) is a statistical method used to compare means across multiple groups. To ensure the validity of ANOVA results, certain assumptions must be met. These assumptions are:

1. **Normality**: The data within each group should be approximately normally distributed. Violations of this assumption can impact ANOVA results, especially if sample sizes are small. For example, if the data are highly skewed or have heavy tails, it might be inappropriate to use ANOVA.

2. **Homogeneity of Variances (Homoscedasticity)**: The variances within each group should be approximately equal. Homogeneity of variances ensures that the groups are roughly equivalent in terms of variability. Violations of this assumption, known as heteroscedasticity, can lead to increased Type I errors (false positives). If one group has significantly larger variances than others, it may affect the reliability of ANOVA results.

3. **Independence of Observations**: Observations within each group must be independent of each other. Independence is a fundamental assumption in many statistical tests, and violations could lead to inaccurate standard errors and, consequently, incorrect p-values.

4. **Random Sampling**: Data should be collected through a random sampling process. This assumption ensures that the sample is representative of the population and enhances the generalizability of the results.

**Examples of Violations:**

- **Non-Normality**: If the data within each group deviate significantly from a normal distribution, it may lead to inaccurate p-values and confidence intervals. For instance, if the data are highly skewed or exhibit heavy tails, ANOVA results might be less reliable.

- **Heteroscedasticity**: Unequal variances between groups can impact the overall F-test in ANOVA. For example, if one group has much larger variability than others, it might contribute more to the overall variability, potentially leading to Type I errors.

- **Correlated Observations**: If observations within groups are correlated, it violates the assumption of independence. For instance, if repeated measurements are taken on the same subjects over time, the observations may be correlated.

- **Non-Random Sampling**: If the sampling process is not random, it may introduce biases into the sample, impacting the generalizability of the results.

It's important to check these assumptions before interpreting the results of ANOVA. If assumptions are violated, alternative methods or transformations of the data may be considered, or a non-parametric test may be more appropriate.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) comes in different types, and the choice of which type to use depends on the design of the study and the nature of the variables being analyzed. The three main types of ANOVA are:

1. **One-Way ANOVA:**
   - **Use Case:** When comparing means across two or more independent groups or levels of a single categorical variable.
   - **Example:** Comparing the mean scores of students from different schools or the mean performance of individuals exposed to different treatments.

2. **Two-Way ANOVA:**
   - **Use Case:** When there are two independent categorical variables (factors) and you want to examine their main effects and the interaction between them.
   - **Example:** Investigating the effects of both gender and treatment on exam scores. This involves two factors (gender and treatment) and their interaction.

3. **Repeated Measures ANOVA:**
   - **Use Case:** When measurements are taken on the same subjects at multiple points in time or under different conditions.
   - **Example:** Assessing the impact of a drug on blood pressure levels measured before and after treatment in the same group of individuals.

**When to Use Each Type:**
- **Use One-Way ANOVA:**
  - When comparing means across two or more independent groups or levels of a single categorical variable.
  - Example: Comparing average scores of students from different schools.

- **Use Two-Way ANOVA:**
  - When there are two independent categorical variables, and you want to assess the main effects of each variable and their interaction.
  - Example: Investigating the effects of both diet and exercise on weight loss. Here, diet and exercise are two factors, and the interaction term explores whether the combined effect is different from what would be expected by simply adding the individual effects.

- **Use Repeated Measures ANOVA:**
  - When measurements are taken on the same subjects under different conditions or at multiple time points.
  - Example: Evaluating the impact of a drug on blood pressure levels measured before and after treatment in the same individuals.

Choosing the right type of ANOVA is crucial for obtaining meaningful and accurate results, so it's essential to consider the study design and the nature of the variables being analyzed.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) refers to the division of the total variability observed in the data into different components, each associated with specific sources. Understanding this concept is crucial for gaining insights into the relative contributions of various factors or sources of variability in a study. ANOVA accomplishes this partitioning by distinguishing between different types of variance:

1. **Total Variance (Total Sum of Squares - SST):**
   - This represents the overall variability in the dependent variable (response variable) across all groups or conditions.

2. **Between-Group Variance (Between-Group Sum of Squares - SSB):**
   - This reflects the variability in the dependent variable that is attributable to the differences between the group means. It assesses whether the means of different groups are significantly different.

3. **Within-Group Variance (Within-Group Sum of Squares - SSW):**
   - Also known as the error variance, it represents the variability in the dependent variable that is not explained by the differences between group means. It reflects the random variability within each group.

The partitioning of variance is typically summarized in an ANOVA table, which includes the degrees of freedom, sum of squares, mean squares, and the F-statistic. The F-statistic is calculated by dividing the between-group mean square by the within-group mean square.

**Importance of Understanding Partitioning of Variance:**

1. **Identifying Significant Sources of Variability:**
   - By partitioning the total variability into between-group and within-group components, ANOVA helps identify whether the observed differences among group means are statistically significant.

2. **Quantifying the Effect Size:**
   - The proportion of total variance explained by between-group differences provides a measure of effect size. A larger proportion suggests a stronger effect.

3. **Assessing the Model Fit:**
   - Understanding how much of the total variance is explained by the model helps assess how well the model fits the data. A good model should account for a significant portion of the observed variability.

4. **Interpreting F-Statistic:**
   - The F-statistic, derived from the partitioned variances, is used to test the hypothesis that there are no significant differences between group means. Understanding the components of the F-statistic aids in interpreting the results of hypothesis tests.

In summary, the partitioning of variance in ANOVA provides a structured way to analyze and interpret the sources of variability in a study, allowing researchers to draw meaningful conclusions about the effects of different factors on the dependent variable.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy.stats import f_oneway

# Example data for three groups
group1 = [15, 18, 21, 24, 27]
group2 = [12, 14, 18, 20, 22]
group3 = [10, 13, 16, 19, 22]

# Combine data from all groups
all_data = np.concatenate([group1, group2, group3])

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate total sum of squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate group means
group_means = [np.mean(group) for group in [group1, group2, group3]]

# Calculate explained sum of squares (SSE)
sse = np.sum([len(group) * (mean - overall_mean)**2 for group, mean in zip([group1, group2, group3], group_means)])

# Calculate residual sum of squares (SSR)
ssr = np.sum([(value - mean)**2 for group, mean in zip([group1, group2, group3], group_means) for value in group])

# Print the results
print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSE): {sse}")
print(f"Residual Sum of Squares (SSR): {ssr}")

# Perform one-way ANOVA for comparison
f_statistic, p_value = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


Total Sum of Squares (SST): 316.93333333333334
Explained Sum of Squares (SSE): 68.13333333333334
Residual Sum of Squares (SSR): 248.8
F-statistic: 1.643086816720257
P-value: 0.234042509471001


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [8]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the inbuilt dataset from statsmodels
data = sm.datasets.get_rdataset("ToothGrowth", "datasets").data

# printing top 5 rows of Tooth Growth dataset
print('Top 5 rows of Tooth Growth Dataset')
print(data.head())
print('\n==============================================================\n')

# Define the model formula
model_formula = "len ~ C(supp) + C(dose) + C(supp):C(dose)"

# Fit the model using OLS regression
model = ols(model_formula, data).fit()

# Calculate the main effects and interaction effects
main_effects = sm.stats.anova_lm(model, typ=2)['sum_sq'][:2]
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][2:3]

# Print the results
print("Main effects:")
print(main_effects)
print("\n==============================\n")
print("Interaction effect:")
print(interaction_effect)
print("\n==============================\n")
print("ANOVA Table:")
print(sm.stats.anova_lm(model, typ=2))


Top 5 rows of Tooth Growth Dataset
    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5


Main effects:
C(supp)     205.350000
C(dose)    2426.434333
Name: sum_sq, dtype: float64


Interaction effect:
C(supp):C(dose)    108.319
Name: sum_sq, dtype: float64


ANOVA Table:
                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of three or more independent (unrelated) groups. The associated p-value helps determine the statistical significance of the observed differences. Here's how you can interpret the results:

1. **F-Statistic:**
   - In your case, the F-statistic is 5.23. This value represents the ratio of the variability between group means to the variability within groups. A higher F-statistic suggests that there may be significant differences among the group means.

2. **P-value:**
   - The p-value associated with the F-statistic is 0.02. This p-value is the probability of observing an F-statistic as extreme as the one obtained, assuming that there are no true differences between the group means.

3. **Interpretation:**
   - Since the p-value (0.02) is less than the commonly used significance level of 0.05, you would reject the null hypothesis. The null hypothesis in this context is that there are no significant differences between the group means. Therefore, you have evidence to suggest that there are indeed significant differences.

4. **Conclusion:**
   - Based on the results, you can conclude that there are statistically significant differences between at least two of the groups. However, the ANOVA itself does not identify which specific groups are different from each other; it only indicates that not all group means are equal.

5. **Post-hoc Tests (if applicable):**
   - If your ANOVA indicates significant differences, it is often followed by post-hoc tests (e.g., Tukey's HSD, Bonferroni correction) to identify which specific groups differ from each other.

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you would reject the null hypothesis and conclude that there are significant differences between at least two of the groups. Further analyses or post-hoc tests may be needed to determine the nature of these differences.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is crucial to ensure the validity and reliability of the results. The choice of method for handling missing data can impact the accuracy of the analysis. Here are some common approaches and their potential consequences:

1. **Complete Case Analysis (Listwise Deletion):**
   - **Method:** Exclude cases with missing data on any variable involved in the analysis.
   - **Consequences:**
     - Reduces sample size, potentially leading to loss of statistical power.
     - Results may be biased if the missing data are not missing completely at random (MCAR). If the missingness is related to the outcome, it can introduce bias.

2. **Pairwise Deletion (Available Case Analysis):**
   - **Method:** Analyze all cases that have data for the specific comparison being made.
   - **Consequences:**
     - Preserves more data than complete case analysis.
     - May introduce bias if the missing data are related to the outcome or if the pattern of missingness is not random.

3. **Imputation Methods (e.g., Mean Imputation, Last Observation Carried Forward):**
   - **Method:** Replace missing values with estimated values based on observed data.
   - **Consequences:**
     - Preserves sample size but can introduce bias if the imputation method is not appropriate.
     - Mean imputation assumes missing values are missing completely at random, and it may underestimate variability.
     - Last Observation Carried Forward assumes that missing values are stable over time, which may not be valid.

4. **Multiple Imputation:**
   - **Method:** Generate multiple datasets with imputed values and perform analyses on each dataset, then combine results.
   - **Consequences:**
     - Preserves sample size and provides more realistic estimates of uncertainty.
     - Requires assumptions about the missing data mechanism, and inappropriate assumptions can still lead to biased results.

5. **Maximum Likelihood Estimation (MLE):**
   - **Method:** Estimates model parameters while accounting for missing data by maximizing the likelihood function.
   - **Consequences:**
     - Preserves sample size and provides unbiased parameter estimates under the assumption that data are missing at random (MAR).
     - Assumes that the missing data mechanism is ignorable, and results can be biased if this assumption is violated.

**General Considerations:**
- The choice of method depends on the nature of missing data and the assumptions that can be reasonably made about the missing data mechanism.
- It is essential to perform sensitivity analyses to assess the impact of different missing data handling methods on results.

In summary, the consequences of using different methods for handling missing data in repeated measures ANOVA include potential biases, loss of statistical power, and the impact on the validity of the results. Researchers should carefully consider the missing data mechanism and choose an appropriate method based on the assumptions and goals of the analysis.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after ANOVA to further investigate pairwise differences between groups when the overall ANOVA test indicates significant differences. These tests help identify which specific groups differ from each other. Some common post-hoc tests include:

1. **Tukey's Honestly Significant Difference (HSD):**
   - **Use Case:** Tukey's HSD is used when you have three or more groups, and you want to identify which specific pairs of groups have significantly different means. It controls the familywise error rate, making it suitable for multiple comparisons.
   - **Example:** In a study comparing the effectiveness of three different teaching methods, Tukey's HSD can be applied to determine which pairs of teaching methods lead to significantly different student performance.

2. **Bonferroni Correction:**
   - **Use Case:** Bonferroni correction is a conservative approach used to control the familywise error rate by adjusting the significance level for each pairwise comparison. It is suitable when you have a predetermined significance level and want to perform multiple comparisons without increasing the overall Type I error rate.
   - **Example:** In a clinical trial comparing the efficacy of four different drug treatments, Bonferroni correction can be applied to assess pairwise differences while controlling for the increased risk of Type I errors.

3. **Sidak Correction:**
   - **Use Case:** Similar to Bonferroni correction, Sidak correction is used to control the familywise error rate. It is less conservative than Bonferroni and may be preferable when performing a large number of comparisons.
   - **Example:** In a market research study comparing the mean ratings of several products across different demographics, Sidak correction can be applied to identify significant differences while controlling for familywise error.

4. **Duncan's Multiple Range Test:**
   - **Use Case:** Duncan's test is used when you have three or more groups and want to identify homogeneous subsets of means, meaning groups that do not differ significantly from each other.
   - **Example:** In an agricultural study comparing the yields of different fertilizer treatments, Duncan's test can be used to group fertilizers that lead to similar yields.

5. **Holm's Method:**
   - **Use Case:** Holm's method is a step-down procedure that controls the familywise error rate. It is less conservative than Bonferroni and more powerful when there are substantial differences between group means.
   - **Example:** In a marketing study comparing the sales performance of products in different regions, Holm's method can be applied to identify significant differences while adjusting for multiple comparisons.

**Example Scenario:**
Suppose you conducted a one-way ANOVA to compare the mean scores of students who received different types of training programs (A, B, C, and D). The ANOVA indicates a significant overall difference. A post-hoc test, such as Tukey's HSD, could be used to determine which specific pairs of training programs have significantly different mean scores.

In summary, the choice of post-hoc test depends on factors such as the number of groups, the desired control of Type I errors, and the assumptions about the data. Researchers should carefully consider these factors to select an appropriate post-hoc test for their specific study.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [9]:
import numpy as np
from scipy.stats import f_oneway

# Generate example data for the three diets
np.random.seed(42)  # for reproducibility
weight_loss_a = np.random.normal(loc=5, scale=2, size=50)
weight_loss_b = np.random.normal(loc=7, scale=2, size=50)
weight_loss_c = np.random.normal(loc=6, scale=2, size=50)

# Concatenate the data
weight_loss_data = np.concatenate([weight_loss_a, weight_loss_b, weight_loss_c])

# Create corresponding group labels
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(weight_loss_a, weight_loss_b, weight_loss_c)

# Print the results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference in mean weight loss between at least two diets.")
else:
    print("There is no significant difference in mean weight loss between the diets.")


F-statistic: 21.809565795751926
P-value: 5.0767681760454e-09
There is a significant difference in mean weight loss between at least two diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate example data
np.random.seed(42)  # for reproducibility

# Create a dataframe with columns for software program, experience level, and completion time
data = pd.DataFrame({
    'Program': np.random.choice(['A', 'B', 'C'], size=90),
    'Experience': np.random.choice(['Novice', 'Experienced'], size=90),
    'Time': np.random.normal(loc=10, scale=2, size=90)  # Adjust mean and scale as needed
})

# Fit the two-way ANOVA model
formula = 'Time ~ C(Program) + C(Experience) + C(Program):C(Experience)'
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print("ANOVA Table:")
print(anova_table)

# Interpret the results
print("\nInterpretation:")
if anova_table['PR(>F)']['C(Program)'] < 0.05:
    print("There is a significant main effect of software program on completion time.")
else:
    print("There is no significant main effect of software program on completion time.")

if anova_table['PR(>F)']['C(Experience)'] < 0.05:
    print("There is a significant main effect of experience level on completion time.")
else:
    print("There is no significant main effect of experience level on completion time.")

if anova_table['PR(>F)']['C(Program):C(Experience)'] < 0.05:
    print("There is a significant interaction effect between software program and experience level.")
else:
    print("There is no significant interaction effect between software program and experience level.")

    

ANOVA Table:
                              sum_sq    df         F    PR(>F)
C(Program)                  1.334021   2.0  0.193670  0.824297
C(Experience)               5.096305   1.0  1.479736  0.227223
C(Program):C(Experience)    8.396750   2.0  1.219018  0.300694
Residual                  289.301266  84.0       NaN       NaN

Interpretation:
There is no significant main effect of software program on completion time.
There is no significant main effect of experience level on completion time.
There is no significant interaction effect between software program and experience level.


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [12]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import MultiComparison

# Generate example data
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=100)
experimental_group = np.random.normal(loc=75, scale=10, size=100)

# Perform a two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group, equal_var=False)

# Print the results of the t-test
print(f"Two-sample t-test results:")
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Check if the results are significant
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
    print("\nPost-hoc Test:")
    
    # Perform a post-hoc test (Tukey's HSD)
    all_data = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * 100 + ['Experimental'] * 100
    posthoc_results = MultiComparison(all_data, group_labels).tukeyhsd()

    # Print the post-hoc results
    print(posthoc_results)
else:
    print("There is no significant difference in test scores between the control and experimental groups.")


Two-sample t-test results:
T-statistic: -4.754695943505282
P-value: 3.8246846694060195e-06
There is a significant difference in test scores between the control and experimental groups.

Post-hoc Test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615   0.0 3.6645 8.8585   True
--------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [17]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# set random seed for reproducibility
np.random.seed(456)

# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))

# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})

# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']

# Printing top 5 rows of generated data
print('Generated data top 5 rows : ')
print(sales_melted.head())

print('\n================================================\n')

# perform repeated measures ANOVA
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)

# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Generated data top 5 rows : 
   Day    Store        Sales
0    0  Store A   933.187150
1    1  Store A   950.179048
2    2  Store A  1061.857582
3    3  Store A  1056.869225
4    4  Store A  1135.050948


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 51.5040 2.0000 58.0000 0.0000

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B   21.2439 0.6945   -40.881   83.3688  False
Store A Store C -207.8078    0.0 -269.9328 -145.6829   True
Store B Store C -229.0517    0.0 -291.1766 -166.9268   True
-----------------------------------------------------------
