### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Analysis of Variance (ANOVA) is a statistical technique used to compare means among different groups. There are several assumptions that need to be met for ANOVA to provide valid and reliable results. Here are the main assumptions of ANOVA along with examples of violations that could impact the validity of the results:

### Assumptions of ANOVA:

1. **Normality:**
   - **Assumption:** The data within each group should be approximately normally distributed.
   - **Example Violation:** If the data in one or more groups deviates significantly from normality, it may affect the results. This can be assessed using normality tests or visual inspection of histograms.

2. **Homogeneity of Variances (Homoscedasticity):**
   - **Assumption:** The variances of the groups being compared should be approximately equal.
   - **Example Violation:** Unequal variances can lead to inflated Type I error rates or reduced power. Levene's test or Bartlett's test can be used to assess homogeneity of variances.

3. **Independence:**
   - **Assumption:** Observations within each group should be independent of each other.
   - **Example Violation:** If observations within a group are correlated, it may lead to pseudoreplication and affect the accuracy of the results. Ensure independence through appropriate study design.

4. **Random Sampling:**
   - **Assumption:** Observations should be randomly and independently sampled from the populations being studied.
   - **Example Violation:** If the sampling is biased or non-random, it may introduce selection bias and impact the generalizability of the results.

### Examples of Violations:

1. **Skewed Distributions:**
   - **Violation:** If the distribution within a group is highly skewed, it may affect the normality assumption.
   - **Impact:** ANOVA is robust to mild departures from normality, but extreme skewness might lead to inaccurate results.

2. **Heterogeneous Variances:**
   - **Violation:** Unequal variances among groups violate the assumption of homogeneity of variances.
   - **Impact:** It can lead to incorrect conclusions about group differences, and adjustments or alternative methods may be needed.

3. **Correlated Observations:**
   - **Violation:** If observations within a group are not independent, it violates the independence assumption.
   - **Impact:** It may lead to inflated Type I error rates or distorted confidence intervals.

4. **Non-Random Sampling:**
   - **Violation:** If the sampling process is not random, it violates the random sampling assumption.
   - **Impact:** Results may not be generalizable to the larger population.

### Dealing with Violations:

1. **Transformations:**
   - If normality is violated, applying transformations (e.g., log-transform) might help.

2. **Non-parametric Tests:**
   - When assumptions are seriously violated, non-parametric alternatives (e.g., Kruskal-Wallis test) can be considered.

3. **Bootstrapping:**
   - Bootstrapping techniques can be used to address violations of assumptions, especially when sample sizes are small.

It's essential to check these assumptions before interpreting the results of an ANOVA analysis to ensure the validity and reliability of the findings. If assumptions are severely violated, alternative approaches may be considered.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) is a statistical technique used to compare means among different groups. There are three main types of ANOVA, each suited for different situations:

### 1. One-Way ANOVA:

- **Situation:**
  - Used when comparing means of three or more independent (unrelated) groups.
  - There is one categorical independent variable (factor) with three or more levels (groups).
  - Assumes that the populations being compared have the same variance.

- **Example:**
  - Comparing the mean scores of students in three different teaching methods (A, B, C) to determine if there is a significant difference in their exam performance.

- **Formula:**
  - $( F = \frac{\text{Between-group variability}}{\text{Within-group variability}})$

### 2. Two-Way ANOVA:

- **Situation:**
  - Used when comparing means of groups formed by two independent categorical variables (factors).
  - There are two main effects and an interaction effect between the two factors.
  - Can be used to explore the impact of each factor individually and their interaction.

- **Example:**
  - Investigating the influence of both gender and treatment type on the effectiveness of a drug.

- **Formula:**
  - $( F = \frac{\text{Between-group variability}}{\text{Within-group variability}})$

### 3. Repeated Measures ANOVA:

- **Situation:**
  - Used when comparing means of the same group across different time points or conditions.
  - There is one group of participants measured under different conditions or at multiple time points.
  - Assumes that the variances of the differences between all possible pairs of conditions are equal.

- **Example:**
  - Analyzing the effect of a training program by measuring participants' performance before, during, and after the training.

- **Formula:**
  - Similar to one-way ANOVA but considers the within-subject variability.

### Key Points:

- **Between-Group Variability:**
  - Represents the variation in means between different groups.

- **Within-Group Variability:**
  - Represents the variation within each group.

- **F-Statistic:**
  - The ratio of between-group variability to within-group variability. A high F-value suggests that the group means are significantly different.

- **P-value:**
  - Determines the statistical significance of the F-statistic. If the p-value is less than the chosen significance level, the null hypothesis is rejected.

Choosing the appropriate type of ANOVA depends on the study design and the nature of the independent variables. One-way ANOVA is used for independent groups, two-way ANOVA for two independent variables, and repeated measures ANOVA for repeated measurements on the same group.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variability in the data into different components, each associated with specific sources. Understanding this partitioning is crucial for interpreting ANOVA results and gaining insights into the contributions of various factors to the overall variability in the data.

### Components of Variance in ANOVA:

1. **Total Variance $(SS_{\text{Total}}))$:**
   - Represents the total variability in the dependent variable across all observations.
   - Computed as the sum of squared differences between each individual data point and the overall mean.

   $[ SS_{\text{Total}} = \sum (Y_{ij} - \bar{Y}_{\text{Total}})^2 ]$

2. **Between-Group Variance (\(SS_{\text{Between}})\):**
   - Represents the variability in the dependent variable that is attributable to differences between the group means.
   - Computed as the sum of squared differences between each group mean and the overall mean, weighted by the number of observations in each group.

   $[ SS_{\text{Between}} = \sum N_j (\bar{Y}_j - \bar{Y}_{\text{Total}})^2 ]$

3. **Within-Group Variance (\(SS_{\text{Within}})\):**
   - Represents the variability in the dependent variable that is not explained by differences between the group means.
   - Computed as the sum of squared differences between each individual data point and its respective group mean.

   $[ SS_{\text{Within}} = \sum \sum (Y_{ij} - \bar{Y}_j)^2 ]$

### Relationship:

The total variance can be decomposed into the sum of the between-group variance and the within-group variance:

$[ SS_{\text{Total}} = SS_{\text{Between}} + SS_{\text{Within}} ]$

### Importance of Understanding Partitioning of Variance:

1. **Identifying Sources of Variation:**
   - Helps researchers understand the relative contributions of different factors or groups to the overall variability in the data.

2. **ANOVA F-Statistic:**
   - The ratio of between-group variance to within-group variance $(F = \frac{MS_{\text{Between}}}{MS_{\text{Within}}})$ is used to assess the statistical significance of group differences. A high F-value indicates that the group means are significantly different.

3. **Effect Size:**
   - The proportion of total variance explained by the between-group variance provides an indication of the effect size, helping to evaluate the practical significance of the observed differences.

4. **Post-hoc Analysis:**
   - Understanding the partitioning of variance guides post-hoc analyses to explore specific group differences or interactions that contribute to the observed patterns.

5. **Model Evaluation:**
   - Helps in assessing the adequacy of the ANOVA model and whether the included factors adequately explain the observed variability.

In summary, the partitioning of variance in ANOVA provides a comprehensive view of the distribution of variability in the data, facilitating a deeper understanding of group differences and the impact of various factors on the dependent variable.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assuming 'data' is your DataFrame with columns 'group' and 'value'
# 'group' is the categorical variable (factor)
# 'value' is the continuous variable (dependent variable)

# Create a sample dataset
data = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'value': [10, 12, 15, 18, 8, 11]
})

# Fit the one-way ANOVA model
model = ols('value ~ group', data=data).fit()

# Get ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract sums of squares from the ANOVA table
SST = anova_table['sum_sq']['group'] + anova_table['sum_sq']['Residual']
SSE = anova_table['sum_sq']['group']
SSR = anova_table['sum_sq']['Residual']

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 65.33333333333334
Explained Sum of Squares (SSE): 54.33333333333335
Residual Sum of Squares (SSR): 11.0


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [3]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assuming you have a DataFrame named 'data' with columns 'factor1', 'factor2', and 'value'
# 'factor1' and 'factor2' are the two categorical variables (factors)
# 'value' is the continuous variable (dependent variable)

# Create a sample dataset
data = pd.DataFrame({
    'factor1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
    'value': [10, 12, 15, 18, 8, 11, 14, 16]
})

# Fit the two-way ANOVA model
model = ols('value ~ factor1 * factor2', data=data).fit()

# Get ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effect_factor1 = anova_table['sum_sq']['factor1']
main_effect_factor2 = anova_table['sum_sq']['factor2']
interaction_effect = anova_table['sum_sq']['factor1:factor2']

print("Main Effect of Factor 1:", main_effect_factor1)
print("Main Effect of Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


Main Effect of Factor 1: 60.50000000000011
Main Effect of Factor 2: 12.500000000000044
Interaction Effect: 1.5777218104420236e-30


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of three or more groups. The associated p-value helps determine the statistical significance of the observed differences. Here's how to interpret the results:

1. **Null Hypothesis (H0):** The null hypothesis in ANOVA states that there are no significant differences between the group means.

2. **Alternative Hypothesis (H1):** The alternative hypothesis suggests that at least one group mean is different from the others.

Given your results:

- **F-statistic of 5.23:** This is the test statistic that follows an F-distribution. It measures the ratio of the variance between groups to the variance within groups. A higher F-statistic indicates a larger difference between group means.

- **p-value of 0.02:** This is the probability of observing an F-statistic as extreme as the one obtained if the null hypothesis were true. A p-value less than the chosen significance level (commonly 0.05) suggests that you reject the null hypothesis.

**Interpretation:**

Since the p-value (0.02) is less than the typical significance level of 0.05, you would reject the null hypothesis. This suggests that there is sufficient evidence to conclude that at least one group mean is different from the others.

In practical terms, you can interpret this as follows:

"There are statistically significant differences between the group means. The data provide enough evidence to reject the null hypothesis, indicating that there are meaningful variations in the dependent variable across the groups."

Keep in mind that to identify which specific groups are different, you may need to perform post-hoc tests (e.g., Tukey's HSD test) or pairwise comparisons. The significant F-statistic indicates overall group differences, but additional tests are required for detailed comparisons between individual groups.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is an important aspect of data analysis. Different methods can be used, and the choice of method can impact the results. Here are some common approaches to handling missing data in repeated measures ANOVA and their potential consequences:

1. **Complete Case Analysis (CCA):**
   - **Method:** Exclude cases with missing data.
   - **Consequences:** This approach may lead to biased results if missing data are not missing completely at random (MCAR). If certain patterns of missingness are related to the outcome, the analysis may be biased.

2. **Mean Imputation:**
   - **Method:** Replace missing values with the mean of the observed values for that variable.
   - **Consequences:** Mean imputation can distort the variability and relationships in the data. It assumes that missing values have the same mean as observed values, which may not be true.

3. **Last Observation Carried Forward (LOCF):**
   - **Method:** Replace missing values with the last observed value.
   - **Consequences:** LOCF assumes that the last observed value is an accurate representation of the missing value. This may not be appropriate if the variable is changing over time.

4. **Interpolation or Linear Imputation:**
   - **Method:** Use linear interpolation between observed data points.
   - **Consequences:** This method assumes a linear relationship between observed values, which may not be valid. It can be sensitive to the assumption of linearity.

5. **Multiple Imputation:**
   - **Method:** Generate multiple sets of imputed values for missing data, creating multiple datasets, and analyze each dataset separately.
   - **Consequences:** Multiple imputation is a more sophisticated approach that accounts for uncertainty in imputed values. However, it requires assumptions about the distribution of missing data and may be computationally intensive.

6. **Model-Based Imputation:**
   - **Method:** Use a statistical model to impute missing values.
   - **Consequences:** Model-based imputation considers the relationships within the data and can provide more accurate imputations. However, the validity of the model assumptions is crucial.

The choice of method should be guided by the nature of the missing data and the assumptions that can reasonably be made. It's essential to report any method used for handling missing data in research publications and consider the potential impact on the validity of the results. Sensitivity analyses using different imputation methods can also be informative.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

In the context of Analysis of Variance (ANOVA), post-hoc tests are conducted to make pairwise comparisons between group means when the overall ANOVA indicates significant differences among groups. Here are some common post-hoc tests and when to use each one:

1. **Tukey's Honestly Significant Difference (HSD) Test:**
   - **Use Case:** Tukey's HSD is suitable when there are equal sample sizes in each group.
   - **Example:** Suppose you conducted a one-way ANOVA comparing the test scores of students from three different teaching methods, and the ANOVA indicates a significant difference. You could use Tukey's HSD to identify which pairs of teaching methods have significantly different means.

2. **Bonferroni Correction:**
   - **Use Case:** Bonferroni correction is a conservative method appropriate when conducting multiple pairwise comparisons.
   - **Example:** In a clinical trial with four treatment groups, you want to compare the mean effectiveness of each treatment with every other treatment. Since you are making multiple comparisons, you might use the Bonferroni correction to adjust the significance level for each comparison.

3. **Scheffé's Test:**
   - **Use Case:** Scheffé's test is more conservative and is suitable when sample sizes may be unequal.
   - **Example:** In a study comparing the means of different age groups on a cognitive test, the ANOVA indicates a significant difference. Scheffé's test could be used for pairwise comparisons between age groups to identify where the significant differences lie.

4. **Duncan's Multiple Range Test:**
   - **Use Case:** Duncan's test is less conservative and is suitable when sample sizes are equal.
   - **Example:** In an agricultural study comparing the yield of different fertilizers applied to crops, the ANOVA indicates a significant difference. Duncan's test can be used to compare the yields of individual fertilizers and identify significant differences.

5. **Holm's Method:**
   - **Use Case:** Holm's method is a step-down procedure that controls the familywise error rate.
   - **Example:** In a marketing study comparing the sales performance of different advertising strategies, the ANOVA indicates a significant overall effect. Holm's method can be applied to make pairwise comparisons between specific advertising strategies while controlling for the overall error rate.

When to use a particular post-hoc test depends on factors such as sample size, homogeneity of variances, and the desired level of control over the familywise error rate. It's essential to choose a test that aligns with the characteristics of the data and the study design.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [6]:
import scipy.stats as stats
import pandas as pd
import numpy as np

# Generate random weight loss data for each diet
np.random.seed(42)  # for reproducibility
weight_loss_A = np.random.normal(loc=5, scale=2, size=50)
weight_loss_B = np.random.normal(loc=4.5, scale=1.5, size=50)
weight_loss_C = np.random.normal(loc=6, scale=2.5, size=50)

# Creating a DataFrame
data = pd.DataFrame({
    'Diet': ['A']*50 + ['B']*50 + ['C']*50,
    'WeightLoss': np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])
})

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    data[data['Diet'] == 'A']['WeightLoss'],
    data[data['Diet'] == 'B']['WeightLoss'],
    data[data['Diet'] == 'C']['WeightLoss']
)

# Print the results
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-Statistic: 7.984872861507485
P-Value: 0.0005104585600694623
There is a significant difference between the mean weight loss of the three diets.


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np

# Generate random data
np.random.seed(42)  # for reproducibility

# Create a DataFrame with columns: 'Program', 'ExperienceLevel', and 'CompletionTime'
data = pd.DataFrame({
    'Program': np.random.choice(['A', 'B', 'C'], size=90),
    'ExperienceLevel': np.random.choice(['Novice', 'Experienced'], size=90),
    'CompletionTime': np.random.normal(loc=10, scale=2, size=90)
})

# Fit the two-way ANOVA model
model = ols('CompletionTime ~ C(Program) + C(ExperienceLevel) + C(Program):C(ExperienceLevel)', data=data).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the results
print(anova_table)

# Interpret the results
# Check the p-values for main effects and interaction effects
# If p-value < 0.05, there is evidence of a significant effect


                                   sum_sq    df         F    PR(>F)
C(Program)                       1.334021   2.0  0.193670  0.824297
C(ExperienceLevel)               5.096305   1.0  1.479736  0.227223
C(Program):C(ExperienceLevel)    8.396750   2.0  1.219018  0.300694
Residual                       289.301266  84.0       NaN       NaN


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [8]:
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import numpy as np

# Generate random test scores for the control and experimental groups
np.random.seed(42)  # for reproducibility
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)

# Create a DataFrame
data = pd.DataFrame({
    'Group': ['Control'] * 100 + ['Experimental'] * 100,
    'TestScores': np.concatenate([control_scores, experimental_scores])
})

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(
    data[data['Group'] == 'Control']['TestScores'],
    data[data['Group'] == 'Experimental']['TestScores']
)

# Print the results
print("Two-Sample T-Test:")
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Follow up with post-hoc test (Tukey's HSD)
tukey_results = pairwise_tukeyhsd(data['TestScores'], data['Group'])

# Print the post-hoc results
print("\nPost-Hoc Test (Tukey's HSD):")
print(tukey_results)


Two-Sample T-Test:
T-Statistic: -4.754695943505282
P-Value: 3.819135262679469e-06

Post-Hoc Test (Tukey's HSD):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615 0.001 3.6645 8.8585   True
--------------------------------------------------------


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd
!pip install pingouin


import pandas as pd
import pingouin as pg

# Generate random daily sales data for three stores over 30 days
np.random.seed(42)  # for reproducibility
days = 30
sales_data = pd.DataFrame({
    'Store_A': np.random.normal(loc=100, scale=10, size=days),
    'Store_B': np.random.normal(loc=110, scale=15, size=days),
    'Store_C': np.random.normal(loc=95, scale=12, size=days)
})

# Reshape the data for repeated measures ANOVA
sales_long = pd.melt(sales_data, var_name='Store', value_name='DailySales')

# Perform repeated measures ANOVA
rm_anova_result = pg.rm_anova(data=sales_long, dv='DailySales', within='Store')

# Print the ANOVA result
print(rm_anova_result)
