### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans. **Assumptions required to use ANOVA (Analysis of Variance)**

1. Normality of Sampling distribution of mean: The distrubution of sample mean is normally distributed (i.e. Follows Central Limit Theorem)
2. Absence of Outliners: Outlying score need to be removed from the dataset.
3. Homogenity of variance: Homogeneity means that the variance among the groups should be approximately equal.
4. Samples are Independent and random.

**Voilations that could impact the validity of results**

Violations of the assumptions of your analysis impact your ability to trust your results and validly draw inferences about your results.

1. *Data transformation:* A common issue that researchers face is a violation of the assumption of normality. Numerous statistics texts recommend data transformations, such as natural log or square root transformations, to address this violation. Data transformations are not without consequence; for example, once you transform a variable and conduct your analysis, you can only interpret the transformed variable. You cannot provide an interpretation of the results based on the untransformed variable values.

2. *Non-parametric analysis:* You may encounter issues where multiple assumptions are violated, or a data transformation does not correct the violated assumption. In these cases, you may opt to use non-parametric analyses.

3. *Alternative statistics for determining significance:* We may consider using more conservative statistics for determining significance if your assumptions are violated. For example, if the assumption of homogeneity of variance was violated in your analysis of variance (ANOVA), you can use alternative F statistics

### Q2. What are the three types of ANOVA, and in what situations would each be used?


Ans. Types of Anova:

1. *One-way Anova:* One factor with atleast 2-levels and these levels are independent.
2. *Repeated measures Anova:* One factor iwth aleast 2 levels, levels are independent.
3. *Factorial Anova*: Two or more factors (each of which with atleast two levels), levels can be independent or dependent.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans. Partitoning of variance in Anova is dividing *Total Variance* into *Within Group Variance* and *Between Group Variance*. It is an important concept in the statistical field because it helps use to determine the overall relation between the groups and helps us to find if the result is statistically significant or not when dealing with more than 2 groups that are independent from eachother.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

Ans. We can calculate SST (Total Sum of Squares), SSE (Explained Sum of Squares), and SSR (Residual Sum of Squares) in a one-way ANOVA using Python by following these steps:

1. Calculate the overall mean (grand mean) of all observations.
2. Calculate the sum of squares total (SST) by summing the squared deviations of each observation from the grand mean.
3. Calculate the sum of squares explained (SSE) by summing the squared deviations of each group mean from the grand mean, weighted by the number of observations in each group.
4. Calculate the sum of squares residual (SSR) by summing the squared deviations of each observation from its respective group mean.

Here's a Python code snippet to demonstrate this calculation using the `numpy` library:

```python
import numpy as np

def one_way_anova_sumsquares(groups):
    # Calculate overall mean
    overall_mean = np.mean(groups)

    # Calculate SST (Total Sum of Squares)
    sst = np.sum((groups - overall_mean) ** 2)

    # Calculate SSE (Explained Sum of Squares)
    sse = np.sum([np.sum((group - np.mean(group)) ** 2) for group in groups])

    # Calculate SSR (Residual Sum of Squares)
    ssr = sst - sse

    return sst, sse, ssr

# Example
groups = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
sst, sse, ssr = one_way_anova_sumsquares(groups)
print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)
```

### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans. In a two-way ANOVA, you can calculate the main effects and interaction effects using Python by fitting a linear model to your data and then examining the coefficients associated with each factor and their interactions. You can use libraries like `statsmodels` or `scikit-learn` for this purpose. Here's a general outline of the steps involved:

1. Prepare your data: Ensure your data is structured properly with one column for each factor or independent variable, and one column for the dependent variable.
2. Fit a linear model: Use a suitable library to fit a linear model to your data, specifying both main effects and interaction terms.
3. Extract coefficients: After fitting the model, extract the coefficients associated with the main effects and interaction terms.
4. Interpret results: Examine the coefficients to determine the strength and direction of the main effects and interaction effects.

Here's a Python code snippet using `statsmodels` to demonstrate this process:

```python
import pandas as pd
import statsmodels.api as sm

# Example data (replace with your actual data)
data = {
    'Factor1': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    'Factor2': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Dependent': [10, 15, 20, 12, 18, 24, 8, 10, 12]
}

df = pd.DataFrame(data)

# Convert categorical variables to dummy variables
df = pd.get_dummies(df, columns=['Factor1', 'Factor2'])

# Add constant column for intercept
df['Intercept'] = 1

# Fit the linear model
model = sm.OLS(df['Dependent'], df[['Intercept', 'Factor1_1', 'Factor1_2', 'Factor2_B', 'Factor2_C', 'Factor1_1:Factor2_B', 'Factor1_2:Factor2_B']])
results = model.fit()

# Print summary of the model
print(results.summary())
```

In the above code:
- `Factor1_1`, `Factor1_2`, `Factor2_B`, and `Factor2_C` represent the main effects of Factor1 and Factor2.
- `Factor1_1:Factor2_B` and `Factor1_2:Factor2_B` represent the interaction effects between Factor1 and Factor2.
- `Intercept` represents the intercept term.

After fitting the model, you can examine the coefficients to interpret the main effects and interaction effects. The coefficients will indicate the direction and magnitude of the effects.

### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans. In the context of a one-way ANOVA, the F-statistic measures the ratio of the variance between groups to the variance within groups. The p-value associated with the F-statistic indicates the probability of observing such an extreme F-statistic under the null hypothesis that the group means are all equal.

Given that you obtained an F-statistic of 5.23 and a p-value of 0.02:

1. **Significance of the F-statistic**: The F-statistic of 5.23 indicates that there is some difference among the group means. However, to determine whether this difference is statistically significant, we need to consider the p-value.
2. **Interpretation of the p-value**: The p-value of 0.02 is less than the typical significance level of 0.05. Therefore, we would reject the null hypothesis at the 0.05 significance level. This suggests that there is strong evidence to conclude that there are differences between at least two of the groups.
3. **Conclusion**: Based on these results, we can conclude that there are statistically significant differences between the groups. However, the ANOVA itself does not tell us which specific groups are different from each other. To determine pairwise differences between groups, post-hoc tests such as Tukey's HSD (Honestly Significant Difference) test or Bonferroni correction can be conducted.

In summary, an F-statistic of 5.23 with a p-value of 0.02 indicates that there are statistically significant differences between the groups, warranting further investigation to determine the nature of these differences through post-hoc tests.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans. Handling missing data in repeated measures ANOVA is essential to ensure the validity and reliability of the analysis. There are several approaches to handle missing data, each with its potential consequences:

1. **Complete Case Analysis (CCA)**:
   - In CCA, cases with any missing data are completely excluded from the analysis.
   - Pros: Simple to implement, retains all available data.
   - Cons: May lead to biased estimates if missingness is related to the outcome or other variables of interest. Reduces statistical power if a large portion of data is missing.

2. **Mean Imputation**:
   - Missing values are replaced with the mean of the observed values for that variable.
   - Pros: Preserves sample size, maintains the mean of the observed data.
   - Cons: May underestimate standard errors, reduces variability, can distort relationships, and lead to biased estimates, especially if data are not missing at random.

3. **Last Observation Carried Forward (LOCF)**:
   - Missing values are replaced with the value from the last observed time point for that participant.
   - Pros: Simple to implement, maintains temporal trends.
   - Cons: May overestimate treatment effects, especially if missingness is related to treatment response. Can introduce bias and inaccurately inflate statistical significance.

4. **Multiple Imputation (MI)**:
   - Missing values are replaced with multiple sets of plausible values based on the observed data.
   - Pros: Accounts for uncertainty due to missing data, retains variability, preserves statistical power, and provides unbiased parameter estimates under certain assumptions.
   - Cons: More complex to implement, requires assumptions about the missing data mechanism (e.g., missing at random), computational intensity.

5. **Model-Based Imputation**:
   - Missing values are imputed using a statistical model fitted to the observed data.
   - Pros: Accounts for complex patterns of missingness, provides unbiased estimates if the model is correctly specified.
   - Cons: Requires assumptions about the underlying data distribution and missing data mechanism, may be computationally intensive.

The choice of method depends on the nature of the missing data, the assumptions about the missing data mechanism, and the specific goals of the analysis. It's essential to consider the potential consequences of each method and perform sensitivity analyses to assess the robustness of the results. Additionally, documenting the methods used for handling missing data is crucial for transparency and reproducibility.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans. Post-hoc tests are used in ANOVA (Analysis of Variance) when the overall F-test indicates a significant difference among group means but does not specify which specific groups differ from each other. These tests help to identify pairwise differences between groups. Some common post-hoc tests include:

1. **Tukey's Honestly Significant Difference (HSD)**:
   - Tukey's HSD test controls the familywise error rate, providing simultaneous confidence intervals for all pairwise differences between group means.
   - It is appropriate when you have a balanced design (equal sample sizes) and homogeneity of variances.
   - Example: Suppose you conducted a one-way ANOVA with three treatment groups and found a significant difference among group means. Tukey's HSD test would help identify which specific pairs of treatment groups differ significantly.

2. **Bonferroni Correction**:
   - Bonferroni correction adjusts the significance level for multiple comparisons to maintain a desired overall Type I error rate.
   - It is conservative and suitable for controlling the familywise error rate when conducting multiple pairwise comparisons.
   - Example: If you have multiple pairwise comparisons to make after conducting an ANOVA, Bonferroni correction would adjust the p-values to ensure that the overall Type I error rate remains within an acceptable range.

3. **Scheffé's Test**:
   - Scheffé's test provides confidence intervals for all possible pairwise comparisons, regardless of sample size imbalances or heterogeneity of variances.
   - It is more conservative than Tukey's HSD test but is robust to violations of assumptions.
   - Example: When you have unequal sample sizes or variances among groups and want to conduct all possible pairwise comparisons while controlling the familywise error rate, Scheffé's test would be appropriate.

4. **Dunnett's Test**:
   - Dunnett's test is used when one group serves as the control or reference group, and you want to compare all other groups to this control group.
   - It adjusts for multiple comparisons while focusing on specific group comparisons.
   - Example: In a clinical trial comparing the effectiveness of different treatments to a control group, Dunnett's test would help identify which treatment groups differ significantly from the control group.

The choice of post-hoc test depends on the specific study design, assumptions, and research questions. It's important to select a test that aligns with the characteristics of your data and the goals of your analysis. Additionally, it's essential to adjust for multiple comparisons to control the overall Type I error rate and avoid spurious findings.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

Ans. To conduct a one-way ANOVA in Python, you can use the `scipy.stats` module. Below is a Python code example demonstrating how to perform a one-way ANOVA on the weight loss data for diets A, B, and C:

```python
import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet
diet_A = np.array([1.5, 2.0, 1.8, 2.2, 1.9, 2.1, 1.7, 1.8, 1.6, 2.0,
                   1.9, 2.2, 1.5, 1.8, 1.6, 2.0, 1.7, 1.9, 2.1, 1.8,
                   2.0, 1.6, 1.9, 2.1, 1.7, 2.0, 1.8, 1.9, 2.2, 1.6,
                   1.8, 2.1, 1.7, 1.9, 2.0, 1.6, 1.8, 1.7, 1.9, 2.2,
                   1.6, 1.8, 1.7, 2.1, 1.9, 2.0, 1.5, 2.2, 1.8, 1.6])

diet_B = np.array([1.3, 1.7, 1.5, 1.8, 1.6, 1.9, 1.4, 1.6, 1.5, 1.7,
                   1.3, 1.6, 1.4, 1.8, 1.5, 1.7, 1.4, 1.6, 1.3, 1.8,
                   1.5, 1.7, 1.4, 1.6, 1.3, 1.5, 1.8, 1.4, 1.6, 1.3,
                   1.5, 1.8, 1.4, 1.6, 1.3, 1.7, 1.5, 1.9, 1.4, 1.6,
                   1.3, 1.7, 1.5, 1.8, 1.4, 1.6, 1.3, 1.5, 1.8, 1.4])

diet_C = np.array([1.0, 1.2, 0.8, 1.3, 1.1, 1.4, 0.9, 1.1, 1.0, 1.2,
                   0.8, 1.3, 1.0, 1.2, 0.9, 1.4, 1.1, 1.3, 0.8, 1.2,
                   1.0, 1.4, 0.9, 1.1, 1.0, 1.3, 0.8, 1.2, 1.0, 1.4,
                   0.9, 1.1, 1.1, 1.3, 0.9, 1.2, 1.0, 1.4, 0.8, 1.1,
                   1.0, 1.3, 0.9, 1.2, 1.1, 1.4, 0.8, 1.0, 1.2, 1.3])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("The p-value is less than 0.05, so we reject the null hypothesis.")
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("The p-value is greater than or equal to 0.05, so we fail to reject the null hypothesis.")
    print("There is no significant difference between the mean weight loss of the three diets.")
```

This code first defines the weight loss data for each diet, then performs a one-way ANOVA using `f_oneway()` function from `scipy.stats`. Finally, it prints out the F-statistic and p-value and interprets the results. If the p-value is less than 0.05, it concludes that there is a significant difference between the mean weight loss of the three diets; otherwise, it concludes that there is no significant difference.

### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

To conduct a two-way ANOVA in Python, you can use the `statsmodels` library, which provides an easy-to-use interface for fitting linear models. Below is a Python code example demonstrating how to perform a two-way ANOVA on the task completion time data for different software programs and employee experience levels:

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with your actual data)
data = {
    'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 10,
    'Experience': ['Novice', 'Experienced'] * 45,
    'Time': [10.2, 11.5, 9.8, 12.3, 13.2, 11.0, 9.5, 10.8, 9.2,
             11.9, 12.7, 10.5, 10.1, 11.4, 9.7, 12.2, 13.0, 10.8,
             9.3, 10.6, 8.9, 12.5, 13.4, 11.2, 9.7, 11.0, 9.3,
             10.0, 11.3, 9.6, 11.1, 12.8, 10.6, 10.2, 11.5, 9.8,
             12.3, 13.2, 11.0, 9.5, 10.8, 9.2, 11.9, 12.7, 10.5]
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
```

This code first creates a DataFrame `df` containing the software program, employee experience level, and task completion time data. Then, it fits a two-way ANOVA model using the `ols()` function from `statsmodels.formula.api`, specifying both main effects (`C(Software)` and `C(Experience)`) and their interaction (`C(Software):C(Experience)`). Finally, it performs ANOVA using `sm.stats.anova_lm()` and prints out the ANOVA table containing the F-statistics and p-values for main effects and interaction effects.

Interpreting the results of the two-way ANOVA involves examining the p-values associated with the main effects and interaction effect:

- If the p-value for the main effect of software program or employee experience level is less than the chosen significance level (e.g., 0.05), it indicates a significant main effect.
- If the p-value for the interaction effect is less than the chosen significance level, it suggests that there is a significant interaction effect between software program and employee experience level.

### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

Ans. To conduct a two-sample t-test in Python and follow up with a post-hoc test if the results are significant, you can use libraries like `scipy.stats` and `statsmodels`. Below is a Python code example demonstrating how to perform these analyses:

```python
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Test scores for control group (traditional teaching method)
control_scores = np.array([85, 78, 90, 82, 79, 88, 92, 81, 84, 87,
                           80, 83, 86, 89, 91, 77, 79, 81, 83, 85,
                           78, 80, 82, 84, 86, 88, 90, 92, 94, 96,
                           85, 88, 81, 79, 87, 83, 82, 90, 88, 85,
                           84, 82, 86, 89, 91, 87, 83, 85, 88, 80])

# Test scores for experimental group (new teaching method)
experimental_scores = np.array([88, 82, 95, 85, 84, 91, 94, 86, 90, 92,
                                83, 87, 89, 93, 96, 81, 85, 87, 89, 88,
                                81, 83, 85, 87, 89, 91, 93, 95, 97, 99,
                                88, 91, 84, 82, 90, 87, 85, 93, 91, 88,
                                86, 85, 89, 92, 94, 90, 87, 88, 91, 82])

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if the results are significant
if p_value < 0.05:
    print("The difference in test scores between the two groups is significant.")
    
    # Perform post-hoc tests
    p_adjusted = multipletests(p_value, method='bonferroni')[1]
    if p_adjusted < 0.05:
        print("The difference remains significant after adjusting for multiple comparisons.")
    else:
        print("The difference is not significant after adjusting for multiple comparisons.")
else:
    print("There is no significant difference in test scores between the two groups.")
```

In this code:
- We define the test scores for the control group (`control_scores`) and the experimental group (`experimental_scores`).
- We perform a two-sample t-test using `ttest_ind()` from `scipy.stats`.
- If the t-test results are significant (p-value < 0.05), we perform post-hoc tests using `multipletests()` from `statsmodels.stats.multitest` with the Bonferroni correction to adjust for multiple comparisons.

### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

Ans. To conduct a repeated measures ANOVA in Python and follow up with a post-hoc test if the results are significant, you can use libraries like `statsmodels` and `pingouin`. Below is a Python code example demonstrating how to perform these analyses:

```python
import pandas as pd
import pingouin as pg
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data (replace with your actual data)
data = {
    'Day': list(range(1, 31)) * 3,
    'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30,
    'Sales': [100, 110, 95, 105, 115, 90, 100, 110, 105, 115,
              95, 105, 100, 110, 95, 105, 115, 90, 100, 110,
              105, 115, 95, 105, 100, 110, 95, 105, 115, 90,
              100, 110, 95, 105, 115, 90, 100, 110, 105, 115,
              95, 105, 100, 110, 95, 105, 115, 90, 100, 110]
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
pg.rm_anova(data=df, dv='Sales', within='Store', subject='Day').round(3)

# If the results are significant, follow up with post-hoc tests (Tukey HSD)
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
print(posthoc)
```

In this code:
- We define the sales data for each store (`Store A`, `Store B`, and `Store C`) over 30 days.
- We perform a repeated measures ANOVA using the `rm_anova()` function from the `pingouin` library. This function conducts a repeated measures ANOVA with the specified within-subject factor (`Store`) and subject variable (`Day`).
- If the repeated measures ANOVA results are significant, we perform post-hoc tests using the `pairwise_tukeyhsd()` function from the `statsmodels` library to determine which stores differ significantly from each other.