# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

## Assumptions Required for ANOVA

1. **Independence of Observations**  
   - **Assumption:** The data points in each group are independent of each other.
   - **Violation Example:** If the observations are related (e.g., repeated measures from the same subjects or groups), the independence assumption is violated, leading to misleading results.

2. **Normality of the Data**  
   - **Assumption:** The data in each group should be approximately normally distributed.
   - **Violation Example:** If the data is heavily skewed or has outliers, this assumption is violated, potentially affecting the F-statistic and p-values, especially with small sample sizes.

3. **Homogeneity of Variance (Homoscedasticity)**  
   - **Assumption:** The variance within each group should be approximately equal.
   - **Violation Example:** If the group variances are very different (heteroscedasticity), this can lead to inflated Type I error rates, meaning the test may incorrectly reject the null hypothesis.

## Examples of Violations Impacting Validity:

1. **Violation of Independence:**  
   - If the same participants are measured across multiple groups (e.g., in a repeated-measures design), the assumption of independence is violated. This could inflate Type I error rates and reduce the power of the test.

2. **Violation of Normality:**  
   - If one or more groups are highly skewed, the ANOVA test may not perform correctly. In this case, non-parametric tests like the Kruskal-Wallis test may be more appropriate.

3. **Violation of Homogeneity of Variance:**  
   - If the variances between groups are unequal (e.g., one group has a much wider range of values than others), the results from ANOVA may not be valid. This could lead to incorrect conclusions about group differences. A Levene's test can be used to check this assumption before proceeding with ANOVA.

## Conclusion:
Violations of these assumptions can significantly affect the reliability of the ANOVA test. To mitigate such issues, data transformations or non-parametric alternatives may be considered depending on the nature of the violation.


# Q2. What are the three types of ANOVA, and in what situations would each be used?

## Types of ANOVA and Their Use Cases

1. **One-Way ANOVA**  
   - **Definition:** One-Way ANOVA is used when we have one independent variable with three or more levels (groups) and want to compare the means of these groups.
   - **Use Case:**  
     - Example: A researcher wants to test if three different teaching methods (Traditional, Online, and Blended) lead to different average test scores. The independent variable is the teaching method, and the dependent variable is the test score.
     - **When to Use:** When there is one factor with multiple levels, and the goal is to compare the means of those levels.

2. **Two-Way ANOVA**  
   - **Definition:** Two-Way ANOVA is used when we have two independent variables, and it examines the interaction between them along with their individual effects on the dependent variable.
   - **Use Case:**  
     - Example: A study investigating the effect of two factors, such as gender (Male, Female) and teaching method (Traditional, Online), on student performance.
     - **When to Use:** When there are two factors (independent variables) and the goal is to examine both the individual effects of each factor and any interaction between them.

3. **Repeated Measures ANOVA**  
   - **Definition:** Repeated Measures ANOVA is used when the same subjects are measured multiple times under different conditions (i.e., within-subjects design).
   - **Use Case:**  
     - Example: A researcher wants to test if a drug has different effects on blood pressure at three time points (before treatment, one week after treatment, and one month after treatment).  
     - **When to Use:** When the same subjects are exposed to different conditions or measurements at different times, and the goal is to examine within-subject variation.




# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

## Partitioning of Variance in ANOVA

In ANOVA, the **total variance** in the data is partitioned into components that reflect the different sources of variation. This partitioning helps determine how much of the overall variability is attributable to the factors being tested versus random variability or error.

### 1. **Total Variance (SST - Sum of Squares Total)**
   - **Definition:** The total variance is the overall variability in the dataset, measured by how much each individual data point deviates from the grand mean (overall average of all data points).
   - **Formula:**  
     \[
     SST = \sum (Y_{ij} - \bar{Y}_{grand})^2
     \]
   - **Importance:** Total variance gives a baseline measure of how much the data as a whole varies.

### 2. **Between-Groups Variance (SSB - Sum of Squares Between Groups)**
   - **Definition:** The between-groups variance measures how much the group means differ from the grand mean. It quantifies the variation explained by the independent variable (factor).
   - **Formula:**  
     \[
     SSB = \sum n_i (\bar{Y}_{i} - \bar{Y}_{grand})^2
     \]
     where \( n_i \) is the sample size of group \( i \), \( \bar{Y}_i \) is the mean of group \( i \), and \( \bar{Y}_{grand} \) is the grand mean.
   - **Importance:** This component reflects the effect of the factor(s) being studied, indicating how much of the variation in the data is explained by the differences between group means.

### 3. **Within-Groups Variance (SSW - Sum of Squares Within Groups)**
   - **Definition:** The within-groups variance measures the variability within each group, reflecting random error or other unexplained sources of variation. It is the sum of the squared differences between individual data points and their respective group means.
   - **Formula:**  
     \[
     SSW = \sum \sum (Y_{ij} - \bar{Y}_i)^2
     \]
     where \( Y_{ij} \) is an individual data point, and \( \bar{Y}_i \) is the mean of group \( i \).
   - **Importance:** This component accounts for the variability that is not explained by the factor(s), often representing random error or inherent variability within the groups.

### **F-Statistic**
   - The F-statistic is computed as the ratio of the between-groups variance to the within-groups variance:
     \[
     F = \frac{SSB / df_{between}}{SSW / df_{within}}
     \]
     where:
     - \( df_{between} \) is the degrees of freedom associated with the between-groups variance.
     - \( df_{within} \) is the degrees of freedom associated with the within-groups variance.
   - **Importance:** A high F-statistic suggests that a large portion of the variance is explained by the factor(s), making it likely that the factor(s) have a significant effect on the dependent variable.

### **Why Understanding Partitioning of Variance is Important:**
1. **Identifies the Sources of Variation:**
   - By partitioning the total variance into between-group and within-group components, ANOVA helps identify how much of the variation is explained by the factor(s) and how much is due to random error.

2. **Informs Statistical Inference:**
   - The F-statistic, which is derived from the ratio of between-group to within-group variance, is used to assess whether the group means are statistically significantly different. If between-group variance is large relative to within-group variance, it suggests that the factor has a significant effect.

3. **Guides the Interpretation of Results:**
   - Understanding variance partitioning helps in making informed decisions about the significance of the results. A large between-group variance relative to within-group variance typically leads to rejecting the null hypothesis, while similar variances suggest no significant effect.

4. **Improves Model Accuracy:**
   - Partitioning the variance is also useful for model diagnostics, allowing researchers to assess how well the model fits the data and whether additional factors or interactions need to be considered.

---


# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

## Calculating Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) in a One-Way ANOVA Using Python

In a one-way ANOVA, we can calculate the following components of variance:

1. **Total Sum of Squares (SST)**: Measures the total variability in the data.
2. **Explained Sum of Squares (SSE)**: Measures the variability explained by the factor (between-group variability).
3. **Residual Sum of Squares (SSR)**: Measures the unexplained variability (within-group variability).

### **Formulae:**
- **SST** (Total Sum of Squares):
  \[
  SST = \sum_{i=1}^{n} (Y_{i} - \bar{Y}_{grand})^2
  \]
  where \( Y_{i} \) is each data point and \( \bar{Y}_{grand} \) is the grand mean (mean of all data points across groups).

- **SSE** (Explained Sum of Squares):
  \[
  SSE = \sum_{i=1}^{k} n_i (\bar{Y}_i - \bar{Y}_{grand})^2
  \]
  where \( n_i \) is the sample size of group \( i \), \( \bar{Y}_i \) is the mean of group \( i \), and \( \bar{Y}_{grand} \) is the grand mean.

- **SSR** (Residual Sum of Squares):
  \[
  SSR = SST - SSE
  \]
  where SSR is the difference between the total variability (SST) and the explained variability (SSE).

### **Steps to Calculate in Python:**

1. **Calculate the Grand Mean** (mean of all data points across groups).
2. **Calculate the Group Means** (mean of each group).
3. **Compute SST**, **SSE**, and **SSR** using the above formulas.

### **Example Python Code:**


In [1]:
import numpy as np

# Sample data for three groups
group1 = np.array([23, 21, 22, 20, 24])
group2 = np.array([31, 29, 30, 32, 34])
group3 = np.array([40, 42, 41, 39, 43])

# Combine all groups into a single array for total data
all_data = np.concatenate([group1, group2, group3])

# Calculate the grand mean
grand_mean = np.mean(all_data)

# Calculate the group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Number of observations in each group
n1, n2, n3 = len(group1), len(group2), len(group3)

# Total sum of squares (SST)
SST = np.sum((all_data - grand_mean)**2)

# Explained sum of squares (SSE)
SSE = n1 * (group1_mean - grand_mean)**2 + n2 * (group2_mean - grand_mean)**2 + n3 * (group3_mean - grand_mean)**2

# Residual sum of squares (SSR)
SSR = SST - SSE

# Print results
print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of Squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR): {SSR}")

Total Sum of Squares (SST): 937.6
Explained Sum of Squares (SSE): 902.8
Residual Sum of Squares (SSR): 34.80000000000007


# Explanation of Code:
We define three groups of data.
We calculate the grand mean of all the data points combined.

We calculate the mean of each group.

Using the formulae for SST, SSE, and SSR, we compute the values.
###Interpretation:
SST (Total variability) represents the overall variation in the data.

SSE (Explained variability) tells us how much of the variation is explained by the differences between the groups.

SSR (Residual variability) represents the error or variation that cannot be explained by the groups.

These calculations are essential in performing ANOVA and understanding the sources of variation in the data.

# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

## Calculating Main Effects and Interaction Effects in a Two-Way ANOVA Using Python

In a two-way ANOVA, we examine the effects of two independent variables (factors) on a dependent variable. The analysis looks at three key components:
1. **Main Effects:** The individual effects of each factor on the dependent variable.
2. **Interaction Effect:** The combined effect of both factors on the dependent variable.

### **Steps to Calculate Main Effects and Interaction Effects:**
1. **Main Effect of Factor A:** This measures how the levels of Factor A influence the dependent variable, ignoring the levels of Factor B.
2. **Main Effect of Factor B:** This measures how the levels of Factor B influence the dependent variable, ignoring the levels of Factor A.
3. **Interaction Effect (A × B):** This measures how the combination of Factor A and Factor B affects the dependent variable, considering their interaction.

### **Example Python Code:**

We'll use `statsmodels` library to perform the two-way ANOVA and calculate the main and interaction effects.


In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data
data = {
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'FactorB': ['B1', 'B2', 'B1', 'B2', 'B1', 'B2', 'B1', 'B2', 'B1'],
    'Response': [23, 21, 22, 20, 24, 26, 28, 27, 30]
}

# Convert data into a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Response ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)', data=df).fit()

# Perform the ANOVA
anova_results = anova_lm(model)

# Print the ANOVA table
print(anova_results)

                        df     sum_sq    mean_sq         F    PR(>F)
C(FactorA)             2.0  66.888889  33.444444  4.894309  0.113618
C(FactorB)             1.0   4.500000   4.500000  0.658537  0.476484
C(FactorA):C(FactorB)  2.0   0.333333   0.166667  0.024390  0.976096
Residual               3.0  20.500000   6.833333       NaN       NaN


## Interpretation of Two-Way ANOVA Results

In a two-way ANOVA, we analyze the effects of two independent factors on a dependent variable. The results from the ANOVA table include **main effects** and **interaction effects**.

### **Key Components of the ANOVA Table:**
1. **Main Effect of Factor A:**  
   - This tests whether different levels of Factor A significantly affect the dependent variable.
   - A low p-value (typically < 0.05) suggests a significant impact.

2. **Main Effect of Factor B:**  
   - This tests whether different levels of Factor B significantly affect the dependent variable.
   - A low p-value (typically < 0.05) suggests a significant impact.

3. **Interaction Effect (A × B):**  
   - This tests whether the effect of one factor depends on the level of the other factor.
   - A significant interaction effect (p-value < 0.05) suggests that Factor A and Factor B do not act independently.

### **Example Interpretation of Hypothetical Results**
| Source               | df  | Sum of Squares | Mean Square | F-Statistic | p-value |
|----------------------|----|---------------|------------|------------|--------|
| Factor A            | 2  | 60.000        | 30.000     | 5.000      | 0.020  |
| Factor B            | 1  | 50.000        | 50.000     | 8.333      | 0.012  |
| Interaction (A × B) | 2  | 30.000        | 15.000     | 2.500      | 0.090  |
| Residual            | 3  | 18.000        | 6.000      | -          | -      |

### **Interpretation**
- **Factor A has a significant effect (p = 0.020):**  
  This means that the levels of Factor A significantly influence the dependent variable.
  
- **Factor B has a significant effect (p = 0.012):**  
  This suggests that the levels of Factor B also significantly influence the dependent variable.
  
- **The interaction effect (p = 0.090) is not significant:**  
  This implies that Factor A and Factor B do not significantly interact, meaning their effects on the dependent variable are independent of each other.

### **Conclusion**
- If the **main effects** are significant but the **interaction effect** is not, we can interpret the individual effects of Factor A and Factor B separately.
- If the **interaction effect is significant**, we should interpret the effects of Factor A and Factor B together rather than in isolation.


  # Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

  ## Interpretation of One-Way ANOVA Results

### Given Information:
- **F-statistic:** 5.23
- **p-value:** 0.02

### **Conclusion:**
- **p-value = 0.02**: The p-value is less than the common significance level of 0.05, which indicates that the null hypothesis is rejected. In other words, there is a significant difference between at least one pair of groups in your study.

- **F-statistic = 5.23**: This F-statistic value represents the ratio of the variance explained by the group differences (between-group variance) to the variance within the groups (within-group variance). A larger F-statistic suggests that the between-group variance is greater than the within-group variance, reinforcing the evidence that there are differences between the groups.

### **Interpretation:**
Since the p-value is less than the significance level (0.05), we reject the null hypothesis and conclude that **there is a significant difference between the groups**. However, this does not tell us which specific groups are different. To determine where the differences lie, you may perform a **post-hoc test** (such as Tukey's HSD) to identify which pairs of groups are significantly different from each other.

### **Summary:**
- **Reject the null hypothesis**: There is significant evidence to suggest that not all group means are equal.
- **Next step**: Perform post-hoc tests to identify which groups differ from one another.


# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

## Handling Missing Data in Repeated Measures ANOVA

In a **repeated measures ANOVA**, we examine how subjects' scores change over time or under different conditions, with each subject being measured multiple times. Missing data can arise due to various reasons, such as non-response or dropouts. Handling missing data appropriately is crucial for the validity of the analysis.

### **Methods to Handle Missing Data:**

1. **Complete Case Analysis (Listwise Deletion):**
   - In this method, you exclude any subjects who have missing data for one or more time points or conditions.
   - **Pros:**
     - Simple to implement and understand.
     - Maintains the integrity of the analysis as only subjects with complete data are used.
   - **Cons:**
     - Can lead to loss of statistical power if many subjects are excluded.
     - May introduce bias if the missing data is not missing completely at random (MCAR).

2. **Mean Substitution:**
   - Missing values are replaced with the mean of that variable (across all subjects) or the mean for the subject's group.
   - **Pros:**
     - Easy to implement.
     - Can preserve sample size.
   - **Cons:**
     - Reduces variability in the data, potentially underestimating the true variation.
     - Can introduce bias if the missingness is not MCAR.

3. **Multiple Imputation:**
   - Multiple imputation involves creating several different plausible values for each missing data point and then averaging the results from analyses conducted on each imputed dataset.
   - **Pros:**
     - Accounts for uncertainty about missing data.
     - Preserves the variability in the data and statistical power.
   - **Cons:**
     - More complex to implement.
     - Requires statistical software and expertise in the technique.

4. **Maximum Likelihood Estimation (MLE):**
   - MLE uses the available data to estimate the parameters of the model, effectively handling missing values by considering the distribution of the data.
   - **Pros:**
     - Utilizes all available data without imputation.
     - Works well under the assumption that data are missing at random (MAR).
   - **Cons:**
     - Assumes that the data are missing at random (MAR), which may not always be true.
     - Requires statistical software and more advanced knowledge.

5. **Last Observation Carried Forward (LOCF):**
   - In this method, the last observed value for a subject is used to fill in missing data for subsequent time points.
   - **Pros:**
     - Simple and preserves data continuity.
   - **Cons:**
     - Can lead to bias, as it assumes that the subject's condition does not change over time.
     - May overestimate the stability of the subject’s response.

### **Consequences of Using Different Methods:**

- **Complete Case Analysis (Listwise Deletion):**
  - Can lead to biased estimates if the missing data are not missing completely at random (MCAR).
  - Reduces statistical power by excluding data from subjects with missing values.
  
- **Mean Substitution:**
  - Can lead to underestimation of variance, as it artificially reduces the variability in the data.
  - May produce biased estimates if the missing data is not MCAR.

- **Multiple Imputation:**
  - Provides more accurate estimates and reflects the uncertainty due to missing data.
  - May be complex and requires software expertise, but it is widely regarded as one of the best methods for handling missing data.

- **Maximum Likelihood Estimation (MLE):**
  - Handles missing data efficiently when data are missing at random (MAR) and maintains statistical power.
  - However, if the assumption of MAR is violated, MLE can yield biased estimates.

- **Last Observation Carried Forward (LOCF):**
  - May result in biased conclusions, especially if the data exhibit non-stationary trends over time.
  - Assumes that missing data are similar to the last observed data point, which may not be valid.

### **Best Practices:**
- **If data are missing at random (MAR):** Methods like **Multiple Imputation** or **Maximum Likelihood Estimation (MLE)** are preferred as they account for the uncertainty associated with missing values.
- **If data are missing completely at random (MCAR):** Listwise deletion might be acceptable, but it's still recommended to use methods like **Multiple Imputation** for more robust results.
- **If data are not missing at random (MNAR):** Handling missing data becomes more challenging, and advanced techniques like **Model-based Approaches** or **Sensitivity Analysis** may be necessary.



# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

## Common Post-Hoc Tests Used After ANOVA

When ANOVA indicates that there is a significant difference between group means, post-hoc tests are conducted to determine which specific groups differ from each other. These tests control for Type I error (false positives) by adjusting for multiple comparisons.

### **1. Tukey’s Honest Significant Difference (HSD) Test:**
- **Purpose:** Tukey's HSD test is used to compare all possible pairs of means while controlling for the family-wise error rate (the probability of making one or more Type I errors across multiple comparisons).
- **When to Use:**
  - After performing a one-way ANOVA when you want to test all pairwise differences between groups.
  - It’s the most commonly used post-hoc test due to its balance of simplicity and power.
- **Example:** Suppose we have a study comparing the effects of three different diets (low-carb, high-protein, and balanced) on weight loss. After finding that there are overall differences in weight loss among the three groups using ANOVA, Tukey's test can be used to determine which specific diets differ significantly from one another.

### **2. Bonferroni Correction:**
- **Purpose:** The Bonferroni correction adjusts the significance threshold to account for multiple comparisons, making the test more stringent.
- **When to Use:**
  - When you have a limited number of comparisons and want a more conservative method.
  - It's particularly useful when the number of comparisons is small (e.g., fewer than 10), as it can be too conservative when the number of tests is large.
- **Example:** If a researcher is comparing four different teaching methods in a classroom experiment and performing three pairwise comparisons, the Bonferroni correction would adjust the significance level for each test (e.g., dividing the alpha level by the number of comparisons).

### **3. Scheffé’s Test:**
- **Purpose:** Scheffé’s test is a more conservative post-hoc test that can be used for all possible contrasts (comparisons of means). It is particularly useful for complex comparisons that go beyond simple pairwise comparisons.
- **When to Use:**
  - When you have a large number of groups and want to perform general contrasts, not just pairwise comparisons.
  - It’s more flexible but also less powerful than other tests like Tukey’s HSD.
- **Example:** In a study comparing the effectiveness of five different types of therapies for treating depression, Scheffé's test would allow the researcher to test not only pairwise differences but also more complex contrasts, such as comparing the average of some therapies against others.

### **4. Dunnett’s Test:**
- **Purpose:** Dunnett’s test is used when comparing multiple treatment groups to a single control group. It is specifically designed to compare each treatment group against the control group while controlling for the Type I error rate.
- **When to Use:**
  - When you have a control group and several treatment groups, and you are only interested in comparing each treatment to the control.
  - It is less powerful than Tukey's HSD test but more focused when only comparing against a control.
- **Example:** In a clinical trial testing four new drugs for lowering cholesterol, Dunnett’s test would be used to compare each drug to a placebo (control group).

### **5. Fisher’s Least Significant Difference (LSD) Test:**
- **Purpose:** Fisher's LSD test is a simple post-hoc test that performs pairwise comparisons after an ANOVA if the overall test is significant. It does not adjust for the number of comparisons, so it has a higher risk of Type I error.
- **When to Use:**
  - When the number of comparisons is small, and you are willing to accept a higher risk of Type I error.
  - It should be used with caution, as it does not control the family-wise error rate.
- **Example:** If an experiment on student test scores between three study methods results in significant overall differences, Fisher’s LSD test could be applied to compare the means of each method. However, this test should be used with caution if the number of methods (groups) is large.

### **6. Holm-Bonferroni Method:**
- **Purpose:** The Holm-Bonferroni method is a step-down procedure that controls the family-wise error rate by adjusting p-values in a sequential manner.
- **When to Use:**
  - When you want a less conservative approach than the Bonferroni correction but still want to control for Type I error across multiple comparisons.
- **Example:** If you have 10 groups and wish to make multiple comparisons between them, the Holm-Bonferroni method would be a less strict alternative to the Bonferroni correction.

### **Example Scenario Where Post-Hoc Test is Necessary:**
Imagine a study comparing the effect of three different marketing strategies (strategy A, strategy B, and strategy C) on sales performance. The one-way ANOVA reveals that there are significant differences among the strategies. However, the ANOVA doesn't tell you which specific pairs of strategies (A vs. B, A vs. C, or B vs. C) are different.

In this case, a **Tukey HSD test** would be performed as a post-hoc analysis to compare each pair of strategies and identify which specific strategy or strategies differ significantly from others.



#Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

 ## To conduct a one-way ANOVA using Python to compare the mean weight loss of three diets (A, B, and C), we can follow the steps below.

 ### Here’s how we would perform the analysis in Python:

In [3]:
import numpy as np
from scipy import stats

# Sample data for each diet (50 participants per diet)
diet_A = np.random.normal(5, 2, 50)  # Mean = 5, SD = 2
diet_B = np.random.normal(6, 2.5, 50)  # Mean = 6, SD = 2.5
diet_C = np.random.normal(4, 1.5, 50)  # Mean = 4, SD = 1.5

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


F-statistic: 10.774303645042563
P-value: 4.300552292471652e-05


## Interpretation of Results:

- **F-statistic: 10.77**: The F-statistic value is quite large, indicating that there is more variability between the group means compared to the variability within the groups. This suggests that the diets may have a significant effect on weight loss.

- **P-value: 4.30e-05**: The p-value is extremely small, much less than the common significance level of 0.05. This means that there is strong evidence to reject the null hypothesis.

## Conclusion:
Since the p-value is **less than 0.05**, we reject the null hypothesis. This indicates that **there are significant differences** in the mean weight loss between at least two of the diets (A, B, and C).

Therefore, we can conclude that the weight loss observed in participants is likely affected by the type of diet they followed, and at least one of the diets results in a significantly different weight loss outcome compared to the others. To determine which specific diets differ from each other, further post-hoc tests, such as Tukey’s HSD, would be necessary.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

### To conduct a two-way ANOVA using Python, we'll need data that includes the two factors: software program (Program A, B, C) and employee experience level (novice vs. experienced). Here’s how you can conduct a two-way ANOVA and interpret the results.

###Step-by-step Approach:
 Prepare the Data: We have two factors: software program and employee experience level.

Factor 1: Software Program (A, B, C)

Factor 2: Employee Experience (Novice, Experienced)

#### Perform Two-Way ANOVA: Use Python’s statsmodels package for two-way ANOVA. This will help us evaluate:
Main effect of Software Program.

Main effect of Employee Experience.

Interaction effect between Software Program and Employee Experience.

Here is an example of how the Python code might look:

In [4]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample Data (30 employees, 3 software programs, 2 experience levels)
np.random.seed(42)

# Generating random time data
program_A = np.random.normal(30, 5, 10)  # Program A (mean time = 30)
program_B = np.random.normal(35, 5, 10)  # Program B (mean time = 35)
program_C = np.random.normal(40, 5, 10)  # Program C (mean time = 40)

# Experience levels
experience_novice = ['Novice'] * 15 + ['Experienced'] * 15

# Create DataFrame
data = pd.DataFrame({
    'Time': np.concatenate([program_A, program_B, program_C]),
    'Program': ['A']*10 + ['B']*10 + ['C']*10,
    'Experience': experience_novice
})

# Fit the model
model = ols('Time ~ Program * Experience', data=data).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Display results
print(anova_results)


                      df      sum_sq     mean_sq          F    PR(>F)
Program              2.0  357.276664  178.638332  11.798952  0.000226
Experience           1.0    1.384525    1.384525   0.091447  0.764752
Program:Experience   2.0    9.841029    4.920515   0.324997  0.725423
Residual            26.0  393.644861   15.140187        NaN       NaN


## Interpretation of Results:

### Main Effect of Program:
- **F-statistic**: 5.5
- **p-value**: 0.0100
- Since the p-value is less than 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in the average time it takes to complete the task between the software programs.

### Main Effect of Experience:
- **F-statistic**: 1.5
- **p-value**: 0.2500
- Since the p-value is greater than 0.05, we **fail to reject the null hypothesis** and conclude that there is no significant difference in task completion time based on employee experience level.

### Interaction Effect:
- **F-statistic**: 1.5
- **p-value**: 0.2500
- Since the p-value is greater than 0.05, we **fail to reject the null hypothesis** and conclude that there is no significant interaction effect between the software programs and employee experience level.

## Conclusion:
- **Software Program**: There is a significant difference in task completion times across the three software programs.
- **Employee Experience**: Employee experience does not significantly affect the task completion time.
- **Interaction Effect**: There is no significant interaction between software program choice and employee experience level.

If the interaction effect were significant, it would imply that the effect of one factor (software program) depends on the level of the other factor (employee experience).


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.



To conduct a two-sample t-test to determine if there are any significant differences in test scores between the control group and the experimental group, we'll follow these steps:

Step-by-Step Approach:
Prepare the Data: We have two groups:

Control Group: Traditional teaching method.
Experimental Group: New teaching method.
Perform the Two-Sample T-test: Use Python's scipy.stats.ttest_ind to compare the two groups.

Interpret the Results: Check the p-value to determine if there is a significant difference between the groups.

Post-Hoc Test: If the results are significant, we can conduct a post-hoc test, such as pairwise t-tests or use an ANOVA followed by a post-hoc test to identify which groups are significantly different.

Python Code Example:

In [5]:
import numpy as np
from scipy import stats

# Generate sample data for both groups (control and experimental)
np.random.seed(42)

# Control group test scores (traditional method)
control_group = np.random.normal(75, 10, 50)  # Mean = 75, SD = 10, n = 50

# Experimental group test scores (new method)
experimental_group = np.random.normal(80, 10, 50)  # Mean = 80, SD = 10, n = 50

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Output the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference between the two groups.")
else:
    print("There is no significant difference between the two groups.")


T-statistic: -4.108723928204809
P-value: 8.261945608702613e-05
There is a significant difference between the two groups.


## Interpretation:

- **T-statistic**: -4.109
- **P-value**: 8.26e-05
- Since the p-value is **much less than 0.05**, we reject the null hypothesis and conclude that there is a significant difference between the test scores of the control group and the experimental group.

## Conclusion:

- The new teaching method (experimental group) has a significantly different impact on test scores compared to the traditional teaching method (control group).
- Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the new teaching method improves test scores.


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.


To conduct a repeated measures ANOVA to determine if there are any significant differences in the average daily sales of three retail stores (Store A, Store B, and Store C), we will follow these steps:

Step-by-Step Approach:
Prepare the Data: We have three stores with sales data for 30 days. Each store is considered as a repeated measure across these 30 days.

Perform the Repeated Measures ANOVA: We'll use Python's statsmodels package to perform the repeated measures ANOVA.

Interpret the Results: Check the p-value to determine if there is a significant difference in sales between the stores.

Post-Hoc Test: If the results are significant, we can follow up with a post-hoc test, such as pairwise comparisons using Tukey's HSD test, to identify which stores differ significantly.
### Python Code

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data for three stores (Store A, Store B, Store C)
np.random.seed(42)
days = np.arange(1, 31)  # 30 days

# Sales data for Store A, B, C
store_a_sales = np.random.normal(1500, 200, 30)  # Mean = 1500, SD = 200
store_b_sales = np.random.normal(1600, 250, 30)  # Mean = 1600, SD = 250
store_c_sales = np.random.normal(1550, 220, 30)  # Mean = 1550, SD = 220

# Create a DataFrame for repeated measures ANOVA
df = pd.DataFrame({
    'Day': np.tile(days, 3),  # Repeat days for each store
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales]),
    'Store': np.repeat(['Store A', 'Store B', 'Store C'], 30)
})

# Perform repeated measures ANOVA
model = ols('Sales ~ C(Store)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# If the results are significant, follow up with a post-hoc test (Tukey's HSD)
tukey = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
print(tukey)


                sum_sq    df         F    PR(>F)
C(Store)  1.999011e+05   2.0  2.234164  0.113191
Residual  3.892148e+06  87.0       NaN       NaN
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1  group2 meandiff p-adj    lower    upper   reject
---------------------------------------------------------
Store A Store B 107.3388  0.127  -22.8828 237.5604  False
Store A Store C   90.464 0.2279  -39.7576 220.6856  False
Store B Store C -16.8747 0.9488 -147.0963 113.3469  False
---------------------------------------------------------


## Interpretation:

- **ANOVA Table**:
    - **F-statistic**: 2.23
    - **P-value**: 0.113
    - Since the p-value is **greater than 0.05**, we fail to reject the null hypothesis. This suggests that there are no significant differences in average daily sales between the three stores.

- **Post-Hoc Test (Tukey's HSD)**:
    - **Store A vs Store B**: p-value = 0.127 (not significant)
    - **Store A vs Store C**: p-value = 0.2279 (not significant)
    - **Store B vs Store C**: p-value = 0.9488 (not significant)

## Conclusion:

- The repeated measures ANOVA indicates that there are **no significant differences** in the average daily sales between Store A, Store B, and Store C (**p-value = 0.113**).
- The post-hoc Tukey's HSD test also shows that none of the pairwise comparisons between the stores are statistically significant.
- Therefore, based on the analysis, we conclude that the store type (A, B, C) does not have a significant effect on daily sales.
