## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

## Ans
-------
Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups or treatments to determine whether there are statistically significant differences among them. To use ANOVA effectively, certain assumptions must be met. Violations of these assumptions can impact the validity of the results. The key assumptions for ANOVA are:

1. **Independence of Observations**: This assumption requires that the observations within and between groups are independent of each other. In other words, the value of one observation should not be influenced by the value of another observation. Violation of this assumption can occur in various ways, such as in time series data or repeated measures designs when observations within the same group are correlated.

   **Example of Violation**: In a study measuring the effect of a new drug on blood pressure, if the same individuals are measured multiple times within a short period, their measurements may be correlated, violating the independence assumption.

2. **Normality**: ANOVA assumes that the residuals (the differences between observed values and the group means) are normally distributed for each group. Violations of this assumption can lead to incorrect p-values and confidence intervals.

   **Example of Violation**: If you're comparing the test scores of students from different schools, and the test scores within each school are not normally distributed, ANOVA results may be unreliable.

3. **Homogeneity of Variances (Homoscedasticity)**: This assumption requires that the variances of the residuals are roughly equal across all groups. If the variances are not equal, it can lead to a violation of the assumption and affect the validity of ANOVA results.

   **Example of Violation**: In an ANOVA comparing the yields of different types of crops, if one crop type has much larger variations in yield compared to others, it could violate the assumption of homogeneity of variances.

4. **Mutual Independence**: In the case of two-way or higher-level ANOVA (where there are multiple factors or independent variables), the assumption of mutual independence requires that the levels or categories of one factor do not interact with the levels or categories of another factor. Violations can result in confounding effects and make it challenging to interpret the main effects.

   **Example of Violation**: Suppose you're studying the effects of both temperature and humidity on plant growth. If temperature and humidity levels are not independent and interact with each other (e.g., high temperature affects plant growth differently at different humidity levels), the mutual independence assumption is violated.

5. **Random Sampling**: ANOVA assumes that the data are collected from random samples from each group or treatment. If the samples are not randomly selected, the results may not generalize to the larger population.

   **Example of Violation**: If you're comparing income levels in different cities but only collect data from one specific neighborhood in each city, the sample may not be representative of the entire city, violating the random sampling assumption.



## Q2. What are the three types of ANOVA, and in what situations would each be used?

## Ans
------
Analysis of Variance (ANOVA) comes in several types, each designed for different situations and research questions. The three primary types of ANOVA are:

1. **One-Way ANOVA**:
   - **Use Case**: One-Way ANOVA is used when you have one categorical independent variable (factor) with more than two levels (groups or treatments), and you want to determine if there are statistically significant differences in the means of a continuous dependent variable among these groups.
   - **Example**: You have three different types of fertilizer (A, B, and C), and you want to know if they lead to different plant growth heights.

2. **Two-Way ANOVA**:
   - **Use Case**: Two-Way ANOVA is used when you have two independent categorical variables (factors), and you want to examine how they individually and together affect a continuous dependent variable. It helps identify main effects and interaction effects.
   - **Example**: You're studying the effects of both temperature (low and high) and humidity (low and high) on plant growth. Two-Way ANOVA allows you to see how each factor and their combination affect plant growth.

3. **Repeated Measures ANOVA**:
   - **Use Case**: Repeated Measures ANOVA is used when you have a within-subjects design, meaning you're measuring the same subjects under multiple conditions or time points. It's suitable when you want to see if there are significant differences across the repeated measurements and possibly interactions between these measurements.
   - **Example**: You're testing the memory performance of the same group of participants before and after they receive different types of training. Repeated Measures ANOVA helps assess if the training had a significant effect on memory and if this effect changed over time.



## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

## Ans
------
The partitioning of variance in ANOVA is a fundamental concept that helps researchers understand how the total variation in their data can be broken down into different components. This breakdown is crucial for several reasons:

1. **Identifying Sources of Variation**: ANOVA helps identify the sources of variation in a dataset. It allows you to determine whether the observed differences among groups are due to the factors you're studying (such as treatment effects) or if they could be attributed to random chance.

2. **Quantifying Treatment Effects**: By partitioning the variance, ANOVA quantifies how much of the total variation in the data can be attributed to the independent variables (factors or treatments). This provides a measure of the treatment effects' strength and significance.

3. **Hypothesis Testing**: ANOVA uses the partitioned variance to conduct hypothesis tests. It compares the variation between groups (due to treatment effects) to the variation within groups (due to random fluctuations or error). This comparison is essential for determining if there are statistically significant differences among the groups.

4. **Assessing Model Fit**: Understanding the partitioned variance helps assess how well the statistical model (ANOVA model) fits the data. If the model explains a substantial portion of the total variance, it suggests that the model is a good representation of the data and can help in making predictions.

5. **Interpreting Results**: When ANOVA reveals statistically significant differences among groups, knowing how the variance is partitioned allows you to interpret the results effectively. You can attribute the observed differences to specific factors or treatments.

The partitioning of variance typically involves breaking down the total variance into three main components:

1. **Between-Group Variance (SSB)**: This represents the variation between different groups or levels of the independent variable. It reflects the differences caused by the factors or treatments you're studying.

2. **Within-Group Variance (SSW or SSE)**: This represents the variation within each group. It accounts for the random variability or error in the data that cannot be explained by the independent variable. It's essentially the variance left after accounting for the treatment effects.

3. **Total Variance (SST)**: This is the overall variation in the entire dataset. It's the sum of the between-group variance and the within-group variance.

Mathematically, ANOVA uses these variance components to calculate an F-statistic, which is used to determine whether the between-group variance is significantly greater than the within-group variance. If the F-statistic is significant, it suggests that the independent variable (the factor or treatment) has a statistically significant effect on the dependent variable.



## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

## Ans
--------

In [1]:
import numpy as np

# Sample data for three groups (replace this with your own data)
group1 = np.array([12, 14, 16, 15, 13])
group2 = np.array([22, 24, 25, 26, 23])
group3 = np.array([32, 35, 36, 34, 33])

# Combine all data into one array
all_data = np.concatenate([group1, group2, group3])

# Calculate the grand mean (overall mean)
grand_mean = np.mean(all_data)

# Calculate the Total Sum of Squares (SST)
SST = np.sum((all_data - grand_mean) ** 2)

# Calculate the Explained Sum of Squares (SSE)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
SSE = np.sum([len(group) * (group_mean - grand_mean) ** 2 for group, group_mean in zip([group1, group2, group3], group_means)])

# Calculate the Residual Sum of Squares (SSR)
SSR = SST - SSE

# Print the results
print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 1030.0
Explained Sum of Squares (SSE): 1000.0
Residual Sum of Squares (SSR): 30.0


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

##  Ans
--------

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with your own data)
data = pd.DataFrame({
    'A': ['A1', 'A1', 'A2', 'A2', 'A1', 'A2', 'A1', 'A2'],
    'B': ['B1', 'B2', 'B1', 'B2', 'B1', 'B2', 'B2', 'B1'],
    'Y': [10, 12, 14, 16, 18, 20, 22, 24]
})

# Perform a two-way ANOVA
formula = 'Y ~ C(A) + C(B) + C(A):C(B)'  # Specify the formula for the model
model = ols(formula, data=data).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Calculate the main effects and interaction effects
main_effect_A = anova_table['sum_sq']['C(A)'] / anova_table['df']['C(A)']
main_effect_B = anova_table['sum_sq']['C(B)'] / anova_table['df']['C(B)']
interaction_effect = anova_table['sum_sq']['C(A):C(B)'] / anova_table['df']['C(A):C(B)']

# Print the results
print("Main Effect of A:", main_effect_A)
print("Main Effect of B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


Main Effect of A: 18.000000000000114
Main Effect of B: 2.000000000000025
Interaction Effect: 7.999999999999965


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

## Ans
------
 Let's interpret the results based on the given F-statistic of 5.23 and a p-value of 0.02:

1. **F-Statistic**: The F-statistic measures the ratio of the variance between the groups to the variance within the groups. In this case, the F-statistic is 5.23.

2. **P-Value**: The p-value associated with the F-statistic tells you the probability of obtaining an F-statistic as extreme as the one you observed, assuming there are no real differences among the groups. In this case, the p-value is 0.02.

Now, let's interpret these results:

- **Null Hypothesis (H0)**: The null hypothesis in ANOVA states that there are no significant differences among the group means; all group means are equal.

- **Alternative Hypothesis (Ha)**: The alternative hypothesis suggests that at least one group mean is different from the others.

With the given results:

1. **Significance Level**: Before interpreting the results, it's crucial to consider the significance level (alpha) you chose for your analysis. Commonly used significance levels are 0.05 (5%) and 0.01 (1%). If you chose alpha = 0.05, a p-value less than 0.05 is typically considered statistically significant.

2. **Interpretation**:
   - Since the p-value (0.02) is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis.
   - This means there is evidence to suggest that at least one of the group means is different from the others.

3. **Conclusion**:
   - Based on these results, you can conclude that there are statistically significant differences among the groups.
   - However, the ANOVA test itself does not tell you which specific groups are different from each other. To identify which groups differ, you may need to perform post-hoc tests (e.g., Tukey's HSD, Bonferroni correction) or conduct pairwise comparisons between group means.


## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

## Ans
--------
Handling missing data in a repeated measures ANOVA is crucial to ensure the validity and accuracy of your analysis. Missing data can occur for various reasons, such as participant non-responses, data recording errors, or dropout in longitudinal studies. 

There are several methods to handle missing data in repeated measures ANOVA, each with its advantages and potential consequences:

1. **Listwise Deletion (Complete Case Analysis)**:
   - **Method**: This method involves removing cases (participants) with any missing data in any of the variables of interest. It analyzes only the data from cases with complete information.
   - **Advantages**: It's straightforward and easy to implement.
   - **Consequences**: It can lead to a loss of valuable data and reduced statistical power, especially if missing data are not random (i.e., missing not at random or MNAR). Results may be biased if the missing data are related to the variables being studied.

2. **Pairwise Deletion**:
   - **Method**: Pairwise deletion uses all available data for each analysis, and it includes participants with missing data in specific analyses where they have complete data.
   - **Advantages**: It retains more data than listwise deletion, maximizing the available information.
   - **Consequences**: It can lead to different sample sizes for different comparisons, potentially complicating the interpretation of results. It may also produce biased estimates if data are not missing completely at random (MCAR).

3. **Imputation**:
   - **Method**: Imputation involves estimating the missing values using various techniques, such as mean imputation, regression imputation, or multiple imputation.
   - **Advantages**: It retains all cases and can reduce bias in parameter estimates if imputation models are correctly specified. Multiple imputation is preferred when uncertainty in imputed values is a concern.
   - **Consequences**: The accuracy of imputation depends on the chosen method and the assumption that the data are missing at random (MAR). Incorrect imputation methods can introduce bias and artificially reduce variability.

4. **Maximum Likelihood Estimation (MLE)**:
   - **Method**: MLE is a statistical technique that estimates model parameters while accounting for missing data using likelihood-based methods. It's commonly used in specialized software for repeated measures ANOVA.
   - **Advantages**: MLE provides unbiased parameter estimates if data are missing at random (MAR). It uses all available data and maximizes statistical power.
   - **Consequences**: It can be computationally intensive and may require specialized software. The MAR assumption is essential for valid results.

5. **Pattern Mixture Models**:
   - **Method**: Pattern mixture models are a type of mixed-effects model that explicitly models different missing data patterns to estimate parameters.
   - **Advantages**: They can provide insights into how different missing data patterns might affect results.
   - **Consequences**: They can be complex to implement and require careful consideration of the missing data mechanisms.


## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

## Ans
-------
After conducting an Analysis of Variance (ANOVA) and finding that there are statistically significant differences among groups, post-hoc tests are often used to determine which specific groups differ from each other. Common post-hoc tests include:

1. **Tukey's Honestly Significant Difference (Tukey's HSD)**:
   - **When to Use**: Tukey's HSD is appropriate when you have conducted a one-way ANOVA with three or more groups, and you want to compare all possible pairs of means to identify which specific pairs are significantly different.
   - **Example**: You are testing the effect of different teaching methods (A, B, C, D) on student test scores. After ANOVA, Tukey's HSD can help determine which teaching methods result in significantly different test scores.

2. **Bonferroni Correction**:
   - **When to Use**: Bonferroni correction is suitable when you want to control the familywise error rate (the probability of making one or more Type I errors) by adjusting the significance level. It's commonly used when conducting multiple pairwise comparisons.
   - **Example**: In a medical study, you are comparing the effects of four different drugs on blood pressure. To maintain an overall Type I error rate of 0.05, Bonferroni correction adjusts the significance level for each pairwise comparison.

3. **Dunnett's Test**:
   - **When to Use**: Dunnett's test is used when you have a control group and want to compare the treatment groups to the control group. It's helpful for identifying which treatment groups are significantly different from the control.
   - **Example**: You're studying the effects of three new diets (A, B, C) on weight loss compared to a control group. Dunnett's test can tell you which diets result in significantly different weight loss compared to the control.

4. **Scheffé's Test**:
   - **When to Use**: Scheffé's test is a conservative post-hoc test used when you have unequal sample sizes and variances among groups. It's robust but less sensitive than some other tests.
   - **Example**: You're comparing the performance of three different products (X, Y, Z) in a manufacturing process, and the sample sizes are unequal. Scheffé's test can help identify significant differences among the products.

5. **Fisher's Least Significant Difference (LSD)**:
   - **When to Use**: Fisher's LSD is a less stringent test used when you are not concerned about controlling the familywise error rate strictly. It's more sensitive but can have a higher Type I error rate.
   - **Example**: You're analyzing the effects of different levels of a fertilizer (low, medium, high) on crop yield. If you want to quickly identify which pairs of fertilizer levels are significantly different, Fisher's LSD can be used.

6. **Games-Howell**:
   - **When to Use**: Games-Howell is suitable when sample sizes are unequal, and variances are not necessarily equal across groups. It's a post-hoc test for unequal variances.
   - **Example**: You're studying the impact of different teaching methods (A, B, C) on student performance, but the group variances are unequal. Games-Howell can help identify significant differences.



##  Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

## Ans
------

In [6]:
import numpy as np
from scipy import stats

#generate weight loss data for three diets
np.random.seed(123)
diet_a = np.random.normal(loc=5.5, scale=1.5, size=50)
diet_b = np.random.normal(loc=4.8, scale=1.2, size=50)
diet_c = np.random.normal(loc=4.2, scale=1.0, size=50)

weight_loss = np.concatenate([diet_a, diet_b, diet_c])
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

#conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

print("F-statistic: {:.2f}".format(f_statistic))
print("p-value: {:.4f}".format(p_value))

if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 9.14
p-value: 0.0002
There is a significant difference between the mean weight loss of the three diets.


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

## Ans
-------

In [7]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Setting random seed for reproducibility
np.random.seed(123)

# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

# Print the simulated data head 
print('Simulated Data example :')
print(data.head())

print('\n======================================================================================\n')

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

# Set significance level
alpha = 0.05

# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")


Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

## Ans
------ 

In [9]:
import numpy as np
import scipy.stats as stats

# Sample data for test scores in the control and experimental groups (replace with your own data)

control_scores = np.random.normal(loc=75, scale=10, size=100)
experimental_scores = np.random.normal(loc=80, scale=10, size=100)


# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the results
print("T-Statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups (p < 0.05).")
else:
    print("There is no significant difference in test scores between the control and experimental groups (p ≥ 0.05).")


T-Statistic: -12.843608009993993
p-value: 1.2008605930616465e-22
There is a significant difference in test scores between the control and experimental groups (p < 0.05).


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

## Ans
_____

In [10]:
import numpy as np
import pandas as pd
from scipy import stats

# Sample data for daily sales in each store (replace with your own data)
store_A = np.random.normal(loc=1000, scale=100, size=(30,))
store_B = np.random.normal(loc=1050, scale=150, size=(30,))
store_C = np.random.normal(loc=800, scale=80, size=(30,))



# Create a DataFrame
data = pd.DataFrame({'Store': ['A']*len(store_A) + ['B']*len(store_B) + ['C']*len(store_C),
                     'Sales': np.concatenate((store_A, store_B, store_C))})

# Perform a one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_A, store_B, store_C)

# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference in average daily sales between the stores (p < 0.05).")
else:
    print("There is no significant difference in average daily sales between the stores (p ≥ 0.05).")


F-Statistic: 63.11869060165205
p-value: 1.1575298191909095e-17
There is a significant difference in average daily sales between the stores (p < 0.05).
