# Q1

In [None]:
"""
Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
"""

In [None]:
"""
ANOVA (Analysis of Variance) has several assumptions that need to be met in order to ensure the validity of the results. These assumptions are:

1. Independence: The observations within each group are independent of each other. Violation of this assumption occurs when there is dependence or correlation between the observations. For example, if the data points within each group are paired or clustered in some way, such as repeated measurements on the same subjects, the independence assumption may be violated.
2. Normality: The residuals (the differences between observed values and predicted values) within each group are normally distributed. Violation of this assumption occurs when the residuals are not normally distributed. For example, if the residuals follow a skewed or non-normal distribution, it can affect the accuracy of the ANOVA results.
3. Homogeneity of Variance: The variance of the residuals is constant across all groups. Violation of this assumption occurs when there are unequal variances across the groups. For example, if one group has much larger variances than the others, it can lead to biased results in the ANOVA analysis.

Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of how the violations can affect the analysis:

- Independence violation: In a study where the observations within each group are not independent, such as a repeated measures design where the same subjects are measured multiple times, the assumption of independence is violated. This can lead to inflated Type I error rates and invalid conclusions.
- Normality violation: If the residuals within each group do not follow a normal distribution, the assumptions related to p-values, confidence intervals, and hypothesis testing may not hold. Non-normality can affect the accuracy of the p-values and lead to incorrect conclusions about the significance of the group differences.
- Homogeneity of Variance violation: When the variances are not equal across the groups, the assumption of homogeneity of variance is violated. This can result in inaccurate F-statistics, p-values, and confidence intervals. If the group with larger variances has a larger impact on the overall analysis, it can lead to biased conclusions.

It is important to assess these assumptions before conducting an ANOVA analysis. If any of the assumptions are violated, alternative analysis methods or adjustments may be necessary, such as non-parametric tests or transformations of the data.
"""

# Q2

In [None]:
"""
What are the three types of ANOVA, and in what situations would each be used?
"""

In [None]:
"""
One-Way ANOVA: This type of ANOVA is used when there is only one categorical independent variable (factor) with two or more groups, and the goal is to compare the means of these groups. It tests whether there are significant differences between the means of the groups. One-Way ANOVA is appropriate when there is a single factor influencing the response variable. For example, you might use One-Way ANOVA to compare the effectiveness of different treatments on patient outcomes.
Two-Way ANOVA: Two-Way ANOVA is used when there are two categorical independent variables (factors) and their interaction, and the goal is to examine the main effects of each factor and their interaction on the response variable. It allows for investigating the effects of two independent variables simultaneously. Two-Way ANOVA is appropriate when there are two factors influencing the response variable, and you want to determine the effects of each factor separately as well as their combined effect. For example, you might use Two-Way ANOVA to study the effects of both gender and age group on test scores.
Three-Way ANOVA: Three-Way ANOVA extends the concept of Two-Way ANOVA by including three categorical independent variables (factors) and their interactions. It allows for investigating the main effects of each factor, the interactions between pairs of factors, and the three-way interaction on the response variable. Three-Way ANOVA is used when there are three factors influencing the response variable, and you want to explore their individual effects and the combined effects of their interactions. This type of ANOVA is less common and typically applied in complex experimental designs with multiple factors.
"""

# Q3

In [None]:
"""
What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
"""

In [None]:
"""
The partitioning of variance in ANOVA refers to dividing the total variation in the response variable into different sources. It helps understand the relative contributions of factors or sources of variation. It is important because it allows us to assess the impact of independent variables, determine statistical significance, and draw meaningful conclusions.
"""

# Q4

In [None]:
"""
How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?
"""

In [1]:
import numpy as np
import scipy.stats as stats

# Example data for three groups
group1 = np.array([1, 2, 3, 4, 5])
group2 = np.array([2, 4, 6, 8, 10])
group3 = np.array([3, 6, 9, 12, 15])

# Concatenate the data from all groups
data = np.concatenate((group1, group2, group3))

# Calculate the one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Calculate the total sum of squares (SST)
mean_data = np.mean(data)
sst = np.sum((data - mean_data) ** 2)

# Calculate the explained sum of squares (SSE)
sse = np.sum((np.mean(group1) - mean_data) ** 2) * len(group1) + \
      np.sum((np.mean(group2) - mean_data) ** 2) * len(group2) + \
      np.sum((np.mean(group3) - mean_data) ** 2) * len(group3)

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)
print("F-statistic:", f_statistic)
print("p-value:", p_value)


SST: 230.0
SSE: 90.0
SSR: 140.0
F-statistic: 3.857142857142857
p-value: 0.05086290933139865


# Q5

In [None]:
"""
In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
"""

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                     'B': [2, 4, 6, 8, 10],
                     'Y': [3, 6, 9, 12, 15]})

# Fit the two-way ANOVA model
model = ols('Y ~ A + B + A:B', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Extract main effects and interaction effects
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A', 'df']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['B', 'df']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'df']

print("Main Effect of A:", main_effect_A)
print("Main Effect of B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


Main Effect of A: 90.00000000000007
Main Effect of B: 9.663546088957395e-30
Interaction Effect: 3.1554436208840472e-30


# Q6

In [None]:
"""
Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
"""

In [None]:
"""
The F-statistic represents the ratio of the between-group variability to the within-group variability. A larger F-statistic indicates a greater difference between the group means relative to the variability within the groups. In this case, the obtained F-statistic of 5.23 suggests that the between-group differences are significant compared to the within-group variability.
The p-value of 0.02 indicates the probability of observing such a large F-statistic by chance, assuming the null hypothesis (no difference between the groups) is true. Since the p-value is less than the significance level (typically 0.05), we reject the null hypothesis and conclude that there are statistically significant differences between the groups.
In terms of interpretation, we can say that there is strong evidence to suggest that at least one of the group means is significantly different from the others. However, the ANOVA test does not specifically identify which group(s) differ. To determine the specific group differences, further post-hoc tests or pairwise comparisons can be conducted.
"""

# Q7

In [None]:
"""
In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
"""

In [None]:
"""
Here are a few approaches commonly used to handle missing data in this context:

Complete Case Analysis (Listwise Deletion): This approach involves excluding any participants with missing data from the analysis. While it is straightforward to implement, it may lead to biased results if the missingness is related to the variables under study, potentially compromising the representativeness and generalizability of the findings.
Pairwise Deletion: This approach includes all available data for each individual analysis, allowing for different sample sizes across variables. However, it may introduce bias if the missingness is related to specific variables or if there is dependence between missing and observed data.
Imputation: Imputation involves estimating the missing values based on available information. Various imputation methods can be used, such as mean imputation, regression imputation, or multiple imputation. Imputation can help retain the sample size and preserve statistical power but assumes that the data are missing at random and that the imputation model adequately captures the missingness mechanism.
"""

# Q8

In [None]:
"""
What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
"""

In [None]:
"""
After conducting an ANOVA and finding a significant overall effect, post-hoc tests are performed to determine which specific group means differ from each other. Some common post-hoc tests used after ANOVA include:

1. Tukey's Honestly Significant Difference (HSD): Tukey's HSD test compares all possible pairs of group means and provides simultaneous confidence intervals. It is commonly used when there are equal sample sizes and variances across groups. This test controls the family-wise error rate, making it suitable for multiple comparisons.
2. Bonferroni Correction: The Bonferroni correction adjusts the significance level for multiple comparisons. It divides the desired alpha level by the number of pairwise comparisons. This method is conservative but ensures an overall type I error rate.
3. Scheffe's Test: Scheffe's test is a conservative post-hoc test that maintains a family-wise error rate for any set of linear combinations of group means. It is useful when there are unequal sample sizes and variances across groups or when exploring complex hypotheses.
4. Sidak's Test: Sidak's test is another method for controlling the family-wise error rate in multiple comparisons. It is less conservative than Bonferroni correction but more conservative than Tukey's HSD test.
5. Fisher's Least Significant Difference (LSD): Fisher's LSD test compares pairs of means by conducting t-tests. It is less conservative than Bonferroni correction but does not control the family-wise error rate. It is useful when there are equal sample sizes and variances across groups.

The choice of post-hoc test depends on various factors, including the research design, sample sizes, and assumptions. Generally, Tukey's HSD is preferred due to its balance between controlling the overall type I error rate and power. Scheffe's test and Bonferroni correction are more conservative but useful when exploring specific hypotheses. Fisher's LSD is less conservative but should be used with caution.
Example scenario: Suppose a researcher conducts an experiment comparing the effectiveness of four different treatments for pain relief. After performing an ANOVA, they find a significant overall effect. To determine which specific treatment groups differ from each other, they would conduct post-hoc tests, such as Tukey's HSD or Bonferroni correction, to compare the mean pain relief scores between pairs of treatments. This helps identify which treatments are significantly different and provides a more nuanced understanding of the findings.
"""

# Q9

In [None]:
"""
A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.
"""

In [3]:
import scipy.stats as stats
import numpy as np

# Weight loss data for each diet
diet_A = np.array([2.5, 3.1, 1.8, 2.9, 2.2, 2.4, 3.0, 2.6, 2.1, 2.8,
                   2.7, 2.3, 2.9, 2.6, 2.5, 2.2, 2.7, 2.8, 2.9, 2.4,
                   2.6, 2.5, 2.3, 2.7, 2.8, 2.4, 2.2, 2.6, 2.5, 2.9,
                   2.7, 2.3, 2.8, 2.5, 2.6, 2.7, 2.4, 2.2, 2.9, 2.8,
                   2.5, 2.7, 2.6, 2.8, 2.9, 2.5, 2.2, 2.4, 2.6, 2.3])

diet_B = np.array([1.7, 1.8, 2.0, 1.9, 1.6, 1.8, 1.7, 2.1, 2.0, 1.8,
                   1.6, 1.9, 1.8, 1.7, 2.2, 1.9, 2.0, 1.6, 1.8, 1.7,
                   2.1, 1.8, 1.7, 1.6, 1.9, 1.8, 1.7, 2.0, 1.6, 1.8,
                   1.7, 2.2, 1.9, 2.0, 1.6, 1.8, 1.7, 2.1, 1.8, 1.7,
                   1.6, 1.9, 1.8, 1.7, 2.0, 1.6, 1.8, 1.7, 2.2, 1.9])

diet_C = np.array([3.8, 3.7, 3.9, 4.1, 3.5, 3.6, 4.0, 3.8, 3.9, 3.7,
                   4.1, 3.5, 3.6, 4.0, 3.8, 3.9, 3.7, 4.1, 3.5, 3.6,
                   4.0, 3.8, 3.9, 3.7, 4.1, 3.5, 3.6, 4.0, 3.8, 3.9,
                   3.7, 4.1, 3.5, 3.6, 4.0, 3.8, 3.9, 3.7, 4.1, 3.5,
                   3.6, 4.0, 3.8, 3.9, 3.7, 4.1, 3.5, 3.6, 4.0, 3.8])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("One-Way ANOVA Results")
print("---------------------")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There are significant differences between the mean weight loss of the diets.")
else:
    print("There are no significant differences between the mean weight loss of the diets.")


One-Way ANOVA Results
---------------------
F-statistic: 1055.021967553837
p-value: 6.49580749021216e-88
There are significant differences between the mean weight loss of the diets.


# Q10

In [None]:
"""
A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.
"""

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the data
data = pd.DataFrame({
    'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 2,
    'Experience': ['Novice'] * 9 + ['Experienced'] * 9,
    'Time': [10, 12, 11, 13, 15, 14, 8, 9, 10, 9, 10, 11, 14, 12, 13, 11, 12, 10]
})

# Perform two-way ANOVA
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                                 sum_sq    df             F    PR(>F)
C(Software)                4.300000e+01   2.0  2.150000e+01  0.000108
C(Experience)              3.813341e-28   1.0  3.813341e-28  1.000000
C(Software):C(Experience)  9.000000e+00   2.0  4.500000e+00  0.034815
Residual                   1.200000e+01  12.0           NaN       NaN


# Q11

In [None]:
"""
An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.
"""

In [5]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random test scores for the control and experimental groups
np.random.seed(42)
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=12, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print the t-statistic and p-value
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Perform post-hoc test (Tukey's HSD)
data = pd.DataFrame({
    'Scores': np.concatenate([control_scores, experimental_scores]),
    'Group': ['Control'] * 100 + ['Experimental'] * 100
})
posthoc = pairwise_tukeyhsd(data['Scores'], data['Group'])

# Print the post-hoc test results
print(posthoc)


T-Statistic: -4.316398519082441
P-Value: 2.5039591073846333e-05
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.3061   0.0 3.4251 9.1872   True
--------------------------------------------------------


# Q12

In [None]:
"""
A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.
"""

In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a dataframe with sales data for each store and day
np.random.seed(42)
data = pd.DataFrame({
    'Day': np.repeat(range(30), 3),
    'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30,
    'Sales': np.random.randint(100, 500, size=90)
})

# Perform repeated measures ANOVA
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Perform post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Print the post-hoc test results
print(posthoc)




                       sum_sq    df             F    PR(>F)
C(Store)         3.246523e-10   2.0  1.220381e-14  1.000000
C(Day)           7.224219e+05  29.0  1.872837e+00  0.112527
C(Store):C(Day)  5.553506e+05  58.0  7.198570e-01  0.790324
Residual         7.980760e+05  60.0           NaN       NaN
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
Store A Store B   0.0333    1.0 -70.5651 70.6318  False
Store A Store C      2.3 0.9967 -68.2984 72.8984  False
Store B Store C   2.2667 0.9968 -68.3318 72.8651  False
-------------------------------------------------------
