### 1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.


ANOVA (Analysis of Variance) is a statistical method used to analyze the differences between means of two or more groups. The assumptions required to use ANOVA are:

Independence: The observations within each group should be independent of each other. That is, the response variable for one observation should not be influenced by the response variable of another observation in the same group.

Normality: The distribution of the response variable should be approximately normal within each group.

Homogeneity of Variance: The variance of the response variable should be equal across all groups.

Examples of violations that could impact the validity of the results are:

Independence violation: If the observations within each group are not independent, it can lead to biased results. For example, if the response variable is measured from the same individual at different times, the observations are not independent.

Normality violation: If the distribution of the response variable is not approximately normal within each group, it can lead to inaccurate results. For example, if the response variable is skewed, the ANOVA results may not be reliable.

Homogeneity of Variance violation: If the variance of the response variable is not equal across all groups, it can lead to incorrect conclusions. For example, if the variances of the response variable in one group are much larger than the other groups, it can lead to incorrect conclusions.

----

### 2. What are the three types of ANOVA, and in what situations would each be used?


There are three types of ANOVA: one-way ANOVA, two-way ANOVA, and repeated measures ANOVA. Each type is used in different situations:

1.One-way ANOVA: This type of ANOVA is used when there is only one independent variable and one dependent variable. It is used to compare means of three or more groups. For example, a one-way ANOVA can be used to determine if there are significant differences in the mean heights of plants grown under different types of fertilizers.

2.Two-way ANOVA: This type of ANOVA is used when there are two independent variables and one dependent variable. It is used to determine if there is an interaction between the two independent variables and their effect on the dependent variable. For example, a two-way ANOVA can be used to determine if there is a significant difference in the mean test scores of students who are taught with two different teaching methods and in two different schools.

3.Repeated measures ANOVA: This type of ANOVA is used when the same participants are measured on the same dependent variable multiple times. It is used to determine if there are any significant differences between the means of two or more groups over time. For example, a repeated measures ANOVA can be used to determine if there is a significant difference in the mean anxiety levels of participants before, during, and after a stressful task.

----

### 3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


Partitioning of variance in ANOVA refers to the division of the total variation in the data into different components, each of which represents a source of variation. ANOVA decomposes the total variance of the response variable into three components: the variation due to differences between groups (also called the "between-group variance"), the variation due to differences within groups (also called the "within-group variance" or "error variance"), and the variation due to random error.

It is important to understand the concept of partitioning of variance in ANOVA because it allows us to determine the proportion of the total variance that is explained by the independent variable(s) and the proportion that is not explained by the independent variable(s) (i.e., the within-group variance and the random error variance). This information can help us to interpret the results of ANOVA and to draw valid conclusions about the differences between groups.

By partitioning the variance, ANOVA also allows us to test the significance of the differences between groups by comparing the variation due to differences between groups with the variation due to differences within groups. If the variation due to differences between groups is larger than the variation due to differences within groups, this suggests that there is a significant difference between the groups.

---

### 4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:

import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns
from statsmodels.stats.anova import anova_lm

# Loading Iris dataset from seaborn
df_iris = sns.load_dataset('iris')
print('Top 5 rows of IRIS dataset : ')
print(df_iris.head())
print('\n===================================================================\n')

# Fit the one-way ANOVA model (sepal length vs Species)
model = ols('sepal_length ~ species', data=df_iris).fit()

# Calculate the sum of squares for the model
print('Values for Sepal Length vs Species:')
SSE = model.ess
SSR = model.ssr
SST = SSE + SSR

print('SSE:', round(SSE,4))
print('SSR:', round(SSR,4))
print('SST:', round(SST,4))

print('\n===================================================================\n')
# Print the ANOVA table
print(anova_lm(model))

Top 5 rows of IRIS dataset : 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


Values for Sepal Length vs Species:
SSE: 63.2121
SSR: 38.9562
SST: 102.1683


             df     sum_sq    mean_sq           F        PR(>F)
species     2.0  63.212133  31.606067  119.264502  1.669669e-31
Residual  147.0  38.956200   0.265008         NaN           NaN


---

### 5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the inbuilt dataset from statsmodels
data = sm.datasets.get_rdataset("ToothGrowth", "datasets").data

# printing top 5 rows of Tooth Growth dataset
print('Top 5 rows of Tooth Growth Dataset')
print(data.head())
print('\n==============================================================\n')

# Define the model formula
model_formula = "len ~ C(supp) + C(dose) + C(supp):C(dose)"

# Fit the model using OLS regression
model = ols(model_formula, data).fit()

# Calculate the main effects and interaction effects
main_effects = sm.stats.anova_lm(model, typ=2)['sum_sq'][:2]
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][2:3]

# Print the results
print("Main effects:")
print(main_effects)
print("\n==============================\n")
print("Interaction effect:")
print(interaction_effect)
print("\n==============================\n")
print("ANOVA Table:")
print(anova_lm(model,typ=2))

Top 5 rows of Tooth Growth Dataset
    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5


Main effects:
C(supp)     205.350000
C(dose)    2426.434333
Name: sum_sq, dtype: float64


Interaction effect:
C(supp):C(dose)    108.319
Name: sum_sq, dtype: float64


ANOVA Table:
                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN


---

### 6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?


If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is a significant difference between the groups.

The F-statistic tests whether the means of the groups are significantly different from each other. In this case, a large F-statistic value of 5.23 indicates that there is a large amount of variability between the groups compared to the variability within the groups. The p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23 by chance alone is very low, assuming the null hypothesis is true.

Therefore, we reject the null hypothesis that there are no differences between the groups and conclude that there is at least one group whose mean is significantly different from the others.

To interpret these results, we would need to perform post-hoc tests to determine which group(s) differ significantly from the others. We might also want to examine the effect size of the differences between the groups, such as using eta-squared or partial eta-squared to quantify the proportion of variability in the response variable that can be explained by the group variable.

---

### 7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


In a repeated measures ANOVA, missing data can be handled in different ways, but the choice of method can have potential consequences on the validity and reliability of the results. Here are some common methods for handling missing data in a repeated measures ANOVA:

Listwise deletion: This method involves removing all cases with missing data from the analysis. While this method is easy to implement, it can lead to a reduction in sample size and loss of statistical power, as well as bias the sample towards certain characteristics.

Pairwise deletion: This method involves using only the available data for each variable pair in the analysis, which can lead to different sample sizes for each comparison. This method can increase statistical power but can also produce biased results if the missing data are not missing completely at random (MCAR) or are systematically related to the outcome variable.

Mean imputation: This method involves replacing missing values with the mean of the available data for that variable. While this method preserves the sample size and does not create artificial relationships between variables, it can underestimate the true variability of the data and bias the estimates towards the mean.

Regression imputation: This method involves replacing missing values using a regression equation that predicts the missing data based on other variables in the dataset. While this method can produce accurate estimates of missing values, it can also increase the complexity of the analysis and introduce additional sources of error.

Multiple imputation: This method involves creating multiple imputed datasets based on a model that accounts for the uncertainty of the missing data. This method can produce accurate estimates and standard errors while preserving the variability of the data, but it can also be computationally intensive and require a large number of imputations to achieve reliable results.

---

### 8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.


Post-hoc tests are used to compare group means after a significant ANOVA result. Some common post-hoc tests include:

Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of means and controls the overall type I error rate. It is appropriate when there are equal group sizes and variances.

Bonferroni correction: This test adjusts the significance level for multiple comparisons to control the family-wise error rate. It is appropriate when there are a large number of pairwise comparisons and a low overall alpha level.

Scheffe's test: This test is conservative and controls the family-wise error rate for all possible comparisons. It is appropriate when there are unequal group sizes and variances.

Dunnett's test: This test compares each group mean to a control group mean and controls the overall type I error rate. It is appropriate when there is a clear control group and multiple treatment groups.

Games-Howell test: This test adjusts for unequal variances and sample sizes and compares all possible pairs of means. It is appropriate when the assumption of equal variances is violated.

---

### 9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [5]:

import numpy as np
from scipy.stats import f_oneway

diet_a = np.array([3.2, 2.5, 4.1, 3.9, 1.5, 2.8, 2.1, 3.0, 2.7, 2.2,
                   1.9, 2.5, 3.1, 1.8, 2.3, 3.5, 2.9, 2.6, 2.0, 2.7,
                   1.6, 2.1, 3.0, 2.2, 2.6])

diet_b = np.array([4.0, 3.8, 3.9, 4.2, 3.6, 4.1, 2.8, 3.7, 4.3, 3.2,
                   2.9, 3.6, 4.4, 3.1, 3.3, 3.5, 3.7, 3.8, 3.0, 3.5,
                   2.7, 3.2, 3.9, 3.0, 3.8])

diet_c = np.array([5.0, 5.2, 4.9, 5.5, 4.8, 5.1, 4.7, 5.3, 5.0, 5.4,
                   5.1, 5.2, 4.5, 4.8, 5.3, 4.6, 4.9, 5.2, 5.1, 4.7,
                   4.6, 5.0, 4.8, 5.2, 5.1])

f_stat, p_value = f_oneway(diet_a, diet_b, diet_c)

print("F-statistic:", f_stat)
print("p-value:", p_value)

F-statistic: 149.65831558918146
p-value: 2.2551291246307732e-26


---

### 10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [6]:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create the data frame
data = pd.DataFrame({'time': [12, 15, 13, 16, 11, 14, 18, 13, 15, 17, 19, 16, 18, 20, 14, 12, 17, 15, 19, 22, 20, 23, 18, 21, 19, 24, 22, 25, 21, 23],
                     'program': ['A']*10 + ['B']*10 + ['C']*10,
                     'experience': ['novice']*15 + ['experienced']*15})

# fit the two-way ANOVA model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print the ANOVA table
print(anova_table)

                              sum_sq    df         F    PR(>F)
C(program)                  0.006904   2.0  0.000528  0.981844
C(experience)                    NaN   1.0       NaN       NaN
C(program):C(experience)    3.000000   2.0  0.229412  0.635964
Residual                  170.000000  26.0       NaN       NaN


  F /= J


---

### 11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [7]:

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multicomp as mc

control = [70, 75, 68, 80, 73, 85, 78, 72, 82, 79]
experiment = [82, 90, 87, 92, 94, 81, 89, 85, 90, 88]

t_stat, p_val = stats.ttest_ind(control, experiment)
print("t-statistic: ", t_stat)
print("p-value: ", p_val)

data = np.array(control + experiment)
groups = np.array(['control']*len(control) + ['experiment']*len(experiment))
tukey = mc.MultiComparison(data, groups)
posthoc = tukey.tukeyhsd()
print(posthoc)

t-statistic:  -5.324313533850306
p-value:  4.6265323467775783e-05
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1   group2   meandiff p-adj lower   upper  reject
-------------------------------------------------------
control experiment     11.6   0.0 7.0228 16.1772   True
-------------------------------------------------------


---

### 12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [8]:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
data = {'store': ['A']*30 + ['B']*30 + ['C']*30,
        'day': list(range(1, 31))*3,
        'sales': [100, 120, 105, 110, 130, 115, 90, 105, 95, 115, 130, 120, 105, 110, 100, 95, 110, 105, 120, 115, 105, 100, 110, 120, 115, 105, 90, 95, 100, 110]*3}
df = pd.DataFrame(data)

rm = AnovaRM(df, 'sales', 'day', within=['store'])
fit = rm.fit()
print(fit.summary())

if fit.anova_table['Pr > F'][0] < 0.05:
  posthoc = sm.stats.multicomp.MultiComparison(df['sales'], df['store'])
  posthoc_results = posthoc.tukeyhsd()
  print(posthoc_results.summary())

               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store -8.8666 2.0000 58.0000 1.0000



---