Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

ANOVA, or Analysis of Variance is a parametric statistical technique that helps in finding out if there is a significant difference between the mean of three or more groups. It checks the impact of various factors by comparing groups (samples) based on their respective mean. 

Assumptions for ANOVA:
1. The dependent variable is approximately normally distributed within each group. This assumption is more critical for smaller sample sizes.
2. The samples are selected at random and should be independent of one another.
3. All groups have equal standard deviations.
4. Each data point should belong to one and only one group. There should be no overlap or sharing of data points between groups.

Violations of the assumptions underlying ANOVA (Analysis of Variance) can impact the validity of the results. Here are some examples of violations that could affect the validity of ANOVA results:

1. Homogeneity: When the variances of the groups being compared are not equal. Violation of homogeneity of variances can lead to inaccurate F-test results and affect the validity of the overall ANOVA analysis. It may inflate the Type I error rate and decrease the power of the test.
2. Normality: When the residuals (the differences between observed and predicted values) from each group do not follow a normal distribution. Departure from normality can lead to biased estimates of group means and affect the precision and accuracy of ANOVA results. It may also influence the Type I error rate and confidence intervals.
3. Independence: When observations within groups are not independent, such as in repeated measures designs or clustered data. Violating the independence assumption can lead to incorrect standard errors, inflated Type I error rates, and biased F-test results. It may also affect the interpretation of group differences.

Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three main types of ANOVA, each suitable for different experimental designs and research questions:

1. One-Way ANOVA:

- One-way ANOVA is used when there is one independent variable (factor) with three or more levels (groups). It determines whether there are statistically significant differences in the means of the dependent variable across the different levels of the independent variable.

- Example: Comparing the effectiveness of three different teaching methods (levels of the independent variable) on student test scores (dependent variable).

2. Two-Way ANOVA:

- Two-way ANOVA is used when there are two independent variables (factors) and one dependent variable. It examines the main effects of each independent variable as well as their interaction effect.
- Example: Investigating the effects of both gender and treatment type (two independent variables) on patient recovery time (dependent variable).

3. Repeated Measures ANOVA:

- Repeated Measures ANOVA is used when the same subjects are measured under different conditions or at different time points. It assesses within-subject differences across the repeated measures.
- Example: Analyzing the effect of a training program on participants' performance by measuring their scores before training, immediately after training, and one month after training.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the decomposition of the total variability observed in the data into different components that can be attributed to specific sources, such as between-group differences and within-group variation. 

- By partitioning the total variance into between-group variance and within-group variance, ANOVA helps researchers understand the relative contribution of different factors or groups to the overall variability in the data. This insight can inform decisions about which factors are most influential and merit further investigation.
- ANOVA compares the magnitude of between-group variance to within-group variance to determine whether the observed differences among groups are statistically significant. By understanding how the total variance is partitioned, researchers can assess the significance of group differences and make valid inferences about the effects of independent variables on the dependent variable.
- ANOVA provides valuable information about the sources of variability in the data, which enhances the interpretation of hypothesis tests. For example, if a significant difference is found between groups, understanding the partitioning of variance can help identify which specific groups or factors are driving the observed differences.
- Partitioning the variance allows researchers to identify areas of interest for further analysis or research. For example, if a large proportion of the total variance is explained by between-group differences, it may indicate that the independent variable(s) under investigation have a substantial impact on the dependent variable, warranting further exploration or experimental manipulation.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns
from statsmodels.stats.anova  import anova_lm

#loading Iris dataset

df_iris = sns.load_dataset('iris')
print('Top 5 rows of iris dataset')
print(df_iris.head())

#fitting the anova model
model = ols('sepal_length ~ species', data = df_iris).fit()

#calculating SST, SSE and SSR
print('Value of sepal length vs species')
SSR = model.ssr
SSE = model.ess
SST = SSR + SSE

print('SSR:', SSR)
print('SSE:', SSE)
print('SST:', SST)

print(anova_lm(model))

Top 5 rows of iris dataset
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
Value of sepal length vs species
SSR: 38.9562
SSE: 63.21213333333335
SST: 102.16833333333335
             df     sum_sq    mean_sq           F        PR(>F)
species     2.0  63.212133  31.606067  119.264502  1.669669e-31
Residual  147.0  38.956200   0.265008         NaN           NaN


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [19]:
import pandas as pd
import numpy as np
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import statsmodels.api as sm


df = sns.load_dataset('iris')
print('Top 5 rows of the Iris dataset')
print(df.head())
print('\n-------------------------------------\n')

model = ols(' sepal_length ~ C(species) + C(petal_length) + C(species):C(petal_length)', data=df).fit()

main_effects = anova_lm(model, typ=2)[:2]
interaction_eff = anova_lm(model,typ=2)[2:3]

print('Main effects')
print(main_effects)
print('\n------------\n')
print('Interaction effects')
print(interaction_eff)
print('\n------------\n')
anova_table = anova_lm(model, typ=2)
print(anova_table)

Top 5 rows of the Iris dataset
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

-------------------------------------

Main effects
                       sum_sq    df             F        PR(>F)
C(species)       6.909412e-12   2.0  3.292629e-11  9.999954e-01
C(petal_length)  3.550098e+02  42.0  8.056061e+01  1.889577e-36

------------

Interaction effects
                               sum_sq    df         F        PR(>F)
C(species):C(petal_length)  84.879528  84.0  9.630644  2.764512e-20

------------

                                  sum_sq     df             F        PR(>F)
C(species)                  6.909412e-12    2.0  3.292629e-11  9.999954e-01
C(



Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Based on the provided F-statistic of 5.23 and a p-value of 0.02, the F-statistic suggests that there is some evidence of differences between the group means. The p-value of 0.02 indicates that the probability of observing the data if there were no differences between the group means is 0.02, which is less than the typical significance level of 0.05. Therefore, we would reject the null hypothesis and conclude that there are statistically significant differences between at least two of the groups. 

In summary, the results suggest that there are significant differences between the groups, but further post-hoc tests or additional analysis may be needed to determine which specific groups differ from each other.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


In repeated measures ANOVA, missing data can arise due to various reasons such as participant dropout, technical errors, or incomplete responses. Handling missing data appropriately is crucial to ensure the validity and reliability of the analysis. Here are some common methods for handling missing data in repeated measures ANOVA and their potential consequences:

1. Complete Case Analysis (CCA):
- In CCA, any case with missing data in any variable is excluded from the analysis.
- It's straightforward and easy to implement.
- This method can lead to biased estimates if missingness is related to the outcome or other variables in the analysis. It can also reduce statistical power if a large portion of the data is missing.

2. Mean Imputation:

- Mean imputation involves replacing missing values with the mean of the observed values for that variable.
- It preserves the sample size and can provide unbiased estimates if data are missing completely at random (MCAR).
- It can underestimate the standard errors, leading to inflated Type I error rates. It also assumes that the missing data have the same mean as the observed data, which may not always be true.

3. Last Observation Carried Forward (LOCF):

- OCF involves using the last observed value for a participant to replace missing values in subsequent time points.
- It's simple and maintains the time sequence of the data.
- It can lead to biased estimates if missingness is related to changes over time or if there's a trend in the data. It may also overestimate treatment effects, especially in longitudinal studies with dropout.

4. Multiple Imputation:

- Multiple imputation generates several plausible values for each missing observation, based on the observed data and the estimated uncertainty.
- It provides more accurate estimates compared to single imputation methods and properly accounts for uncertainty due to missing data.
- It's computationally intensive and may require additional assumptions about the missing data mechanism. It can also be challenging to implement correctly.

5. Model-Based Imputation:

- Model-based imputation involves using regression or other statistical models to predict missing values based on observed data.
- It can provide more accurate imputations by incorporating information from other variables.
- It relies on the assumption that the model used for imputation accurately represents the relationship between variables, which may not always be the case.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

After conducting an Analysis of Variance (ANOVA) and finding a significant difference between groups, it's common to perform post-hoc tests to determine which specific groups differ from each other. There are several post-hoc tests available, each with its own assumptions and appropriate use cases. Some common post-hoc tests include:

1. Tukey's Honestly Significant Difference (HSD):

- Use when you have equal group sizes and homogeneity of variances.
- It's conservative, meaning it controls the family-wise error rate.

2. Bonferroni Correction:

- Use when you have unequal group sizes or violations of homogeneity of variances.
- It's more conservative than Tukey's HSD and controls the family-wise error rate by adjusting the significance threshold for multiple comparisons.

3. Scheffe's test: 

- This test also controls the family-wise error rate but is more conservative than Tukey's HSD test. 
- It is often used when the number of groups is large, and when there is no prior knowledge about which groups differ.

4. Duncan's New Multiple Range Test:

- Use when you have unequal group sizes and homogeneity of variances.
- It's less conservative than Tukey's HSD but still controls the Type I error rate.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [2]:
import numpy as np
from scipy.stats import f_oneway

np.random.seed(1)

diet_a = np.random.normal(5,1,50)
diet_b = np.random.normal(4,1,50)
diet_c = np.random.normal(2,1,50)

f_statistic, p_value = f_oneway(diet_a,diet_b,diet_c)

null_hypo = "The mean weight loss is same for all the three diets"
alternate_hypo = "The mean weight loss is not same of the three diets"

print('F-statistic = ', f_statistic)
print('P_value = ', p_value)

alpha = 0.05

if p_value<alpha:
    print('We reject the null hypothesis')
    print(alternate_hypo)
else:
    print('We failed to reject the null hypothesis')
    print(null_hypo)

F-statistic =  140.82403794251056
P_value =  6.893019806567957e-35
We reject the null hypothesis
The mean weight loss is not same of the three diets


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(123)

time_novice = np.random.normal(15, 2, 30)
time_expert = np.random.normal(10, 2, 30)

df = pd.DataFrame({'Software':['A']*20 + ['B']*20 + ['C']*20,
                 'Experience': ['Novice']*30 + ['Experienced']*30,
                  'Time': list(time_novice)+list(time_expert)})
print('Dataset:')
print(df.head())
print('\n-----------------------\n')

model = ols('Time ~ C(Software) + C(Experience) + C(Software): C(Experience)', data = df).fit()

table = sm.stats.anova_lm(model, typ=1)

alpha = 0.05

print(table)
print('\n-----------------------\n')
if table['PR(>F)'][0]< alpha:
    print('Conclusion: There is significant main effect of Software')
else:
    print('Conclusion: There is no significant main effect of Software')

if table['PR(>F)'][1]< alpha:
    print('Conclusion: There is significant main effect of Experience')
else:
    print('Conclusion: There is no significant main effect of Experience')

if table['PR(>F)'][2]< alpha:
    print('Conclusion: There is significant main effect of Software and Experience')
else:
    print('Conclusion: There is no significant main effect of Software and Experience')

Dataset:
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799

-----------------------

                             df      sum_sq  ...          F        PR(>F)
C(Software)                 2.0  204.881181  ...  18.135666  8.460472e-07
C(Experience)               1.0  165.079097  ...  29.224933  1.375177e-06
C(Software):C(Experience)   2.0   17.481552  ...   1.547431  2.217544e-01
Residual                   56.0  316.319953  ...        NaN           NaN

[4 rows x 5 columns]

-----------------------

Conclusion: There is significant main effect of Software
Conclusion: There is significant main effect of Experience
Conclusion: There is no significant main effect of Software and Experience


Here are the interpretations of the three conclusions:
"There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.

"There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.

"There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [4]:
import numpy as np
import pandas as pd
import scipy.stats as stats

np.random.seed(45)

test_score_control = np.random.normal(70,3,50)
test_score_experimental = np.random.normal(70,3,50)

df = pd.DataFrame({'test_score': list(test_score_control)+list(test_score_experimental),
                  'Teaching_method': ['Control']*50 + ['experimental']*50})
print('Dataset:')
print(df.head())
print('\n-------------\n')

null_hypo = 'There is no significant difference in test scores between two groups'
alternate_hypo = 'There is significant difference in test scores between two groups'

statistic, p_value = stats.ttest_ind(test_score_control, test_score_experimental, equal_var=True)

print('Statistic', statistic)
print('P_value', p_value)

alpha = 0.05

if p_value<alpha:
    print('Reject the null hypothesis')
    print(alternate_hypo)
else:
    print('We failed to reject the null hypothesis')
    print(null_hypo)

Dataset:
   test_score Teaching_method
0   70.079124         Control
1   70.780965         Control
2   68.814563         Control
3   69.387097         Control
4   66.185102         Control

-------------

Statistic -1.5847354411553825
P_value 0.11624822250929473
We failed to reject the null hypothesis
There is no significant difference in test scores between two groups


Tukey's HSD test

In [5]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_result =  pairwise_tukeyhsd(df['test_score'], df['Teaching_method'], 0.05)
print(tukey_result)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower  upper  reject
----------------------------------------------------------
Control experimental   0.8829 0.1162 -0.2227 1.9886  False
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [6]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# set random seed for reproducibility
np.random.seed(456)

# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))

# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})

# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']

# Printing top 5 rows of generated data
print('Generated data top 5 rows : ')
print(sales_melted.head())

print('\n================================================\n')

# perform repeated measures ANOVA
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)

# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Generated data top 5 rows : 
   Day    Store        Sales
0    0  Store A   933.187150
1    1  Store A   950.179048
2    2  Store A  1061.857582
3    3  Store A  1056.869225
4    4  Store A  1135.050948


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 51.5040 2.0000 58.0000 0.0000

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B   21.2439 0.6945   -40.881   83.3688  False
Store A Store C -207.8078    0.0 -269.9328 -145.6829   True
Store B Store C -229.0517    0.0 -291.1766 -166.9268   True
-----------------------------------------------------------


Interpretation of above
In Repeated Measure ANOVA test we got p_value (Pr>F) as 0.0000 which is less than 0.05 .Reject the Null Hypothesis .Which means atleast one of the mean of groups is different.

In Tukey's Post Hoc Test we get following interpretation :

No significant difference between sales of Store A and Store B. Store B earns 21.24 dollars more than store A(becuse reject=False for this)
Significant difference between sales of Store A and Store C . Store C has approx 207.8 dollars lesser compared to store A (reject=True)
Siginficant difference between sales of Store B and Store C . Store C has approx 229.0 dollars lesser compared to store B (reject=True)