Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [1]:
# Answer:
'''
The assumptions required for ANOVA are:

Normality: The data in each group should follow a normal distribution. Violation of this assumption can lead to inaccurate p-values and confidence intervals. 
            For example, if the data in one of the groups is highly skewed, ANOVA may not be appropriate.

Homogeneity of variance: The variance of the data in each group should be equal. This assumption is also known as homoscedasticity. 
                        If the variances are unequal, then the ANOVA results may be unreliable. For example, if one group has a much larger variance than the others, then the ANOVA results may be impacted.

Independence: The observations in each group should be independent of each other. Violation of this assumption can lead to inaccurate p-values and confidence intervals. 
                For example, if the observations in one group are correlated with each other, then ANOVA may not be appropriate.

Examples of violations that could impact the validity of ANOVA results are:

Outliers: Outliers can have a significant impact on the mean and variance of a group, which can violate the assumptions of normality and homogeneity of variance.

Non-normality: If the data in each group does not follow a normal distribution, the ANOVA results may be unreliable.

Unequal variances: If the variances in each group are significantly different, then ANOVA may not be appropriate.

Correlated data: If the observations in each group are not independent, then ANOVA may not be appropriate.

Missing data: If there is a significant amount of missing data, then the ANOVA results may be unreliable.

'''

Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
# Answer:
'''
The three types of ANOVA (Analysis of Variance) are:

One-Way ANOVA: This type of ANOVA is used when there is one categorical independent variable (also known as a factor) and one continuous dependent variable. 
                It is used to test if there is a significant difference in the mean values of the dependent variable across two or more groups defined by the independent variable.

Two-Way ANOVA: This type of ANOVA is used when there are two categorical independent variables and one continuous dependent variable. 
                It is used to test if there is a significant interaction between the two independent variables, as well as to test for main effects of each independent variable.

MANOVA (Multivariate ANOVA): This type of ANOVA is used when there are two or more continuous dependent variables and one or more categorical independent variables. 
                            It is used to test if there is a significant difference between the mean values of the dependent variables across two or more groups defined by the independent variable.
'''

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
# Answer:
'''
Partitioning of variance in ANOVA refers to the process of dividing the total variance of a dataset into different sources of variation. 

Understanding the partitioning of variance is important in ANOVA because it allows us to determine the proportion of the total variance in the data that can be explained by the differences between the groups. 
This is important because it helps us to determine whether the differences between the groups are statistically significant or whether they could have occurred by chance.
'''

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [6]:
import scipy.stats as stats
import pandas as pd

# Create a DataFrame with the data
data = {'Group A': [4, 6, 8, 7, 5],
        'Group B': [10, 12, 14, 11, 13],
        'Group C': [19, 17, 21, 20, 18]}
df = pd.DataFrame(data)

# Calculate the overall mean
overall_mean = df.values.mean()

# Calculate the total sum of squares
squared_deviations = (df - overall_mean) ** 2
SST = squared_deviations.values.sum()

# Calculate the explained sum of squares
group_means = df.mean()
squared_deviations = (group_means - overall_mean) ** 2
SSE = (squared_deviations * len(df.columns)).sum()

# Calculate the residual sum of squares
squared_deviations = (df - group_means) ** 2
SSR = squared_deviations.values.sum()

# Print the results
print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)


SST: 453.3333333333333
SSE: 254.0
SSR: 30.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [11]:
# Answer:
'''
In a two-way ANOVA, the main effects and interaction effects can be calculated using Python with the help of ANOVA tables. 
The ANOVA table summarizes the sources of variation, degrees of freedom, sums of squares, mean squares, and F-test statistics for each factor and interaction.

To calculate the main effects, we can look at the ANOVA table and compare the mean squares of each factor to the mean square of the residuals. 
The factor with the highest F-test statistic is considered the main effect. For example, if we have a two-way ANOVA with factors A and B, we can calculate the main effect of A by 
comparing the mean square for A to the mean square of the residuals, and the main effect of B by comparing the mean square for B to the mean square of the residuals.

To calculate the interaction effect, we can look at the ANOVA table and compare the mean square for the interaction to the mean square of the residuals. 
If the F-test statistic for the interaction is significant, then there is evidence of an interaction effect.
'''

'\nIn a two-way ANOVA, the main effects and interaction effects can be calculated using Python with the help of ANOVA tables. \nThe ANOVA table summarizes the sources of variation, degrees of freedom, sums of squares, mean squares, and F-test statistics for each factor and interaction.\n\nTo calculate the main effects, we can look at the ANOVA table and compare the mean squares of each factor to the mean square of the residuals. \nThe factor with the highest F-test statistic is considered the main effect. For example, if we have a two-way ANOVA with factors A and B, we can calculate the main effect of A by \ncomparing the mean square for A to the mean square of the residuals, and the main effect of B by comparing the mean square for B to the mean square of the residuals.\n\nTo calculate the interaction effect, we can look at the ANOVA table and compare the mean square for the interaction to the mean square of the residuals. \nIf the F-test statistic for the interaction is significant, 

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
# Answer:
'''
If the F-statistic is 5.23 and the p-value is 0.02 in a one-way ANOVA, we can conclude that there is a significant difference between at least two of the groups. 
This means that the null hypothesis of equal means across all groups can be rejected at the specified level of significance (usually 0.05).

The F-statistic represents the ratio of the variance between groups to the variance within groups. A large F-statistic indicates that the variance between groups is significantly 
greater than the variance within groups. The p-value indicates the probability of observing such an extreme F-statistic by chance, assuming that the null hypothesis is true.

To interpret the results, we can say that there is strong evidence to suggest that at least two of the groups have different means. However, we cannot determine which specific groups are different 
from each other based solely on the ANOVA results. Post-hoc tests such as Tukey's HSD or Bonferroni correction can be used to make pairwise comparisons between the groups and determine which pairs are significantly different.
'''

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
# Answer:
'''
In a repeated measures ANOVA, missing data can be handled in a few ways:

Listwise deletion: If a participant has any missing data in any of the repeated measures, they are excluded from the analysis entirely. 
This method may result in biased estimates if the missing data is not completely random.

Pairwise deletion: If a participant has some missing data in one or more repeated measures, their data is still included in the analysis for the measures where data is available. 
This method may result in loss of power and biased estimates.

Imputation: Missing values can be replaced with plausible values. There are different methods to do this such as mean imputation, regression imputation, and multiple imputation.

The consequences of using different methods to handle missing data are different. Pairwise deletion results in loss of power as it reduces the sample size. 
Listwise deletion may result in biased estimates if the missing data is not completely random. Imputation can lead to increased power and unbiased estimates if the imputation model is correctly specified.
'''

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
# Answer:
'''
Post-hoc tests are used to determine which specific group or groups in a ANOVA are significantly different from one another after finding a significant difference between 
at least two groups in the ANOVA. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, ScheffÃ©'s test, and Dunnett's test.

For example, suppose a researcher conducts an ANOVA to compare the mean weight of fish across four different tanks in a fish farm. The ANOVA shows a significant difference 
between at least two of the tanks. To determine which tanks are significantly different, the researcher can use a post-hoc test such as Tukey's HSD or Bonferroni correction.
'''

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [12]:
import scipy.stats as stats

# create data
diet_A = [5, 8, 9, 10, 6, 7, 5, 4, 6, 7, 8, 10, 9, 12, 13, 8, 6, 4, 5, 7, 8, 10, 11, 9, 12, 7, 6, 8, 10, 9]
diet_B = [4, 5, 7, 6, 4, 5, 6, 7, 8, 9, 7, 6, 8, 9, 10, 11, 12, 8, 7, 6, 5, 4, 6, 7, 9, 10, 11, 12, 8, 9]
diet_C = [3, 5, 6, 7, 4, 3, 5, 6, 7, 8, 9, 6, 7, 8, 9, 10, 11, 12, 8, 7, 6, 5, 4, 6, 7, 9, 10, 11, 12, 8]

# conduct one-way ANOVA
f_stat, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# print results
print("F-statistic:", f_stat)
print("p-value:", p_value)


F-statistic: 0.5965250965250967
p-value: 0.5529586973116905


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [16]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = {'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C',
                     'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Experience': ['Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced',
                       'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                       'Novice', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced',
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced'],
        'Time': [9.2, 8.5, 8.8, 10.4, 11.2, 10.6, 11.7, 12.1, 11.8, 7.6, 7.9, 7.5, 9.8, 9.5, 10.1, 12.5, 12.8, 12.1,
                 8.9, 9.2, 8.5, 10.7, 10.5, 10.9, 12.7, 12.9, 12.2, 7.2, 7.5, 7.1, 9.4, 9.2, 9.7, 11.5, 11.8, 11.3]}

df = pd.DataFrame(data)

# Fit the model with interaction term
model = ols('Time ~ C(Software)*C(Experience)', data=df).fit()

# Print the ANOVA table
table = sm.stats.anova_lm(model, typ=2)
print(table)


                              sum_sq    df           F        PR(>F)
C(Software)                94.017222   2.0  301.444603  1.370577e-20
C(Experience)               7.380278   1.0   47.326327  1.230393e-07
C(Software):C(Experience)   2.153889   2.0    6.905949  3.411440e-03
Residual                    4.678333  30.0         NaN           NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [17]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate data
np.random.seed(123)
control = np.random.normal(70, 10, size=100)
experimental = np.random.normal(75, 10, size=100)

# Conduct two-sample t-test
t_stat, p_value = stats.ttest_ind(control, experimental)
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Conduct post-hoc test
data = pd.DataFrame({'score': np.concatenate([control, experimental]),
                     'group': np.concatenate([['control']*len(control), ['experimental']*len(experimental)])})
posthoc = pairwise_tukeyhsd(data['score'], data['group'])
print(posthoc)


t-statistic: -3.0316172004188147
p-value: 0.0027577299763983324
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   4.5336 0.0028 1.5846 7.4826   True
---------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [19]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with sales data
data = {'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'day': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'sales': [100, 120, 110, 90, 80, 85, 70, 75, 80]}
df = pd.DataFrame(data)

# Conduct a repeated measures ANOVA
model = ols('sales ~ store + day + store:day', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


           sum_sq   df     F    PR(>F)
store      1950.0  2.0  15.6  0.025980
day          37.5  1.0   0.6  0.495025
store:day    75.0  2.0   0.6  0.603682
Residual    187.5  3.0   NaN       NaN
