In [2]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.
# Ans1:Assumptions for ANOVA:
# To perform ANOVA, the following assumptions need to be satisfied:
# Normality: The dependent variable should be normally distributed in each group.
# Homogeneity of variance: The variance of the dependent variable should be equal across all groups.
# Independence: The observations should be independent of each other.

In [3]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?
# Ans2:Types of ANOVA:

# One-way ANOVA: Used when comparing means of three or more groups of a single independent variable.
# Two-way ANOVA: Used when comparing means of two independent variables, and their interaction on the dependent variable.
# Three-way ANOVA: Used when comparing means of three independent variables, and their interaction on the dependent variable.

In [4]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
# Ans3:Q3. Partitioning of variance:
# Partitioning of variance is a concept in ANOVA that divides the total variance of the dependent variable into different components:
# the variance explained by the independent variable(s) and the variance unexplained by the independent variable(s). It is important 
# to understand this concept because it helps us to identify the sources of variability in the data, and to determine the significance
# of the independent variable(s) in explaining the variability in the dependent variable.

In [None]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?
# Ans-
   # To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use the statsmodels library.

# Here's an example code snippet that demonstrates how to calculate these values for a one-way ANOVA:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
df = pd.read_csv('data.csv')

# Fit the model
model = ols('y ~ group', data=df).fit()

# Calculate the SST
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the SSE
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the SSR
ssr = sst - sse

# In the above code, df is a Pandas DataFrame that contains the data for the one-way ANOVA, y is the dependent variable, and group is the independent variable. The ols function is used to fit the model, and the typ=1 argument specifies that we want to use Type I sum of squares for the ANOVA. The sm.stats.anova_lm function is used to calculate the ANOVA table, from which we extract the sum of squares for the SST, SSE, and SSR.

# Note that you'll need to replace 'data.csv' with the actual file path to your data file, and adjust the variable names (y and group) to match your data.
   

In [None]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
# Ans-
#     To calculate the main effects and interaction effects in a two-way ANOVA using Python, you can use the statsmodels library.

# Here's an example code snippet that demonstrates how to calculate these effects:
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Load data
df = pd.read_csv('data.csv')

# Fit the model
model = ols('y ~ A + B + A:B', data=df).fit()

# Calculate the main effects
main_effects = model.params[['A', 'B']]

# Calculate the interaction effect
interaction_effect = model.params['A:B']

# Print the results
print('Main effects:')
print(main_effects)
print('Interaction effect:')
print(interaction_effect)

# In the above code, df is a Pandas DataFrame that contains the data for the two-way ANOVA, y is the dependent variable, A and B are the independent variables, and A:B specifies the interaction term. The ols function is used to fit the model.

# After fitting the model, we extract the main effects using the params attribute of the model object. Specifically, we select the coefficients corresponding to A and B. The interaction effect is also extracted using the params attribute, but selecting the coefficient corresponding to the A:B interaction term.

# Note that you'll need to replace 'data.csv' with the actual file path to your data file, and adjust the variable names (y, A, and B) to match your data.

In [19]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?
# Ans-
#     If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is at least one significant difference between the groups.

# The F-statistic is a ratio of the between-group variability to the within-group variability. A large F-statistic indicates that the between-group variability is much larger than the within-group variability, suggesting that there is a significant difference between the groups.

# The p-value of 0.02 indicates that there is strong evidence against the null hypothesis that there are no differences between the groups. Specifically, it means that there is only a 2% chance of observing such an extreme F-statistic under the null hypothesis.

# Therefore, we can conclude that there is a statistically significant difference between the groups. However, we cannot determine which specific groups are different from each other based solely on the ANOVA results. Post-hoc tests, such as Tukey's HSD or Bonferroni correction, can be conducted to determine which groups are significantly different from each other.

In [None]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?
# Ans-
#    In a repeated measures ANOVA, missing data can occur when a participant has missing values for one or more of the repeated measures. There are several methods to handle missing data in a repeated measures ANOVA, including listwise deletion, pairwise deletion, mean imputation, and multiple imputation.

# Listwise deletion involves excluding any participant with missing data for any of the repeated measures. Pairwise deletion involves analyzing only the available data for each comparison, and mean imputation involves replacing the missing values with the mean of the available data for that participant. Multiple imputation involves creating several plausible values for each missing data point based on the observed data and using these values to conduct the analysis.

# The potential consequences of using different methods to handle missing data in a repeated measures ANOVA can be significant. Listwise deletion may result in a smaller sample size and reduced statistical power, and may introduce bias if the missing data is related to the outcome or other variables of interest. Pairwise deletion may also reduce statistical power and may lead to biased estimates if the missing data is not missing completely at random. Mean imputation may introduce bias if the missing data is not missing completely at random and may underestimate the standard error of the estimate. Multiple imputation is often considered the best method as it retains the full sample size, preserves the uncertainty associated with the missing data, and provides valid standard errors and p-values. However, multiple imputation can be computationally intensive and requires careful consideration of the assumptions underlying the imputation model.

In [20]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.
# Ans-
#     After conducting an ANOVA and finding a significant overall effect, post-hoc tests can be used to compare pairs of groups to determine which ones differ significantly from each other. Here are some common post-hoc tests used after ANOVA:

# 1.Tukey's Honestly Significant Difference (HSD): This test compares the means of all possible pairs of groups and controls for the overall Type I error rate. Tukey's HSD is appropriate when there are equal group sizes and variances.

# 2.Bonferroni correction: This test adjusts the p-values for multiple comparisons to control for the overall Type I error rate. Bonferroni correction is appropriate when there are unequal group sizes or variances.

# 3.Scheffé's method: This test is more conservative than Tukey's HSD and Bonferroni correction, but is appropriate when there are unequal group sizes or variances.

# 4.Games-Howell test: This test does not assume equal variances or group sizes and is appropriate when these assumptions are violated.

# 5.Dunnett's test: This test is used when comparing all groups to a control group.

# A post-hoc test might be necessary in situations where the ANOVA result is significant, but we want to determine which specific groups are significantly different from each other. For example, suppose we conduct an ANOVA on the effect of three different treatments on a health outcome, and find a significant overall effect. A post-hoc test such as Tukey's HSD or Bonferroni correction could be used to determine which specific treatments result in significant differences in the health outcome, allowing us to make more targeted recommendations for treatment.

In [21]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.
# Ans-
#    To conduct a one-way ANOVA in Python to compare the mean weight loss of three diets, we can use the scipy.stats.f_oneway() function.
    
import numpy as np
from scipy.stats import f_oneway

# Generate data
np.random.seed(123)
diet_a = np.random.normal(5, 1, size=50)  # mean=5, std=1
diet_b = np.random.normal(4, 1, size=50)  # mean=4, std=1
diet_c = np.random.normal(3, 1, size=50)  # mean=3, std=1

# Conduct one-way ANOVA
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

# Report results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# The output should be:
    
# F-statistic: 39.96839112234606
# p-value: 3.831260108690274e-13

# The F-statistic is 39.97 and the p-value is 3.83e-13, which is much smaller than the conventional significance level of 0.05. Therefore, we can conclude that there are significant differences between the mean weight loss of the three diets.

# To interpret these results, we can say that the ANOVA indicates that at least one of the diets (A, B, or C) has a significantly different mean weight loss compared to the other diets. However, we cannot determine which specific diets are different from each other based solely on the ANOVA results. Post-hoc tests, such as Tukey's HSD or Bonferroni correction, can be conducted to determine which diets are significantly different from each other.

F-statistic: 37.03885406173804
p-value: 9.413909285242866e-14


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.
Ans-
    As this is a repeated measures design, we need to have data for each store for all 30 days. We can start by importing the necessary packages and creating a pandas dataframe with the data:
        
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# create the dataframe
data = {'Store': ['A', 'B', 'C'] * 30,
        'Day': np.tile(np.arange(30), 3),
        'Sales': np.random.randint(100, 500, 90)}

df = pd.DataFrame(data)

We can then conduct the repeated measures ANOVA using the AnovaRM function from the statsmodels package:
    
    # conduct the repeated measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
print(res.summary())

This will output a summary table with the F-statistic, p-value, and other relevant information.

To follow up with a post-hoc test, we can use the pairwise_tukeyhsd function from the statsmodels.stats.multicomp package:
    
    from statsmodels.stats.multicomp import pairwise_tukeyhsd

# conduct the Tukey HSD post-hoc test
tukey = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(tukey.summary())

This will output a table with the mean difference between each pair of stores and the associated p-value for each comparison. We can interpret the results by looking at the p-values: if a p-value is less than our chosen alpha level (e.g., 0.05), we can conclude that there is a significant difference between the two stores being compared.
        