In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
Assumptions for ANOVA validity:

1. Independence: Observations in each group are independent.
2. Normality: Data within groups are normally distributed.
3. Homogeneity of Variance: Variances across groups are roughly equal.

Violations:

1. Independence: Data from related individuals or repeated measures.
2. Normality: Heavily skewed or outlier-laden data.
3. Homogeneity of Variance: Unequal variability between groups.


In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?



In [None]:
The three types of ANOVA are:

1. One-Way ANOVA: This is used when comparing means across two or more independent groups on a single factor or independent variable. It's suitable for situations where there is only one factor being investigated. For example, comparing the effectiveness of three different treatments on a medical condition.

2. Two-Way ANOVA: This is used when comparing means across two independent variables simultaneously. It assesses the main effects of each independent variable as well as any interaction between them. It's suitable for situations where there are two factors being investigated, and you want to examine their individual effects and potential interaction. For example, studying the effects of both diet and exercise regimen on weight loss.

3. Repeated Measures ANOVA (or within-subjects ANOVA): This is used when measurements are taken on the same subjects under different conditions or time points. It's suitable for situations where each participant serves as their own control, such as in longitudinal studies or experiments involving before-and-after measurements. For example, tracking changes in cognitive function over time with different interventions.


In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
1. Between-group variance: This component represents the variability between the group means. It reflects the extent to which the groups differ from each other on the dependent variable. In ANOVA, this variance is compared to the variability within groups to determine if there are statistically significant differences between the group means.

2. Within-group variance: Also known as error variance, this component represents the variability within each group. It reflects the natural variability or random fluctuations in the data that are not accounted for by the independent variable(s). In ANOVA, this variance serves as the baseline against which the between-group variance is compared.

Understanding the partitioning of variance is crucial for several reasons:

1. Interpretation of Results: By decomposing the total variance into its constituent components, researchers can gain insight into the relative contributions of different factors to the overall variability observed in the data. This allows for a more nuanced interpretation of the results and helps identify which factors are driving significant differences between groups.

2. Assessment of Model Fit: Partitioning variance helps evaluate how well the statistical model fits the data. If a substantial portion of the total variance can be explained by the between-group differences while minimizing within-group variability, it suggests that the model is effectively capturing the systematic effects of the independent variable(s) on the dependent variable.

3. Effect Size Estimation: Partitioning variance facilitates the calculation of effect sizes, which quantify the magnitude of the differences between groups. Effect sizes provide valuable information about the practical significance of the observed differences beyond mere statistical significance.

4. Guidance for Further Analysis: Understanding the partitioning of variance can inform subsequent analyses and guide the selection of appropriate post-hoc tests or follow-up investigations. For example, if most of the variance is explained by between-group differences, it may warrant further exploration of specific group comparisons or factors driving these differences.


In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np

data = np.array([[5, 7, 9, 11, 13],
                 [6, 8, 10, 12, 14],
                 [4, 6, 8, 10, 12]])

grand_mean = np.mean(data)
group_means = np.mean(data, axis=1)

SST = np.sum((data - grand_mean) ** 2)
SSE = np.sum((data - group_means.reshape(-1, 1)) ** 2)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 130.0
Explained Sum of Squares (SSE): 120.0
Residual Sum of Squares (SSR): 10.0


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
data = pd.DataFrame({'A': [1, 1, 2, 2],
                     'B': [1, 2, 1, 2],
                     'Y': [5, 7, 9, 11]})

if data.isnull().values.any() or not np.isfinite(data.values).all():
    print("Data contains NaN or infinite values. Please preprocess the data.")
else:
    model = ols('Y ~ C(A) + C(B) + C(A):C(B)', data=data).fit()

    print(sm.stats.anova_lm(model, typ=2))

    main_effects = model.params[['C(A)[T.2]', 'C(B)[T.2]']]
    interaction_effect = model.params['C(A)[T.2]:C(B)[T.2]']

    print("Main Effects:")
    print(main_effects)
    print("Interaction Effect:")
    print(interaction_effect)


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [4]:
F_statistic = 5.23
p_value = 0.02

alpha = 0.05

if p_value < alpha:
    print("The p-value is less than the significance level.")
    print("Reject the null hypothesis.")


else:
    print("The p-value is not less than the significance level.")
    print("Fail to reject the null hypothesis.")



The p-value is less than the significance level.
Reject the null hypothesis.


In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data

In [None]:
1. Complete Case Analysis (CCA):
   - Method: Excludes cases with any missing data, analyzing only the complete cases.
   - Consequences: CCA may lead to biased estimates if the missing data are not missing completely at random (MCAR). It may also reduce the sample size and statistical power, particularly if missingness is related to the outcome variable or covariates.

2. Mean Imputation:
   - Method: Replaces missing values with the mean of observed values for that variable.
   - Consequences: Mean imputation can distort the distribution and reduce variance, leading to biased estimates and underestimation of standard errors. It does not account for uncertainty introduced by imputation, potentially inflating Type I error rates.

3. Last Observation Carried Forward (LOCF):
   - Method: Imputes missing values with the last observed value for that variable.
   - Consequences: LOCF assumes that missing values remain constant over time, which may not be valid. It can introduce bias, particularly if missingness is related to the outcome variable or if data are not missing completely at random.

4. Multiple Imputation (MI):
   - Method: Generates multiple plausible values for missing data based on observed data and imputes missing values with these estimates.
   - Consequences: MI preserves variability and accounts for uncertainty introduced by imputation, yielding unbiased estimates and accurate standard errors. However, it requires assumptions about the missing data mechanism and may be computationally intensive.

5. Model-Based Imputation:
   - Method: Imputes missing values using a statistical model that predicts missing values based on observed data.
   - Consequences: Model-based imputation can provide accurate estimates if the model adequately captures the relationships in the data. However, misspecification of the imputation model can lead to biased results.



In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Common post-hoc tests used after ANOVA include:

1. Tukey's Honestly Significant Difference (HSD):
   - Use: Tukey's HSD is used when you have three or more groups and want to compare all possible pairwise differences between group means.
   - Situation: Suppose you conducted a one-way ANOVA with three or more treatment groups and found a significant overall difference. Tukey's HSD can help identify which specific groups differ significantly from each other.

2. Bonferroni Correction:
   - Use: Bonferroni correction adjusts the significance level for multiple comparisons to control the familywise error rate.
   - Situation: When performing multiple pairwise comparisons, Bonferroni correction is useful to reduce the likelihood of false positive findings. For example, if you conduct several pairwise comparisons between treatment groups, Bonferroni correction can help maintain an overall alpha level of 0.05 while adjusting for the increased risk of Type I errors due to multiple testing.

3. Sidak Correction:
   - Use: Similar to Bonferroni correction, Sidak correction adjusts the significance level for multiple comparisons to control the familywise error rate.
   - Situation: Sidak correction is also used to address multiple comparisons, but it can be less conservative than Bonferroni correction, making it a preferable option when conducting a large number of comparisons.

4. Dunnett's Test:
   - Use: Dunnett's test compares each treatment group to a control group while controlling the overall Type I error rate.
   - Situation: When you have a control group and several treatment groups, Dunnett's test can help identify which treatment groups differ significantly from the control group, while still maintaining the overall Type I error rate at the desired level.

5. Scheffé's Test:
   - Use: Scheffé's test is a conservative post-hoc test that controls the familywise error rate for all possible comparisons.
   - Situation: Scheffé's test is used when you want to compare all possible combinations of groups while controlling the overall Type I error rate at a specified level. It is particularly useful when sample sizes are unequal or group variances are not homogeneous.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [6]:
import numpy as np
from scipy.stats import f_oneway

diet_A = np.array([2, 3, 4, 5, 6])
diet_B = np.array([1, 2, 3, 4, 5])
diet_C = np.array([3, 4, 5, 6, 7])

F_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", F_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("The p-value is less than 0.05. Reject the null hypothesis.")
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("The p-value is greater than or equal to 0.05. Fail to reject the null hypothesis.")
    print("There are no significant differences between the mean weight loss of the three diets.")


F-statistic: 2.0
p-value: 0.177978515625
The p-value is greater than or equal to 0.05. Fail to reject the null hypothesis.
There are no significant differences between the mean weight loss of the three diets.


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(0)
n_employees = 30
employee_experience = np.random.choice(['Novice', 'Experienced'], size=n_employees)
software_program = np.random.choice(['A', 'B', 'C'], size=n_employees)
task_completion_time = np.random.normal(loc=10, scale=2, size=n_employees)

data = pd.DataFrame({'Employee_Experience': employee_experience,
                     'Software_Program': software_program,
                     'Task_Completion_Time': task_completion_time})

model = ols('Task_Completion_Time ~ C(Employee_Experience) * C(Software_Program)', data=data).fit()

print(sm.stats.anova_lm(model, typ=2))

print("\nInterpretation:")
print("The F-statistics and p-values indicate the significance of main effects and interaction effects.")


                                                sum_sq    df         F  \
C(Employee_Experience)                       12.126559   1.0  2.905717   
C(Software_Program)                           5.434252   2.0  0.651067   
C(Employee_Experience):C(Software_Program)    7.579898   2.0  0.908132   
Residual                                    100.160287  24.0       NaN   

                                              PR(>F)  
C(Employee_Experience)                      0.101180  
C(Software_Program)                         0.530456  
C(Employee_Experience):C(Software_Program)  0.416691  
Residual                                         NaN  

Interpretation:
The F-statistics and p-values indicate the significance of main effects and interaction effects.


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [18]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import scikit_posthocs as sp

np.random.seed(42)

control = np.random.normal(75, 10, 100)
experimental = np.random.normal(80, 10, 100)

df = pd.DataFrame({'score': np.concatenate([control, experimental]),
                   'group': np.repeat(['control', 'experimental'], 100)})

anova_result = stats.f_oneway(control, experimental)
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)

tukey_result = sp.posthoc_tukey(df, val_col='score', group_col='group')
print("\nTukey's HSD post-hoc test results:")
print(tukey_result)


ANOVA F-statistic: 22.607133515185637
ANOVA p-value: 3.819135262679368e-06

Tukey's HSD post-hoc test results:
              control  experimental
control         1.000         0.001
experimental    0.001         1.000


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [19]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
import statsmodels.stats.multicomp as mc

np.random.seed(42)
days = 30
sales_store_A = np.random.normal(loc=500, scale=50, size=days)
sales_store_B = np.random.normal(loc=550, scale=40, size=days)
sales_store_C = np.random.normal(loc=600, scale=45, size=days)

data = pd.DataFrame({'Store_A': sales_store_A,
                     'Store_B': sales_store_B,
                     'Store_C': sales_store_C})

anova_result = f_oneway(data['Store_A'], data['Store_B'], data['Store_C'])
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)

if anova_result.pvalue < 0.05:
    all_sales = np.concatenate([sales_store_A, sales_store_B, sales_store_C])
    all_groups = ['Store_A'] * days + ['Store_B'] * days + ['Store_C'] * days
    posthoc_result = mc.MultiComparison(all_sales, all_groups).tukeyhsd()
    print("\nPost-hoc test results (Tukey's HSD):")
    print(posthoc_result)
else:
    print("No significant differences found. Post-hoc test not performed.")


ANOVA F-statistic: 50.361055546233764
ANOVA p-value: 2.9591453821082584e-15

Post-hoc test results (Tukey's HSD):
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj  lower   upper   reject
------------------------------------------------------
Store_A Store_B  54.5608   0.0 28.4285  80.6931   True
Store_A Store_C 109.9872   0.0 83.8549 136.1195   True
Store_B Store_C  55.4263   0.0  29.294  81.5586   True
------------------------------------------------------
