In [2]:
#Q1 Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.

Independence of Observations: The observations in each group must be independent of each other. This means that the value of one observation should not depend on or influence the value of another observation within the same group.

Normality of Residuals: The residuals (the differences between observed values and group means) should follow a normal distribution within each group.

Homogeneity of Variances (Homoscedasticity): The variances of the residuals should be approximately equal across all groups. 

Independence of Groups: The groups being compared should be independent of each other.

Random Sampling: The samples within each group should be drawn randomly from their respective populations. 

Absesnce of Outliers

In [3]:
# Q2 What are the three types of ANOVA, and in what situations would each be used?

One-Way ANOVA:

Use Case: One-Way ANOVA is used when you have one independent variable (factor) with more than two levels or groups, and you want to compare the means of those groups to determine if there are statistically significant differences.
Example: You have data on the test scores of students from three different schools (groups A, B, and C), and you want to determine if there are any significant differences in the average test scores among the schools.

Two-Way ANOVA:

Use Case: Two-Way ANOVA is used when you have two independent variables (factors) and you want to examine how they interact to affect a dependent variable. It can determine the main effects of each factor and whether there is an interaction effect between them.
Example: You want to analyze how both the type of diet (factor 1: low-fat, high-fat) and the type of exercise (factor 2: cardio, strength training) impact weight loss in a study. Two-Way ANOVA allows you to assess the effects of each factor and their interaction.

Repeated Measures ANOVA (RM-ANOVA):

Use Case: Repeated Measures ANOVA is used when you have a repeated measurement or a dependent variable collected at multiple time points or under multiple conditions on the same subjects. It assesses changes over time or across conditions.
Example: You measure the blood pressure of the same group of individuals at three different time points: before treatment, after one month of treatment, and after three months of treatment. Repeated Measures ANOVA helps determine if there are statistically significant changes in blood pressure over time due to the treatment.

In [4]:
# Q3 What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that involves breaking down the total variance observed in a dataset into different components. Understanding this concept is crucial because it allows researchers to assess the sources of variation in their data and determine if there are statistically significant differences among groups or factors being studied. 

Total Variance (Total Sum of Squares - SST): The total variance represents the overall variation observed in the data. It is calculated as the sum of the squared differences between each data point and the overall mean.

SST = SUM(X-X_mean)**2

X = represents individual data points.
X_mean=is the overall mean of all data points.

Between-Group Variance (Between-Group Sum of Squares - SSB): This component of variance assesses the differences between the group means. It measures how much of the total variance can be attributed to differences among the groups or factors being studied.

SSB=n*( SUM(X_mean-X_overall)**2)
n= number of observations in each group
X_mean= mean of each group
X_overall= overall mean.

Within-Group Variance (Within-Group Sum of Squares - SSW): This component of variance measures the variation within each group. It quantifies the random variation or noise within the groups that cannot be explained by differences between the groups.

SSW= SUM(sum(X-X_mean)**2)

X=represents individual data points in each group.
 X_mean=mean of each group.
 
 
 Hypothesis Testing: ANOVA tests whether the variation between groups (SSB) is significantly greater than the variation within groups (SSW). If the between-group variation is much larger than the within-group variation, it suggests that there are significant differences among the groups.

Effect Size: By examining the proportion of total variance explained by between-group variation, researchers can assess the effect size of the factors or treatments under investigation.

Post-Hoc Tests: If ANOVA indicates significant differences among groups, post-hoc tests can be conducted to identify which specific groups differ from each other.

Experimental Design: Understanding the partitioning of variance informs the design of experiments by helping researchers determine the appropriate sample sizes and factors to include in their study.

In [5]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?

In [16]:
import scipy.stats as stats

# Sample data for each group
group1 = [15, 20, 25, 30, 35]
group2 = [10, 18, 22, 28, 32]
group3 = [12, 16, 20, 24, 28]


f_statistic, p_value = stats.f_oneway(group1, group2, group3)

df_total = len(group1) + len(group2) + len(group3) - 1
df_between = len([group1, group2, group3]) - 1
df_within = df_total - df_between

all_data = group1 + group2 + group3
overall_mean = sum(all_data) / len(all_data)

mean_group1 = sum(group1) / len(group1)
mean_group2 = sum(group2) / len(group2)
mean_group3 = sum(group3) / len(group3)

In [28]:
SST = sum((x - overall_mean) ** 2 for x in all_data)

In [29]:
SSB=(len(group1) - 1) * (mean_group1 - overall_mean) ** 2 + \
      (len(group2) - 1) * (mean_group2 - overall_mean) ** 2 + \
      (len(group3) - 1) * (mean_group3 - overall_mean) ** 2

In [30]:
SSW= SST - SSB

In [32]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Main Effects:

Main effects represent the individual influence of each independent variable (factor) on the dependent variable, while keeping the other factors constant. In a two-way ANOVA, there are two main effects: one for each factor.
You can calculate the main effects using ANOVA tables or by comparing group means for each factor.
Interaction Effects:

Interaction effects represent the combined or joint influence of two or more independent variables on the dependent variable. They indicate whether the effect of one factor depends on the level of another factor.
Interaction effects can be calculated by examining how the presence of one factor affects the effect of the other factor(s) on the dependent variable.
Interaction effects are often tested using the interaction term in a linear model.



In [15]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [34]:
data = {
    'FactorA': [1, 2, 3, 1, 2, 3],
    'FactorB': [1, 1, 1, 2, 2, 2],
    'Y': [10, 12, 15, 18, 20, 25],
}

# Create a linear model (two-way ANOVA) with interaction term
model = ols('Y ~ FactorA * FactorB', data=data).fit()

# Print the ANOVA table to obtain main effects and interaction effects
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                     sum_sq   df      F    PR(>F)
FactorA           36.000000  1.0   43.2  0.022374
FactorB          112.666667  1.0  135.2  0.007315
FactorA:FactorB    1.000000  1.0    1.2  0.387628
Residual           1.666667  2.0    NaN       NaN


The ANOVA table will provide information about the main effects of 'FactorA' and 'FactorB' and their interaction effect. Look for the p-values associated with each effect to determine their significance. If the p-value is less than your chosen significance level (e.g., 0.05), it suggests a significant effect.

In [37]:
# Q6 Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?

in a one-way ANOVA, the F-statistic and its associated p-value are used to assess whether there are statistically significant differences among the group means. 

F-Statistic: The F-statistic is a test statistic that measures the ratio of the variation between the group means (explained variance) to the variation within the groups (unexplained or residual variance). In your case, the F-statistic is 5.23.

p-value: The p-value associated with the F-statistic indicates the probability of observing an F-statistic as extreme as the one obtained (or more extreme) under the null hypothesis. In your case, the p-value is 0.02.

Null Hypothesis (H0): The null hypothesis in ANOVA is that there are no significant differences among the group means. In other words, all group means are equal.

Alternative Hypothesis (H1): The alternative hypothesis is that at least one group mean is significantly different from the others.

If p-value < α (typically 0.05): You would reject the null hypothesis (H0). 
p-value ≥ α (typically 0.05): You would fail to reject the null hypothesis (H0).


In your case, you have a p-value of 0.02, which is less than the common significance level of 0.05. Therefore, you would reject the null hypothesis. This suggests that there are statistically significant differences among the group means in your dataset.

In [38]:
# Q7 In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is an important consideration because missing data can potentially bias the results and reduce the power of the analysis. There are several methods to handle missing data in repeated measures ANOVA, and the choice of method can impact the validity and interpretability of the results. Here's how you can handle missing data and the potential consequences of different approaches:

Listwise Deletion (Complete Case Analysis): This method involves excluding participants with missing data from the analysis. Only subjects with complete data across all time points and conditions are included.

Consequences: While this approach is straightforward, it can result in a loss of valuable data, reduced statistical power, and potentially biased estimates if missing data are not missing completely at random (MCAR). It may also lead to a non-representative sample.
Pairwise Deletion: With this method, you use all available data for each pair of time points or conditions, even if data are missing for some subjects at specific time points.

Consequences: Pairwise deletion retains more data than listwise deletion and can maximize the sample size for each comparison. However, it can introduce inconsistency in the sample size across comparisons, potentially affecting the power and validity of individual comparisons.

Imputation Methods:

Mean Imputation: Replace missing values with the mean of observed values for the respective variable.

Last Observation Carried Forward (LOCF): Use the last observed value for a participant to fill in missing data.

Linear Interpolation: Estimate missing values based on linear trends between observed data points.

Multiple Imputation: Generate multiple complete datasets with imputed values, perform analyses on each dataset, and combine results.

Consequences: Imputation methods can retain sample size and preserve statistical power. However, they assume that the missing data are missing at random (MAR) and may introduce bias if this assumption is violated. The choice of imputation method can also affect the results.

Mixed-Model Analysis (Linear Mixed Effects Models): This approach is a robust method for handling missing data in repeated measures ANOVA. Mixed models account for both within-subject correlations and varying numbers of observations per subject.

Consequences: Mixed models are flexible and can handle various patterns of missing data. They provide unbiased parameter estimates and can maximize statistical power. However, they require careful model specification and may be computationally intensive.

In [39]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.

Tukey's Honestly Significant Difference (Tukey's HSD):

Use Case: Tukey's HSD is a widely used post-hoc test that is conservative but powerful. It is suitable when you want to control the familywise error rate (the probability of making at least one Type I error across all comparisons). Use it when you have more than three groups and want to compare all possible pairs of groups.
Example: In a study comparing the performance of four different teaching methods, you find a significant difference in test scores among the methods. You would use Tukey's HSD to determine which specific pairs of methods are different from each other.
Bonferroni Correction:

Use Case: Bonferroni correction is a conservative method to control the familywise error rate. It is suitable when you want to make multiple pairwise comparisons, and you want to ensure that the overall Type I error rate remains below a certain threshold (e.g., 0.05).
Example: In a medical trial comparing the effectiveness of five different treatments, you use Bonferroni correction to adjust the significance level for each comparison to maintain an overall significance level of 0.05.
Duncan's Multiple Range Test:

Use Case: Duncan's test is less conservative than Tukey's HSD and is suitable when you want to make pairwise comparisons between groups while controlling the Type I error rate. It is often used when you have unequal sample sizes.
Example: In an agricultural study, you have data on the yield of different crop varieties from different regions. You use Duncan's test to compare varieties and regions and identify which combinations are significantly different.
Scheffé's Test:

Use Case: Scheffé's test is a powerful post-hoc test that is suitable for comparing groups when the sample sizes are unequal or when there are specific contrasts of interest. It is less restrictive than Tukey's HSD but controls the familywise error rate.
Example: In a social science survey, you have data on the job satisfaction levels of employees in various departments of a large organization. You use Scheffé's test to explore specific contrasts, such as comparing satisfaction levels between different departments.
Games-Howell Test:

Use Case: The Games-Howell test is a post-hoc test that does not assume equal variances across groups and is suitable when you have unequal sample sizes and variances. It is less conservative than Tukey's HSD.
Example: In a clinical trial, you are comparing the effectiveness of multiple drug treatments with different sample sizes and potentially different variances. The Games-Howell test helps identify which treatments are significantly different from each other.
When to use a particular post-hoc test depends on factors like the nature of your data, the number of groups, whether the variances are equal, and the level of control you want over the Type I error rate. It's important to select the most appropriate post-hoc test for your specific research question and dataset to make valid pairwise comparisons.






In [1]:
# Q9 A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np

In [13]:
diet_A = [2.5, 3.1, 2.8, 3.5, 2.9, 3.2, 2.7, 3.0, 3.3, 2.6, 2.9, 3.1, 3.4, 3.0, 3.2, 3.3, 2.8, 3.0, 3.1, 2.7,
          2.9, 3.3, 3.4, 2.8, 3.2, 2.7, 3.0, 3.3, 2.6, 2.9, 3.1, 3.4, 3.0, 3.2, 3.3, 2.8, 3.0, 3.1, 2.7, 2.9, 3.3,
          3.4, 2.8, 3.2, 2.7, 3.0, 3.3]
diet_B = [2.1, 2.5, 2.4, 2.0, 2.6, 2.3, 2.7, 2.2, 2.8, 2.1, 2.5, 2.4, 2.0, 2.6, 2.3, 2.7, 2.2, 2.8, 2.1, 2.5,
          2.4, 2.0, 2.6, 2.3, 2.7, 2.2, 2.8, 2.1, 2.5, 2.4, 2.0, 2.6, 2.3, 2.7, 2.2, 2.8, 2.1, 2.5, 2.4, 2.0, 2.6,
          2.3, 2.7, 2.2, 2.8]
diet_C = [1.8, 2.0, 1.9, 1.7, 2.1, 1.9, 2.2, 1.8, 2.3, 1.7, 2.0, 1.9, 1.7, 2.1, 1.9, 2.2, 1.8, 2.3, 1.7, 2.0,
          1.9, 1.7, 2.1, 1.9, 2.2, 1.8, 2.3, 1.7, 2.0, 1.9, 1.7, 2.1, 1.9, 2.2, 1.8, 2.3, 1.7, 2.0, 1.9, 1.7, 2.1,
          1.9, 2.2, 1.8, 2.3]

In [17]:
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

In [18]:
p_value

3.958686390437258e-44

p_value<0.05 so we reject null hypothesis and can say that significant differences between the mean weight loss of the three diets.

In [19]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.

In [23]:
import pandas as pd

In [20]:
data = {
    'Software': ['A', 'B', 'C'] * 30,
    'Experience': ['Novice', 'Experienced'] * 45,
    'Time': [15.3, 16.2, 15.8, 16.5, 17.2, 16.9, 14.7, 15.4, 14.8, 15.2, 15.7, 15.5, 16.0, 15.6, 16.1] * 6,
}

In [24]:
df = pd.DataFrame(data)

In [25]:
df

Unnamed: 0,Software,Experience,Time
0,A,Novice,15.3
1,B,Experienced,16.2
2,C,Novice,15.8
3,A,Experienced,16.5
4,B,Novice,17.2
...,...,...,...
85,B,Experienced,15.7
86,C,Novice,15.5
87,A,Experienced,16.0
88,B,Novice,15.6


In [30]:
formula='Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'

In [31]:
model = ols(formula, data=df).fit()

In [32]:
anova_table = sm.stats.anova_lm(model, typ=2)

In [33]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Software),3.488,2.0,3.763255,0.027212
C(Experience),9.00775e-27,1.0,1.9437189999999998e-26,1.0
C(Software):C(Experience),2.2852310000000002e-28,2.0,2.46557e-28,1.0
Residual,38.928,84.0,,


c(Software) P_value <0.05 , p-value for the software effect is small (typically < 0.05), you would conclude that there is a significant difference in task completion time among the software programs, regardless of experience level.

C(Experience) P_value >0.05 experience has not effect on time of the task

In [34]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

In [35]:
import scipy.stats as stats

# Sample data (replace this with your actual data)
control_group_scores = [85, 88, 90, 78, 92, 87, 86, 82, 89, 83, 88, 85, 91, 80, 84, 87, 89, 81, 86, 85]
experimental_group_scores = [91, 94, 96, 85, 97, 92, 90, 86, 93, 88, 92, 89, 96, 87, 91, 94, 92, 88, 95, 90]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Print the results
print("Two-Sample T-Test - t-statistic:", t_statistic)
print("Two-Sample T-Test - p-value:", p_value)

Two-Sample T-Test - t-statistic: -4.856369527717335
Two-Sample T-Test - p-value: 2.0772880196554376e-05


H0= No significant difference in test score between two group
H1= significant difference in test score between two group

p_value <0.05 we reject H0 and can say significant difference in test score between two group

In [36]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.

In [43]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data (replace this with your actual data)
data = {
    'Store': ['A', 'B', 'C'] * 30,
    'Sales': [100, 110, 95, 105, 115, 98, 102, 112, 93, 108, 100, 113, 97, 111, 96] * 6,
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Perform one-way ANOVA
formula = 'Sales ~ C(Store)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA results
print("One-Way ANOVA Results:")
print(anova_table)

# Perform Tukey's HSD post-hoc test
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)

# Print the post-hoc test results
print("\nTukey's HSD Post-Hoc Test:")
print(posthoc)


One-Way ANOVA Results:
          sum_sq    df          F        PR(>F)
C(Store)  1757.6   2.0  27.677237  4.982458e-10
Residual  2762.4  87.0        NaN           NaN

Tukey's HSD Post-Hoc Test:
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj  lower    upper  reject
----------------------------------------------------
     A      B      7.2   0.0   3.7308 10.6692   True
     A      C     -3.4 0.056  -6.8692  0.0692  False
     B      C    -10.6   0.0 -14.0692 -7.1308   True
----------------------------------------------------
