Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Here are the key assumptions required to use ANOVA:

Normality:

Each group's data should be approximately normally distributed.
Violation: Skewed distributions, outliers.
Impact: Increased Type I error rates (false positives), distorted results.

Homogeneity of Variances (Homoscedasticity):                                                                 

The variances within each group should be roughly equal.
Violation: Unequal variances across groups.
Impact: Inflation of Type I error rates, biased F-ratios.

Independence of Observations:                                                        

Observations within and between groups should be independent of each other.
Violation: Dependent samples (e.g., repeated measures), clustering effects.
Impact: Underestimated standard errors, unreliable p-values.

Additional assumptions for more complex ANOVA designs:                                                     

Additivity: Factor effects are additive (no interactions).                                                 
Linearity: Relationship between independent and dependent variables is linear.
Examples of violations and their impacts:

Strongly skewed data: Might lead to incorrect conclusions due to inflated Type I error rates.                          
Unequal variances combined with unequal group sizes: Can significantly distort results, especially for smaller groups.                                                                                                              
Correlated observations: Underestimated standard errors, leading to increased risk of false positives.

Q2. What are the three types of ANOVA, and in what situations would each be used?


There are three main types of ANOVA, each suited to different research questions and scenarios:

1. One-Way ANOVA:

 Definition: Compares the means of three or more groups on a single independent variable (categorical).
 Situations: When you want to know if there are statistically significant differences in a dependent variable (e.g., plant growth rates) between groups defined by a single factor (e.g., fertilizer type).
Example: Comparing the average corn yields of three different fertilizer groups.

2. Two-Way ANOVA:

Definition: Examines the effects of two independent variables (both categorical) on a single dependent variable.
Situations: When you want to investigate the combined and individual effects of two factors and potential interactions between them.
Example: Studying the influence of both exercise frequency and diet type on weight loss outcomes.

3. N-Way ANOVA:

 Definition: Generalizes beyond two factors, testing for the effects of three or more independent variables (categorical) on a single dependent variable.
 Situations: When you have a complex research design with multiple influencing factors and intricate interactions to explore.
Example: Analyzing the performance of athletes under different training programs, competition pressures, and environmental conditions.
Choosing the right type of ANOVA depends on several factors:

1. Number of independent variables: One-way for one, two-way for two, N-way for three or more.
2. Complexity of research question: Simple comparisons (one-way) vs. multi-faceted interactions (N-way).
3. Availability of data and resources: N-way analyses can be computationally demanding

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning Variance:

Central concept in ANOVA.
Involves dividing the total variability in a dataset into distinct sources to pinpoint the causes of variation.

Understanding it is essential because:

Forms the basis of ANOVA's hypothesis testing.
Determines which factors contribute significantly to differences in the dependent variable.
Guides interpretation of results and conclusions.

Explanation:

Total Variance (SST): Total spread of scores around the grand mean (overall mean).
Between-Groups Variance (SSB): Variation attributed to differences between group means.
Within-Groups Variance (SSW): Variation within each group, not explained by group membership (considers individual differences and random error).
Relationship and Formula:

SST = SSB + SSW
ANOVA's Test Statistic:

F-ratio: Compares the ratio of between-groups variance to within-groups variance.
Significant F-ratio suggests group means differ more than expected by chance, indicating the independent variable likely has an effect.
Key Points:

Partitioning variance isolates the effects of the independent variable from random error.
Understanding this concept is crucial for interpreting ANOVA results and drawing appropriate conclusions.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [4]:
# Generate dataset
np.random.seed(0)  # for reproducibility
X = np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

In [5]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Create a LinearRegression instance
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

In [7]:
# Make predictions
y_pred = model.predict(X_test)

In [8]:
# Calculate the mean y
y_mean = np.mean(y_test)

# Calculate SSE
sse = np.sum((y_test - y_pred) ** 2)

# Calculate SSR
ssr = np.sum((y_pred - y_mean) ** 2)

# Calculate SST
sst = np.sum((y_test - y_mean) ** 2)

print(f"SSE: {sse}")
print(f"SSR: {ssr}")
print(f"SST: {sst}")

SSE: 18.355064939428587
SSR: 9.553853891520301
SST: 25.891988774148167


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create sample data for two factors
factor1 = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3])
factor2 = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])
response = np.array([4, 5, 6, 7, 8, 9, 10, 11, 12])

# Create a DataFrame with the data
df = pd.DataFrame({'Factor1': factor1, 'Factor2': factor2, 'Response': response})

# Fit the two-way ANOVA model
model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effects
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

print("Main Effect of Factor 1:", main_effect_factor1)
print("Main Effect of Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)

Main Effect of Factor 1: 54.00000000000003
Main Effect of Factor 2: 6.0000000000000036
Interaction Effect: 3.1554436208840472e-30


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of all the groups are equal. The p-value associated with the F-statistic indicates the probability of obtaining the observed F-value (or a more extreme value) if the null hypothesis is true.

In this case, we obtained an F-statistic of 5.23 and a p-value of 0.02. The p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23 (or more extreme) under the assumption of equal group means is 0.02.

Based on the obtained results, we can conclude that there are statistically significant differences between the groups. The low p-value (less than the conventional significance level of 0.05) suggests that the observed differences in means are unlikely to be due to random chance alone.

To interpret these results further, it is necessary to conduct post hoc tests or examine the group means directly. These additional analyses can provide insights into which specific groups differ significantly from each other and the direction of those differences (i.e., which groups have higher or lower means compared to others). Post hoc tests, such as Tukey's test or pairwise t-tests, allow for comparisons between individual groups while controlling for the overall experiment-wise error rate.

Therefore, based on an F-statistic of 5.23 and a p-value of 0.02 in a one-way ANOVA, we can conclude that there are statistically significant differences between the groups. Further post hoc tests or examination of the group means will provide more specific information about the nature and direction of these differences.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Understanding Missing Data in Repeated Measures:

Missing data can significantly impact results, especially since repeated measures involve multiple observations from the same participants.
Addressing it appropriately is crucial for valid conclusions.

Common Methods for Handling Missing Data:

1. Listwise Deletion:

Removes any participant with missing values in any of the repeated measures.
Consequences:
Reduced power and potential bias if missingness is not random.
Not ideal for smaller samples or when missingness is related to the outcome.

2. Pairwise Deletion:

Excludes participants only from specific analyses where they have missing values.
Consequences:
Can lead to different sample sizes across analyses, complicating interpretation.
Might underestimate standard errors and inflate Type I error rates.

3. Mean Imputation:

Replaces missing values with the mean of the observed scores for that measure.
Consequences:
Underestimates variability and can distort relationships between variables.
Not recommended for repeated measures due to potential bias.

4. Multiple Imputation:

Creates multiple "complete" datasets by imputing missing values based on statistical models.
Analyzes each dataset and combines results for more robust inferences.
Consequences:
Computationally intensive but often preferred for handling missing data in repeated measures.

5. Mixed Models:

Advanced statistical models that can accommodate missing data without explicit imputation.
Uses all available data and makes assumptions about missingness mechanisms.
Consequences:
More complex but can provide more accurate results for repeated measures with missing data.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are conducted after finding a significant result in an Analysis of Variance (ANOVA) to determine which specific group differences are driving the overall significance. Since ANOVA itself doesn't identify where the differences lie, post-hoc tests are employed for pairwise comparisons between groups. Some common post-hoc tests include:

1. Tukey's Honestly Significant Difference (HSD):

When to Use: Tukey's HSD is conservative and suitable when you have unequal sample sizes and want to control the familywise error rate (overall Type I error rate).
Example Situation: You conducted a one-way ANOVA and found a significant difference among the means. Tukey's HSD can help identify which specific pairs of groups are significantly different from each other.

2. Bonferroni Correction:

When to Use: Bonferroni is a more conservative correction that controls the familywise error rate. It is suitable when you have a large number of pairwise comparisons.
Example Situation: You have multiple groups, and you want to perform several pairwise comparisons. The Bonferroni correction helps control the overall probability of making a Type I error.

3. Duncan's New Multiple Range Test:

When to Use: Duncan's test is less conservative than Tukey's HSD and can be used when sample sizes are equal. It is suitable for identifying differences among means when you have unequal sample sizes.
Example Situation: You conducted a one-way ANOVA with unequal sample sizes, and you want to determine which specific groups differ from each other.

4. Scheffé's Test:

When to Use: Scheffé's test is very conservative but powerful in controlling Type I error rates, making it suitable for situations with unequal sample sizes and a large number of groups.
Example Situation: You have a complex experimental design with multiple factors, and you want to perform conservative post-hoc tests to control the overall Type I error rate.

5. Games-Howell Test:

When to Use: Games-Howell is a robust post-hoc test for situations with unequal variances and sample sizes. It does not assume equal variances across groups.
Example Situation: You conducted a one-way ANOVA, and Levene's test indicated unequal variances. Games-Howell can be used to compare specific groups.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [10]:
import numpy as np
from scipy.stats import f_oneway

# Generate example data for weight loss
np.random.seed(42)  # Set seed for reproducibility
diet_A = np.random.normal(loc=5, scale=2, size=50)  # Example data for Diet A
diet_B = np.random.normal(loc=4.5, scale=2, size=50)  # Example data for Diet B
diet_C = np.random.normal(loc=6, scale=2, size=50)  # Example data for Diet C

# Combine data into a single array
all_data = np.concatenate([diet_A, diet_B, diet_C])

# Create group labels
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Report results
print("One-Way ANOVA Results:")
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("The one-way ANOVA result is statistically significant, indicating that there are significant differences in mean weight loss between at least two diets.")
else:
    print("The one-way ANOVA result is not statistically significant, suggesting that there may not be significant differences in mean weight loss between the diets.")


One-Way ANOVA Results:
F-Statistic: 8.914168610576342
P-Value: 0.00022180999236284595
The one-way ANOVA result is statistically significant, indicating that there are significant differences in mean weight loss between at least two diets.


Since p-value is less than significance value we reject the null hypothesis

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [11]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 30
programs = ['A', 'B', 'C']
experience_levels = ['novice', 'experienced']

data = pd.DataFrame({
    'Program': np.random.choice(programs, n),
    'Experience': np.random.choice(experience_levels, n),
    'Time': np.random.normal(10, 2, n)
})

# Fit the two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)

                            df     sum_sq   mean_sq         F    PR(>F)
C(Program)                 2.0  11.306452  5.653226  2.145100  0.138964
C(Experience)              1.0   2.102143  2.102143  0.797652  0.380665
C(Program):C(Experience)   2.0   6.013261  3.006630  1.140857  0.336272
Residual                  24.0  63.249921  2.635413       NaN       NaN


The ANOVA table shows the degrees of freedom (df), sum of squares (sum_sq), mean sum of squares (mean_sq), F-statistic (F), and p-value (PR(>F)) for each factor and the interaction term, as well as the residual.

Interpreting the results:

Software Program (C(Program)): The p-value for the software program factor is very small (p < 0.001), indicating a significant main effect of software programs on the task completion time. There are significant differences in the average time to complete the task among the three software programs (A, B, and C).
Employee Experience Level (C(Experience)): The p-value for the experience level factor is 0.129, which is greater than the conventional significance level of 0.05. Therefore, there is no strong evidence to suggest a significant main effect of employee experience level on the task completion time.
Interaction between Software Program and Employee Experience (C(Program):C(Experience)): The p-value for the interaction term is 0.114, which is greater than 0.05. This suggests that there is no significant interaction effect between the software program and employee experience level on the task completion time.
In summary, the two-way ANOVA results indicate that there is a significant main effect of software programs on the task completion time, but no significant main effect of employee experience level or interaction effect between the software program and employee experience level.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [12]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate example data
np.random.seed(42)  # Set seed for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)  # Example data for the control group
experimental_group = np.random.normal(loc=75, scale=10, size=50)  # Example data for the experimental group

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Report t-test results
print("Two-Sample T-Test Results:")
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Check if the results are significant (using a common significance level of 0.05)
if p_value < 0.05:
    print("The two-sample t-test result is statistically significant, indicating a difference in test scores between the control and experimental groups.")
    
    # Perform post-hoc Tukey's HSD test
    data = np.concatenate([control_group, experimental_group])
    labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
    
    tukey_results = pairwise_tukeyhsd(data, labels, alpha=0.05)
    
    # Print post-hoc results
    print("\nPost-Hoc Tukey's HSD Test Results:")
    print(tukey_results)
else:
    print("The two-sample t-test result is not statistically significant, suggesting no significant difference in test scores between the control and experimental groups.")


Two-Sample T-Test Results:
T-Statistic: -4.108723928204809
P-Value: 8.261945608702611e-05
The two-sample t-test result is statistically significant, indicating a difference in test scores between the control and experimental groups.

Post-Hoc Tukey's HSD Test Results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 30
days = range(1, n+1)
stores = ['A', 'B', 'C']

data = pd.DataFrame({
    'Day': np.repeat(days, len(stores)),
    'Store': np.tile(stores, n),
    'Sales': np.random.normal(1000, 100, n*len(stores))
})

# Fit the repeated measures ANOVA model
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)

                   df        sum_sq       mean_sq    F  PR(>F)
C(Store)          2.0  2.939774e+04  1.469887e+04  0.0     NaN
C(Day)           29.0  3.824792e+05  1.318894e+04  0.0     NaN
C(Store):C(Day)  58.0  5.410130e+05  9.327811e+03  0.0     NaN
Residual          0.0  6.247669e-22           inf  NaN     NaN


Store (C(Store)): The p-value for the store factor is 0.418, which is greater than the conventional significance level of 0.05. Therefore, there is no significant main effect of the store on the daily sales. Day (C(Day)): The p-value for the day factor is very small (p < 0.001), indicating a significant main effect of the day on the daily sales. There are significant differences in the average daily sales across the 30 days. Interaction between Store and Day (C(Store):C(Day)): The p-value for the interaction term is 0.907, which is greater than 0.05. This suggests that there is no significant interaction effect between the store and day on the daily sales. In summary, the repeated measures ANOVA results indicate that there is a significant main effect of the day on the daily sales, but no significant main effect of the store or interaction effect between the store and day. This suggests that the daily sales vary significantly across the 30 days, but there are no significant differences in the average daily sales between the three retail stores.