#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans.

ANOVA (Analysis of Variance) is a statistical method used to compare means across multiple groups. It comes with a few assumptions to ensure valid results. Here are the key assumptions and examples of violations that can impact the validity of the results:  

1. Independence of Observations
- Assumption: The observations within each group must be independent of each other.
- Violation Example: If the data are collected from a set of related samples, such as measurements from the same participants at different times, the independence assumption is violated, leading to incorrect conclusions.
- Impact of Violation: If observations aren't independent, the test may produce inflated Type I or Type II error rates.
2. Normality
- Assumption: The residuals (or errors) within each group are approximately normally distributed.
- Violation Example: If the data has a strong skew, heavy tails, or is bimodal, the normality assumption is violated.
- Impact of Violation: Severe departures from normality can lead to inaccurate p-values and incorrect decisions about statistical significance. For small sample sizes, normality is crucial; however, for large sample sizes, the Central Limit Theorem may mitigate the effects.
3. Homogeneity of Variances (Homoscedasticity)
- Assumption: The variance within each group should be approximately equal.
- Violation Example: If one group has much higher or lower variance than others, this assumption is violated (heteroscedasticity).
- Impact of Violation: If the variances are unequal, the F-statistic may be distorted, leading to invalid results. This can increase the likelihood of Type I and Type II errors.
4. Scale of Measurement
- Assumption: The dependent variable should be measured on an interval or ratio scale (i.e., continuous).
- Violation Example: If the dependent variable is ordinal (e.g., a Likert scale), ANOVA is not appropriate.
- Impact of Violation: Using ANOVA with ordinal or categorical data violates this assumption and can lead to incorrect conclusions.
5. Additivity and Linearity
- Assumption: The relationship between the dependent variable and the independent variable is linear.
- Violation Example: If the data show a non-linear relationship or interactions between variables that are not captured in the model.
- Impact of Violation: If the relationship isn't linear or the model fails to account for interactions, ANOVA might not correctly capture the variance, leading to misleading results.

When These Assumptions Are Violated:
- Type I Error: Incorrectly rejecting a true null hypothesis (false positive).
- Type II Error: Failing to reject a false null hypothesis (false negative).
- Reduced Power: The test may have a lower chance of detecting a real effect.
- Misleading Conclusions: Violations can lead to incorrect interpretations, such as finding significant effects when there are none, or missing real effects.

---

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans.

1. One-Way ANOVA  
- Purpose: Compares the means of three or more independent (unrelated) groups based on a single independent variable.
- Use Case: When you want to determine if there is a significant difference in the average test scores of students across three different teaching methods.

2. Two-Way ANOVA
- Purpose: Examines the effect of two independent variables on a dependent variable, including their interaction effect.
- Use Case: When analyzing how both "teaching method" and "gender" affect student test scores, and whether there is an interaction between them.

3. Repeated Measures ANOVA
- Purpose: Used when the same subjects are measured multiple times under different conditions or at different time points.
- Use Case: When assessing the effectiveness of a drug by measuring blood pressure in the same group of patients before, during, and after treatment.

---

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans.

Partitioning of Variance in ANOVA:  
In ANOVA, the total variance in the data is divided (or partitioned) into different components to determine whether the group means differ significantly. The partitioning of variance helps in understanding the sources of variability in the data.

**Importance of Understanding Variance Partitioning**  
1.Helps in Hypothesis Testing:  
- The F-statistic in ANOVA is calculated as:
𝐹 = (Variance Between Groups) / (Variance Within Groups)
- A high F-value suggests a significant difference between groups.

2.Distinguishes True Effects from Random Variability:  
- If between-group variance is significantly larger than within-group variance, it suggests a real effect rather than random variation.

3.Influences Statistical Power:  
- If within-group variance is high, it can reduce the power of ANOVA to detect real differences.

---

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

Ans.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

data = {
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Values': [5, 6, 7, 8, 9, 10, 12, 13, 14]
}

df = pd.DataFrame(data)

y_bar = df['Values'].mean()

sst = np.sum((df['Values'] - y_bar) ** 2)

group_means = df.groupby('Group')['Values'].mean()
group_counts = df.groupby('Group')['Values'].count()
sse = np.sum(group_counts * (group_means - y_bar) ** 2)

df = df.merge(group_means.rename('Group_Mean'), on='Group')
ssr = np.sum((df['Values'] - df['Group_Mean']) ** 2)

print(f"SST (Total Sum of Squares): {sst:.2f}")
print(f"SSE (Explained Sum of Squares): {sse:.2f}")
print(f"SSR (Residual Sum of Squares): {ssr:.2f}")

print(f"Verification: SST = SSE + SSR → {np.isclose(sst, sse + ssr)}")

anova_result = f_oneway(df[df['Group'] == 'A']['Values'], df[df['Group'] == 'B']['Values'], df[df['Group'] == 'C']['Values'])

print(f"ANOVA F-statistic: {anova_result.statistic:.4f}, p-value: {anova_result.pvalue:.4f}")

SST (Total Sum of Squares): 80.00
SSE (Explained Sum of Squares): 74.00
SSR (Residual Sum of Squares): 6.00
Verification: SST = SSE + SSR → True
ANOVA F-statistic: 37.0000, p-value: 0.0004


---

#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans.

In a two-way ANOVA, we analyze the effects of two independent variables (factors) on a dependent variable. We calculate:
- Main Effects: The impact of each independent variable separately.
- Interaction Effect: The combined effect of both independent variables (i.e., whether the effect of one factor depends on the level of the other).

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Sample dataset: Two independent variables (Factor A & Factor B)
data = {
    'Factor_A': ['Low', 'Low', 'Low', 'High', 'High', 'High', 'Low', 'Low', 'Low', 'High', 'High', 'High'],
    'Factor_B': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'Response': [10, 12, 14, 20, 22, 21, 13, 15, 14, 25, 27, 26]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Define the two-way ANOVA model with interaction term
model = smf.ols('Response ~ C(Factor_A) + C(Factor_B) + C(Factor_A):C(Factor_B)', data=df).fit()

# Perform ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Print ANOVA table
print(anova_table)


                         sum_sq   df           F        PR(>F)
C(Factor_A)              330.75  1.0  189.000000  7.560184e-07
C(Factor_B)               36.75  1.0   21.000000  1.795939e-03
C(Factor_A):C(Factor_B)    6.75  1.0    3.857143  8.513507e-02
Residual                  14.00  8.0         NaN           NaN


---

#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans.

- The F-statistic quantifies the ratio of between-group variance to within-group variance.
- A p-value of 0.02 (which is less than the common threshold of 0.05) suggests that at least one group mean significantly differs from the others.
- Therefore, we reject the null hypothesis (H0), which states that all group means are equal.
- However, ANOVA does not tell us which specific groups are different—for that, we need a post-hoc test (e.g., Tukey's HSD).

---

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans.

1.Listwise Deletion (Complete Case Analysis) => dataframe.dropna(inplace=True)  
- Method: Exclude subjects with any missing data across time points.
- Pros: Simple and easy to implement.
- Cons: Reduces sample size, which lowers statistical power and can introduce bias if data is not missing completely at random (MCAR).

2.Mean/Group Mean Imputation => dataframe.fillna(dataframe.mean(), inplace=True)
- Method: Replace missing values with the mean of that variable (overall mean) or the mean of the respective group.
- Pros: Maintains sample size.
- Cons: Underestimates variability, potentially leading to inflated false positive rates.

3.Last Observation Carried Forward (LOCF) => dataframe.fillna(method='ffill', inplace=True)
- Method: Use the last recorded value of a subject to fill in missing data.
- Pros: Preserves within-subject correlation in repeated measures.
- Cons: Assumes data remains stable over time, which may not be valid.

---

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans.

**Common Post-Hoc Tests After ANOVA:**
- When a one-way or two-way ANOVA detects a significant difference (p < 0.05), it only tells us that at least one group differs from the others—but it does not specify which groups are different. Post-hoc tests are used to determine which specific group differences are significant while controlling for multiple comparisons.

1.Tukey’s Honestly Significant Difference (Tukey HSD):  
- Use:
  - Comparing all pairwise group means.
  - Equal sample sizes (balanced design) preferred but works with slightly unequal sizes.
- Controls for: Familywise error rate (FWER) using the studentized range distribution.
- Example: If an ANOVA comparing three plant fertilizers shows a significant effect on plant growth, Tukey HSD can determine which fertilizers differ.

2.Bonferroni Correction:
- Use:
  - Comparing a few specific groups rather than all pairwise comparisons.
  - Very conservative (controls Type I error well but reduces power).
- Controls for: FWER by adjusting the significance threshold (α′ = α/number of tests).
- Example: If comparing only two of the three fertilizers (e.g., A vs. B), Bonferroni is better than Tukey.

3.Holm’s Method:
- Use:
  - Similar to Bonferroni but more powerful (less conservative).
  - Used for sequential rejection testing (ranks p-values and adjusts dynamically).
- Controls for: FWER with more statistical power than Bonferroni.
- Example: When comparing multiple drugs' effects on cholesterol levels.

Necessary of Post-Hoc Test:
- Suppose you conduct an ANOVA on three teaching methods (A, B, C) and find a significant effect on student performance (p = 0.01, F = 4.5).
- This means at least one method is different, but which one?
- A Tukey HSD test can tell you whether A differs from B, B differs from C, etc.

---

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

Ans.

In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats

np.random.seed(42)  # For reproducibility
diet_A = np.random.normal(loc=5.0, scale=1.5, size=50)  # Mean = 5 kg loss
diet_B = np.random.normal(loc=6.0, scale=1.5, size=50)  # Mean = 6 kg loss
diet_C = np.random.normal(loc=4.5, scale=1.5, size=50)  # Mean = 4.5 kg loss

f_stat, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result: Reject the null hypothesis (At least one diet is significantly different).")
else:
    print("Result: Fail to reject the null hypothesis (No significant difference between diets).")

F-statistic: 18.44
P-value: 0.0000
Result: Reject the null hypothesis (At least one diet is significantly different).


In [4]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

df = pd.DataFrame({
    'WeightLoss': np.concatenate([diet_A, diet_B, diet_C]),
    'Diet': ['A'] * 50 + ['B'] * 50 + ['C'] * 50
})

tukey = pairwise_tukeyhsd(df['WeightLoss'], df['Diet'], alpha=0.05)
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B   1.3649    0.0  0.6951  2.0346   True
     A      C  -0.2207 0.7157 -0.8905   0.449  False
     B      C  -1.5856    0.0 -2.2554 -0.9158   True
----------------------------------------------------


---

#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

Ans.

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

np.random.seed(42)

novice_A = np.random.normal(loc=60, scale=10, size=10)  # 10 novice users, mean time 60 minutes
novice_B = np.random.normal(loc=55, scale=8, size=10)   # 10 novice users, mean time 55 minutes
novice_C = np.random.normal(loc=50, scale=7, size=10)   # 10 novice users, mean time 50 minutes

experienced_A = np.random.normal(loc=40, scale=5, size=10)  # 10 experienced users, mean time 40 minutes
experienced_B = np.random.normal(loc=38, scale=5, size=10)  # 10 experienced users, mean time 38 minutes
experienced_C = np.random.normal(loc=35, scale=5, size=10)  # 10 experienced users, mean time 35 minutes

data = pd.DataFrame({
    'Time': np.concatenate([novice_A, novice_B, novice_C, experienced_A, experienced_B, experienced_C]),
    'Software': ['A']*10 + ['B']*10 + ['C']*10 + ['A']*10 + ['B']*10 + ['C']*10,
    'Experience': ['Novice']*30 + ['Experienced']*30
})

model = ols('Time ~ Software * Experience', data=data).fit()

anova_result = anova_lm(model)

print(anova_result)

                       df       sum_sq      mean_sq           F        PR(>F)
Software              2.0  1082.648271   541.324136   17.391425  1.478244e-06
Experience            1.0  4236.940052  4236.940052  136.122558  2.207911e-16
Software:Experience   2.0   638.840997   319.420498   10.262202  1.669443e-04
Residual             54.0  1680.799761    31.125921         NaN           NaN


---

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

Ans.

In [6]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(42)

control_scores = np.random.normal(loc=75, scale=10, size=50)  # Mean = 75, SD = 10
experimental_scores = np.random.normal(loc=80, scale=10, size=50)  # Mean = 80, SD = 10

t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print(f"Two-Sample t-test Results:")
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result: Reject the null hypothesis (The groups differ significantly).")
else:
    print("Result: Fail to reject the null hypothesis (No significant difference between groups).")

df = pd.DataFrame({
    'Score': np.concatenate([control_scores, experimental_scores]),
    'Group': ['Control'] * 50 + ['Experimental'] * 50
})

tukey = pairwise_tukeyhsd(df['Score'], df['Group'], alpha=0.05)
print("\nPost-hoc Test (Tukey HSD) Results:")
print(tukey)


Two-Sample t-test Results:
t-statistic: -4.11
p-value: 0.0001
Result: Reject the null hypothesis (The groups differ significantly).

Post-hoc Test (Tukey HSD) Results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


---

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post - hoc test to determine which store(s) differ significantly from each other.

Ans.

In [7]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(42)

store_A_sales = np.random.normal(loc=500, scale=50, size=30)  # Store A
store_B_sales = np.random.normal(loc=520, scale=60, size=30)  # Store B
store_C_sales = np.random.normal(loc=510, scale=55, size=30)  # Store C

df = pd.DataFrame({
    'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales]),
    'Store': ['A']*30 + ['B']*30 + ['C']*30,
    'Day': np.tile(np.arange(1, 31), 3)  # Day 1 to Day 30 for each store
})

anova = AnovaRM(df, 'Sales', 'Day', within=['Store'])
anova_result = anova.fit()

print("Repeated Measures ANOVA Results:")
print(anova_result)

tukey = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
print("\nPost-hoc Test (Tukey HSD) Results:")
print(tukey)

Repeated Measures ANOVA Results:
               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  1.7276 2.0000 58.0000 0.1867


Post-hoc Test (Tukey HSD) Results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  22.1376 0.2314  -9.8984 54.1736  False
     A      C   20.116 0.2972   -11.92  52.152  False
     B      C  -2.0216 0.9876 -34.0576 30.0144  False
-----------------------------------------------------
