Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact 
the validity of the results.
Ans:-Analysis of Variance (ANOVA) is a statistical technique used to compare means among more than two groups. There are several assumptions associated with ANOVA, and violating these assumptions can impact the validity of the results. Here are the key assumptions for ANOVA:

Independence of Observation:

Assumption: The observations within each group are independent of each other.
Violation: If observations are not independent, it can lead to inflated significance levels and unreliable results. For example, in a repeated measures design where the same subjects are used in all groups, observations may not be independent.
Normlity:

Assumption: The residuals (the differences between observed and expected values) are normally distributed for each group.
Violation: If the residuals are not normally distributed, it may affect the accuracy of the p-values. ANOVA is robust to moderate deviations from normality, especially with larger sample sizes. However, severe departures may impact the validity.
Homogeneity of Variances (Homoscedsticity):

Assumption: The variances of the residuals are equal across all groups.
Violation: Heteroscedasticity (unequal variances) can lead to inaccurate F-statistics and affect the overall reliability of ANOVA results. Levene's test can be used to test for homogeneity of variances
Additivity:

Assumption: The population means of the groups being compared are additive, meaning that the effect of one variable on the response variable is consistent across all levels of another variable.
Violation: If there is interaction between factors (non-additivity), it may complicate the interpretation of main effects. Interaction effects can be addressed through more complex designs or statistical techniques.
Random Sampling:

Assumption: The data are collected using random sampling from the population of interest.
Violation: Non-random sampling may lead to biased estimates and affect the generalizability of the results to the broader population.
Examples of violatons and their impact:

Non-Normality: If the residuals are not normally distributed, the p-values may be unreliable, and the confidence intervals may not accurately represent the true uncerainty in the estimates.

Heteroscedasticity: Unequal variances can result in unequal weighting of groups, potentially leading to biased estimates of the overall variability.

Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans:-Analysis of Variance (ANOVA) can be categorized into three main types based on the experimental design and the number of independent variables. These types are one-way ANOVA, two-way ANOVA, and repeated measures ANOVA.

One-Way ANOV:

Design: Compares means across three or more independent (unrelated) groups.
Use Cases:
Comparing the means of multiple groups to determine if there are significant differences.
Example: Testing whether there are differences in test scores among students who receive different teaching methods (Group 1, Group 2, Group 3).
Two-Wa ANOVA:

Design: Examines the influence of two independent variables (factors) on a dependent variable. It can be further classified into two types:
Two-Way Between-Subjects ANOVA: Each subject is in only one combination of the levels of the two independent variables.
Two-Way Within-Subjects ANOVA (Repeated Measures ANOVA): Each subject is in all combinations of the levels of the two independent variables.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans:-The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variability in the observed data into different components associated with the factors or variables under consideration. Understanding this concept is crucial for interpreting the sources of variability in a study and gaining insights into the contributions of different factors to the overall variation in the data.
Why is it important to understand the partitioning of variance in ANOVA?

Identifying Sources of Variatio:

Understanding the partitioning of variance helps researchers identify the sources of variability in the data. It allows them to assess whether observed differences are due to actual treatment effects or simply reflect random variability.
Assessing Group Differeces:

By comparing the between-group variability (SSB) to the within-group variability (SSW), researchers can determine whether there are significant differences between group means. If SSB is large relative to SSW, it suggests that the group means are not all equal.
Interpreting F-Sttistic:

The F-statistic in ANOVA is calculated as the ratio of between-group variance to within-group variance (F = SSB / SSW). Understanding the partitioning of variance helps interpret the F-statistic and evaluate the significance of group differences.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
sum of squares (SSR) in a one-way ANOVA using Python

In [None]:
import numpy as np
import scipy.stats as stats

# Example data for three groups (replace with your actual data)
group1 = np.array([23, 25, 28, 30, 32])
group2 = np.array([18, 20, 22, 25, 28])
group3 = np.array([27, 30, 33, 35, 38])

# Combine data into a single array
all_data = np.concatenate([group1, group2, group3])

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean) ** 2)

# Calculate Group Means
mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

# Calculate Explained Sum of Squares (SSE)
sse = len(group1) * (mean_group1 - overall_mean) ** 2 + \
      len(group2) * (mean_group2 - overall_mean) ** 2 + \
      len(group3) * (mean_group3 - overall_mean) ** 2

# Calculate Residual Sum of Squares (SSR)
ssr_group1 = np.sum((group1 - mean_group1) ** 2)
ssr_group2 = np.sum((group2 - mean_group2) ** 2)
ssr_group3 = np.sum((group3 - mean_group3) ** 2)

ssr = ssr_group1 + ssr_group2 + ssr_group3

# Verify the relationship: SST = SSE + SSR
assert np.isclose(sst, sse + ssr), "SST is not equal to SSE + SSR"

# Print results
print(f'Total Sum of Squares (SST): {sst:.4f}')
print(f'Explained Sum of Squares (SSE): {sse:.4f}')
print(f'Residual Sum of Squares (SSR): {ssr:.4f}')


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data for a two-way ANOVA (replace with your actual data)
np.random.seed(42)  # for reproducibility
data = pd.DataFrame({
    'Variable_A': np.random.choice(['A1', 'A2', 'A3'], 100),
    'Variable_B': np.random.choice(['B1', 'B2'], 100),
    'Dependent_Variable': np.random.randn(100)
})

# Fit the two-way ANOVA model
formula = 'Dependent_Variable ~ Variable_A * Variable_B'
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effect_A = anova_table['sum_sq']['Variable_A']
main_effect_B = anova_table['sum_sq']['Variable_B']
interaction_effect = anova_table['sum_sq']['Variable_A:Variable_B']

# Print results
print(f'Main Effect of Variable_A: {main_effect_A:.4f}')
print(f'Main Effect of Variable_B: {main_effect_B:.4f}')
print(f'Interaction Effect: {interaction_effect:.4f}')


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. 
What can you conclude about the differences between the groups, and how would you interpret these 
results?
Ans:-In a one-way ANOVA, the F-statistic is used to test whether there are statistically significant differences among the means of three or more groups. The p-value associated with the F-statistic indicates the probability of obtaining such an extreme F-statistic by random chance if the null hypothesis (no group differences) is true.

Here's how you can interpret the result:

F-Statisic:

The F-statistic is a ratio of the variability between group means to the variability within groups. A larger F-statistic suggests that the variability between groups is larger relative to the variability within groups.
Pvalue:

The p-value is the probability of observing an F-statistic as extreme as the one obtained, assuming the null hypothesis is true. A small p-value (typically below the chosen significance level, e.g., 0.05) indicates evidence against the null hypothesis.
Interretation:

Null Hypothesis (H0): There are no significant differences among the group means.
Alternative Hypothesis (H1): There are significant differences among the group means.In your case:

F-Statistic: 5.23
P-value:0.02
Conclusion:

The p-value (0.02) is less than the chosen significance level (e.g., 0.05). Therefore, you would reject the null hypotheis.
Interpretation:

There is sufficient evidence to conclude that there are significant differences among the group means. In other words, at least one group mean is different from the others.
It's important to note that a significant result in ANOVA doesn't identify which specific groups are different from each other. Post hoc tests or pairwise comparisons can be conducted to explore and identify where the differences lie.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential 
consequences of using different methods to handle missing data?
Ans:-Handling missing data in a repeated measures ANOVA is an important aspect of data analysis. When dealing with repeated measures or longitudinal data, missing values can arise for various reasons, such as participant dropout, technical issues, or other sources of non-response. The way missing data is handled can impact the validity of the results and the conclusions drawn from the analysis. Here are common approaches to handle missing data in repeated measures ANOVA and their potential consequences:

Common Approaches for Handling Missing Data:
Complete Case Analysis (CCA) or Listwise Deletin:

Approach: Exclude cases with missing data on any variable involved in the analysis.
Potential Consequences:
Reduces sample size, potentially leading to loss of statistical power.
Results in biased estimates if missingness is related to the outcome or predictors.
Mean Imutation:

Approach: Replace missing values with the mean of the observed values for the variable.
Potential Consequences:
Preserves sample size but can underestimate standard errors and lead to biased estimates if missingness is not completely at random.
Fails to account for the uncertainty associated with imputed values.
Last Observation Carried Forward (LOCF) or Next Observation Carried Bckward (NOCB):

Approach: Impute missing values with the last observed value (LOCF) or the next observed value (NOCB).
Potential Consequences:
May not accurately represent the true underlying values, especially if the pattern of missingness is not random.
Can introduce biases, especially in the presence of trends or systematic changes over time.Multiple Imputation:

Approach: Impute missing values multiple times to create several complete datasets, perform analyses on each dataset, and combine results.
Potential Consequences:
Provides more accurate estimates by accounting for the uncertainty associated with missing data.
Requires more sophisticated statistical techniques and assumptions about the missing data mechanism.
Considerations for Choosing an Approah:
Missing Data Mechanism:

Understanding the mechanism causing missing data is crucial. If missingness is completely at random, simpler methods like CCA or mean imputation may be appropriate. If missingness is related to the outcome, more advanced methods like multiple imputatin are preferred.
Sample Size:

The impact on sample size is an important consideration. Methods like CCA may result in a smaller sample size, potentially reducin statistical power.
Assumptions:

Each method makes different assumptions about the nature of missing data. Assess the validity of these assumptions based on the characteristics of your data.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
an example of a situation where a post-hoc test might be necessary.
Ans:-After performing an Analysis of Variance (ANOVA) and finding that there are significant differences among group means, post-hoc tests are often conducted to determine which specific groups differ from each other. These tests help to identify pairwise differences and provide more detailed information than the overall ANOVA. Some common post-hoc tests include:

Tukey's Honestly Significant Difference (HSD:

Use Case:
Use when you have conducted a one-way ANOVA and found a significant difference among three or more groups.
Example:
Suppose you conducted a one-way ANOVA to compare the mean scores of students who received different teaching methods (A, B, C). The ANOVA indicates significant differences, and Tukey's HSD can be used to identify which specific pairs of teaching methods differ significantly.
Bonferroni Corection:

Use Case:
Appropriate when conducting multiple pairwise comparisons to control the familywise error rate.
Example:
In a study with three treatment groups, you want to compare all possible pairs. Instead of using a standard significance level (e.g., 0.05), you adjust the level using the Bonferroni correction to maintain an overall Type I error rate.
Sheffe's Test:

Use Case:
Suitable when you have unequal sample sizes or a variable that does not meet the assumption of homogeneity of variances.
Example:
You are comparing the effectiveness of different therapies on a psychological measure. Due to the nature of the therapies, the sample sizes in each group are not equal. Scheffe's test can be applied to assess pairwise differences.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python 
to determine if there are any significant differences between the mean weight loss of the three diets. 
Report the F-statistic and p-value, and interpret the results.

In [None]:
import scipy.stats as stats
import numpy as np

# Example data (replace with your actual data)
np.random.seed(42)
weight_loss_A = np.random.normal(5, 2, 50)  # mean=5, std=2
weight_loss_B = np.random.normal(7, 2, 50)  # mean=7, std=2
weight_loss_C = np.random.normal(6, 2, 50)  # mean=6, std=2

# Combine data into a single array
all_weight_loss = np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])

# Create a grouping variable
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

# Print results
print(f'F-Statistic: {f_statistic:.4f}')
print(f'P-value: {p_value:.4f}')

# Interpretation
alpha = 0.05  # Significance level

if p_value < alpha:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude significant differences between the mean weight loss of the three diets.")


Q10. A company wants to know if there are any significant differences in the average time it takes to 
complete a task using three different software programs: Program A, Program B, and Program C. They 
randomly assign 30 employees to one of the programs and record the time it takes each employee to 
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or 
interaction effects between the software programs and employee experience level (novice vs. 
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (replace with your actual data)
np.random.seed(42)

# Generating random data for illustration purposes
data = pd.DataFrame({
    'Time': np.random.normal(loc=20, scale=5, size=90),
    'Program': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45)
})

# Convert categorical variables to categorical type
data['Program'] = pd.Categorical(data['Program'])
data['Experience'] = pd.Categorical(data['Experience'])

# Fit the two-way ANOVA model
formula = 'Time ~ Program * Experience'
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print results
print(anova_table)


Q11. An educational researcher is interested in whether a new teaching method improves student test 
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the 
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
two-sample t-test using Python to determine if there are any significant differences in test scores 
between the two groups. If the results are significant, follow up with a post-hoc test to determine which 
group(s) differ significantly from each other.

In [None]:
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (replace with your actual data)
np.random.seed(42)
control_group = np.random.normal(70, 10, 50)
experimental_group = np.random.normal(75, 10, 50)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print t-test results
print(f'Two-Sample T-Test Results:')
print(f'T-Statistic: {t_statistic:.4f}')
print(f'P-value: {p_value:.4f}')

# Follow up with post-hoc test (Tukey's HSD)
data = np.concatenate([control_group, experimental_group])
group_labels = ['Control'] * 50 + ['Experimental'] * 50

tukey_results = pairwise_tukeyhsd(data, group_labels)

# Print post-hoc test results
print('\nPost-Hoc Test (Tukey\'s HSD):')
print(tukey_results)
