#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
    Ans. Assumptions of ANOVA:
    Independence: Observations within each group must be independent of each other. In other words, the data points in one group should not be influenced by or related to the data points in another group.
    Normality: The data within each group should be approximately normally distributed. This assumption is more critical when the sample sizes are small, as ANOVA is robust to violations of normality with larger sample sizes.
    Homogeneity of variance (Homoscedasticity): The variability (variance) of the data should be roughly the same across all groups. This means that the spread of the data points should be similar for each group.
    Random Sampling: The data should be collected through a random sampling process, ensuring that the sample is representative of the population.
    
    Examples of Violations:
    Non-Independence: If the observations within one group are influenced by or related to observations in another group, it violates the independence assumption. For instance, if the same individuals are included in multiple groups or if there is some form of dependency between groups, it can lead to invalid ANOVA results.
    Non-Normality: If the data within each group significantly deviates from a normal distribution, it can affect the validity of the ANOVA results. This can be checked using graphical methods (e.g., QQ-plots) or statistical tests (e.g., Shapiro-Wilk test).
    Non-Homogeneity of Variance: Unequal variances across the groups can lead to biased results. Violation of this assumption can be assessed using statistical tests (e.g., Levene's test) or visual inspection of variance plots.
    Non-Random Sampling: If the data is not collected through a random sampling process, the ANOVA results may not be generalizable to the target population. This can introduce bias and limit the external validity of the findings.


#### Q2. What are the three types of ANOVA, and in what situations would each be used?
    Ans.One-Way ANOVA: This type of ANOVA is used when there is a single categorical independent variable (with two or more levels/groups) and a continuous dependent variable. It determines if there are any statistically significant differences between the means of the groups.

    Two-Way ANOVA: This type of ANOVA is used when there are two categorical independent variables (factors) and one continuous dependent variable. It examines the main effects of each factor and their interaction effect on the dependent variable.

    Three-Way ANOVA (and higher): This type of ANOVA extends the concept of two-way ANOVA to situations with three or more categorical independent variables. It is used when there are multiple factors, and researchers want to examine their individual and combined effects on the dependent variable.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
    Ans. The partitioning of variance in ANOVA involves breaking down the total variability in the data into different components attributed to specific sources. For a one-way ANOVA, the total variance (SST) is divided into two main components:

    Total Sum of Squares (SST): SST represents the total variability in the dependent variable and is calculated as the sum of squared differences between each data point and the overall mean of all data points.

    Explained Sum of Squares (SSE): SSE, also known as the "between-group" variability, measures the variation between the group means and the overall mean. It quantifies how much of the total variance can be attributed to the effect of the independent variable.

    Residual Sum of Squares (SSR): SSR, also called the "within-group" variability, measures the variation within each group. It represents the differences between individual data points and their respective group means.

    The partitioning is essential because it helps researchers understand the relative contributions of the independent variable (treatment effect) and random variability (error) to the overall variance in the dependent variable. By comparing SSE to SSR, ANOVA can determine whether the observed differences between group means are statistically significant or simply due to chance.

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?
    Ans. To calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) in a one-way ANOVA using Python, you can follow these steps:

    Compute the overall mean (grand mean) of the data.
    Calculate the sum of squared differences between each data point and the overall mean to get the Total Sum of Squares (SST).
    Calculate the sum of squared differences between each group mean and the overall mean to get the Explained Sum of Squares (SSE).
    Calculate the sum of squared differences between each data point and its respective group mean to get the Residual Sum of Squares (SSR).

In [24]:
import numpy as np

# Sample data for three groups (replace these with your own data)
group_a = [10, 12, 15, 14, 11]
group_b = [18, 20, 22, 19, 21]
group_c = [25, 28, 24, 27, 26]

# Combine the data into a single array
data = np.concatenate([group_a, group_b, group_c])

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate the Total Sum of Squares (SST)
sst = np.sum((data - overall_mean)**2)

# Calculate the group means
mean_group_a = np.mean(group_a)
mean_group_b = np.mean(group_b)
mean_group_c = np.mean(group_c)

# Calculate the Explained Sum of Squares (SSE)
sse = np.sum((mean_group_a - overall_mean)**2) * len(group_a) + \
      np.sum((mean_group_b - overall_mean)**2) * len(group_b) + \
      np.sum((mean_group_c - overall_mean)**2) * len(group_c)

# Calculate the Residual Sum of Squares (SSR)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)

Total Sum of Squares (SST): 501.7333333333333
Explained Sum of Squares (SSE): 464.5333333333333
Residual Sum of Squares (SSR): 37.19999999999999


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [25]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data for a two-way ANOVA (replace these with your own data)
data = {
    'A': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
    'B': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
    'Y': [10, 12, 15, 14, 11, 18, 20, 22, 19, 21]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Convert the 'Y' column to numeric (optional, depending on your data)
df['Y'] = pd.to_numeric(df['Y'])

# Perform two-way ANOVA
formula = 'Y ~ C(A) + C(B) + C(A):C(B)'  # C() indicates categorical variables and ':' represents interaction
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract main effects and interaction effect from the ANOVA table
main_effect_A = anova_table.loc['C(A)', 'mean_sq']
main_effect_B = anova_table.loc['C(B)', 'mean_sq']
interaction_effect = anova_table.loc['C(A):C(B)', 'mean_sq']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)

Main Effect A: 144.4000000000001
Main Effect B: 2.016666666666664
Interaction Effect: 0.01666666666666798


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
    Ans.In a one-way ANOVA, the obtained F-statistic and p-value are used to determine whether there are significant differences between the groups (treatment levels) with respect to the dependent variable.

    F-statistic: The F-statistic measures the ratio of the variance between the group means to the variance within the groups. In this case, the F-statistic is 5.23.

    p-value: The p-value represents the probability of obtaining the observed results (or more extreme results) if there were no true differences between the group means. In this case, the p-value is 0.02.

    Interpretation:
    Since the p-value (0.02) is less than the commonly chosen significance level of 0.05, we reject the null hypothesis. The null hypothesis in ANOVA assumes that there are no significant differences between the group means. Thus, we can conclude that there are significant differences between at least some of the groups.

    However, the ANOVA does not tell us exactly which groups are different from each other. To identify the specific differences between groups, further post-hoc tests (e.g., Tukey's test) can be conducted.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
    Ans. In a repeated measures ANOVA, where the same participants are measured multiple times under different conditions, handling missing data can be crucial for obtaining accurate results.

    Methods for Handling Missing Data:

    Complete Case Analysis (Listwise Deletion): In this method, any participant with missing data on any variable is excluded from the analysis. While it is straightforward, it may lead to a loss of statistical power and potential bias.

    Mean Imputation: Missing values are replaced with the mean value of the observed data. This can distort the variability and relationships in the data.

    Multiple Imputation: This method creates multiple plausible imputed datasets based on the observed data's distribution. The analysis is performed on each imputed dataset, and the results are combined. It is more robust than mean imputation but requires additional computational resources.

    Maximum Likelihood Estimation: This method estimates missing values based on the likelihood function, considering the relationship between variables. It is considered a principled approach and can provide unbiased estimates.

    Potential Consequences of Different Methods:
    Using inappropriate missing data handling methods can lead to biased estimates, reduced statistical power, and incorrect conclusions. Complete case analysis can decrease the sample size, leading to a loss of information. Mean imputation can affect the data distribution and correlations. Multiple imputation and maximum likelihood estimation are generally preferred when data are missing at random, as they provide more accurate estimates and appropriate standard errors.


#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
    Ans. Post-hoc tests are used to compare multiple groups in ANOVA when the overall ANOVA result is significant. Some common post-hoc tests include:

    Tukey's Honestly Significant Difference (HSD): Tukey's test is used to identify differences between all pairs of groups. It controls the familywise error rate and is appropriate when the number of group comparisons is relatively small.

    Bonferroni Correction: This method adjusts the significance level for multiple comparisons to avoid inflating the overall Type I error rate. It is more conservative than Tukey's test and appropriate when the number of comparisons is large.

    Scheffe's Test: Scheffe's test is conservative but suitable for complex comparisons, including combinations of linear and non-linear contrasts.

    Dunnett's Test: Dunnett's test is used when comparing several treatment groups to a control group, and it controls the Type I error rate.


#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [26]:
import numpy as np
import pandas as pd
from scipy import stats

# Sample data for weight loss for each diet (replace these with your own data)
diet_A = [5, 6, 7, 6, 5]
diet_B = [4, 3, 2, 3, 4]
diet_C = [2, 3, 4, 3, 2]

# Combine the data into a single array
data = np.concatenate([diet_A, diet_B, diet_C])

# Create a group labels array
groups = ['A'] * len(diet_A) + ['B'] * len(diet_B) + ['C'] * len(diet_C)

# Create a DataFrame
df = pd.DataFrame({'Weight Loss': data, 'Diet': groups})

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-Statistic:", f_statistic)
print("p-value:", p_value)

F-Statistic: 18.95238095238094
p-value: 0.0001933016445130483


    Interpretation:
    In this example, the one-way ANOVA yields an F-statistic of F_statistic and a p-value of p_value. Since the p-value (p_value) is less than 0.05 (assuming a significance level of 0.05), we can conclude that there are significant differences in the mean weight loss between the three diets (A, B, and C). Further post-hoc tests can be performed to identify which diets are significantly different from each other.


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [27]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data for task completion time (replace these with your own data)
data = {
    'Software': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [25, 30, 27, 32, 28, 34, 30, 33, 31, 35,
             22, 20, 24, 19, 23, 18, 26, 21, 29, 27,
             40, 38, 42, 36, 41, 39, 37, 44, 43, 45]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Perform two-way ANOVA
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                                sum_sq    df         F    PR(>F)
C(Software)                   2.600000   2.0  0.022414  0.977856
C(Experience)               425.633333   1.0  7.338506  0.012252
C(Software):C(Experience)    28.066667   2.0  0.241954  0.786984
Residual                   1392.000000  24.0       NaN       NaN


    Interpretation:
    In this two-way ANOVA, we are analyzing the effects of two factors: 'Software' (with levels A, B, and C) and 'Experience' (with levels Novice and Experienced) on the task completion time.

    Main Effects:
    The 'Software' factor's p-value indicates whether there are significant differences in task completion time between the three software programs (A, B, and C). If the p-value for 'Software' is less than the chosen significance level (e.g., 0.05), it suggests that at least one software program significantly affects task completion time.
    The 'Experience' factor's p-value indicates whether there are significant differences in task completion time between novice and experienced users. If the p-value for 'Experience' is less than the significance level, it suggests that the level of experience significantly influences task completion time.

    Interaction Effect:
    The interaction effect between 'Software' and 'Experience' examines whether the effect of one factor (e.g., 'Software') on task completion time depends on the level of the other factor (e.g., 'Experience'). If the p-value for the interaction term is significant (p < 0.05), it indicates that the effect of software on task completion time is different for novice and experienced users, and there is an interaction between these two factors.
    
    Overall, the two-way ANOVA helps us understand how the software program and user experience level interact to influence task completion time. The interpretation of the results should be based on the specific p-values and the chosen significance level. If significant effects are found, further post-hoc tests (e.g., Tukey's test) can be conducted to identify which software program or user experience level differs significantly from others.

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [28]:
import numpy as np
from scipy import stats

# Sample data for test scores (replace these with your own data)
control_group = [80, 85, 78, 90, 88, 82, 75, 84, 92, 81]  # Continue with 100 control scores
experimental_group = [88, 90, 82, 95, 87, 94, 86, 89, 93, 91]  # Continue with 100 experimental scores

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print("T-Statistic:", t_statistic)
print("p-value:", p_value)

T-Statistic: -2.835436868147591
p-value: 0.010969952273298564


    Interpretation:
    The two-sample t-test yields a t-statistic of t_statistic and a p-value of p_value. If the p-value is less than the chosen significance level (e.g., 0.05), it indicates a significant difference in test scores between the two groups.

    If the result is significant, you can conduct post-hoc tests to determine which group(s) differ significantly from each other. A common post-hoc test for comparing two independent groups is the Tukey-Kramer test or Bonferroni correction for pairwise comparisons.

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [29]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data for daily sales (replace these with your own data)
data = {
    'Day': np.repeat(np.arange(1, 31), 3),
    'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30,
    'Sales': np.random.randint(900,1200,90)  # Continue with 30 daily sales for each store
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
formula = 'Sales ~ C(Store) + C(Day) + C(Store):C(Day)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model)

print(anova_table)

                   df         sum_sq      mean_sq         F    PR(>F)
C(Store)          2.0    8418.688889  4209.344444  0.496990  0.610841
C(Day)           29.0  195196.381988  6730.909724  0.794707  0.747570
C(Store):C(Day)  58.0  470311.918012  8108.826173  0.957395  0.565521
Residual         60.0  508180.666667  8469.677778       NaN       NaN


    Interpretation:
    The repeated measures ANOVA provides F-statistics and p-values for the main effect of 'Store,' the main effect of 'Day,' and the interaction effect between 'Store' and 'Day.'

    If the p-value for the main effect of 'Store' is less than the chosen significance level (e.g., 0.05), it indicates a significant difference in average daily sales between the three stores.

    If the result is significant, you can conduct post-hoc tests (e.g., Tukey's test) to determine which store(s) differ significantly from each other in terms of average daily sales.