## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Analysis of Variance (ANOVA) is a statistical technique used to compare means of two or more groups to determine whether there are statistically significant differences among them. ANOVA relies on several assumptions, and violations of these assumptions can impact the validity of the results. The key assumptions for ANOVA are:

1. **Independence of Observations**:
   - Assumption: Observations within each group are independent of each other.
   - Violation Example: In a repeated measures design, where the same individuals are measured over time, observations within the same subject may not be independent. This can violate the assumption.

2. **Normality**:
   - Assumption: The data within each group are normally distributed.
   - Violation Example: If the data within a group significantly deviate from a normal distribution (e.g., highly skewed or have outliers), this assumption may be violated. Violations can be detected through normality tests or visual inspection of histograms.

3. **Homogeneity of Variances (Homoscedasticity)**:
   - Assumption: The variances of the data in all groups are approximately equal (homogeneity of variances).
   - Violation Example: If the variances in different groups are significantly different, it can violate this assumption. This is often checked using statistical tests (e.g., Levene's test) or by visually inspecting scatterplots.

4. **Absence of Outliers**:
   - Assumption: Outlying score to be removed from dataset.
   - Violation Example: If the outliers are not removed, for example, if there is some score outlying in dataset, it can violate this assumption.



## Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) is a statistical technique used to compare means of two or more groups to determine if there are statistically significant differences among them. There are three main types of ANOVA, each used in different situations:

1. **One-Way ANOVA**:
   - **Situation**: One-Way ANOVA is used when you have one categorical independent variable (factor) with more than two levels (groups). It tests whether there are any statistically significant differences among the means of these multiple groups.
   - **Example**: Suppose you want to compare the average test scores of students in three different schools (School A, School B, and School C) to determine if there is a significant difference in performance across the schools.

2. **Two-Way/Factorial ANOVA**:
   - **Situation**: Two-Way ANOVA is used when you have two categorical independent variables (factors), and you want to examine the effects of both factors on a continuous dependent variable. It can reveal whether there are main effects for each factor, as well as an interaction effect (i.e., whether the combination of the two factors has an effect that is different from what would be expected from the individual effects).
   - **Example**: Consider a study where you want to investigate the effects of both the type of diet (Factor 1: Diet A, Diet B) and the amount of exercise (Factor 2: Low, High) on weight loss. Two-Way ANOVA would help you determine the impact of each factor and whether there is an interaction between diet and exercise on weight loss.

3. **Repeated Measures ANOVA**:
   - **Situation**: Repeated Measures ANOVA is used when you have one group of participants that you measure repeatedly under different conditions or at different time points. It's designed to assess changes within the same subjects over time or under different conditions.
   - **Example**: Suppose you are studying the effect of a drug on blood pressure and you measure the blood pressure of the same group of individuals before taking the drug, one hour after taking the drug, and two hours after taking the drug. Repeated Measures ANOVA would help you determine whether there is a statistically significant change in blood pressure over time due to the drug.



## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps us understand how the total variability in the data is divided into different sources of variation. It is important to understand this concept because it provides insights into the relationships between the independent and dependent variables and allows us to assess the significance of these relationships. The partitioning of variance in ANOVA consists of several components:

1. **Total Variance (Total Sum of Squares, SST)**:
   - This represents the total variability in the data. It is calculated as the sum of the squared differences between each data point and the overall mean.
   - SST = Variability within groups (explained by the factors) + Variability between groups (unexplained by the factors)

2. **Between-Group Variance (Between-Groups Sum of Squares, SSB)**:
   - This component represents the variability between the group means. It measures the extent to which the group means differ from each other.
   - SSB = Σ(ni * (group_mean_i - grand_mean)^2), where ni is the number of observations in group i.

3. **Within-Group Variance (Within-Groups Sum of Squares, SSW)**:
   - This component represents the variability within each group or category. It accounts for the differences among individual data points within the same group.
   - SSW = Σ(Σ(xi - group_mean_i)^2), where xi is an individual data point in group i.

4. **Degrees of Freedom (df)**:
   - Degrees of freedom are associated with each component of variance. They represent the number of values in the final calculation of the variance components.
   - Degrees of freedom for between-groups (dfB) = Number of groups - 1
   - Degrees of freedom for within-groups (dfW) = Total number of observations - Number of groups

5. **Mean Squares (MS)**:
   - Mean squares are obtained by dividing the sum of squares by their respective degrees of freedom.
   - MSB (Mean Squares Between) = SSB / dfB
   - MSW (Mean Squares Within) = SSW / dfW

6. **F-Ratio (F-statistic)**:
   - The F-ratio is the ratio of the mean squares between and mean squares within. It is used to test the hypothesis of whether there are significant differences among the group means.
   - F = MSB / MSW

Understanding the partitioning of variance is crucial for several reasons:

- It provides a structured way to quantify the sources of variability in the data.
- It helps assess the significance of the factors being studied by comparing the between-group variance to the within-group variance.
- It informs us about the relative importance of different factors in explaining the variation in the dependent variable.
- It aids in hypothesis testing and determining whether the observed differences among groups are statistically significant.

Overall, understanding how variance is partitioned in ANOVA is essential for interpreting the results accurately, drawing meaningful conclusions, and making informed decisions based on statistical analyses.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

1.Calculate the total sum of squares (SST):

SST represents the total variability in the data and is calculated as the sum of the squared differences between each data point and the overall mean.

In [1]:
import numpy as np

# Sample data
data = np.array([45, 52, 48, 55, 50, 60, 58, 62, 53, 57])

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate SST
sst = np.sum((data - overall_mean) ** 2)
sst


264.0

2.Calculate the explained sum of squares (SSE):

SSE represents the variability in the data that is explained by the group means and is calculated as the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.

In [2]:
# Group data into categories (e.g., groups A, B, C)
group_A = np.array([45, 52, 48])
group_B = np.array([55, 50, 60])
group_C = np.array([58, 62, 53, 57])

# Calculate group means
mean_A = np.mean(group_A)
mean_B = np.mean(group_B)
mean_C = np.mean(group_C)

# Calculate SSE
sse = len(group_A) * (mean_A - overall_mean) ** 2 + len(group_B) * (mean_B - overall_mean) ** 2 + len(group_C) * (mean_C - overall_mean) ** 2
sse


148.33333333333326

3. Calculate the residual sum of squares (SSR):

SSR represents the unexplained variability in the data and is calculated as the sum of the squared differences between each individual data point and its respective group mean.

In [3]:
# Calculate SSR for each group
ssr_A = np.sum((group_A - mean_A) ** 2)
ssr_B = np.sum((group_B - mean_B) ** 2)
ssr_C = np.sum((group_C - mean_C) ** 2)

# Calculate the total SSR
ssr = ssr_A + ssr_B + ssr_C
ssr

115.66666666666667

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In a two-way ANOVA, you can calculate the main effects and interaction effects using Python by analyzing the output of the ANOVA test. Python libraries like statsmodels or scipy provide functions to perform two-way ANOVA and obtain the necessary statistics.
1. Perform Two-Way ANOVA:

First, perform the two-way ANOVA using Python. You'll need the statsmodels library for this example

In [4]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm


In [5]:
# Create a DataFrame with your data
data = pd.DataFrame({'Factor1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
                     'Factor2': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y'],
                     'Response': [10, 12, 8, 9, 11, 13, 7, 8]})
# Fit the two-way ANOVA model
formula = 'Response ~ Factor1 * Factor2'
model = ols(formula, data=data).fit()
# Perform the ANOVA analysis
anova_table = anova_lm(model, typ=2)


#### Interpret the results:
The ANOVA table will contain information about the main effects and interaction effect:

##### Main Effects:

Factor1

Factor2

##### Interaction Effect:

Factor1:Factor2

You can access these effects from the anova_table like this:

In [6]:
main_effect_Factor1 = anova_table['sum_sq']['Factor1'] / anova_table['sum_sq']['Factor1:Factor2']
main_effect_Factor2 = anova_table['sum_sq']['Factor2'] / anova_table['sum_sq']['Factor1:Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']


In [7]:
main_effect_Factor1

9.00000000000019

In [8]:
main_effect_Factor2

49.00000000000062

In [9]:
interaction_effect

0.49999999999999334

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, the F-statistic is used to test whether there are statistically significant differences among the means of three or more groups. The p-value associated with the F-statistic helps determine the significance of these differences. Here's how to interpret the results:

1. **F-Statistic (5.23)**:
   - The F-statistic measures the ratio of the variance between the group means to the variance within the groups. In simple terms, it quantifies whether the variation in the means of the groups is larger than what you'd expect by chance.
   - In your case, the F-statistic is 5.23.

2. **p-value (0.02)**:
   - The p-value is the probability of observing such extreme F-statistic results (or more extreme) if there were no real differences between the groups.
   - In your case, the p-value is 0.02, which is less than the common significance level of 0.05.

Now, let's interpret these results:

**Conclusion**:
Since the p-value (0.02) is less than the chosen significance level (e.g., 0.05), you would typically reject the null hypothesis. Therefore:

- **Rejecting the Null Hypothesis**: This means that there is evidence to suggest that there are statistically significant differences among at least some of the group means.

- **Accepting the Alternative Hypothesis**: Alternatively, you could state that at least one group mean is different from the others in a statistically significant way.


In summary, with an F-statistic of 5.23 and a p-value of 0.02, you have evidence to conclude that there are significant differences between at least some of the groups you are comparing in your one-way ANOVA analysis.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is important to ensure the validity of your analysis. Repeated measures ANOVA is typically used when you have data collected from the same subjects or items over multiple time points or conditions. Missing data can occur for various reasons, such as participant dropout, equipment malfunction, or errors in data recording. Here's how you can handle missing data and the potential consequences of different methods:

**1. Listwise Deletion:**
   - This method involves removing all cases (participants or items) with any missing data from the analysis.
   - Pros:
     - Simple and easy to implement.
   - Cons:
     - Reduces the sample size, potentially leading to a loss of statistical power.
     - May introduce bias if the missing data are not missing completely at random (MCAR).

**2. Mean Imputation:**
   - Replace missing values with the mean of the observed values for that variable.
   - Pros:
     - Retains the entire sample.
     - Simple and can preserve the sample's overall characteristics.
   - Cons:
     - Reduces variance, potentially underestimating standard errors and inflating Type I error rates.
     - Assumes that missing data are missing completely at random (MCAR).

**3. Linear Interpolation:**
   - If your data have a temporal or ordinal structure (e.g., time points), you can estimate missing values using linear interpolation.
   - Pros:
     - Preserves the temporal structure of the data.
   - Cons:
     - May not be appropriate for all types of data.
     - Requires careful consideration of the underlying pattern of the data.

**4. Multiple Imputation:**
   - This advanced technique involves creating multiple complete datasets with imputed values for missing data and then analyzing each dataset separately.
   - Pros:
     - Retains all available data.
     - Accounts for uncertainty related to missing data.
     - Provides more accurate parameter estimates and standard errors.
   - Cons:
     - Complex and computationally intensive.
     - Requires assumptions about the missing data mechanism (e.g., missing at random, MAR).

**5. Maximum Likelihood Estimation (MLE):**
   - MLE estimates model parameters while taking missing data into account. It is often used in specialized software or statistical packages designed for repeated measures ANOVA.
   - Pros:
     - Retains all available data.
     - Provides unbiased parameter estimates and standard errors.
   - Cons:
     - Requires specialized software or knowledge.

**Potential Consequences:**
- Using listwise deletion can lead to reduced statistical power and potential bias if the data are not MCAR.
- Mean imputation can introduce bias and underestimate variability, leading to incorrect statistical inferences.
- Linear interpolation may not be appropriate for all types of data and may not accurately represent missing values.
- Multiple imputation and MLE are more robust methods but require more complex statistical techniques and software.

The choice of how to handle missing data in a repeated measures ANOVA should be based on the nature of your data, the reasons for missingness, and the underlying assumptions about missing data mechanisms. It is crucial to carefully document and justify your chosen approach to address missing data in your analysis to ensure the validity and reliability of your results.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common post-hoc tests used after ANOVA:

1. **Tukey's HSD**: Controls familywise error rate, suitable for identifying significant differences in multiple group comparisons.

2. **Bonferroni Correction**: Adjusts significance levels for pairwise comparisons, ideal for maintaining an overall low Type I error rate.

3. **Sidak Correction**: Similar to Bonferroni but less conservative, controlling familywise error rate with more power.

4. **Holm-Bonferroni Method**: Efficiently controls familywise error rate while maximizing statistical power in multiple comparisons.

5. **Dunnett's Test**: Compares treatment groups to a control group, identifying which treatments differ significantly from the control.

6. **Fisher's LSD**: Liberal test for specific hypotheses about group differences, appropriate when you have prior expectations about specific comparisons.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [10]:
import numpy as np
import scipy.stats as stats
diet_A = [2.1, 1.8, 2.5, 1.9, 2.3, 2.8, 2.4, 2.0, 2.2, 2.7, 2.3, 2.1, 2.6, 2.0, 2.5, 2.4, 1.9, 2.2, 2.7, 2.8, 2.5, 2.1, 2.0, 2.3, 2.6, 2.4, 2.7, 2.5, 2.2, 2.8, 2.0, 2.1, 2.3, 2.6, 2.4, 2.7, 2.5, 2.2, 2.0, 2.1, 2.3, 2.6, 2.4, 2.7, 2.5, 2.2, 2.0]
diet_B = [1.7, 1.9, 1.5, 1.8, 1.6, 1.7, 2.0, 1.8, 1.6, 1.9, 1.7, 1.5, 1.8, 1.6, 1.7, 2.0, 1.8, 1.6, 1.9, 1.7, 1.5, 1.8, 1.6, 1.7, 2.0, 1.8, 1.6, 1.9, 1.7, 1.5, 1.8, 1.6, 1.7, 2.0, 1.8, 1.6, 1.9, 1.7, 1.5, 1.8, 1.6, 1.7, 2.0, 1.8, 1.6]
diet_C = [2.9, 2.7, 3.1, 2.6, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1, 2.9, 2.7, 3.0, 3.2, 2.8, 3.1]
#Performone-way ANOVA

# Create a corresponding group variable for the three diets
groups = ['A'] * len(diet_A) + ['B'] * len(diet_B) + ['C'] * len(diet_C)

# Perform the one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 370.931049849141
p-value: 3.955036964768049e-55


**Interpret the results:**

Since you want to determine if there are any significant differences between the mean weight loss of the three diets, you can use the p-value to make your conclusion:

If the p-value is less than your chosen significance level (e.g., 0.05), you would typically reject the null hypothesis.
In this case, if the p-value is less than 0.05, you would conclude that there are significant differences in mean weight loss between at least two of the diets.



## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [11]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Create a DataFrame with your data
data = pd.DataFrame({
    'Software': np.repeat(['A', 'B', 'C'], 10),  # 10 employees per software
    'Experience': np.tile(['Novice', 'Experienced'], 15),  # 15 novice and 15 experienced per software
    'Time': np.random.normal(30, 5, 30)  # Replace with your actual time data
})

# Fit the two-way ANOVA model
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=data).fit()

# Perform the two-way ANOVA
anova_table = anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                               sum_sq    df         F    PR(>F)
C(Software)                 68.909648   2.0  0.978882  0.390236
C(Experience)                0.000779   1.0  0.000022  0.996286
C(Software):C(Experience)   16.426648   2.0  0.233345  0.793656
Residual                   844.755070  24.0       NaN       NaN


##### Interpret the results:

Main Effects:

C(Software): Tests the main effect of software programs.
C(Experience): Tests the main effect of employee experience level.
Interaction Effect:

C(Software):C(Experience): Tests the interaction effect between software programs and employee experience level.
You can access these effects from the anova_table like this:

In [12]:
main_effect_Software = anova_table['sum_sq']['C(Software)'] / anova_table['sum_sq']['Residual']
main_effect_Experience = anova_table['sum_sq']['C(Experience)'] / anova_table['sum_sq']['Residual']
interaction_effect = anova_table['sum_sq']['C(Software):C(Experience)']


If the p-value for an effect is less than your chosen significance level (e.g., 0.05), you would typically reject the null hypothesis for that effect, indicating a significant effect.

If the interaction effect is significant, it suggests that the effect of software programs on task completion time depends on the employee's experience level (and vice versa)

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scoresbetween the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [13]:
import numpy as np
import pandas as pd
from scipy import stats
# Example data (replace with your actual data)
control_group = [85, 78, 92, 88, 75, 79, 81, 90, 87, 82]  # Scores for the control group
experimental_group = [92, 89, 94, 91, 86, 90, 93, 88, 95, 89]  # Scores for the experimental group
# Perform a two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the t-statistic and p-value
print("t-statistic:", t_stat)
print("p-value:", p_value)


t-statistic: -3.53854415246012
p-value: 0.0023469370674799885


##### Interpret the results:

If the p-value is less than your chosen significance level (e.g., 0.05), you would typically reject the null hypothesis and conclude that there are significant differences in test scores between the two groups.

Post-hoc Tests (if significant):

If the results of the t-test are significant, you can follow up with post-hoc tests (e.g., Tukey's HSD, Bonferroni correction, etc.) to identify which specific group(s) differ significantly from each other. However, please note that post-hoc tests are more commonly used in ANOVA when comparing multiple groups. For a two-group comparison like this, you can simply compare means and conduct additional analyses if needed.






## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
# Example data (replace with your actual data)
data = pd.DataFrame({
    'Store': np.repeat(['Store A', 'Store B', 'Store C'], 30),  # 30 days of sales for each store
    'Day': np.tile(np.arange(1, 31), 3),  # Day numbers from 1 to 30 repeated for each store
    'Sales': np.random.randint(300, 600, 90)  # Replace with your actual sales data
})
# Fit the repeated measures ANOVA model
model = AnovaRM(data, 'Sales', 'Store', within=['Day'])
results = model.fit()

# Print the ANOVA results
print(results)


              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
Day  0.7267 29.0000 58.0000 0.8243



##### Interpret the results:

The results will include information about the main effect of the "Store" factor and the interaction effect between "Store" and "Day."
If the interaction effect is significant, it suggests that the effect of store (sales) depends on the day or time point (repeated measure).

Please make sure your data is appropriately structured for repeated measures analysis. If you have cross-sectional data (sales data for a single point in time), a repeated measures ANOVA may not be suitable, and you might want to consider a one-way ANOVA or a different statistical approach to assess differences between stores.