# Statistics Advance-6

#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions of ANOVA:
* Independence: Observations are independent of each other.
* Normality: The residuals (errors) are normally distributed.
* Homogeneity of Variance: The variances of the residuals are equal across groups.
* Homogeneity of Regression: For factorial ANOVA, the regression coefficients are the same for each group.

Examples of violations and their impact:
* Independence Violation: If observations are not independent, the assumption of separate groups breaks down, and the validity of F-tests can be compromised.
* Normality Violation: If the residuals are not normally distributed, the p-values and confidence intervals may not be accurate.
* Homogeneity of Variance Violation: Unequal variances across groups can lead to inflated type I error rates and reduced power.
* Homogeneity of Regression Violation: This assumption is specific to factorial ANOVA, and violations can lead to incorrect inferences about main effects and interactions.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

Three types of ANOVA:
1. One-Way ANOVA: Used when comparing means across multiple independent groups.
2. Two-Way ANOVA: Used when comparing means across multiple groups with two independent categorical variables (factors).
3. Repeated Measures ANOVA: Used when comparing means across multiple related groups, such as repeated measurements on the same subjects.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the division of the total variance in the data into different components that can be attributed to specific sources of variation. This includes the explained variance (variation between groups) and the unexplained variance (variation within groups). Understanding this concept is important because it allows us to quantify how much of the total variability in the data is explained by the factors being studied.

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as ny
import scipy.stats as stats
# Example data for two groups
g1 = ny.array([25, 30, 28, 24, 26])
g2 = ny.array([40, 35, 38, 42, 36])
data = [g1, g2]  # List of data arrays for each group
tm = ny.mean(ny.concatenate(data))
SST = ny.sum((ny.concatenate(data) - tm)**2)
SSE = sum([ny.sum((group - ny.mean(group))**2) for group in data])
SSR = SST - SSE
print("Total Sum of Squares (SST):", SST)
print("Sum of Squares for Error (SSE):", SSE)
print("Sum of Squares for Regression (SSR):", SSR)

Total Sum of Squares (SST): 392.40000000000003
Sum of Squares for Error (SSE): 56.0
Sum of Squares for Regression (SSR): 336.40000000000003


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

We would typically perform a two-way ANOVA using a statistical package like SciPy or statsmodels. The main effects are the effects of individual factors on the response variable, and the interaction effect is the combined effect of two or more factors that is not additive. Python code for two-way ANOVA would involve using $scipy.stats.f_oneway$ or $statsmodels$ functions and interpreting the results.

#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

The obtained F-statistic of 5.23 suggests that there are significant differences between the group means. The p-value of 0.02 is less than the typical significance level of 0.05, indicating that we would reject the null hypothesis (that all group means are equal). Therefore, we can conclude that there are significant differences in at least one pair of groups.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA involves careful consideration. We could exclude cases with missing data, impute missing values, or use methods like mixed-effects models. Different methods can lead to biased estimates, reduced power, or incorrect significance results. The choice should be based on the nature of missing data and the assumptions of the analysis.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common post-hoc tests include Tukey's HSD, Bonferroni, and Scheffe's method. These tests are used when we reject the null hypothesis in ANOVA to determine which specific groups differ significantly. For example, if a one-way ANOVA shows a significant difference in mean scores among three treatment groups, a post-hoc test can identify which pairs of groups have different means.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [2]:
from scipy.stats import f_oneway
import numpy as np
# Example weight loss data for three diet groups
wlA = np.array([2, 3, 4, 2, 3])
wlB = np.array([1, 2, 1, 1, 3])
wlC = np.array([4, 5, 6, 4, 5])
F_stat, p_value = f_oneway(wlA, wlB, wlC)
print("F-statistic:", F_stat)
print("P-value:", p_value)
if p_value < 0.05:
    print("Reject the null hypothesis: There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences between the mean weight loss of the three diets.")

F-statistic: 17.81818181818181
P-value: 0.0002555382200150048
Reject the null hypothesis: There are significant differences between the mean weight loss of the three diets.


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [3]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Create a DataFrame with 'time', 'software', and 'experience' columns
data = pd.DataFrame({
    'time': [10, 15, 12, 8, 13, 18, 20, 16, 17, 22],
    'software': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
    'experience': ['novice', 'experienced', 'novice', 'experienced', 'novice', 'experienced', 'novice', 'experienced', 'novice', 'experienced']
})
data['software'] = data['software'].astype('category')
data['experience'] = data['experience'].astype('category')
# Specify the formula for the model
formula = 'time ~ C(software) * C(experience)'
# Fit the model and perform ANOVA
model = ols(formula, data=data).fit()
anova_results = anova_lm(model)
print(anova_results)

                            df  sum_sq  mean_sq         F    PR(>F)
C(software)                2.0   108.9    54.45  4.109434  0.107166
C(experience)              1.0     4.9     4.90  0.369811  0.575945
C(software):C(experience)  2.0     8.1     4.05  0.305660  0.752436
Residual                   4.0    53.0    13.25       NaN       NaN


#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [4]:
from scipy.stats import ttest_ind
cgs = [10, 15, 12, 8, 13]  # Example scores for the control group
egs = [18, 20, 16, 17, 22]  # Example scores for the experimental group
t_statistic, p_value = ttest_ind(cgs, egs)
print("t-statistic:", t_statistic)
print("P-value:", p_value)
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in test scores between the control and experimental groups.")

t-statistic: -4.324614442506509
P-value: 0.0025302781637910293
Reject the null hypothesis: There is a significant difference in test scores between the control and experimental groups.


#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [5]:
import pandas as pd
from scipy.stats import f_oneway
# Create a DataFrame with sales and store columns
data = pd.DataFrame({
    'sales': [120, 130, 125, 110, 135, 128, 122, 130, 124, 118],
    'store': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C']
})
data['store'] = data['store'].astype('category')
# Perform one-way ANOVA
sg = [group['sales'] for name, group in data.groupby('store')]
F_stat, p_value = f_oneway(*sg)
print("F-statistic:", F_stat)
print("P-value:", p_value)
if p_value < 0.05:
    print("Reject the null hypothesis: There are significant differences between the mean sales of the stores.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences between the mean sales of the stores.")

F-statistic: 1.4941267387944361
P-value: 0.2881573469718555
Fail to reject the null hypothesis: There are no significant differences between the mean sales of the stores.
