# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Analysis of Variance (ANOVA) is a statistical technique used to compare the means of three or more groups to determine if there are significant differences among them. To use ANOVA effectively and interpret its results accurately, certain assumptions must be met. Violations of these assumptions can impact the validity of the ANOVA results. The main assumptions for ANOVA are:

- Independence of Observations:

Assumption: Observations within each group must be independent of each other, meaning that the values in one group should not depend on or be influenced by the values in another group.

Violation Example: In a study comparing test scores of students from different schools, if students from the same school collaborate or share answers, the independence assumption is violated.
Normality:

Assumption: The data within each group should follow a normal distribution. This assumption is more critical for smaller sample sizes (typically less than 30 per group).

Violation Example: If a group's data is significantly skewed or has heavy tails, it may not meet the normality assumption. For example, test scores in a highly competitive exam may have a skewed distribution.

- Homogeneity of Variance (Homoscedasticity):

Assumption: The variances of the groups should be roughly equal. In other words, the spread of data points within each group should be similar.

Violation Example: If one group has much larger variance (greater variability) than another, it can violate the homogeneity of variance assumption. 

For instance, if one group of employees has highly variable productivity compared to another group, this assumption may be violated.
Examples of Violations and Their Impact on Validity:

- a. Non-Normality:

Violation: Suppose you are comparing the effectiveness of three different medications on pain relief, and the data for one medication group is heavily skewed due to a few extreme outliers.

Impact: The ANOVA results may be invalid. A non-normal distribution can affect the accuracy of p-values and confidence intervals, leading to incorrect conclusions.

- b. Heteroscedasticity:

Violation: In a study comparing the performance of three car models, if the variances of the miles per gallon (MPG) for each model are significantly different, with one model having much larger variance.

Impact: ANOVA assumes equal variances among groups. Heteroscedasticity can lead to inflated Type I error rates (false positives) or reduced power, making it harder to detect true differences.

- c. Lack of Independence:

Violation: In a survey of employee satisfaction, responses from employees within the same department may not be independent if they influence each other's responses.

Impact: Violations of independence can lead to unreliable ANOVA results. It may also affect the representativeness of the sample.

When these assumptions are violated, alternative statistical methods or transformations of the data may be necessary to address the issues and ensure valid conclusions. Additionally, if the assumptions cannot be met, caution should be exercised when interpreting ANOVA results.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) is a statistical technique used to compare the means of three or more groups to determine if there are significant differences among them. There are three main types of ANOVA, each suited for different situations:

- One-Way ANOVA:

Use Case: One-way ANOVA is used when you have one categorical independent variable (factor) with three or more levels or groups and a continuous dependent variable. It is employed to test if there are any statistically significant differences in the means of the groups.

_Example: Comparing the test scores of students who attended three different schools to determine if there are significant differences in their performance._

- Two-Way ANOVA:

Use Case: Two-way ANOVA is used when you have two categorical independent variables (factors) and a continuous dependent variable. It assesses the impact of both factors on the dependent variable and whether there are interactions between the factors.

_Example: Evaluating the effects of two factors, such as the type of fertilizer (Factor A) and the amount of sunlight (Factor B), on plant growth._

- Repeated Measures ANOVA (or Within-Subjects ANOVA):

Use Case: Repeated Measures ANOVA is used when you have a single group of subjects that is measured at multiple time points or under different conditions (repeated measurements) and a continuous dependent variable. It is used to assess changes within the same subjects over time or across conditions.

_Example: Examining the effect of a drug treatment on the blood pressure of the same group of patients before treatment, immediately after treatment, and at regular intervals thereafter._

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps explain the sources of variability in a dataset and how they contribute to the overall variance in the dependent variable. It is important to understand this concept because it provides valuable insights into the relationships between factors, helps assess the significance of those relationships, and guides the interpretation of ANOVA results. Here's an explanation of the partitioning of variance in ANOVA:

In ANOVA, the total variance in the dependent variable is divided into two main components:

- Between-Group Variance (or Treatment Variance):

This component represents the variability in the dependent variable that is attributed to the differences between the groups or levels of the independent variable(s). It measures how much the group means differ from each other.
Mathematically, it is calculated as the sum of squared differences between the group means and the overall mean, weighted by the number of observations in each group.

- Within-Group Variance (or Error Variance):

This component represents the variability in the dependent variable that cannot be explained by the differences between the groups. It includes random variability and unexplained sources of variance within each group.
Mathematically, it is calculated as the sum of squared differences between each individual data point and the mean of its respective group.
The key idea behind partitioning the variance is to assess whether the between-group variance is significantly greater than the within-group variance. If the between-group variance is much larger than the within-group variance, it suggests that there are significant differences among the groups, and this is typically what ANOVA tests for.

The ratio of between-group variance to within-group variance, known as the F-statistic, is used to determine whether the differences among group means are statistically significant. If the F-statistic is sufficiently large, it suggests that at least one group is different from the others, and you can reject the null hypothesis, concluding that there are significant differences among the groups.

- Understanding the partitioning of variance is important for several reasons:

Interpretation of Results: It helps you interpret ANOVA results by quantifying the relative importance of the factors being studied in explaining the variability in the dependent variable.

Identifying Significant Effects: It allows you to identify which factors or levels of factors have a significant effect on the dependent variable, helping you focus on meaningful relationships in your data.

Model Assessment: It aids in evaluating the goodness of fit of your statistical model, indicating whether the model adequately explains the observed variation in the data.

Research and Decision Making: It assists researchers, analysts, and decision-makers in drawing conclusions about the impact of different factors or treatments on the dependent variable, which can inform further investigations or decisions.

In summary, understanding the partitioning of variance in ANOVA is crucial for conducting hypothesis tests, drawing meaningful conclusions, and making informed decisions based on the analysis of data with multiple groups or factors. It helps identify where the variability in the data comes from and whether it is statistically significant.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In a one-way ANOVA, you can calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) using Python. You'll typically use libraries like NumPy and SciPy to perform these calculations. Here's how you can do it:

Total Sum of Squares (SST):

SST represents the total variation in the dependent variable. It is the sum of squared differences between each data point and the overall mean.

In [1]:
import numpy as np
# Create an array of data (replace with your data)
data = np.array([group1_data, group2_data, group3_data, ...])
# Calculate the overall mean
overall_mean = np.mean(data)
# Calculate the total sum of squares (SST)
SST = np.sum((data - overall_mean) ** 2)

NameError: name 'group1_data' is not defined

Explained Sum of Squares (SSE):

SSE represents the variation in the dependent variable explained by the differences between the group means. It is the sum of squared differences between each group mean and the overall mean, weighted by the number of observations in each group.

In [4]:
# Create arrays for each group's data (replace with your data)
group1_data = np.array([value1, value2, ...])
group2_data = np.array([value1, value2, ...])
group3_data = np.array([value1, value2, ...])
# Calculate the group means
group1_mean = np.mean(group1_data)
group2_mean = np.mean(group2_data)
group3_mean = np.mean(group3_data)
# ...
# Calculate the explained sum of squares (SSE)
SSE = (len(group1_data) * (group1_mean - overall_mean) ** 2 +
       len(group2_data) * (group2_mean - overall_mean) ** 2 +
       len(group3_data) * (group3_mean - overall_mean) ** 2 +
       # ... Repeat for other groups
       )

SyntaxError: invalid syntax (3240877760.py, line 15)

Residual Sum of Squares (SSR):

SSR represents the unexplained variation in the dependent variable. It is the sum of squared differences between each data point and its respective group mean.

In [5]:
# Calculate the residuals for each group
residuals_group1 = group1_data - group1_mean
residuals_group2 = group2_data - group2_mean
residuals_group3 = group3_data - group3_mean
# ...

# Calculate the residual sum of squares (SSR)
SSR = np.sum(residuals_group1 ** 2) + np.sum(residuals_group2 ** 2) + np.sum(residuals_group3 ** 2) + ...


NameError: name 'group1_data' is not defined

Once you have calculated SST, SSE, and SSR, you can use these values to perform an F-test to determine whether there are significant differences among the group means. This test will help you assess whether the independent variable (e.g., group or treatment) has a statistically significant effect on the dependent variable. You can use libraries like SciPy to perform the F-test.

# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

A5 : In a two-way ANOVA, you can calculate the main effects and interaction effects using Python with the help of libraries like NumPy and SciPy. Here's a step-by-step guide on how to do it:

Let's assume you have a dataset with two categorical independent variables, Factor A (with 'a' levels) and Factor B (with 'b' levels), and a continuous dependent variable.

Data Preparation:

Organize your data into arrays or data structures where each row corresponds to an observation, and each column corresponds to one of the factors.

In [6]:
import numpy as np

# Create arrays for Factor A, Factor B, and the dependent variable (replace with your data)
factor_A = np.array([level1, level2, ..., levela, ...])
factor_B = np.array([level1, level2, ..., levelb, ...])
dependent_variable = np.array([value1, value2, ..., valueN])


NameError: name 'level1' is not defined

Calculate Group Means:

Calculate the means for each combination of Factor A and Factor B. These are the cell means.

In [7]:
from scipy import stats

# Calculate cell means using scipy's stats.binned_statistic_2d function
cell_means, _, _ = stats.binned_statistic_2d(factor_A, factor_B, dependent_variable, statistic='mean', bins=[a_levels, b_levels])


NameError: name 'factor_A' is not defined

Calculate Marginal Means:
    
Calculate the marginal means for Factor A and Factor B. These are the main effects.

In [8]:
# Calculate the marginal means for Factor A
marginal_mean_A = np.mean(cell_means, axis=1)

# Calculate the marginal means for Factor B
marginal_mean_B = np.mean(cell_means, axis=0)


NameError: name 'cell_means' is not defined

Calculate Interaction Effect:

Calculate the interaction effect by subtracting the marginal means from the cell means.

In [9]:
# Calculate the interaction effect
interaction_effect = cell_means - np.outer(marginal_mean_A, marginal_mean_B)


NameError: name 'cell_means' is not defined

Performing Hypothesis Tests:

You can use hypothesis tests to determine if the main effects and interaction effect are statistically significant. For this, you would typically use an ANOVA table or statistical tests such as F-tests.

Here's a simplified example that demonstrates how to calculate main effects and interaction effects using Python:

In [10]:
import numpy as np
from scipy import stats

# Example data
factor_A = np.array(['A1', 'A1', 'A2', 'A2', 'A3'])
factor_B = np.array(['B1', 'B2', 'B1', 'B2', 'B1'])
dependent_variable = np.array([10, 12, 8, 9, 14])

# Calculate cell means
cell_means, _, _ = stats.binned_statistic_2d(factor_A, factor_B, dependent_variable, statistic='mean', bins=[3, 2])

# Calculate marginal means for Factor A and Factor B
marginal_mean_A = np.mean(cell_means, axis=1)
marginal_mean_B = np.mean(cell_means, axis=0)

# Calculate interaction effect
interaction_effect = cell_means - np.outer(marginal_mean_A, marginal_mean_B)

print("Cell Means:")
print(cell_means)
print("Marginal Mean for Factor A:", marginal_mean_A)
print("Marginal Mean for Factor B:", marginal_mean_B)
print("Interaction Effect:")
print(interaction_effect)


UFuncTypeError: ufunc 'minimum' did not contain a loop with signature matching types (dtype('<U2'), dtype('<U2')) -> None

Remember that for hypothesis testing, you would typically perform ANOVA tests or other relevant statistical tests to assess the significance of the main effects and interaction effect. The F-statistic and p-values are often used for this purpose.

# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

A6 : 
    
In a one-way ANOVA, the F-statistic and its associated p-value are used to assess whether there are significant differences among the means of three or more groups. Here's how to interpret the results you provided:

F-Statistic: The F-statistic is a test statistic that measures the ratio of the variance between groups (explained variance) to the variance within groups (unexplained variance). It quantifies whether the differences in group means are statistically significant.

P-Value: The p-value associated with the F-statistic tells you the probability of observing the results (or more extreme results) if there were no significant differences among the groups. A small p-value indicates evidence against the null hypothesis.

In your case:

F-Statistic = 5.23

p-Value = 0.02

- Interpretation:

Null Hypothesis (H0): The null hypothesis in this case is that there are no significant differences among the group means (i.e., all group means are equal).

Alternative Hypothesis (Ha): The alternative hypothesis is that at least one group mean is significantly different from the others.

- Interpretation:

Since the p-value (0.02) is less than the significance level (commonly chosen at 0.05 or 0.01), you would reject the null hypothesis.

Based on the results, you have evidence to conclude that there are statistically significant differences among the group means.

However, the ANOVA test itself does not tell you which specific groups are different from each other. If you want to identify which pairs of groups have significant differences, you may need to perform post hoc tests, such as Tukey's HSD (Honestly Significant Difference) test or Bonferroni corrections.

Additionally, you can calculate effect size measures (e.g., eta-squared or partial eta-squared) to quantify the practical significance or strength of the observed differences among the groups.

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you have evidence to conclude that there are significant differences among the groups. However, further analyses or post hoc tests may be needed to determine which specific group(s) differ from one another.


# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

A7:
    
Handling missing data in a repeated measures ANOVA (or any statistical analysis) is crucial to ensure the validity and reliability of your results. Missing data can occur for various reasons, such as participant dropouts, equipment failures, or incomplete responses. There are several methods for dealing with missing data, each with its potential consequences. Here are common approaches and their implications:

- Complete Case Analysis (Listwise Deletion):

In this approach, any case (participant) with missing data on any variable is removed from the analysis.

Pros:

Simple and straightforward.

Preserves the integrity of the dataset for analysis.

Cons:

Reduces the sample size, potentially leading to a loss of statistical power.

May introduce bias if missing data are not completely random (i.e., if certain types of participants are more likely to have missing data).

- Mean Imputation (or Single Imputation):

Missing values are replaced with the mean of the observed values for that variable.

Pros:

Simple and does not reduce the sample size.

Cons:

Reduces variance, potentially underestimating standard errors.

Assumes that missing data are missing completely at random (MCAR), which is often unrealistic.

Can distort relationships and correlations in the data.

- Linear Interpolation or Last Observation Carried Forward (LOCF):

Linear interpolation estimates missing values based on adjacent time points or carries forward the last observed value.

Pros:

Can provide plausible estimates of missing values for time-series data.

Cons:

Assumes linear trends between observations, which may not be valid.

LOCF can overestimate the true values, particularly if the missing data occur early in the series.

- Multiple Imputation:

Multiple imputation generates multiple datasets with imputed values and combines results from each dataset to account for uncertainty in imputation.

Pros:

Provides valid and unbiased parameter estimates.

Accounts for the uncertainty associated with missing data.

Cons:

Requires specialized software and additional computational resources.

Can be complex to implement correctly.

- Model-Based Imputation:

Imputation methods based on regression models or machine learning techniques to predict missing values.

Pros:

Can provide accurate imputations when relationships between variables are well understood.

Cons:

Requires knowledge and assumptions about the underlying data structure.

Complex and may overfit the imputed values.

The choice of method should depend on the nature of the missing data and the assumptions made about the missing data mechanism (e.g., MCAR, missing at random [MAR], or missing not at random [MNAR]). It is generally advisable to perform sensitivity analyses by using multiple imputation or different methods to assess the robustness of your results to the handling of missing data.

Using inappropriate methods for handling missing data can lead to biased and unreliable results. Therefore, it is essential to carefully consider the nature of your data and the assumptions underlying your chosen imputation method. Consulting with a statistician or data analyst with expertise in missing data handling is often recommended to ensure the appropriate treatment of missing data in your repeated measures ANOVA.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

A8:
    
Post-hoc tests are used after conducting an Analysis of Variance (ANOVA) to compare group means and identify specific pairwise differences between groups when the ANOVA indicates that there are significant differences among three or more groups. Common post-hoc tests include:

- Tukey's Honestly Significant Difference (Tukey's HSD):

When to Use: Tukey's HSD is a conservative test appropriate for situations where you want to control the familywise error rate (the probability of making at least one Type I error across all comparisons).

Example: In a one-way ANOVA comparing the performance of four different training methods, you use Tukey's HSD to determine which specific pairs of training methods have significantly different means.

- Bonferroni Correction:

When to Use: Bonferroni correction is a more conservative approach that controls the Type I error rate by dividing the desired significance level (e.g., 0.05) by the number of comparisons being made.

Example: When conducting multiple pairwise comparisons in a one-way ANOVA with four groups, you would use Bonferroni correction to adjust the significance level for each comparison.

- Scheffé's Method:

When to Use: Scheffé's method is less conservative than Tukey's HSD and Bonferroni correction. It is suitable when you have unequal sample sizes and you want to control the familywise error rate.

Example: In a two-way ANOVA with unequal group sizes and multiple comparisons to assess the effect of different treatments across different levels of a second factor, Scheffé's method can be used to control the error rate.

- Duncan's Multiple Range Test (MRT):

When to Use: Duncan's MRT is less conservative and more powerful when you have relatively homogeneous group variances. It is less stringent in controlling Type I errors.

Example: In a one-way ANOVA comparing the yields of five different crop fertilizers, Duncan's MRT can help identify which fertilizers have significantly different yields.

- Holm-Bonferroni Method:

When to Use: Holm-Bonferroni adjusts for multiple comparisons in a way that is less conservative than Bonferroni correction but still controls the familywise error rate.

Example: In a repeated measures ANOVA comparing the effects of three different interventions at multiple time points, the Holm-Bonferroni method can be used to control for multiple comparisons.

- Games-Howell Test:

When to Use: The Games-Howell test is suitable when group variances are unequal and sample sizes are different. It does not assume homogeneity of variances.

Example: In a one-way ANOVA comparing the performance of several different schools with varying student populations, you can use the Games-Howell test to compare schools with different sample sizes and variances.

- Example Situation Requiring a Post-hoc Test:
Suppose you are conducting a one-way ANOVA to compare the effectiveness of four different advertising strategies on product sales. The ANOVA indicates that there are significant differences among the strategies. In this scenario, you would perform a post-hoc test (e.g., Tukey's HSD, Bonferroni, or Scheffé's method) to determine which specific pairs of advertising strategies have significantly different impacts on sales. This helps you identify which strategies are more effective than others and make informed decisions for your advertising campaign.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [18]:
import numpy as np
import scipy.stats as stats
# Example data 
diet_A = np.array([2.1, 1.8, 2.5, 1.9, 2.3,])  # Weight loss for Diet A
diet_B = np.array([1.5, 1.7, 1.3, 1.8, 1.4,])  # Weight loss for Diet B
diet_C = np.array([2.8, 2.6, 2.9, 3.1, 2.7,])  # Weight loss for Diet C
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway (diet_A, diet_B, diet_C)
# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)
# Interpret the results
alpha = 0.05  # Set the significance level
if p_value < alpha:
    print("The p-value is less than the significance level (alpha).")
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("The p-value is greater than the significance level (alpha).")
    print("There is no evidence of significant differences between the mean weight loss of the three diets.")

F-Statistic: 38.03703703703701
p-value: 6.397333028465909e-06
The p-value is less than the significance level (alpha).
There are significant differences between the mean weight loss of the three diets.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [15]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Example data
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 30,
    'Experience': ['Novice', 'Experienced'] * 45,
    'Time': np.random.normal(30, 5, 90)  # Random time data
})
# Fit the two-way ANOVA model
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
# Print the ANOVA table
print(anova_table)
# Interpret the results
alpha = 0.05  # Set the significance level
# Main effects of Software and Experience
p_software = anova_table['PR(>F)']['C(Software)']
p_experience = anova_table['PR(>F)']['C(Experience)']
# Interaction effect
p_interaction = anova_table['PR(>F)']['C(Software):C(Experience)']

if p_software < alpha:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

if p_experience < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

if p_interaction < alpha:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")

                                sum_sq    df         F    PR(>F)
C(Software)                  30.597812   2.0  0.585784  0.558927
C(Experience)                51.740366   1.0  1.981103  0.162964
C(Software):C(Experience)    13.197384   2.0  0.252659  0.777321
Residual                   2193.824194  84.0       NaN       NaN
There is no significant main effect of Software.
There is no significant main effect of Experience.
There is no significant interaction effect between Software and Experience.


This code will help you determine whether there are main effects or interaction effects between software programs and employee experience levels in terms of task completion time.

# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [17]:
import numpy as np
import scipy.stats as stats
# Example data
control_group_scores = np.array([85, 88, 92, 78, 90,])  # Test scores for control group
experimental_group_scores = np.array([95, 92, 98, 88, 96,])  # Test scores for experimental group
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)
# Print the results of the t-test
print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)
# Set the significance level (alpha)
alpha = 0.05
# Interpret the t-test results
if p_value < alpha:
    print("The p-value is less than the significance level (alpha).")
    print("There is a significant difference in test scores between the two groups.")
    print("You can proceed with a post-hoc test to determine which group(s) differ significantly.")
else:
    print("The p-value is greater than or equal to the significance level (alpha).")
    print("There is no significant difference in test scores between the two groups.")
# If the t-test is significant, you can proceed with post-hoc tests (e.g., Tukey's HSD or Bonferroni) to identify specific group differences.

Two-Sample T-Test Results:
t-statistic: -2.400000000000001
p-value: 0.0431767278278466
The p-value is less than the significance level (alpha).
There is a significant difference in test scores between the two groups.
You can proceed with a post-hoc test to determine which group(s) differ significantly.


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [19]:
import numpy as np
import scipy.stats as stats
# Example data 
store_A_sales = np.array([1000, 950, 1100, 1050])  # Sales data for Store A
store_B_sales = np.array([900, 850, 920, 930])    # Sales data for Store B
store_C_sales = np.array([1200, 1250, 1180, 1220])  # Sales data for Store C
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)
# Print the results of the one-way ANOVA
print("One-Way ANOVA Results:")
print("F-Statistic:", f_statistic)
print("p-value:", p_value)
# Set the significance level (alpha)
alpha = 0.05
# Interpret the one-way ANOVA results
if p_value < alpha:
    print("The p-value is less than the significance level (alpha).")
    print("There are significant differences in daily sales between the three stores.")
    print("You can proceed with post-hoc tests to determine which store(s) differ significantly.")
else:
    print("The p-value is greater than or equal to the significance level (alpha).")
    print("There is no significant difference in daily sales between the three stores.")
# If the one-way ANOVA is significant, you can proceed with post-hoc tests (e.g., Tukey's HSD or Bonferroni) to identify specific store differences.

One-Way ANOVA Results:
F-Statistic: 46.936758893280626
p-value: 1.7327070140008526e-05
The p-value is less than the significance level (alpha).
There are significant differences in daily sales between the three stores.
You can proceed with post-hoc tests to determine which store(s) differ significantly.
