Answer 1:-

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups or treatments to determine if there are any significant differences among them.

To use ANOVA and ensure the validity of its results, certain assumptions need to be met. These assumptions are :

Independence : The observations within each group or treatment are independent of each other. This means that the values in one group should not be influenced or related to the values in another group.

Normality : The data in each group should follow a normal distribution. The normality assumption is particularly important when the sample sizes are small, as ANOVA tends to be robust to violations when the sample sizes are large.

Homogeneity of Variance (Homoscedasticity) : The variability of the data (variance) within each group should be roughly equal. In other words, the spread of the data points around the group means should be similar across all groups.

Absence of outliers : There must no outliers in the data points

Examples of Violations:

Violation of Independence: In some experimental designs, the independence assumption may be violated if there is a hierarchical or nested structure in the data. For example, if you measure the performance of students within different classrooms, the students within the same classroom may not be independent of each other due to shared characteristics or teaching styles.

Violation of Normality: If the data in any of the groups deviates significantly from a normal distribution, it can impact the validity of ANOVA results. For instance, if the data is strongly skewed or has heavy tails, the normality assumption may not hold.

Violation of Homoscedasticity: Unequal variances among groups can lead to biased ANOVA results. For example, if the variability of test scores in one group is much larger than that in another group, the assumption of homogeneity of variance may not be met.

Answer 2:-

The three types of ANOVA (Analysis of Variance) are:

One-Way ANOVA:

One-Way ANOVA is used when there is one categorical independent variable (also known as a factor) with three or more levels or groups, and we want to compare the means of a continuous dependent variable across these groups.
It is suitable for situations where we have one factor and want to determine if there are any significant differences in the means of the dependent variable across the different levels of that factor.

    
Two-Way ANOVA:

Two-Way ANOVA is used when there are two categorical independent variables (factors), and we want to examine the interaction between these two factors and their effects on a continuous dependent variable.
It is suitable for situations where we have two factors, and we want to investigate how the means of the dependent variable vary across the combinations of levels of both factors.

    
Three-Way ANOVA:

Three-Way ANOVA is an extension of the two-way ANOVA and is used when there are three categorical independent variables (factors).
It is suitable for situations where we have three factors, and we want to examine their individual effects and their interactions on a continuous dependent variable.

Answer 3:-

Partitioning of variance in ANOVA refers to the division of the total variability in the data into different components that are associated with various sources of variation.

In ANOVA, the total variance in the data is broken down into three main components:

Between-Group Variance (Between-Treatments Variance): This component represents the variability in the dependent variable that can be attributed to the differences between the groups or treatments (levels of the independent variable). It measures the effect of the factors (independent variables) on the dependent variable. 

Within-Group Variance (Within-Treatments Variance or Residual Variance): This component represents the variability in the dependent variable that cannot be explained by the differences between the groups. It accounts for the random or unexplained variation within each group. It measures the variability of data points within each group around the group mean. 

Total Variance: This is the overall variability in the data, and it is the sum of the between-group variance and the within-group variance. It represents the total variation in the dependent variable across all groups and treatments.

Answer 4:-

Calculate the Total Sum of Squares (SST):
SST represents the total variability in the dependent variable.

The sum of squared differences between individual data points (yi) and the mean of the response variable (y).

SST = Σ(yi – y)2

Calculate the Explained Sum of Squares (SSE):
SSE represents the variability in the dependent variable that can be attributed to the differences between the group means (treatments).
The sum of squared differences between predicted data points (ŷi) and observed data points (yi).

SSE = Σ(ŷi – yi)2

Calculate the Residual Sum of Squares (SSR):
SSR represents the unexplained variability in the dependent variable within each group.
The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).

SSR = Σ(ŷi – y)2

Answer 5:-

In [4]:
import pandas as pd
import numpy as np

In [5]:
# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
  
# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13, 
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})
  
  
# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) + C(Fertilizer):C(Watering)',data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)
  
# Print the result
print(result)

model.summary()

                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.092308  0.092308  0.033422  0.856260
C(Fertilizer):C(Watering)   1.0   0.057692  0.057692  0.020889  0.886118
Residual                   28.0  77.333333  2.761905       NaN       NaN


0,1,2,3
Dep. Variable:,height,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.035
Method:,Least Squares,F-statistic:,0.01207
Date:,"Tue, 03 Sep 2024",Prob (F-statistic):,0.913
Time:,07:33:04,Log-Likelihood:,-56.772
No. Observations:,30,AIC:,117.5
Df Residuals:,28,BIC:,120.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,14.8000,0.429,34.491,0.000,13.921,15.679
C(Fertilizer)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392
C(Watering)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392
C(Fertilizer)[T.weekly]:C(Watering)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392

0,1,2,3
Omnibus:,0.177,Durbin-Watson:,0.916
Prob(Omnibus):,0.915,Jarque-Bera (JB):,0.011
Skew:,0.029,Prob(JB):,0.995
Kurtosis:,2.929,Cond. No.,1.33e+32


Answer 6:-

The p-value indicates the probability of obtaining the observed data or more extreme data under the assumption that there are no true differences between the group means (null hypothesis). A smaller p-value suggests that the observed differences are unlikely to have occurred by chance alone. Interpretation:

Since the p-value (0.02) is less than the chosen significance level (α) of 0.05 (commonly used so assumption is made), we reject the null hypothesis.
This means that there is sufficient evidence to conclude that there are statistically significant differences between the means of the groups being compared.

Answer 7:-

There are several methods to handle missing data in a repeated measures ANOVA:

Complete Case Analysis (Listwise Deletion): This method involves excluding any case with missing data in any of the variables being analyzed. While it is the simplest approach, it can lead to biased results if the data is not missing completely at random. It can also reduce the sample size and statistical power, potentially leading to less reliable results.

Mean Imputation: In this method, missing data in a variable are replaced by the mean of that variable from the observed cases. While this is a straightforward approach, it may distort the distribution of the variable and underestimate the standard error, leading to overly optimistic statistical significance.

Last Observation Carried Forward (LOCF): LOCF imputes missing data with the value of the last observed data point. This method assumes that the data follows a linear pattern, which may not be appropriate for all situations.

Multiple Imputation: Multiple imputation involves creating multiple plausible imputations for the missing data, incorporating uncertainty in the imputation process. This approach can provide more reliable estimates and standard errors. However, it can be computationally intensive and may require making assumptions about the missing data mechanism.

Maximum Likelihood Estimation (MLE): MLE is a statistical approach that estimates parameters by maximizing the likelihood function. In the context of missing data, it allows for the use of all available data and provides unbiased estimates under the assumption that data is missing at random.

Pattern-Mixture Models: These models involve considering different patterns of missingness and fitting separate models for each pattern. This approach can be complex but may provide more accurate estimates when the missing data mechanism is related to the outcome.

Answer 8:-

Some common post-hoc tests include are :

Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is widely used when comparing all possible pairs of group means. It controls the familywise error rate, ensuring that the overall experimentwise error rate remains at a desired level (typically 0.05). This test is appropriate when you have equal group sizes and homogeneity of variances.

Bonferroni Correction: The Bonferroni correction is a simple method to adjust the significance level for multiple comparisons. It divides the desired alpha level (usually 0.05) by the number of comparisons being made. This method is more conservative but can be applied to any set of comparisons.

Answer 9:-

In [7]:
import numpy as np
import scipy.stats as stats

# Generate simulated data assuming normal distribution with same variance
np.random.seed(20)
diet_A = np.random.normal(5, 1, 50)
diet_B = np.random.normal(4, 1, 50)
diet_C = np.random.normal(3, 1, 50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Set significance level
alpha = 0.05

# Null hypothesis: The mean weight loss is the same for all three diets.
# Alternative hypothesis: The mean weight loss is different for at least one diet.
null_hypothesis = "The mean weight loss is the same for all three diets."
alternate_hypothesis = "The mean weight loss is different for at least one diet."

print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < alpha:
    print("We reject the null hypothesis.")
    print(f"Final Conclusion : {alternate_hypothesis}")
else:
    print("We fail to reject the null hypothesis.")
    print(f"Final Conclusion : {null_hypothesis}")

F-statistic: 41.80444706032352
p-value: 4.2309140010930765e-15
We reject the null hypothesis.
Final Conclusion : The mean weight loss is different for at least one diet.


Answer 10:-

In [8]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

# Software programs: A, B, C
software_programs = np.random.choice(['A', 'B', 'C'], size=90)

# Employee experience level: Novice, Experienced
experience_level = np.random.choice(['Novice', 'Experienced'], size=90)

# Random time data for each combination of program and experience level
time_to_complete_task = np.random.normal(loc=20, scale=5, size=90)

# Create a DataFrame
data = pd.DataFrame({'Software': software_programs,
                     'ExperienceLevel': experience_level,
                     'Time': time_to_complete_task})

# Perform the two-way ANOVA
model = ols('Time ~ C(Software) + C(ExperienceLevel) + C(Software):C(ExperienceLevel)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Report the results
print(anova_table)

                                  df       sum_sq    mean_sq         F  \
C(Software)                      2.0     9.309580   4.654790  0.216246   
C(ExperienceLevel)               1.0    31.851905  31.851905  1.479736   
C(Software):C(ExperienceLevel)   2.0    52.479686  26.239843  1.219018   
Residual                        84.0  1808.132913  21.525392       NaN   

                                  PR(>F)  
C(Software)                     0.805984  
C(ExperienceLevel)              0.227223  
C(Software):C(ExperienceLevel)  0.300694  
Residual                             NaN  


Answer 11:-

In [9]:
import numpy as np
import scipy.stats as stats

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

control_group = np.random.normal(loc=75, scale=5, size=100)
experimental_group = np.random.normal(loc=80, scale=6, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Report the results
print("Two-sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")

Two-sample t-test:
t-statistic: -7.738786904885968
p-value: 5.026085102727666e-13
There is a significant difference in test scores between the control and experimental groups.


Answer 12:-

In [10]:
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine the test scores and group information into a DataFrame
data = pd.DataFrame({'Test_Score': np.concatenate([control_group, experimental_group]),
                     'Group': ['Control'] * 100 + ['Experimental'] * 100})

# Perform Tukey's HSD post-hoc test
tukey_results = pairwise_tukeyhsd(data['Test_Score'], data['Group'])

# Report the results
print("\nTukey's HSD post-hoc test:")
print(tukey_results)


Tukey's HSD post-hoc test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   5.6531   0.0 4.2125 7.0936   True
--------------------------------------------------------


Answer 12:-

In [11]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

store_A_sales = np.random.normal(loc=1000, scale=100, size=30)
store_B_sales = np.random.normal(loc=950, scale=90, size=30)
store_C_sales = np.random.normal(loc=1100, scale=110, size=30)

# Combine the sales data and group information into a DataFrame
data = pd.DataFrame({'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales]),
                     'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30})

# Perform one-way repeated measures ANOVA
F_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Report the results
print("One-way repeated measures ANOVA:")
print("F-statistic:", F_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in average daily sales between the three stores.")
else:
    print("There is no significant difference in average daily sales between the three stores.")


One-way repeated measures ANOVA:
F-statistic: 23.62763182315457
p-value: 6.36905489476218e-09
There is a significant difference in average daily sales between the three stores.
