In [1]:
# Ques 1
# Analysis of Variance (ANOVA) is a statistical method used to compare means among three or more groups. It is based on several assumptions, and violations of these assumptions can impact the validity of ANOVA results. The key assumptions for using ANOVA are:

#1.> Independence: The observations within each group should be independent of each other. Violations could occur if observations within a group are correlated, such as in repeated measures designs where the same subjects are measured over time.

#2.> Normality: The distribution of the residuals (the differences between observed values and predicted values) should be approximately normal for each group. Violations can occur if the residuals are significantly skewed or have heavy tails, leading to inaccurate p-values and confidence intervals.

#3.> Homogeneity of Variance (Homoscedasticity): The variance of the residuals should be roughly constant across all groups. Violations could lead to unequal variability between groups, which may affect the ANOVA F-test and subsequent post hoc tests.

#4.> Homogeneity of Regression Slopes (Interaction Assumption): This assumption is specific to two-way ANOVA (or higher-way ANOVA) when there are multiple independent variables (factors). It assumes that the relationship between the dependent variable and each independent variable is consistent across different levels of the other independent variables. Violations can result in misleading interpretations of main effects and interactions.

# Examples of violations that could impact the validity of ANOVA results:

#1.> Outliers: Outliers can lead to violations of the assumptions, especially normality and homogeneity of variance. They can skew the distribution of residuals and affect the accuracy of the F-test.

#2.> Non-Normality: If the residuals are not normally distributed within groups, the ANOVA results may be invalid. This could occur when the data is heavily skewed or has extreme kurtosis.

#3.> Heteroscedasticity: Unequal variances between groups can lead to inflated or deflated F-statistics, affecting the overall significance of the ANOVA. This violation can arise when the spread of residuals differs among groups.

#4.> Violations of Independence: If observations are not independent within groups, it could lead to pseudo-replication, where the effective sample size is smaller than the actual sample size, potentially affecting the reliability of the results.

#5.> Interactions: In designs with multiple factors, violating the assumption of homogeneity of regression slopes can distort the interpretation of main effects and interactions, leading to erroneous conclusions.

In [2]:
# Ques 2
# ans - 
# 1.> One- Way annova -- One factor with atleast 2 levels, these levels are independent.
# example - Medication - (Factor) & 10 mg , 20 mg ,30mg are levels . 

#2.> Repeated Measure Annova :-- Onefactor with atleast 2 levels & levels are Dependent.
# example-- Running(factor) & day1 , day2 ,day3 are levels.

#3.> Factorial Annova : Two or more factor (each of which with 2 levels), levels can be either independent and dependent.
# example-- running & Gender are two factor and Day1,Day2,Day3 are levels.

In [3]:
# Ques 3
# ans - The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variability in a dataset into different components that can be attributed to different sources of variation. ANOVA aims to determine whether there are statistically significant differences between the means of multiple groups and to quantify the contributions of different factors to the overall variability in the data. This partitioning helps researchers understand the relative importance of various factors in explaining the variation observed in the dependent variable.

# In ANOVA, the total variability in the data is broken down into three main components:

# 1.> Between-Group Variance (SSB): This component represents the variability between the group means. It measures how much the group means differ from each other. A larger between-group variance suggests that there are significant differences between the group means.

# 2.> Within-Group Variance (SSW): Also known as the "error" or "residual" variance, this component represents the variability within each group. It measures how much individual data points deviate from their respective group means. A smaller within-group variance indicates that the data points within each group are relatively similar.

# 3.> Total Variance (SST): This is the overall variability in the entire dataset, regardless of group membership. It is the sum of the between-group variance and the within-group variance: SST = SSB + SSW.

# The importance of understanding the partitioning of variance in ANOVA includes:

#1.> Interpretation of Results: By decomposing the total variance into its components, researchers can better interpret the sources of variation that contribute to the observed differences between groups. This helps to provide a clearer understanding of the factors that are driving the significant results.

#2.> Effect Size Calculation: The partitioning of variance is crucial for calculating effect sizes, such as eta-squared (η²) or partial eta-squared (η²p). These effect sizes quantify the proportion of total variance explained by the factors of interest, helping researchers assess the practical significance of their findings.

#3.> Research Design and Hypotheses: Understanding the partitioning of variance can guide researchers in designing experiments and formulating hypotheses. For example, if the within-group variance is very high, it might suggest that other variables not included in the analysis are influencing the results.

#4.> Model Assessment: The partitioning of variance provides a basis for assessing the adequacy of the ANOVA model. If the between-group variance is much larger than the within-group variance, it suggests that the groups are indeed different and that the ANOVA model is a good fit for the data.

#5.> Comparative Analysis: ANOVA allows for the comparison of multiple groups simultaneously, which can provide insights into the relative performance or characteristics of different treatments, conditions, or group

In [4]:
# Ques 4
#ans --
import numpy as np
from scipy import stats

# Sample data for each group
group1 = np.array([23, 27, 29, 31, 32])
group2 = np.array([18, 20, 22, 25, 28])
group3 = np.array([15, 16, 18, 20, 23])

# Combine data from all groups
all_data = np.concatenate((group1, group2, group3))

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Calculate Explained Sum of Squares (SSE)
sse = np.sum((group1_mean - overall_mean)**2) * len(group1) + \
      np.sum((group2_mean - overall_mean)**2) * len(group2) + \
      np.sum((group3_mean - overall_mean)**2) * len(group3)

# Calculate Residual Sum of Squares (SSR)
ssr = sst - sse

# Calculate degrees of freedom
df_total = len(all_data) - 1
df_groups = 3 - 1  # Number of groups minus 1
df_residual = df_total - df_groups

# Calculate mean square for groups (MS_groups) and for residuals (MS_residual)
ms_groups = sse / df_groups
ms_residual = ssr / df_residual

# Calculate F-statistic
f_statistic = ms_groups / ms_residual

# Calculate p-value
p_value = 1 - stats.f.cdf(f_statistic, df_groups, df_residual)

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 407.73333333333335
Explained Sum of Squares (SSE): 252.13333333333333
Residual Sum of Squares (SSR): 155.60000000000002


In [6]:
# Ques 5
import numpy as np
from scipy import stats

# Sample data for each combination of factors A and B
data = np.array([
    [15, 18, 20],
    [12, 14, 17],
    [17, 19, 22],
    [10, 12, 15]
])

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate the main effects for factors A and B
main_effect_A = np.mean(data, axis=1) - overall_mean
main_effect_B = np.mean(data, axis=0) - overall_mean

# Calculate the interaction effect
interaction_effect = np.zeros_like(data, dtype=float)
for i in range(data.shape[0]):
    for j in range(data.shape[1]):
        interaction_effect[i, j] = data[i, j] - (main_effect_A[i] + main_effect_B[j] + overall_mean)

# Calculate the Total Sum of Squares (SST)
sst = np.sum((data - overall_mean)**2)

# Calculate the Main Effects Sum of Squares (SSE_main) for factors A and B
sse_main_A = np.sum(main_effect_A**2) * data.shape[1]
sse_main_B = np.sum(main_effect_B**2) * data.shape[0]

# Calculate the Interaction Effect Sum of Squares (SSE_interaction)
sse_interaction = np.sum(interaction_effect**2)

# Calculate the Residual Sum of Squares (SSR)
ssr = sst - sse_main_A - sse_main_B - sse_interaction

# Calculate degrees of freedom
df_total = data.size - 1
df_factor_A = data.shape[0] - 1
df_factor_B = data.shape[1] - 1
df_interaction = df_factor_A * df_factor_B
df_residual = df_total - df_factor_A - df_factor_B - df_interaction

# Calculate mean squares for factors A, B, and interaction (MS_factor_A, MS_factor_B, MS_interaction)
ms_factor_A = sse_main_A / df_factor_A
ms_factor_B = sse_main_B / df_factor_B
ms_interaction = sse_interaction / df_interaction

# Calculate F-statistics for factors A, B, and interaction
f_statistic_factor_A = ms_factor_A / ms_residual
f_statistic_factor_B = ms_factor_B / ms_residual
f_statistic_interaction = ms_interaction / ms_residual

# Calculate p-values for factors A, B, and interaction
p_value_factor_A = 1 - stats.f.cdf(f_statistic_factor_A, df_factor_A, df_residual)
p_value_factor_B = 1 - stats.f.cdf(f_statistic_factor_B, df_factor_B, df_residual)
p_value_interaction = 1 - stats.f.cdf(f_statistic_interaction, df_interaction, df_residual)

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)
print("F-statistic for Factor A:", f_statistic_factor_A)
print("P-value for Factor A:", p_value_factor_A)
print("F-statistic for Factor B:", f_statistic_factor_B)
print("P-value for Factor B:", p_value_factor_B)
print("F-statistic for Interaction:", f_statistic_interaction)
print("P-value for Interaction:", p_value_interaction)


Main Effect A: [ 1.75       -1.58333333  3.41666667 -3.58333333]
Main Effect B: [-2.41666667 -0.16666667  2.58333333]
Interaction Effect: [[-0.25        0.5        -0.25      ]
 [ 0.08333333 -0.16666667  0.08333333]
 [ 0.08333333 -0.16666667  0.08333333]
 [ 0.08333333 -0.16666667  0.08333333]]
F-statistic for Factor A: 2.320051413881747
P-value for Factor A: nan
F-statistic for Factor B: 1.9344473007712082
P-value for Factor B: nan
F-statistic for Interaction: 0.006426735218509008
P-value for Interaction: nan


In [7]:
# ques 6
# In a one-way ANOVA, the F-statistic is used to test the null hypothesis that all group means are equal against the alternative hypothesis that at least one group mean is different from the others. The p-value associated with the F-statistic indicates the probability of obtaining a result as extreme as the observed result under the assumption that the null hypothesis is true.

# Given an F-statistic of 5.23 and a p-value of 0.02, here's what you can conclude and how you would interpret the results:

#1.> Conclusion about Group Differences:

# The F-statistic of 5.23 indicates that there is some variability in the means of the groups.
# The p-value of 0.02 is less than the commonly used significance level of 0.05 (or 5%).
# Since the p-value is less than 0.05, you would reject the null hypothesis at the 0.05 significance level.
# This means that you have evidence to conclude that there are significant differences between at least some of the groups.
#2.> Interpretation:

# The F-statistic of 5.23 suggests that the variability between the group means is 5.23 times larger than the variability within the groups.
# The p-value of 0.02 suggests that the probability of observing such a large F-statistic under the assumption of no group differences is only 0.02 (or 2%).
# In practical terms, these results indicate that the groups are not all the same and that there are statistically significant differences between the groups' means.
# However, the ANOVA does not tell you which specific groups are different from each other; it only indicates the presence of differences.
#3.> Further Analysis:

#To determine which specific groups are different from each other, you might consider conducting post hoc tests (e.g., Tukey's HSD, Bonferroni, or others).
# These tests help identify the specific pairs of groups that have significantly different means.
#It's important to note that while the p-value is a valuable indicator of the strength of evidence against the null hypothesis, it doesn't provide information about the magnitude of the effect or the practical significance of the differences. Effect size measures (such as eta-squared) can help provide insights into the size of the differences between groups.

# In summary, with an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there are significant differences between the groups' means. However, additional post hoc tests or further analysis would be necessary to determine which specific groups are different from each other.








In [1]:
# Ques 7 
#ans-- Handling missing data in a repeated measures ANOVA (Analysis of Variance) is an important consideration to ensure the validity and accuracy of your results. There are several methods to handle missing data, each with its own potential consequences. Let's explore the options and their implications:

#1.> Listwise Deletion (Complete Case Analysis):

# In this approach, cases with missing data on any variable are excluded from the analysis.
# Pros: It is straightforward and retains cases with complete data.
# Cons: It can lead to biased results if missing data are not missing completely at random (MCAR). It can also reduce statistical power and result in an unrepresentative sample.

#2.> Pairwise Deletion:

# Cases with missing data for a specific comparison are excluded only from that specific comparison, allowing all other data to be used.
# Pros: Maximizes use of available data and can provide more information than listwise deletion.
# Cons: Can lead to different degrees of freedom for different comparisons, which may complicate interpretation. Similar to listwise deletion, it may introduce bias if missingness is not MCAR.

#3.> Mean Imputation:

# Missing values are replaced with the mean of the available data for the respective variable.
# Pros: Simple to implement and can maintain the original sample size.
# Cons: Can lead to underestimation of variability and correlations between variables, as well as distortion of relationships between variables.

#4.> Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):

# Missing data are replaced with the last available observation (LOCF) or the next available observation (NOCB).
# Pros: Simple to implement and maintains the time sequence of data.
# Cons: Can introduce bias if the missing data patterns are related to the values themselves, leading to inaccurate estimates.

#5.> Multiple Imputation:

# Missing data are imputed multiple times, creating several complete datasets. Analysis is conducted on each dataset, and results are pooled to obtain final estimates.
# Pros: Provides more accurate estimates by accounting for uncertainty related to missing data. Handles missingness that is not MCAR.
# Cons: Can be computationally intensive and may require assumptions about the missing data mechanism.

#6.> Model-Based Imputation:

# Imputes missing values based on a model fitted to the observed data.
# Pros: Can produce accurate imputations when the model assumptions hold.
# Cons: Model misspecification can lead to biased results.

#7.> Interpolation or Extrapolation:

# Missing data are estimated using interpolation (for missing data within the observed range) or extrapolation (for missing data outside the observed range).
# Pros: Can be appropriate when the data follow a predictable pattern.
# Cons: Extrapolation can be risky and may lead to unreliable estimates.
# The consequences of using different methods to handle missing data include potential bias, inflated or deflated standard errors, incorrect p-values, and distorted relationships between variables. The choice of method should be guided by the nature of the missing data, the underlying assumptions, and the goals of the analysis. It's important to transparently report the chosen method and any potential limitations associated with it.






In [2]:
# Ques 8 
#ans-- Post-hoc tests are used to make pairwise comparisons between groups after obtaining a significant result in an Analysis of Variance (ANOVA) to determine which specific groups differ from each other. When you have three or more groups and ANOVA indicates a significant overall effect, post-hoc tests help identify which group differences are responsible for that effect. Some common post-hoc tests include:

#1.> Tukey's Honestly Significant Difference (HSD):

# Use when you want to control the familywise error rate (the probability of making at least one Type I error across all comparisons) at a specific level (often 0.05).
# Suitable for balanced designs (equal group sizes) and when you have a moderate to large number of groups.
# Example: A researcher conducts an experiment comparing the performance of four different teaching methods (A, B, C, D) on students' test scores. ANOVA shows a significant difference between at least some of the groups. Tukey's HSD is used to determine which specific pairs of teaching methods are significantly different from each other.

#2.> Bonferroni Correction:

# Use when you need a more conservative approach to control the familywise error rate across multiple comparisons.
# Divides the desired significance level (e.g., 0.05) by the number of comparisons, resulting in a stricter threshold for individual p-values.
# Example: A clinical trial assesses the effectiveness of three different drug treatments (X, Y, Z) on a particular medical condition. ANOVA suggests a significant difference between treatments. To account for multiple comparisons, Bonferroni correction is applied to compare each treatment to the others while maintaining an overall significance level.

#3.> Dunn's Test:

# Use when you have a small sample size or unequal group sizes and are concerned about the assumptions of other post-hoc tests.
# It's a non-parametric alternative to Tukey's HSD or Bonferroni correction.
# Example: A study examines the effect of different exercise regimes (High Intensity, Moderate Intensity, Low Intensity) on weight loss. ANOVA indicates a significant difference. Since the assumptions of parametric tests might not hold well, Dunn's test is employed to make pairwise comparisons.

#5.> Scheffe's Method:

# Use when you want a more conservative approach to control the familywise error rate, especially when sample sizes are unequal and the variances are not homogeneous.
# Offers broader protection against Type I errors but is less powerful than other methods.
# Example: An experiment investigates the impact of three different marketing strategies (A, B, C) on sales across different regions. ANOVA reveals a significant effect. Scheffe's method is chosen to compare the strategies due to its robustness.

#5.>Fisher's LSD (Least Significant Difference):

# Use when you want a simple and exploratory approach to pairwise comparisons.
# It's less stringent than some other post-hoc tests and might be used when you're exploring potential group differences before conducting more stringent tests.
# Example: An analysis examines the effect of various doses of a drug (5 mg, 10 mg, 20 mg) on blood pressure. ANOVA shows a significant difference. Fisher's LSD is employed for exploratory pairwise comparisons before considering other post-hoc tests.
# Remember, the choice of post-hoc test should be based on the specific characteristics of your data, the research question, and the assumptions of each test. It's important to adjust for multiple comparisons to control the overall Type I error rate and to report the chosen post-hoc test and its results transparently in your analysis.






In [3]:
# Ques 9 
# ans -- Certainly! Here's an example of how you can conduct a one-way ANOVA in Python using the scipy.stats library to analyze the mean weight loss of three diets (A, B, and C) based on the data collected from 50 participants

import numpy as np
from scipy import stats

# Simulated weight loss data for each diet
diet_A = np.array([3.2, 4.5, 2.7, 5.1, 3.9, 4.3, 3.8, 4.2, 2.5, 3.6,
                   4.1, 2.9, 3.8, 4.0, 3.7, 4.5, 3.4, 4.1, 3.3, 3.0,
                   4.2, 3.6, 3.5, 4.8, 4.4, 3.2, 3.9, 4.1, 3.7, 4.0,
                   3.3, 4.2, 2.8, 4.5, 3.6, 3.9, 4.1, 3.7, 2.9, 4.4,
                   3.1, 4.3, 3.6, 3.8, 4.0, 2.7, 3.5, 4.2, 3.0, 4.4])

diet_B = np.array([2.8, 3.9, 2.2, 3.1, 3.5, 3.8, 3.4, 3.7, 2.5, 3.0,
                   3.3, 2.9, 3.6, 3.1, 3.2, 3.8, 2.7, 3.4, 2.9, 2.8,
                   3.5, 3.2, 3.0, 3.6, 3.1, 2.7, 3.4, 3.5, 3.0, 3.6,
                   2.8, 3.9, 2.7, 3.4, 3.1, 3.6, 3.3, 2.6, 3.8, 3.4,
                   2.9, 3.7, 3.1, 3.3, 3.5, 2.5, 3.0, 3.8, 2.8, 3.5])

diet_C = np.array([1.9, 2.6, 2.0, 1.8, 2.3, 2.7, 2.5, 2.1, 2.8, 1.7,
                   2.4, 1.6, 2.3, 2.2, 1.9, 2.6, 2.0, 2.1, 1.8, 2.0,
                   2.4, 2.2, 2.3, 2.5, 2.1, 1.7, 2.6, 2.3, 2.0, 2.5,
                   1.8, 2.4, 1.5, 2.7, 2.2, 2.4, 2.6, 1.6, 2.5, 2.0,
                   2.3, 2.1, 2.2, 2.4, 2.6, 1.7, 1.9, 2.3, 1.8, 2.2])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 153.0731705736759
p-value: 1.1595673877935681e-36
There is a significant difference between the mean weight loss of the three diets.


In [4]:
# In this example, we first simulate weight loss data for each diet using NumPy arrays. We then use the stats.f_oneway function to perform the one-way ANOVA. The F-statistic and p-value are printed, followed by an interpretation of the results based on the chosen significance level (alpha = 0.05).

# Remember that the data provided here is simulated for illustrative purposes. In a real-world scenario, you would replace the simulated data with your actual data collected from the study participants.

In [5]:
# Ques 10 
# ans --
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Simulated data
np.random.seed(42)

# Create data frame
data = pd.DataFrame({
    'Software': np.repeat(['Program A', 'Program B', 'Program C'], 20),
    'Experience': np.tile(['Novice', 'Experienced'], 30),
    'Time': np.random.normal(10, 2, 60)  # Simulated task completion times
})

# Perform two-way ANOVA
formula = 'Time ~ Software + Experience + Software:Experience'
model = ols(formula, data).fit()
anova_results = anova_lm(model)

# Print ANOVA results
print(anova_results)

# Interpret the results
alpha = 0.05

# Main effect of Software
print("\nMain effect of Software:")
software_p_value = anova_results.loc['Software', 'PR(>F)']
if software_p_value < alpha:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

# Main effect of Experience
print("\nMain effect of Experience:")
experience_p_value = anova_results.loc['Experience', 'PR(>F)']
if experience_p_value < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

# Interaction effect between Software and Experience
print("\nInteraction effect between Software and Experience:")
interaction_p_value = anova_results.loc['Software:Experience', 'PR(>F)']
if interaction_p_value < alpha:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")


                       df      sum_sq   mean_sq         F    PR(>F)
Software              2.0    2.323530  1.161765  0.327679  0.722017
Experience            1.0    0.026471  0.026471  0.007466  0.931463
Software:Experience   2.0    0.993526  0.496763  0.140114  0.869575
Residual             54.0  191.453381  3.545433       NaN       NaN

Main effect of Software:
There is no significant main effect of Software.

Main effect of Experience:
There is no significant main effect of Experience.

Interaction effect between Software and Experience:
There is no significant interaction effect between Software and Experience.


In [None]:
# Ques 11
# ans-- 