# Assignment

### Ans1)

Analysis of Variance (ANOVA) is a statistical technique that is used to test for differences among two or more groups or samples. However, before conducting ANOVA, certain assumptions must be met in order to ensure that the results are valid and accurate. These assumptions include:

1) Independence: The data within each group must be independent of each other. This means that there should be no relationship between the observations in one group and the observations in another group.

2) Normality: The distribution of data within each group should be normal or approximately normal. This assumption is particularly important when the sample size is small.

3) Homogeneity of variance: The variance of the data within each group should be equal or approximately equal. This assumption is important because if the variance is not equal, the test may not accurately reflect the differences among the groups.

4) Random sampling: The samples should be randomly selected from the population.

Violations of these assumptions can affect the validity of the ANOVA results. Examples of violations and their impact on validity are:

1) Non-independence: If the data within one group is related to the data in another group, this can lead to bias in the results. For example, if a study is conducted in a classroom where students are grouped based on their academic ability, the data within each group may not be independent.

2) Non-normality: If the distribution of the data is not normal, this can lead to inaccurate results. For example, if a sample is too small, the distribution may not appear to be normal, even if it is.

3) Non-homogeneity of variance: If the variance of the data is not equal, the ANOVA may incorrectly detect significant differences among the groups. For example, if one group has a much larger variance than the others, this can lead to incorrect conclusions.

4) Non-random sampling: If the samples are not randomly selected from the population, the results may not accurately reflect the population as a whole. For example, if a study is conducted only on people who volunteer, the results may not be representative of the entire population.

### Ans2)

The three types of ANOVA are:

1) One-Way ANOVA: One-way ANOVA is used when there is only one independent variable, which has three or more levels (or categories). It is used to determine whether there are significant differences in the means of a dependent variable across the different levels of the independent variable. One-way ANOVA is commonly used in studies that involve comparing the means of several groups, such as in medical studies to compare the effectiveness of different treatments.

2) Two-Way ANOVA: Two-way ANOVA is used when there are two independent variables. It is used to determine whether there are significant main effects of each independent variable and whether there is an interaction effect between the two independent variables on the dependent variable. Two-way ANOVA is commonly used in studies that involve examining the effects of two different factors, such as in psychology studies to examine the effects of both gender and age on a dependent variable.

3) Three-Way ANOVA: Three-way ANOVA is used when there are three independent variables. It is used to determine whether there are significant main effects of each independent variable and whether there are interaction effects between each pair of independent variables, as well as a three-way interaction effect. Three-way ANOVA is commonly used in studies that involve examining the effects of three different factors, such as in education studies to examine the effects of teaching method, classroom size, and teacher experience on a dependent variable.

### Ans3)

The partitioning of variance in ANOVA (Analysis of Variance) is a technique used to identify the sources of variation in a set of data. ANOVA is a statistical method that is used to compare the means of two or more groups and determine whether they are statistically significant.

The variance of a set of data measures how much the data points are spread out from the mean. The partitioning of variance in ANOVA involves breaking down the total variance of the data into different components that can be attributed to specific sources of variation. These sources of variation can include differences between groups, differences within groups, and random error.

Understanding the partitioning of variance in ANOVA is important because it allows researchers to identify the sources of variation in their data and determine which factors are contributing the most to differences between groups. By identifying these sources of variation, researchers can develop more accurate models to explain the data, make more precise predictions, and better understand the underlying processes that are driving the observed differences.

### Ans4)

In [3]:
import scipy.stats as stats
import numpy as np

# Generate some sample data
group1 = np.random.normal(loc=10, scale=2, size=30)
group2 = np.random.normal(loc=12, scale=2, size=30)
group3 = np.random.normal(loc=15, scale=2, size=30)
data = np.concatenate([group1, group2, group3])

# Calculate the ANOVA
f_stat, p_value = stats.f_oneway(group1, group2, group3)
n = len(data)
k = 3

# Calculate the total sum of squares (SST)
SST = np.sum((data - np.mean(data))**2)

# Calculate the explained sum of squares (SSE)
SSE = np.sum((np.mean(group1) - np.mean(data))**2) + \
      np.sum((np.mean(group2) - np.mean(data))**2) + \
      np.sum((np.mean(group3) - np.mean(data))**2)

# Calculate the residual sum of squares (SSR)
SSR = SST - SSE

print("SST =", SST)
print("SSE =", SSE)
print("SSR =", SSR)


SST = 764.9041326290013
SSE = 15.487573431470548
SSR = 749.4165591975308


### Ans5)

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create some sample data
df = pd.DataFrame({'A': ['a1', 'a2', 'a1', 'a2', 'a1', 'a2', 'a1', 'a2'],
                   'B': ['b1', 'b1', 'b2', 'b2', 'b1', 'b1', 'b2', 'b2'],
                   'Y': [3, 4, 6, 8, 7, 9, 10, 12]})

# Fit the two-way ANOVA model
model = ols('Y ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effects
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table['sum_sq'].sum()
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table['sum_sq'].sum()
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table['sum_sq'].sum()

print("Main effect of A:", main_effect_A)
print("Main effect of B:", main_effect_B)
print("Interaction effect:", interaction_effect)


Main effect of A: 0.09589041095890422
Main effect of B: 0.3307240704500982
Interaction effect: 0.001956947162426608


### Ans6)

If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is at least one significant difference between the groups. The F-statistic indicates the ratio of the between-group variability to the within-group variability, and a larger F-statistic implies greater differences between the groups relative to the variation within each group. The p-value indicates the probability of obtaining an F-statistic as extreme as the one observed, assuming that there are no differences between the groups (i.e., the null hypothesis is true).

A p-value of 0.02 indicates that there is strong evidence against the null hypothesis and suggests that the observed differences between the groups are unlikely to be due to chance alone. Typically, a p-value threshold of 0.05 is used to determine statistical significance, which means that if the p-value is less than 0.05, the differences between the groups are considered statistically significant.

Therefore, in this case, you can conclude that there is strong evidence that at least one group is different from the others. However, you would need to conduct additional post-hoc tests or examine the confidence intervals of the group means to determine which specific group(s) differ significantly from each other.





### Ans7)

Handling missing data in repeated measures ANOVA depends on the nature of the missing data. Here are a few potential methods for handling missing data in repeated measures ANOVA:

1) Pairwise deletion: With this method, missing data are excluded from the analysis. However, this can lead to a loss of power and bias if the missing data are not missing at random (MAR).

2) Last observation carried forward (LOCF): This method replaces missing values with the value from the previous time point. LOCF can introduce bias if the missing data are not missing at random.

3) Maximum likelihood estimation: This is a statistical method that can be used to estimate missing data by assuming a specific distribution for the data. Maximum likelihood estimation can provide unbiased estimates if the data are missing at random.

4) Multiple imputation: This method involves creating multiple plausible values for each missing data point based on the observed data and using these imputed values to estimate the model parameters. Multiple imputation can provide unbiased estimates if the data are missing at random, and it can also increase the precision of the estimates.



### Ans8)

Some common post-hoc tests include:

1) Tukey's Honestly Significant Difference (HSD): This test is commonly used when there are three or more groups being compared. The Tukey HSD test calculates the minimum difference that must exist between the means of any two groups in order for that difference to be considered statistically significant.

2) Bonferroni correction: This test adjusts the p-values of the pairwise comparisons to account for multiple comparisons, and is generally more conservative than other post-hoc tests. The Bonferroni correction can be used in situations where there are a large number of pairwise comparisons.

3) Scheffé's test: This test is more conservative than Tukey's HSD test, and is used when there are a small number of groups being compared.

4) Dunnett's test: This test is used when there is a control group being compared to multiple other groups. It controls the overall error rate while allowing for multiple pairwise comparisons with the control group.

5) Fisher's Least Significant Difference (LSD): This test is used when there are only two groups being compared, and is less conservative than other post-hoc tests.

An example of a situation where a post-hoc test might be necessary is in a study examining the effects of three different exercise programs on weight loss. After conducting an ANOVA, the researcher finds a significant difference between the means of the three groups. To determine which specific groups are significantly different from each other, the researcher could conduct a post-hoc test such as Tukey's HSD test. This would allow the researcher to identify which exercise programs resulted in significantly greater weight loss compared to the others.

### Ans9)

In [1]:
import scipy.stats as stats

# Define the data for each diet group
diet_A = [5.1, 4.5, 6.2, 7.8, 3.2, 5.5, 4.3, 6.4, 4.9, 5.7, 6.1, 5.4, 4.7, 3.9, 5.3, 6.8, 5.6, 4.1, 5.9, 6.0]
diet_B = [7.2, 8.5, 6.8, 7.5, 6.2, 8.1, 9.2, 7.8, 6.5, 7.0, 8.0, 7.3, 6.9, 8.3, 7.6, 6.7, 7.1, 7.9, 8.6, 6.4]
diet_C = [10.3, 11.2, 12.1, 9.7, 11.0, 10.5, 12.6, 9.5, 10.8, 11.5, 9.9, 10.7, 11.4, 12.2, 10.6, 12.5, 11.9, 9.8, 10.4, 12.0]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)


F-statistic:  177.87661148829076
p-value:  3.1279147100109306e-25


### Ans10)

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a data frame with the data
data = {'Software': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C',
                     'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C',
                     'A', 'A', 'B', 'B', 'C', 'C'],
        'Experience': ['Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
                       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
                       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
                       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
                       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced'],
        'Time': [13.8, 15.6, 12.3, 13.1, 10.7, 11.3, 14.2, 14.9, 12.6, 13.7, 11.1, 11.9,
                 15.2, 16.3, 13.5, 14.7, 12.1, 12.9, 14.8, 15.7, 12.9, 13.8, 11.5, 12.1,
                 15.5, 16.1, 13.2, 14.3, 11.7, 12.2]}

df = pd.DataFrame(data)

# Conduct two-way ANOVA
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
sm.stats.anova_lm(model, typ=2)


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Software),59.890667,2.0,89.701448,7.281977e-12
C(Experience),6.075,1.0,18.197703,0.0002686055
C(Software):C(Experience),0.216,2.0,0.323515,0.7267079
Residual,8.012,24.0,,


### Ans11)

In [3]:
import numpy as np
from scipy.stats import ttest_ind

# Generate some example data
np.random.seed(123)
control_scores = np.random.normal(75, 10, size=100)
experimental_scores = np.random.normal(80, 10, size=100)

# Conduct two-sample t-test
t_stat, p_val = ttest_ind(control_scores, experimental_scores)

# Report the results
print("T-statistic: {:.2f}".format(t_stat))
print("P-value: {:.3f}".format(p_val))



T-statistic: -3.03
P-value: 0.003


In [4]:
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a data frame with the data
data = {'Group': ['Control'] * 100 + ['Experimental'] * 100,
        'Score': np.concatenate((control_scores, experimental_scores))}

df = pd.DataFrame(data)

# Conduct Tukey's HSD test
tukey_results = pairwise_tukeyhsd(df['Score'], df['Group'])

# Report the results
print(tukey_results)


   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental   4.5336 0.0028 1.5846 7.4826   True
---------------------------------------------------------


### Ans12)

In [3]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

#generate sample data
store_a_sales = np.random.normal(1000, 100, 30)
store_b_sales = np.random.normal(1200, 150, 30)
store_c_sales = np.random.normal(900, 120, 30)

sales_df = pd.DataFrame({
    'Store A': store_a_sales,
    'Store B': store_b_sales,
    'Store C': store_c_sales
})

sales_df_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_df_melted.columns = ['day', 'store', 'sales']

# Perform repeated measures ANOVA
rm_anova = AnovaRM(sales_df_melted, 'sales', 'day', within=['store'])
res = rm_anova.fit()
print(res)

               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store 47.6875 2.0000 58.0000 0.0000

