# Assignment - 13th March

1. **Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.** 

Assumptions of ANOVA include:

- **Independence :** Each observation is independent of all other observations.
- **Normality :** The data in each group follow a normal distribution.
- **Homogeneity of variances :** The variance of the data in each group is approximately equal.
- **Random sampling :** The data are randomly sampled from their respective populations.

Violations of these assumptions can impact the validity of ANOVA results. 

**For Example :** <br>
If the assumption of normality is violated, the F-test may not be valid and could lead to incorrect conclusions. Similarly, violations of the homogeneity of variances assumption can result in inflated or deflated F-statistics and lead to incorrect conclusions.

2. **What are the three types of ANOVA, and in what situations would each be used?**

The three types of ANOVA are:

- **One-way ANOVA :** Used to compare means across multiple groups on a single independent variable.
- **Two-way ANOVA :** Used to compare means across multiple groups on two independent variables.
- **Repeated measures ANOVA :** Used to compare means across multiple groups on a single independent variable with repeated measures over time.

3. **What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

Partitioning of variance in ANOVA refers to the process of decomposing the total variance in a dataset into its component parts: the variance explained by the independent variable(s) and the variance due to error. This is important to understand because it allows us to quantify the extent to which differences between groups are due to the independent variable(s) and the extent to which they are due to random error.

4. **How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?** 

In [62]:
import pandas as pd

# read in the data from CSV file
data = pd.read_csv('data.csv')

# display the first 5 rows of data
print(data.head())


  group  value
0     A   10.2
1     A    8.3
2     A    9.1
3     A    7.8
4     A   11.5


In [63]:
import numpy as np

# calculate overall mean
overall_mean = np.mean(data['value'])

# calculate SST
SST = np.sum((data['value'] - overall_mean)**2)

# print SST
print('SST:', SST)

SST: 162.50966666666665


In [64]:
# group data by group and calculate mean and sample size
group_means = data.groupby('group').mean()
group_sizes = data.groupby('group').size()

# calculate SSE
SSE = np.sum(group_sizes * (group_means['value'] - overall_mean)**2)

# print SSE
print('SSE:', SSE)

SSE: 123.99266666666674


In [65]:
# calculate SSR
SSR = np.sum((data['value'] - data.groupby('group')['value'].transform('mean'))**2)

# print SSR
print('SSR:', SSR)

SSR: 38.51699999999999


So the total sum of squares (SST) is 162.50, the explained sum of squares (SSE) is 123.99, and the residual sum of squares (SSR) is 38.51.

5. **In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?** 

In a **two-way ANOVA**, to calculate the main effects and interaction effects using Python, you can use the `statsmodels` package. After importing the necessary packages and loading the data into a pandas dataframe, you can create the model using the `ols` function, specifying the dependent variable and independent variables as well as the interaction term. Then, you can fit the model using the `fit` method and obtain the ANOVA table using the `anova_lm` function. The main effects and interaction effects can be identified by examining the F-statistics and associated p-values in the ANOVA table.

6. **Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**

In a **one-way ANOVA**, the F-statistic tests the null hypothesis that all group means are equal. A low p-value (typically below 0.05) indicates that there is strong evidence against the null hypothesis and that at least one group mean is significantly different from the others. Therefore, if the F-statistic is 5.23 and the p-value is 0.02, we can conclude that there is a significant difference between the group means. However, we cannot say which group means are different from each other without conducting post-hoc tests.

7. **In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?** 

In a **repeated measures ANOVA**, missing data can be handled using various methods, such as pairwise deletion, mean substitution, or maximum likelihood estimation. Pairwise deletion involves excluding any participant who has missing data from the analysis of that particular variable, which can result in reduced power and biased estimates if the data are not missing completely at random. Mean substitution involves replacing the missing values with the mean of the observed values, which can also result in biased estimates if the data are not missing completely at random. Maximum likelihood estimation is a more sophisticated method that estimates the missing data based on the available data, assuming a specific distribution of the data. However, this method can be computationally intensive and may require more assumptions than simpler methods.

8. **What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.** 

Common post-hoc tests used after ANOVA include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, and Scheffe's test. Tukey's HSD test is used to determine which specific group means differ significantly from each other. The Bonferroni correction adjusts the significance level to account for multiple comparisons, and Scheffe's test is a more conservative test that controls for the overall error rate. You would use each of these tests depending on the specific research question and the desired level of control for Type I error. 

**For example :**<br> 
If you are interested in comparing the means of all possible pairwise comparisons, Tukey's HSD test would be appropriate. If you are more concerned with controlling for Type I error, the Bonferroni correction or Scheffe's test may be more appropriate.

9. **A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.** 

In [66]:
import random
import pandas as pd

# set random seed for reproducibility
random.seed(123)

# generate weight loss data for three diets
A = [random.uniform(1, 5) for _ in range(50)]
B = [random.uniform(1, 6) for _ in range(50)]
C = [random.uniform(2, 7) for _ in range(50)]

# combine data into a single dataframe
data = pd.DataFrame({'A': A, 'B': B, 'C': C})

# save dataframe to CSV file
data.to_csv('diets.csv', index=False)


In [67]:
import pandas as pd
import scipy.stats as stats

# load data into a pandas dataframe
data = pd.read_csv('diets.csv')

# conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(data['A'], data['B'], data['C'])

# print results
print('F-statistic:', f_statistic)
print('p-value:', p_value)

F-statistic: 25.92372694440951
p-value: 2.273011160482186e-10


10. **A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.** 

In [68]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_csv('soft.csv')
data.head()

Unnamed: 0,program,experience,time
0,A,Novice,12.5
1,B,Novice,11.2
2,C,Novice,14.3
3,A,Experienced,10.8
4,B,Experienced,13.2


In [69]:
# Convert program and experience columns to categorical variables
data['program'] = data['program'].astype('category')
data['experience'] = data['experience'].astype('category')

# Fit the model
model = ols('time ~ program + experience + program:experience', data=data).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                       sum_sq    df         F    PR(>F)
program              3.987778   2.0  1.704179  0.223126
experience           0.435556   1.0  0.372270  0.553149
program:experience   2.181111   2.0  0.932099  0.420455
Residual            14.040000  12.0       NaN       NaN


11. **An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**

In [70]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# load data
data = pd.read_csv('student.csv')

# separate scores by group
control_scores = data[data['group'] == 'control']['score']
experimental_scores = data[data['group'] == 'experimental']['score']

# conduct two-sample t-test
t_stat, p_val = stats.ttest_ind(control_scores, experimental_scores)

# print results
print('Two-Sample T-Test Results:')
print('t-statistic: {:.2f}'.format(t_stat))
print('p-value: {:.4f}'.format(p_val))

# conduct post-hoc test
tukey_results = pairwise_tukeyhsd(data['score'], data['group'])

# print post-hoc test results
print('\nPost-Hoc Test Results:')
print(tukey_results)


Two-Sample T-Test Results:
t-statistic: nan
p-value: nan

Post-Hoc Test Results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower  upper  reject
----------------------------------------------------------
Control Experimental     0.56 0.6288 -1.7544 2.8744  False
----------------------------------------------------------


12. **A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**

Since this is a repeated measures design, where the same subjects are measured over multiple time points, a repeated measures ANOVA is not appropriate. Instead, a one-way repeated measures ANOVA, also known as a within-subjects ANOVA, would be more appropriate.

In [71]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statsmodels.stats.anova import AnovaRM

# load data
data = pd.read_csv('sales.csv')

# set up data in long format
data_long = pd.melt(data, id_vars=['day'], var_name='store', value_name='sales')

# set up the repeated measures ANOVA
rm_anova = AnovaRM(data_long, 'sales', 'day', within=['store'])

# fit the model
results = rm_anova.fit()

# print results
print(results.summary())

# follow up with post-hoc test if results are significant
if results.anova_table['Pr > F'][0] < 0.05:
    print('\nPost-Hoc Test Results:')
    print(pairwise_tukeyhsd(data_long['sales'], data_long['store']))

               Anova
      F Value  Num DF  Den DF Pr > F
------------------------------------
store 123.7608 2.0000 58.0000 0.0000


Post-Hoc Test Results:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B     210.0    0.0  158.8856  261.1144   True
Store A Store C    3.3333 0.9868  -47.7811   54.4477  False
Store B Store C -206.6667    0.0 -257.7811 -155.5523   True
-----------------------------------------------------------
