In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.


In [None]:
Analysis of Variance (ANOVA) is a statistical method used to compare means between three or more groups. 
The method makes several assumptions about the data, which are important to consider when interpreting the results.
Assumptions of ANOVA:
Independence: The observations within each group must be independent of each other. This means that the value
of one observation should not influence the value of another observation within the same group.

Normality: The distribution of the data within each group should be approximately normal. This means that the data 
should be symmetrically distributed around the mean and the majority of the data should be located near the mean.

Homogeneity of variance: The variance within each group should be approximately equal. This means that the spread of the data should be similar across all groups.

Violations of these assumptions can impact the validity of ANOVA results and should be carefully considered when interpreting 
the results. Examples of violations include:
Independence: If there is a relationship between observations within a group, the assumption of independence is violated. For
example, if the same individual is measured multiple times in a study, the observations within that individual are not independent.

Normality: If the data is not normally distributed within each group, the ANOVA results may not be valid. For example, if the
data is highly skewed, the normality assumption may be violated.

Homogeneity of variance: If the variance is not approximately equal within each group, the ANOVA results may not be valid. For 
example, if the variance in one group is much larger than the variance in another group, the homogeneity of variance assumption may be violated.

In summary, it is important to carefully consider the assumptions of ANOVA and to check for any violations before interpreting the results. 
If violations are found, alternative methods may need to be used to analyze the data.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?


In [None]:
Three types of ANOVA are :
One-Way ANOVA: One-way ANOVA is a type of ANOVA that is used to compare the means of three or more groups 
that are classified by a single factor or independent variable. For example, a researcher might be interested in 
determining whether there are significant differences in the effectiveness of three different types of pain medication. 
The factor is the type of pain medication, and the dependent variable is the level of pain relief. One-way ANOVA is used to test whether
there is a significant difference in the mean level of pain relief between the three types of pain medication.

Repeated Measures ANOVA: Repeated measures ANOVA is a type of ANOVA that is used to compare the means of three or more groups
when the same individuals are measured repeatedly over time or under different conditions. For example, a researcher might be 
interested in determining whether there are significant differences in the reaction time of participants when they are presented 
with different types of stimuli. The factor is the type of stimulus, and the dependent variable is the reaction time. Repeated measures 
ANOVA is used to test whether there is a significant difference in the mean reaction time between the different types of stimuli.

Factorial ANOVA: Factorial ANOVA is a type of ANOVA that is used to compare the means of three or more groups when there are two or
more independent variables or factors. For example, a researcher might be interested in determining whether there are significant differences
in the effectiveness of a weight loss program that is targeted towards men and women. The factors are gender and weight loss program, and the
dependent variable is the amount of weight lost. Factorial ANOVA is used to test whether there is a significant interaction between the gender 
and weight loss program factors, as well as the main effects of each factor on weight loss.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
    

In [None]:

The partitioning of variance in ANOVA refers to the division of the total variation in a 
dataset into different sources of variation, which are then used to calculate the F-statistic and test
for significant differences between groups. In ANOVA, the total variance is partitioned into two components: the variance between groups and the variance within groups.
The variance between groups represents the differences between the group means, and is calculated by taking the
sum of squares between groups (SSB). The variance within groups represents the variability within each group, and is 
calculated by taking the sum of squares within groups (SSW). The total variance is calculated by taking the sum of squares total (SST), 
which is the sum of the squared differences between each data point and the overall mean.
By understanding the partitioning of variance in ANOVA, we can determine the proportion of the total variance that can 
be attributed to the differences between groups (SSB), and the proportion that is due to random error within each group (SSW). 
This allows us to test whether the differences between groups are statistically significant, and to determine the magnitude of these 
differences relative to the overall variability in the dataset. It also allows us to identify the sources of variation that are most
important in explaining the differences between groups, and to assess the validity of our conclusions based on the assumptions underlying the ANOVA model.
In summary, understanding the partitioning of variance in ANOVA is important for interpreting the results of the analysis,
identifying sources of variation that contribute to group differences, and evaluating the assumptions underlying the ANOVA model.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?


In [5]:
from statsmodels.stats.anova import anova_lm
from statsmodels .formula.api import ols
import seaborn as sns


df = sns.load_dataset('iris')

model = ols('petal_length ~ species',data=df).fit()

sse = model.ess
ssr = model.ssr
sst = sse+ssr

print(anova_lm(model))

             df    sum_sq     mean_sq            F        PR(>F)
species     2.0  437.1028  218.551400  1180.161182  2.856777e-91
Residual  147.0   27.2226    0.185188          NaN           NaN


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [2]:
import seaborn as sns

df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model_formula = 'sepal_length ~ C(species) +C(sepal_width)+ C(species):C(sepal_width)'

model = ols(model_formula,data=df).fit()

main_effect = sm.stats.anova_lm(model,typ=2)['sum_sq'][:3]
interaction_effect = sm.stats.anova_lm(model,typ=2)['sum_sq'][0:1]

print('main effect', main_effect)
print('interaction effect', interaction_effect)

print(anova_lm(model,typ=2))

main effect C(species)                          NaN
C(sepal_width)                48.787995
C(species):C(sepal_width)    131.529630
Name: sum_sq, dtype: float64
interaction effect C(species)   NaN
Name: sum_sq, dtype: float64
                               sum_sq     df          F        PR(>F)
C(species)                        NaN    2.0        NaN           NaN
C(sepal_width)              48.787995   22.0  11.711669  4.679696e-10
C(species):C(sepal_width)  131.529630   44.0  15.786994  7.024495e-26
Residual                    20.260738  107.0        NaN           NaN


  F /= J
  F /= J
  F /= J


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?


In [None]:
Significance value is not given in above problem I am assuming significance value of alpha = 0.05
If the alpha value is 0.05 and the p-value is 0.02, then we can REJECT the null hypothesis at the 0.05 level of significance. 
This means that there is sufficient evidence to conclude that there is a statistically significant difference between the groups.
In this case, we would reject the null hypothesis, indicating that there is a significant difference between the groups. However,
we still need to investigate the direction of this difference and which specific groups are different from each other.
The F-statistic of 5.23 suggests that there is a difference between the groups, and the magnitude of the F-value indicates that the 
variability between groups is 5.23 times the variability within groups. However, post-hoc tests or further analyses are necessary to determine which groups are significantly different from each other.
Post Hoc methods like Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffe's method can be used for further analysis of means

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


In [None]:
In repeated measures ANOVA, missing data can occur when one or more measurements are not available for a particular subject at a specific time point. 
There are several methods to handle missing data in repeated measures ANOVA, each with its own potential consequences:
Complete Case Analysis (CCA): This method involves excluding any participant with missing data from the analysis. CCA is easy to implement and can lead 
to unbiased estimates if data are missing at random. However, it can result in a loss of statistical power,
and the assumption of missing completely at random (MCAR) may be violated.

Last Observation Carried Forward (LOCF): This method involves using the last observed value of a variable as a substitute for any missing values for that 
variable at later time points. LOCF can produce biased estimates if the assumption that missing values are missing completely at random (MCAR) is not met.

Multiple Imputation (MI): This method involves imputing missing values with plausible values, and then analyzing each of the completed datasets separately.
MI can lead to unbiased estimates if data are missing at random (MAR) or missing not at random (MNAR), but it can be computationally intensive and requires making assumptions about the distribution of the missing data.

Maximum Likelihood (ML): This method involves estimating model parameters using all available data and allowing for missing values. ML can lead to unbiased
estimates if data are missing at random (MAR), but it requires making assumptions about the distribution of the missing data.

The choice of method for handling missing data in repeated measures ANOVA should depend on the reason for the missing data and the assumptions that can be 
made about the missingness mechanism. The potential consequences of using different methods to handle missing data can include biased estimates, loss of statistical power, and reduced generalizability of the findings.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.


In [None]:
Post-hoc tests are used in ANOVA to compare specific pairs of groups after a significant main effect or interaction effect has been found. Some common post-hoc tests include:
Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of group means and controls for the family-wise 
error rate. It is often used when the number of groups is equal or large, and when there is no prior knowledge about which groups differ.

Bonferroni correction: This test controls the family-wise error rate by dividing the alpha level by the number of comparisons. It is often used when there are few groups or when there is prior knowledge about which groups differ.

Scheffe's test: This test also controls the family-wise error rate but is more conservative than Tukey's HSD test. It is often used when the number of groups is large, and when there is no prior knowledge about which groups differ.

Dunnett's test: This test compares each group mean to a control group mean and controls for the family-wise error rate. It is often used when
there is a control group and when the research question is focused on comparing other groups to the control group.

The choice of post-hoc test depends on the research question, the number of groups, and the prior knowledge about which groups are likely to differ.
A post-hoc test might be necessary when an ANOVA indicates a significant difference between groups but does not identify which specific groups differ. 
For example, a researcher might conduct an ANOVA to examine the effect of different instructional methods on student achievement. If the ANOVA shows a significant 
main effect of instructional method, the researcher might use a post-hoc test to compare the mean scores of each instructional method to identify which methods are significantly different from each other.

Question 9

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


In [7]:
import numpy as np
from scipy.stats import f_oneway

A = np.random.normal(1,5,50)
B = np.random.normal(3,5,50)
C = np.random.normal(2,5,50)

null = 'three diets have same weight loss'
alternate = 'three diets differnet weight loss'

f_stat, p = f_oneway(A,B,C)

alpha = 0.05

if p<alpha:
    print('reject null:', alternate)
else:
    print('accept null:', null)

reject null: three diets differnet weight loss


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.


In [23]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import numpy as np
import pandas as pd


df = pd.DataFrame({'software': ['A','B','C','C','B','A','B','A','C','C','A','B','C','C','B','A','B','A','C','C','A','B','C','C','B','A','B','A','C','C'],
                   'experience':['Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp','Nov','exp'],
                   'time': np.random.normal(loc=10,scale=2, size =30)})

print(df.head())

  software experience       time
0        A        Nov   9.893413
1        B        exp   8.857808
2        C        Nov   9.271280
3        C        exp  11.202295
4        B        Nov   8.946930


In [24]:
model_formula = 'time ~ C(software) + C(experience)+C(software):C(experience)'

model = ols(model_formula, data = df).fit()

value = sm.stats.anova_lm(model, typ=2)

print('anova model is', value)
print('              ')
alpha =0.05

if value['PR(>F)'][0] < alpha:
    print('There is a significant main effect of software')
else:
    print('There is not a significant main effect of software')


if value['PR(>F)'][1] < alpha:
    print('There is a significant main effect of experience')
else:
    print('There is not a significant main effect of experience')
    
if value['PR(>F)'][2]< alpha:
    print('There is a significant interaction effect between software and experience')
else:
    print('There is not a significant interaction effect between software and experience')

anova model is                               sum_sq    df         F    PR(>F)
C(software)                 0.486608   2.0  0.101374  0.903980
C(experience)               4.783032   1.0  1.992884  0.170874
C(software):C(experience)   5.056499   2.0  1.053413  0.364322
Residual                   57.601316  24.0       NaN       NaN
              
There is not a significant main effect of software
There is not a significant main effect of experience
There is not a significant interaction effect between software and experience


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.


In [31]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

control_group = np.random.normal(loc=10, scale=2, size =50)
experiment_group = np.random.normal(loc=10, scale=2, size =50)

df =pd.DataFrame({'score':list(control_group) + list(experiment_group),
                   'group': ['control']*50 + ['experiment']*50})
print(df.head())

       score    group
0  13.970920  control
1  11.474915  control
2  14.161988  control
3  11.437428  control
4   9.120410  control


In [29]:
null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."

control_scores = df[df['group']=='control']['score']
experiment_scores = df[df['group']=='experiment']['score']

t_stat, p = ttest_ind(control_scores,experiment_scores,equal_var=True)
alpha = 0.05
if p<alpha:
    print('reject null:', null_hypothesis)
else:
    print('accept null:', alt_hypothesis)

accept null: There is SIGNIFICANT difference in test scores between the control and experimental groups.


In [33]:
# HSD test -------------

from statsmodels.stats.multicomp import pairwise_tukeyhsd


tukey_results = pairwise_tukeyhsd(df['score'], df['group'], 0.05)
print(tukey_results)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1   group2   meandiff p-adj  lower  upper  reject
-------------------------------------------------------
control experiment   0.4736 0.215 -0.2794 1.2265  False
-------------------------------------------------------


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [65]:
import pandas as pd
import numpy as np

np.random.seed(30)
Sales_A = np.random.normal(loc=10,scale=2,size=10)
Sales_B = np.random.normal(loc=10,scale=2,size=10)
Sales_C = np.random.normal(loc=10,scale=2,size=10)

### df = pd.DataFrame({'Group':['A']*10 + ['B']*10 + ['C']*10,
####                  'sales': list(Sales_A) + list(Sales_B) + list(Sales_C)})


df = pd.DataFrame({'Store A': Sales_A, 'Store B': Sales_B, 'Store C': Sales_C})

print(df.tail())

     Store A    Store B    Store C
5  10.607586  11.520770  13.233781
6   6.548075   9.428709  12.851020
7  13.170191  11.076735   8.670490
8  10.268593   5.832207  11.970036
9   7.786289  11.875563   6.599069


In [69]:
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

res = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
res.columns =['Day', 'Store', 'Sales']


rm_anova = AnovaRM(res,'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)



               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.1097 2.0000 18.0000 0.8967



In [70]:
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

NO significant difference between groups.
