In [None]:
# """Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.
"""ANOVA  is a statistical technique used to compare means across multiple groups. 

The assumptions required to use ANOVA are:

Normality- The data should be normally distributed within each group. 
if the data in one group is heavily skewed, ANOVA may not be the best statistical test to use.

Homogeneity of variance- The variance of the dependent variable should be equal across all groups.


Independence: The observations should be independent of each other. 


Examples of violations 
 If the data is not normally distributed, ANOVA may not be appropriate. 
 For example, if the data is heavily skewed or has outliers.

 If the variances of the dependent variable are significantly different across the groups, 
 this may impact the validity of the results. 

 If the observations are not independent of each other, this can lead to inaccurate results. 
"""




In [None]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?
"""The three types of ANOVA are:

One-way ANOVA---- One-way ANOVA is used when there is only one independent variable (or factor) that has two or more 
levels or groups.  
if we want to compare the mean weight gain  among three different diets, we would use a one-way ANOVA.

Two-way ANOVA--- Two-way ANOVA is used when there are two independent variables (or factors) that each have two or more levels
or groups. 
For example -  if we want to determine the effect of a rain  and snow  on crop  we  use a two-way ANOVA.

Repeated measures ANOVA---- Repeated measures ANOVA is used when the same subjects are measured under different
 conditions or at different time points. 
"""

In [None]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
"""Partitioning of variance is a central concept in ANOVA that refers to the division of the total variance in the data
 into different components that are associated with the different sources of variation.

The total variance in the data can be partitioned into two components: the variance between groups 
, and the variance within groups 

The variance between groups represents the degree to which the means of the dependent variable differ between the groups
 being compared. The variance within groups represents the degree to which the observations within each group differ from each other.

Partitioning of variance is important because it allows us to determine whether the differences between the groups are 
statistically significant. By comparing the variance between groups to the variance within groups, we can calculate 
the F-statistic, which tells us whether the differences between the groups are due to chance or whether they are 
statistically significant.

Furthermore, partitioning of variance also helps us to understand the relative contributions of different sources of
 variation to the overall variance in the data. This information can be useful in identifying which factors are most 
 important in explaining the variability in the data, and can help guide future research and interventions.

"""

In [39]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {'food': [3,4,5,6,7],
        'weight_lost': [1, 2, 3, 2, 3]}
df = pd.DataFrame(data)
model = ols('weight_lost ~ food', data=df).fit()
tbl = sm.stats.anova_lm(model, typ=1)
print(tbl)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]
print(sst)

sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]
print(sse)

ssr = sst - sse
print(ssr)




           df  sum_sq  mean_sq    F    PR(>F)
food      1.0     1.6      1.6  4.0  0.139326
Residual  3.0     1.2      0.4  NaN       NaN
1.6000000000000023
1.1999999999999993
0.400000000000003


In [38]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({'food1': [8,12,19,8,6,11],
                     'food2': [4,5,4,6,9,7],
                     "weight":[45,46,55,67,78,87]
                     })



model = ols('weight ~ food1 + food2 + food1:food2', data=data).fit()

anova_table = sm.stats.anova_lm(model, typ=2)


food1_effect_1 = anova_table['sum_sq'][0] / anova_table['df'][0]
food2_effect_2 = anova_table['sum_sq'][1] / anova_table['df'][1]
interaction_effect = anova_table['sum_sq'][2] / anova_table['df'][2]


print(food1_effect_1)
print(food2_effect_2)
print(interaction_effect)


88.40903881010894
978.5141341764569
151.13052457126028


In [None]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?
""" F-statistic is the ratio of the variance between the groups to the variance within the groups, and it 
tests that there is no significant difference between the means of the groups.

the F-statistic is 5.23,the variance between the group is five times the variance within the group.
 The p-value of 0.02 < 0.05, rejection of null hypothesis that mean across the group is same.
"""


In [None]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?

"""the simplest method is the deletion of the data listwise or pairwise but the missing data can be imputaed with the estimated
 value  using mean imputation or the regression imputation , the problem may arise if it gives the biased results ,it may also 
 lead to the type 1 and type 2 error """



In [None]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.




"""After conducting an ANOVA, post-hoc tests are often used to determine which specific groups have significant differences.
  

Tukey's Honestly Significant Difference (HSD) test-- This test is a conservative procedure that adjusts for the probability 
of making multiple comparisons. It is recommended when the sample sizes are equal, and the variances are homogenous.

Bonferroni correction: This test adjusts the alpha level for multiple comparisons by dividing it by the number of tests.
 It is a more stringent procedure than Tukey's HSD and is recommended when the sample sizes are unequal, and the variances 
 are heterogeneous.

Scheffé's method: This test is a more conservative procedure than Tukey's HSD and Bonferroni correction. It is recommended 
when the number of comparisons is small, and the sample sizes are unequal.

Dunnett's test: This test compares each group with a control group, rather than all groups to each other. It is recommended 
when there is a single control group and multiple experimental groups.

 let's say a researcher wants to compare the effectiveness of three different types of exercise on weight loss: 
aerobic exercise, resistance exercise, and a combination of both. After conducting an ANOVA, the researcher finds a 
significant difference between the means of the groups. To determine which specific groups differ significantly, a 
post-hoc test such as Tukey's HSD or Bonferroni correction could be used. In this case, the Tukey's HSD test would
 be appropriate if the sample sizes are equal, and the variances are homogenous. However, if the sample sizes are 
 unequal, and the variances are heterogeneous, the Bonferroni correction would be more appropriate.
"""


In [12]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.
# H(0): mean is same across the group 
# H(1):mean is not same across the group
import scipy.stats as stats
import numpy as np


A=np.linspace(1,10,50)
B=np.linspace(1,5,50)
C=np.linspace(1,7,50)
f_stat, p_val = stats.f_oneway(A, B,C)

print(f_stat)
print(p_val)

if p_val < 0.05:
    print("reject the null hypothesis ")
else:
    print("failed to reject the null hypothesis ")

    """we have generated the random variable that has limit for A= 1 to 10 ,for B= 1to 5,for c= 1 to 7  so the random vartable is 
    diffrent so we get the p value <0.05  so we reject the null hypothesis this means there is quite diffrence in the mean """


20.176470588235297
1.808323932018226e-08
reject the null hypothesis 


In [46]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.

import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols


data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 30,
    'Experience': ['Novice'] * 45 + ['Experienced'] * 45,
    'Time': [10, 12, 15, 13, 14, 12, 11, 16, 18, 20, 11, 14, 16, 17, 14, 13, 12, 14, 11, 10,
             14, 13, 11, 12, 16, 17, 19, 20, 18, 16, 15, 18, 17, 19, 20, 16, 14, 11, 12, 13,
             14, 12, 10, 11, 15, 16, 13, 12, 11, 10, 12, 14, 15, 13, 12, 14, 16, 18, 19, 17,
             16, 15, 19, 18, 17, 15, 14, 13, 12, 11, 14, 16, 15, 14, 13, 12, 10, 11, 13, 15,
             16, 15, 19, 18, 17, 15, 14, 13, 12, 11]
})


model = ols('Time ~ Software + Experience + Software:Experience', data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                         sum_sq    df         F    PR(>F)
Software               1.155556   2.0  0.074897  0.927901
Experience             0.011111   1.0  0.001440  0.969816
Software:Experience    1.155556   2.0  0.074897  0.927901
Residual             648.000000  84.0       NaN       NaN


In [48]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

import numpy as np
from scipy.stats import ttest_ind
control_scores = [80, 85, 90, 75, 78, 82, 86, 88, 92, 85]
experimental_scores = [90, 92, 95, 88, 86, 93, 96, 98, 85, 90]
t_stat, p_value = ttest_ind(control_scores, experimental_scores)
print( t_stat)
print( p_value)


-3.3132887518206413
0.0038661916961076725


In [49]:
import numpy as np
import statsmodels.stats.multicomp as mc
control_scores = [80, 85, 90, 75, 78, 82, 86, 88, 92, 85]
experimental_scores = [90, 92, 95, 88, 86, 93, 96, 98, 85, 90]
scores = np.concatenate([control_scores, experimental_scores])
labels = ['Control']*len(control_scores) + ['Experimental']*len(experimental_scores)
tukey_results = mc.MultiComparison(scores, labels).tukeyhsd()
print(tukey_results)


   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental      7.2 0.0039 2.6346 11.7654   True
----------------------------------------------------------


In [45]:
"""(12)researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other."""

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols


data = {'Day': [i+1 for i in range(10)]*3,
        'Store': ['A']*10 + ['B']*10 + ['C']*10,
        'Sales': [50, 52, 48, 55, 53, 51, 49, 50, 52, 47,
                  60, 58, 62, 59, 61, 63, 58, 59, 60, 61,
                  45, 47, 50, 49, 46, 48, 50, 52, 55, 53]}

sales_df = pd.DataFrame(data)


rm_model = ols('Sales ~ Store + Day', data=sales_df).fit()


anova_table = sm.stats.anova_lm(rm_model, typ=2)
print(anova_table)


tukey_results = sm.stats.multicomp.pairwise_tukeyhsd(sales_df['Sales'], sales_df['Store'])
print(tukey_results)


              sum_sq    df          F        PR(>F)
Store     673.866667   2.0  56.663339  3.328188e-10
Day        12.897980   1.0   2.169102  1.528121e-01
Residual  154.602020  26.0        NaN           NaN
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj  lower    upper  reject
----------------------------------------------------
     A      B      9.4   0.0   6.6382 12.1618   True
     A      C     -1.2 0.536  -3.9618  1.5618  False
     B      C    -10.6   0.0 -13.3618 -7.8382   True
----------------------------------------------------
