In [1]:
pip install -r req.txt

Collecting matplotlib
  Downloading matplotlib-3.7.5-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (9.2 MB)
[K     |████████████████████████████████| 9.2 MB 12.7 MB/s eta 0:00:01
[?25hCollecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Collecting pandas
  Downloading pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[K     |████████████████████████████████| 12.4 MB 40.7 MB/s eta 0:00:01
[?25hCollecting numpy
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[K     |████████████████████████████████| 17.3 MB 52.6 MB/s eta 0:00:01
[?25hCollecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 51.4 MB/s eta 0:00:01
[?25hCollecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
[K     |████████████████████████████████| 34.5 MB 59.9 MB/s eta 0:00:01
[?25hCollecting 

Q1. Assumptions for ANOVA:

1. **Normality**: The dependent variable should be approximately normally distributed within each group. Violations could occur if the data significantly deviate from normality, affecting the accuracy of ANOVA results.
   
2. **Homogeneity of Variance**: The variances of the groups should be approximately equal. Violations, known as heteroscedasticity, can lead to inflated Type I error rates, impacting the validity of ANOVA results.
   
3. **Independence**: Observations within each group should be independent of each other. Violations, such as autocorrelation or dependence among observations, can bias ANOVA results.

Examples of violations:

- Normality: Skewed or heavily-tailed distributions, outliers.
- Homogeneity of Variance: Unequal variances across groups, evident from visual inspection or Levene's test.
- Independence: Data collected from the same subjects over time, violating the assumption of independence.

Q2. Types of ANOVA:

1. **One-way ANOVA**: Compares means across two or more independent groups. Useful when comparing the effects of a single factor or treatment.
   
2. **Two-way ANOVA**: Examines the influence of two independent categorical variables on a continuous dependent variable. Useful for studying interactions between factors.
   
3. **Repeated measures ANOVA**: Analyzes data where the same subjects are measured multiple times under different conditions. Useful for within-subject designs.

Q3. Partitioning of Variance:

In ANOVA, the total variance in the dependent variable is partitioned into components: 

- **Total sum of squares (SST)**: Variation in the dependent variable across all observations.
  
- **Explained sum of squares (SSE)**: Variation explained by the factors or treatments.
  
- **Residual sum of squares (SSR)**: Unexplained variation or error within groups.

Understanding this concept helps in assessing how much of the total variance is accounted for by the factors being studied, aiding in the interpretation of ANOVA results.



In [10]:
# Q4. Calculation of SST, SSE, and SSR in one-way ANOVA using Python:
import numpy as np

# Assuming 'data' is a DataFrame containing the weight loss data for three diets A, B, and C
# Example:
data = pd.DataFrame({'A': [50,26,15,24,65,87,84,65],
                     'B': [65,65,15,54,35,65,84,87],
                     'C': [65,65,84,84,69,56,94,59]})

# Calculate total sum of squares (SST)
SST = np.sum((data.values - np.mean(data.values))**2)

# Calculate explained sum of squares (SSE)
SSE = np.sum([len(data[group]) * (np.mean(data[group]) - np.mean(data.values))**2 for group in data.columns])

# Calculate residual sum of squares (SSR)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)



Total Sum of Squares (SST): 12433.833333333332
Explained Sum of Squares (SSE): 1656.3333333333333
Residual Sum of Squares (SSR): 10777.499999999998


Q5. Calculation of main and interaction effects in two-way ANOVA using Python depends on the specific analysis method, such as the `statsmodels` library's ANOVA framework.

Q6. Interpretation of ANOVA results:

With an F-statistic of 5.23 and a p-value of 0.02, we reject the null hypothesis that the means of all groups are equal. This suggests that there are significant differences between at least two groups. Further post-hoc tests can determine which specific groups differ significantly.

Q7. Handling missing data in repeated measures ANOVA:

- Missing Completely at Random (MCAR): Exclude missing data without biasing results.
- Missing at Random (MAR): Use imputation techniques or exclusion based on auxiliary variables.
- Missing Not at Random (MNAR): Address underlying reasons for missingness or use sensitivity analysis to assess impact.

Consequences of different methods: Ignoring missing data or using inappropriate imputation techniques can bias estimates, inflate Type I error rates, or reduce statistical power.

Q8. Common post-hoc tests:

- **Tukey's Honestly Significant Difference (HSD)**: Compares all pairs of means, useful for one-way ANOVA.
  
- **Bonferroni correction**: Controls family-wise error rate by adjusting significance levels for multiple comparisons.
  
- **Holm-Bonferroni method**: Sequentially adjusts significance levels, maintaining family-wise error rate control.

Post-hoc tests are necessary when ANOVA indicates significant differences between groups, helping identify specific pairwise differences.



In [8]:
# Q9. One-way ANOVA in Python:
import scipy.stats as stats
import pandas as pd

# Assuming 'data' is a DataFrame containing weight loss data for three diets A, B, and C
# Example:
data = pd.DataFrame({'A': [50,26,15,24,65,87,84,65],
                     'B': [65,65,15,54,35,65,84,87],
                     'C': [65,65,84,84,69,56,94,59]})

# Perform one-way ANOVA
F_statistic, p_value = stats.f_oneway(data['A'], data['B'], data['C'])

print("F-statistic:", F_statistic)
print("P-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis: There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences between the mean weight loss of the three diets.")


# This tests whether there are significant differences in the mean weight loss among the three diets. If the p-value is less than 0.05, we reject the null hypothesis, concluding that there are significant differences between the diets. Otherwise, if the p-value is greater than 0.05, we fail to reject the null hypothesis.

F-statistic: 1.6136859197401996
P-value: 0.2228878492661036
Fail to reject the null hypothesis: There are no significant differences between the mean weight loss of the three diets.


Q10. Two-way ANOVA with Python:

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assuming 'data' is a DataFrame containing the time data for each software program and employee experience level
# Example:
# data = pd.DataFrame({'Time': [list of time data],
#                      'Program': [list of program labels],
#                      'Experience': [list of experience level labels]})

# Fit the ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)
```

Interpretation: Look for significant main effects of Program and Experience, as well as a significant interaction effect between Program and Experience.

Q11. Two-sample t-test with Python:

```python
import scipy.stats as stats

# Assuming 'control' and 'experimental' are arrays containing test scores for control and experimental groups
# Example:
# control = [list of test scores for control group]
# experimental = [list of test scores for experimental group]

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(control, experimental)

print("T-statistic:", t_stat)
print("P-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis: There are significant differences in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences in test scores between the control and experimental groups.")
```

If the results are significant, follow up with a post-hoc test such as Tukey's HSD to determine which group(s) differ significantly from each other.

Q12. Repeated measures ANOVA with Python:

```python
import pingouin as pg

# Assuming 'data' is a DataFrame containing the daily sales data for each store
# Example:
# data = pd.DataFrame({'Day': [list of days],
#                      'Store': [list of store labels],
#                      'Sales': [list of daily sales]})

# Perform repeated measures ANOVA
rm_anova = pg.rm_anova(dv='Sales', within='Store', subject='Day', data=data)

print(rm_anova)
```

If the results are significant, follow up with a post-hoc test such as pairwise t-tests with Bonferroni correction to determine which store(s) differ significantly from each other.