Statistical Hypothesis Testing
===================================
This script demonstrates the application of various statistical hypothesis tests on synthetic datasets. It includes checks for assumptions, appropriate test selection, and detailed interpretations of results.

Steps:
1. **Generate Synthetic Datasets**
   - Create three datasets (`data_X`, `data_Y`, `data_Z`) with different means and standard deviations.
2. **Create a DataFrame**
   - Combine the datasets into a single DataFrame for analysis.
3. **Perform Statistical Tests**
   - Normality Check (Shapiro-Wilk Test)
   - Homogeneity of Variances (Levene's Test)
   - Difference in Means (ANOVA)
   - Non-Parametric Comparison (Mann-Whitney U Test)
   - Association Between Categorical Variables (Chi-Square Test)

Each test is documented with:
- Null and Alternative Hypotheses
- Assumptions
- Interpretation of Results


## Import Libraries

In [14]:
import pandas as pd
import numpy as np
from scipy.stats import shapiro, levene, f_oneway, mannwhitneyu, chi2_contingency

## Generate synthetic datasets

In [15]:
np.random.seed(24)
data_X = np.random.normal(loc=80, scale=8, size=100)
data_Y = np.random.normal(loc=85, scale=6, size=100)
data_Z = np.random.normal(loc=90, scale=12, size=100)

## Create a DataFrame

In [17]:
data_frame = pd.DataFrame({
    "Result": np.concatenate([data_X, data_Y, data_Z]),
    "Category": ["X"] * 100 + ["Y"] * 100 + ["Z"] * 100
})
data_frame["Level"] = pd.cut(data_frame["Result"], bins=[0, 40, 70, 100], labels=["Low", "Moderate", "High"])


### Normality Check (Shapiro-Wilk Test)

In [18]:
print("Normality Check\n")
for group in ["X", "Y", "Z"]:
    print(f"Group: {group}")
    stat, p = shapiro(data_frame[data_frame["Category"] == group]["Result"])
    print(f" - Null Hypothesis: The data follows a normal distribution.")
    print(f" - Alternative Hypothesis: The data does not follow a normal distribution.")
    print(f" - p-value: {p:.4f}")
    if p > 0.05:
        print(f"   ✅ Interpretation: Fail to reject the null hypothesis. The data follows a normal distribution.\n")
    else:
        print(f"   ❌ Interpretation: Reject the null hypothesis. The data does not follow a normal distribution.\n")

Normality Check

Group: X
 - Null Hypothesis: The data follows a normal distribution.
 - Alternative Hypothesis: The data does not follow a normal distribution.
 - p-value: 0.4702
   ✅ Interpretation: Fail to reject the null hypothesis. The data follows a normal distribution.

Group: Y
 - Null Hypothesis: The data follows a normal distribution.
 - Alternative Hypothesis: The data does not follow a normal distribution.
 - p-value: 0.8788
   ✅ Interpretation: Fail to reject the null hypothesis. The data follows a normal distribution.

Group: Z
 - Null Hypothesis: The data follows a normal distribution.
 - Alternative Hypothesis: The data does not follow a normal distribution.
 - p-value: 0.0260
   ❌ Interpretation: Reject the null hypothesis. The data does not follow a normal distribution.



### Homogeneity of Variances (Levene's Test)


In [19]:
print("Homogeneity of Variances Check\n")
stat, p = levene(data_frame[data_frame["Category"] == "X"]["Result"],
                 data_frame[data_frame["Category"] == "Y"]["Result"],
                 data_frame[data_frame["Category"] == "Z"]["Result"])
print(f"Levene’s Test:\n - Null Hypothesis: Variances are equal across groups.\n - Alternative Hypothesis: Variances are not equal across groups.\n - p-value: {p:.4f}")
if p > 0.05:
    print("   ✅ Interpretation: Fail to reject the null hypothesis. Variances are equal across the groups.\n")
else:
    print("   ❌ Interpretation: Reject the null hypothesis. Variances are not equal across the groups.\n")


Homogeneity of Variances Check

Levene’s Test:
 - Null Hypothesis: Variances are equal across groups.
 - Alternative Hypothesis: Variances are not equal across groups.
 - p-value: 0.0000
   ❌ Interpretation: Reject the null hypothesis. Variances are not equal across the groups.



### Difference in Means (ANOVA)

In [20]:
print("Difference in Means (ANOVA)\n")
stat, p = f_oneway(data_frame[data_frame["Category"] == "X"]["Result"],
                   data_frame[data_frame["Category"] == "Y"]["Result"],
                   data_frame[data_frame["Category"] == "Z"]["Result"])
print(f"ANOVA Test:\n - Null Hypothesis: All group means are equal.\n - Alternative Hypothesis: At least one group mean is different.\n - p-value: {p:.4f}")
if p < 0.05:
    print("   ❗ Interpretation: Reject the null hypothesis. There is a significant difference in means among the groups.\n")
else:
    print("   ✅ Interpretation: Fail to reject the null hypothesis. No significant difference in means among the groups.\n")


Difference in Means (ANOVA)

ANOVA Test:
 - Null Hypothesis: All group means are equal.
 - Alternative Hypothesis: At least one group mean is different.
 - p-value: 0.0000
   ❗ Interpretation: Reject the null hypothesis. There is a significant difference in means among the groups.



### Non-Parametric Comparison (Mann-Whitney U Test)

In [22]:
print("Non-Parametric Test (Mann-Whitney U)\n")
stat, p = mannwhitneyu(data_frame[data_frame["Category"] == "X"]["Result"],
                       data_frame[data_frame["Category"] == "Y"]["Result"],
                       alternative='two-sided')
print(f"Mann-Whitney U Test:\n - Null Hypothesis: No significant difference between Group X and Y.\n - Alternative Hypothesis: Significant difference between Group X and Y.\n - p-value: {p:.4f}")
if p < 0.05:
    print("   ❗ Interpretation: Reject the null hypothesis. Significant difference between Group X and Y.\n")
else:
    print("   ✅ Interpretation: Fail to reject the null hypothesis. No significant difference between Group X and Y.\n")


Non-Parametric Test (Mann-Whitney U)

Mann-Whitney U Test:
 - Null Hypothesis: No significant difference between Group X and Y.
 - Alternative Hypothesis: Significant difference between Group X and Y.
 - p-value: 0.0001
   ❗ Interpretation: Reject the null hypothesis. Significant difference between Group X and Y.



### Association Between Categorical Variables (Chi-Square Test)

In [23]:
print("Association Between Categorical Variables\n")
contingency = pd.crosstab(data_frame["Category"], data_frame["Level"])
stat, p, dof, expected = chi2_contingency(contingency)
print(f"Chi-Square Test:\n - Null Hypothesis: Category and Level are independent.\n - Alternative Hypothesis: There is a relationship between Category and Level.\n - p-value: {p:.4f}")
if p < 0.05:
    print("   ❗ Interpretation: Reject the null hypothesis. There is a relationship between Category and Level.\n")
else:
    print("   ✅ Interpretation: Fail to reject the null hypothesis. Category and Level are independent.\n")


Association Between Categorical Variables

Chi-Square Test:
 - Null Hypothesis: Category and Level are independent.
 - Alternative Hypothesis: There is a relationship between Category and Level.
 - p-value: 0.0022
   ❗ Interpretation: Reject the null hypothesis. There is a relationship between Category and Level.

