**Practical 4 :  Hypothesis Testing**
*   **Formulate null and alternative hypotheses for a given problem.**
*   **Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test).**
*   **Interpret the results and draw conclusions based on the test outcomes.**



# **t-test to evaluate whether our hypothesis is correct or not.**

In [1]:
import pandas as pd
import scipy.stats as stats

# 1. Load the dataset
df_students = pd.read_csv('student_sleep_patterns.csv')

# 2. Select the continuous variable: 'Sleep_Duration'
data = df_students['Sleep_Duration'].copy()

# 3. Define the hypothesized population mean (mu_0) for sleep duration
mu_0 = 7.5  # Hypothesized average sleep duration is 7.5 hours

# Define the null hypothesis
H0 = f"The average sleep duration of students is exactly {mu_0} hours."

# Define the alternative hypothesis (ttest_1samp performs a two-tailed test by default)
H1 = f"The average sleep duration of students is not {mu_0} hours."

# 4. Calculate the test statistic (t-statistic) and p-value
t_stat, p_value = stats.ttest_1samp(data, mu_0)

# 5. Print the results
print("Sample Variable: Sleep_Duration (hours)")
print("Hypothesized Mean (mu_0):", mu_0)
print("Sample Mean:", data.mean())

print("\nTest statistic:", t_stat)
print("p-value:", p_value)

# 6. Conclusion
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis.")
    print(f"Conclusion: {H1}")
else:
    print("\nFail to reject the null hypothesis.")
    print(f"Conclusion: {H0}")

Sample Variable: Sleep_Duration (hours)
Hypothesized Mean (mu_0): 7.5
Sample Mean: 6.4723999999999995

Test statistic: -15.465337709710655
p-value: 2.321080448498737e-44

Reject the null hypothesis.
Conclusion: The average sleep duration of students is not 7.5 hours.


**Two sampled T-test** :-The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.

Example : is there any association between week1 and week2

In [9]:
import pandas as pd
from scipy.stats import ttest_ind
import numpy as np

# 1. Load the dataset
df = pd.read_csv("telco.csv")

# 2. Identify two independent samples: Monthly Charge for Male and Female customers
# Group 1 (gender1 proxy): Male Monthly Charge
gender1 = df[df['Gender'] == 'Male']['Monthly Charge'].dropna().to_numpy()
# Group 2 (gender2 proxy): Female Monthly Charge
gender2 = df[df['Gender'] == 'Female']['Monthly Charge'].dropna().to_numpy()

# Print data (First 5 elements shown for brevity)
print("gender1 data (Male Monthly Charge first 5 elements):-\n")
print(gender1[:5])
print("gender2 data (Female Monthly Charge first 5 elements):-\n")
print(gender2[:5])

# Calculate means
gender1_mean = np.mean(gender1)
gender2_mean = np.mean(gender2)
print("\ngender1 mean value (Male): %.2f" % gender1_mean)
print("gender2 mean value (Female): %.2f" % gender2_mean)

# Calculate standard deviations
gender1_std = np.std(gender1)
gender2_std = np.std(gender2)
print("gender1 std value (Male): %.2f" % gender1_std)
print("gender2 std value (Female): %.2f" % gender2_std)

# Perform the Two-Sample T-test
# H0: The mean monthly charges for Male and Female are equal.
ttest, pval = ttest_ind(gender1, gender2)

print("\np-value: %.4f" % pval)
if pval < 0.05:
    print("we reject null hypothesis: The mean monthly charges are significantly different.")
else:
    print("we accept null hypothesis: The mean monthly charges are not significantly different.")

gender1 data (Male Monthly Charge first 5 elements):-

[39.65 95.45 45.3  74.4  40.2 ]
gender2 data (Female Monthly Charge first 5 elements):-

[80.65 98.5  76.5  78.05 70.45]

gender1 mean value (Male): 64.33
gender2 mean value (Female): 65.20
gender1 std value (Male): 30.11
gender2 std value (Female): 30.06

p-value: 0.2215
we accept null hypothesis: The mean monthly charges are not significantly different.


**Paired sampled t-test** :- The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

H0 :- means difference between two sample is 0

H1:- mean difference between two sample is not 0

In [3]:
import pandas as pd
from scipy import stats

# 1. Load the dataset
df = pd.read_csv("mental_health_diagnosis_treatment_.csv")

# 2. Rename columns for easier access
df.rename(columns={'Symptom Severity (1-10)': 'Symptom_Severity',
                   'Stress Level (1-10)': 'Stress_Level'}, inplace=True)

# 3. Display descriptive statistics for the chosen columns
print(df[['Symptom_Severity','Stress_Level']].describe().to_markdown(numalign="left", stralign="left"))

# 4. Perform the paired t-test
# H0: The mean difference between Symptom Severity and Stress Level is zero.
ttest, pval = stats.ttest_rel(df['Symptom_Severity'], df['Stress_Level'])

# 5. Print the results
print("\nPaired T-test Results:")
print("p-value:", pval)

# 6. Conclusion
if pval < 0.05:
    print("reject Null Hypothesis: There is a significant difference between the mean Symptom Severity and the mean Stress Level.")
else:
    print("Accept Null Hypothesis: There is no significant difference between the mean Symptom Severity and the mean Stress Level.")

|       | Symptom_Severity   | Stress_Level   |
|:------|:-------------------|:---------------|
| count | 500                | 500            |
| mean  | 7.478              | 7.542          |
| std   | 1.70626            | 1.70941        |
| min   | 5                  | 5              |
| 25%   | 6                  | 6              |
| 50%   | 8                  | 8              |
| 75%   | 9                  | 9              |
| max   | 10                 | 10             |

Paired T-test Results:
p-value: 0.5574736415853678
Accept Null Hypothesis: There is no significant difference between the mean Symptom Severity and the mean Stress Level.


# **When you can run a Z Test.**
Several different types of tests are used in statistics (i.e. f test, chi square test, t test). You would use a Z test if:

Your sample size is greater than 30. Otherwise, use a t test.


Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.

Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.

Your data should be randomly selected from a population, where each item has an equal chance of being selected.

Sample sizes should be equal if at all possible.

Example again we are using z-test for blood pressure with some mean like 156  
**one-sample Z test**.

In [5]:
import pandas as pd
from statsmodels.stats import weightstats as stests
import numpy as np

# 1. Load the dataset
df = pd.read_csv("diabetes.csv")

# 2. Select the continuous variable: 'BloodPressure'
# Filter out 0 values, which represent missing data in this dataset.
bp_data = df[df['BloodPressure'] != 0]['BloodPressure']

# 3. Define the hypothesized population mean (mu_0)
mu_0 = 75

# Perform the one-sample Z-test
# H0: The mean BloodPressure is mu_0 (75).
# H1: The mean BloodPressure is not mu_0 (75).
ztest ,pval = stests.ztest(bp_data, x2=None, value=mu_0)

# 4. Print the results
print(f"Sample Variable: BloodPressure (excluding 0s, n={len(bp_data)})")
print(f"Hypothesized Mean (mu_0): {mu_0}")
print(f"Sample Mean: {bp_data.mean():.2f}")
print("Z-statistic:", ztest)
print("p-value:", float(pval))

# 5. Conclusion
alpha = 0.05
if pval < alpha:
    print("\nreject null hypothesis: The average blood pressure is significantly different from 75 mmHg.")
else:
    print("\naccept null hypothesis: The average blood pressure is not significantly different from 75 mmHg.")

Sample Variable: BloodPressure (excluding 0s, n=733)
Hypothesized Mean (mu_0): 75
Sample Mean: 72.41
Z-statistic: -5.673645234780451
p-value: 1.3979047891468943e-08

reject null hypothesis: The average blood pressure is significantly different from 75 mmHg.


**Two-sample Z test-** In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.

H0 : mean of two group is 0

H1 : mean of two group is not 0

Example : we are checking in blood data after blood and before blood data.

In [6]:
import pandas as pd
from statsmodels.stats import weightstats as stests
import numpy as np

# 1. Load the dataset
df = pd.read_csv("diabetes.csv")

# 2. Filter out 0 values from 'BMI' (missing data)
df_cleaned = df[df['BMI'] != 0].copy()

# 3. Create two independent samples based on the binary 'Outcome' column
# x1: Non-diabetic (Outcome = 0) BMI
x1_bmi = df_cleaned[df_cleaned['Outcome'] == 0]['BMI']
# x2: Diabetic (Outcome = 1) BMI
x2_bmi = df_cleaned[df_cleaned['Outcome'] == 1]['BMI']

# 4. Perform the two-sample Z-test (value=0 tests for equal means)
ztest ,pval1 = stests.ztest(x1_bmi, x2=x2_bmi,
                             value=0, alternative='two-sided')

# 5. Print the results
print(f"Sample 1: Non-diabetic BMI (n={len(x1_bmi)}, Mean={x1_bmi.mean():.2f})")
print(f"Sample 2: Diabetic BMI (n={len(x2_bmi)}, Mean={x2_bmi.mean():.2f})")

print("\nZ-statistic:", ztest)
print("p-value:", float(pval1))

# 6. Conclusion
alpha = 0.05
if pval1 < alpha:
    print("\nreject null hypothesis: The average BMI is significantly different between non-diabetic and diabetic individuals.")
else:
    print("\naccept null hypothesis: The average BMI is not significantly different between non-diabetic and diabetic individuals.")

Sample 1: Non-diabetic BMI (n=491, Mean=30.86)
Sample 2: Diabetic BMI (n=266, Mean=35.41)

Z-statistic: -9.07722131205247
p-value: 1.1138197379236504e-19

reject null hypothesis: The average BMI is significantly different between non-diabetic and diabetic individuals.


**Chi-Square Test**- The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference

In [4]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset
df = pd.read_csv('Amazon_Like_Dataset.csv')

# Create a contingency table using 'Category' and 'Brand'
contingency_table = pd.crosstab(df['Category'], df['Brand'])

# Perform the chi-square test
chi2_statistic, p_value, dof, expected_frequencies = chi2_contingency(contingency_table)

# Print the contingency table (Observed Frequencies)
print('Contingency Table (Observed Frequencies):\n')
print(contingency_table.to_markdown(numalign="left", stralign="left"))

# Print the results
print('\nChi-square statistic:', chi2_statistic)
print('P-value:', p_value)
print('Degrees of freedom:', dof)
print('\nExpected frequencies (rounded for readability):\n')
print(pd.DataFrame(expected_frequencies, index=contingency_table.index, columns=contingency_table.columns).round(2).to_markdown(numalign="left", stralign="left"))

# Conclusion
if p_value < 0.05:
    print("\nReject the null hypothesis: There is a significant association between Product Category and Brand.")
else:
    print("\nFail to reject the null hypothesis: Product Category and Brand are independent.")

Contingency Table (Observed Frequencies):

| Category    | BrandA   | BrandB   | BrandC   | BrandD   | BrandE   |
|:------------|:---------|:---------|:---------|:---------|:---------|
| Books       | 18       | 9        | 22       | 19       | 13       |
| Clothing    | 13       | 13       | 21       | 16       | 15       |
| Electronics | 13       | 11       | 13       | 16       | 16       |
| Home        | 22       | 15       | 23       | 13       | 16       |
| Sports      | 20       | 21       | 16       | 19       | 15       |
| Toys        | 17       | 22       | 13       | 20       | 20       |

Chi-square statistic: 17.378074815247604
P-value: 0.6283047812641522
Degrees of freedom: 20

Expected frequencies (rounded for readability):

| Category    | BrandA   | BrandB   | BrandC   | BrandD   | BrandE   |
|:------------|:---------|:---------|:---------|:---------|:---------|
| Books       | 16.69    | 14.74    | 17.5     | 16.69    | 15.39    |
| Clothing    | 16.07    | 14.2  