# <center>PR 3</center>
# <center>Derivable Judgement</center>

In [67]:
import pandas as pd
import numpy as np
from scipy import stats


df = pd.read_csv('derivable_judgement_dataset.csv')
df.head()

Unnamed: 0,record_id,age_group,age,weight,gender,region,smoking_status,exercise_frequency,bmi,blood_pressure,diabetes,hypertension,cholesterol_level,glucose_level,visit_date
0,b41ed172-165f-4438-b1f9-c0133c6bbd52,60+,69,66,Female,East,Smoker,Daily,23.5,158.2,True,False,230.7,128.0,2023-01-21
1,24940e3e-a551-494d-876c-ecaca624b591,60+,75,78,Female,West,Smoker,Daily,36.8,171.9,True,True,190.0,132.5,2023-07-19
2,22407e64-7d52-439a-9687-c25a0ab9f1e7,60+,81,70,Male,North,Former Smoker,Rarely,21.7,148.1,False,False,174.6,74.7,2023-03-08
3,5d37b2f8-56e3-423c-83e9-a40ab5d8c074,46-60,53,67,Male,West,Non-Smoker,Weekly,17.1,146.4,True,False,208.7,143.2,2023-08-16
4,0b0ca677-5874-4a69-8b6e-d2432fe08ae3,60+,79,105,Male,West,Smoker,Weekly,32.2,165.8,False,True,216.8,80.7,2023-01-13


## <center>Theoretical Foundation</center>

In [157]:
"""
    1. What is inferential statistics?
    -> Inferential statistics is a branch of statistics that uses a random sample of data to make predictions about a larger population.

    2. What is hypothesis testing and its components?
    -> Hypothesis testing is a systematic method used to evaluate data and decide whether to accept or reject a claim (hypothesis) about a population.
       Components:
       - Null Hypothesis (H0): The default assumption (no effect/no difference).
       - Alternative Hypothesis (H1): The claim we want to test (there is an effect/difference).
       - Significance Level (alpha): The probability of rejecting the null hypothesis when it is true (usually 0.05).
       - Test Statistic: A standardized value calculated from sample data.

    3. Explain confidence interval and critical value.
    -> Confidence Interval (CI): A range of values derived from sample statistics that is likely to contain the true population parameter (e.g., "We are 95% confident the true mean is between X and Y").
    -> Critical Value: A specific point on the distribution curve that separates the rejection region from the acceptance region. If the test statistic exceeds this value, we reject the null hypothesis.

    4. Define p-value.
    -> The p-value (probability value) measures the strength of evidence against the null hypothesis. A low p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, leading us to reject it.

    5. Differentiate Type 1 and Type 2 errors.
    -> Type 1 Error (False Positive): Rejecting the null hypothesis when it is actually true (e.g., telling a healthy person they have a disease).
    -> Type 2 Error (False Negative): Failing to reject the null hypothesis when it is actually false (e.g., telling a sick person they are healthy).

    6. Brief descriptions of z-test, t-test, chi-square test, and ANOVA test.
    -> Z-test: Used to compare population means when the sample size is large (n > 30) or population variance is known.
    -> T-test: Used to compare means when the sample size is small (n < 30) and population variance is unknown.
    -> Chi-Square Test: Used to check relationships or independence between two categorical variables.
    -> ANOVA (Analysis of Variance): Used to compare the means of three or more groups to see if at least one is significantly different.

    7. What is Covariance?
    -> Covariance is a measure of how two variables change together. A positive covariance means they move in the same direction, while a negative covariance means they move in opposite directions.
"""

'\n    1. What is inferential statistics?\n    -> Inferential statistics is a branch of statistics that uses a random sample of data to make predictions about a larger population.\n\n    2. What is hypothesis testing and its components?\n    -> Hypothesis testing is a systematic method used to evaluate data and decide whether to accept or reject a claim (hypothesis) about a population.\n       Components:\n       - Null Hypothesis (H0): The default assumption (no effect/no difference).\n       - Alternative Hypothesis (H1): The claim we want to test (there is an effect/difference).\n       - Significance Level (alpha): The probability of rejecting the null hypothesis when it is true (usually 0.05).\n       - Test Statistic: A standardized value calculated from sample data.\n\n    3. Explain confidence interval and critical value.\n    -> Confidence Interval (CI): A range of values derived from sample statistics that is likely to contain the true population parameter (e.g., "We are 95% 

## <center>Data Analysis & Testing Tasks</center>

#### <center>1</center>

In [69]:
H0 = 'Smoking has no effect on Diabetes prevalence'
H1 = 'Smoking affects Diabetes prevalence'

In [70]:

H0 = 'The mean BMI is equal across all different Age Groups.Alternative'
H1 = 'At least one Age Group has a significantly different mean BMI compared to the others.'

#### <center>2</center>

In [71]:
def manual_ci(data,alpha):
    mean = np.mean(data)
    n = len(data)
    std = np.std(data, ddof=1)

    se = std / np.sqrt(n)
    t_critical = stats.t.ppf(1 - alpha /2, df=n-1)
    margin_error = t_critical * se

    ci_lower = mean - margin_error
    ci_upper = mean + margin_error

    return ci_lower, ci_upper

def lib_ci(data,alpha):
    mean = np.mean(data)
    n = len(data)
    ci = stats.t.interval(1-alpha/2, df=n-1, loc=mean, scale=stats.sem(data))

    return ci


columns = df.select_dtypes(include='number').columns


In [72]:
print('Manual Way')
print()

for col in columns:
    ci = manual_ci(df[col],alpha=0.05)
    print(f'{col.upper()} CI: [{ci[0]:.2f}, {ci[1]:.2f}]')

Manual Way

AGE CI: [50.52, 52.87]
WEIGHT CI: [73.22, 75.18]
BMI CI: [25.57, 26.36]
BLOOD_PRESSURE CI: [145.80, 147.65]
CHOLESTEROL_LEVEL CI: [199.24, 204.12]
GLUCOSE_LEVEL CI: [108.28, 111.41]


In [73]:
print('Library Way')
print()
for col in columns:
    ci = lib_ci(df[col],alpha=0.05)
    print(f'{col.upper()} CI: [{ci[0]:.2f}, {ci[1]:.2f}]')

Library Way

AGE CI: [50.35, 53.04]
WEIGHT CI: [73.08, 75.32]
BMI CI: [25.51, 26.41]
BLOOD_PRESSURE CI: [145.67, 147.79]
CHOLESTEROL_LEVEL CI: [198.88, 204.47]
GLUCOSE_LEVEL CI: [108.05, 111.64]


#### <center>3</center>

In [98]:
def z_test(data,alpha):
    mu = np.mean(data)
    n = 50

    z = np.random.choice(data, n)
    x_bar = np.mean(z)
    sigma = np.std(data)
    alpha = alpha

    z_score = (x_bar - mu) / (sigma / np.sqrt(n))
    p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

    critical_z = stats.norm.ppf(1 - alpha/ 2)

    return z_score, p_value, critical_z,alpha

In [125]:
def t_test(data, alpha):
    mu = np.mean(data)
    n = 29

    t = np.random.choice(data, n)
    x_bar = np.mean(t)
    s = np.std(t)
    alpha = alpha

    t_score = (x_bar - mu) / (s/ np.sqrt(n))
    p_value = 2 * (1 - stats.t.cdf(abs(t_score), df = n-1))

    critical_t = stats.t.ppf(1 - alpha/2, df= n-1)

    return t_score, p_value, critical_t, alpha

In [115]:
print('Z-Test')
z_list = []
for col in columns:
    z = z_test(df[col],alpha=0.05)
    # list(z)
    z_list.append(z)
    print(f'\n{col.upper()}: ')

    print(f'Z score: {z[0]}')
    print(f'P value: {z[1]}')
    print(f'Critical z: {z[2]}')
    print('-------------------')

Z-Test

AGE: 
Z score: 0.3096184445695319
P value: 0.7568511283705972
Critical z: 1.959963984540054
-------------------

WEIGHT: 
Z score: -0.19868795674472398
P value: 0.8425068453300935
Critical z: 1.959963984540054
-------------------

BMI: 
Z score: 0.07099014879407047
P value: 0.9434055958336147
Critical z: 1.959963984540054
-------------------

BLOOD_PRESSURE: 
Z score: 0.46009444527788407
P value: 0.6454484309868256
Critical z: 1.959963984540054
-------------------

CHOLESTEROL_LEVEL: 
Z score: 3.000187022780609
P value: 0.002698138815001583
Critical z: 1.959963984540054
-------------------

GLUCOSE_LEVEL: 
Z score: 0.43085423753228025
P value: 0.6665743597617999
Critical z: 1.959963984540054
-------------------


In [131]:
print('T-Test')
t_list = []
for col in columns:
    t = t_test(df[col],alpha=0.05)
    t_list.append(t)
    print(f'\n{col.upper()}: ')

    print(f'Z score: {t[0]}')
    print(f'P value: {t[1]}')
    print(f'Critical z: {t[2]}')
    print('-------------------')

T-Test

AGE: 
Z score: 0.3468601551747029
P value: 0.7312898068915707
Critical z: 2.048407141795245
-------------------

WEIGHT: 
Z score: -0.5378658814238876
P value: 0.5949194492008141
Critical z: 2.048407141795245
-------------------

BMI: 
Z score: -0.16211292176173275
P value: 0.8723811516262747
Critical z: 2.048407141795245
-------------------

BLOOD_PRESSURE: 
Z score: 0.5132953871393062
P value: 0.611771309375289
Critical z: 2.048407141795245
-------------------

CHOLESTEROL_LEVEL: 
Z score: 0.39854114983175876
P value: 0.6932543260813482
Critical z: 2.048407141795245
-------------------

GLUCOSE_LEVEL: 
Z score: -0.6057879713193859
P value: 0.5495351612233548
Critical z: 2.048407141795245
-------------------


#### <center>4</center>

In [139]:
def two_sample_z_test(data1, data2, alpha):
    n1 = len(data1)
    x_bar1 = np.mean(data1)
    var1 = np.var(data1, ddof=1)

    n2 = len(data2)
    x_bar2 = np.mean(data2)
    var2 = np.var(data2, ddof=1)

    se = np.sqrt((var1/n1) + (var2/n2))

    z_score = (x_bar1 - x_bar2) / se

    p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

    critical_z = stats.norm.ppf(1 - alpha/2)

    return z_score, p_value, critical_z, alpha

group1_df = df[df['smoking_status'] == 'Smoker']
group2_df = df[df['smoking_status'] == 'Non-Smoker']

z_results_list = []

for col in columns:
    z_res = two_sample_z_test(group1_df[col], group2_df[col], alpha=0.05)
    z_results_list.append(z_res)

    print(f'\n{col.upper()} (Smoker vs Non-Smoker): ')
    print(f'Z score:    {z_res[0]:.4f}')
    print(f'P value:    {z_res[1]:.4f}')
    print(f'Critical z: {z_res[2]:.4f}')
    print('-------------------')




AGE (Smoker vs Non-Smoker): 
Z score:    -0.1685
P value:    0.8662
Critical z: 1.9600
-------------------

WEIGHT (Smoker vs Non-Smoker): 
Z score:    -0.0087
P value:    0.9931
Critical z: 1.9600
-------------------

BMI (Smoker vs Non-Smoker): 
Z score:    -0.0825
P value:    0.9342
Critical z: 1.9600
-------------------

BLOOD_PRESSURE (Smoker vs Non-Smoker): 
Z score:    1.8146
P value:    0.0696
Critical z: 1.9600
-------------------

CHOLESTEROL_LEVEL (Smoker vs Non-Smoker): 
Z score:    1.3438
P value:    0.1790
Critical z: 1.9600
-------------------

GLUCOSE_LEVEL (Smoker vs Non-Smoker): 
Z score:    4.1390
P value:    0.0000
Critical z: 1.9600
-------------------


In [155]:
c = 0
for col in columns:
    print(f'\nZ-test of {col.upper()}: ')
    if z_results_list[c][1] < z_results_list[c][3]:
         print(f'The p-value is less than 0.05, so we REJECT the null hypothesis.\nSignificant difference found in {col}.')
    else:
        print(f'The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.\nNo significant difference in {col}.')
    c += 1


Z-test of AGE: 
The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.
No significant difference in age.

Z-test of WEIGHT: 
The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.
No significant difference in weight.

Z-test of BMI: 
The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.
No significant difference in bmi.

Z-test of BLOOD_PRESSURE: 
The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.
No significant difference in blood_pressure.

Z-test of CHOLESTEROL_LEVEL: 
The p-value is greater than 0.05, so we FAIL to reject the null hypothesis.
No significant difference in cholesterol_level.

Z-test of GLUCOSE_LEVEL: 
The p-value is less than 0.05, so we REJECT the null hypothesis.
Significant difference found in glucose_level.


#### <center>5</center>

In [145]:
def manual_chi_square(col1, col2, alpha):
    observed_table = pd.crosstab(col1, col2)
    observed = observed_table.values

    row_totals = observed.sum(axis=1)
    col_totals = observed.sum(axis=0)
    grand_total = observed.sum()

    expected = np.outer(row_totals, col_totals) / grand_total

    chi_squared_stat = np.sum((observed - expected)**2 / expected)

    rows = observed.shape[0]
    cols = observed.shape[1]
    df_val = (rows - 1) * (cols - 1)

    critical_val = stats.chi2.ppf(1 - alpha, df=df_val)
    p_value = 1 - stats.chi2.cdf(chi_squared_stat, df=df_val)

    return chi_squared_stat, p_value, critical_val, alpha

print('Chi-Square Test of Independence (Manual)')
print('Hypothesis: Smoking Status affects Diabetes Prevalence')

chi2, p, crit, alpha = manual_chi_square(df['smoking_status'], df['diabetes'], alpha=0.05)

print(f'\nChi-Square Statistic: {chi2:.4f}')
print(f'Critical Value:       {crit:.4f}')
print(f'P-Value:              {p:.4f}\n')

if p < alpha:
    print('Decision: Reject Null Hypothesis.')
    print('There IS a significant relationship between Smoking Status and Diabetes.')
else:
    print('Decision: Fail to Reject Null Hypothesis.')
    print('There is NO significant relationship between Smoking Status and Diabetes.')

Chi-Square Test of Independence (Manual)
Hypothesis: Smoking Status affects Diabetes Prevalence

Chi-Square Statistic: 21.3014
Critical Value:       5.9915
P-Value:              0.0000

Decision: Reject Null Hypothesis.
There IS a significant relationship between Smoking Status and Diabetes.


#### <center>6</center>

In [156]:
print('One-Way ANOVA Test (Age Group vs BMI)')

unique_age_groups = df['age_group'].unique()

groups = [df[df['age_group'] == g]['bmi'].values for g in unique_age_groups]

f_stat, p_value = stats.f_oneway(*groups)

print(f"F-Statistic: {f_stat:.4f}")
print(f"P-Value:     {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("Conclusion: Reject Null Hypothesis.")
    print("There IS a significant difference in BMI across Age Groups.")
else:
    print("Conclusion: Fail to Reject Null Hypothesis.")
    print("There is NO significant difference in BMI across Age Groups.")

One-Way ANOVA Test (Age Group vs BMI)
F-Statistic: 0.9880
P-Value:     0.4131
Conclusion: Fail to Reject Null Hypothesis.
There is NO significant difference in BMI across Age Groups.


#### <center>7</center>

In [151]:
print('Spearman Rank Correlation (Age vs BMI)')

var1 = df['age']
var2 = df['bmi']

corr_coef, p_val = stats.spearmanr(var1, var2)

print(f"Spearman Coefficient: {corr_coef:.4f}")
print(f"P-Value:              {p_val:.4f}")
print("Interpretation:")


abs_corr = abs(corr_coef)
if abs_corr > 0.7:
    strength = "Strong"
elif abs_corr > 0.3:
    strength = "Moderate"
else:
    strength = "Weak"

if corr_coef > 0:
    direction = "Positive (+)"
else:
    direction = "Negative (-)"

print(f"There is a {strength}, {direction} monotonic relationship.")


Spearman Rank Correlation (Age vs BMI)
Spearman Coefficient: 0.0454
P-Value:              0.1512
Interpretation:
There is a Weak, Positive (+) monotonic relationship.


#### <center>8</center>


In [159]:

print('\n[1] IMPACT OF SMOKING (Smoker vs Non-Smoker)')
print('--------------------------------------------')
c = 0
for col in columns:
    p_val = z_results_list[c][1]

    if p_val < 0.05:
        print(f"  • {col}: SIGNIFICANT DIFFERENCE (p={p_val:.4f})")
        print(f"    -> Smoking has a statistically significant impact on {col}.")
    else:
        pass
    c += 1


print('\n[2] SMOKING & DIABETES RELATIONSHIP')
print('--------------------------------------------')
if p < 0.05:
    print(f"  • Result: SIGNIFICANT (p={p:.4f})")
    print("  -> Interpretation: Smoking Status and Diabetes are DEPENDENT (linked).")
else:
    print(f"  • Result: Not Significant (p={p:.4f})")


print('\n[3] BMI ACROSS AGE GROUPS')
print('--------------------------------------------')

if p_value < 0.05:
    print(f"  • Result: SIGNIFICANT (p={p_value:.4f})")
    print("  -> Interpretation: BMI varies significantly depending on age group.")
else:
    print(f"  • Result: Not Significant (p={p_value:.4f})")
    print("  -> Interpretation: BMI is consistent across all age groups.")


print('\n[4] AGE vs BMI CORRELATION')
print('--------------------------------------------')

print(f"  • Spearman Coeff: {corr_coef:.4f}")
if p_val < 0.05:
    print("  -> Interpretation: There is a significant correlation.")
else:
    print("  -> Interpretation: No significant correlation found.")


[1] IMPACT OF SMOKING (Smoker vs Non-Smoker)
--------------------------------------------
  • glucose_level: SIGNIFICANT DIFFERENCE (p=0.0000)
    -> Smoking has a statistically significant impact on glucose_level.

[2] SMOKING & DIABETES RELATIONSHIP
--------------------------------------------
  • Result: SIGNIFICANT (p=0.0000)
  -> Interpretation: Smoking Status and Diabetes are DEPENDENT (linked).

[3] BMI ACROSS AGE GROUPS
--------------------------------------------
  • Result: Not Significant (p=0.4131)
  -> Interpretation: BMI is consistent across all age groups.

[4] AGE vs BMI CORRELATION
--------------------------------------------
  • Spearman Coeff: 0.0454
  -> Interpretation: There is a significant correlation.
