In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/cleaned-day2-day4/cleaned_day2.csv


In [4]:
df = pd.read_csv('/kaggle/input/cleaned-day2-day4/cleaned_day2.csv')

In [5]:
object_list = ['Gender', 'Ever_Married', 'Work_Type', 'Residence_Type','Smoking_Status']

for col in object_list:
    df[col] = df[col].astype('category')


which reduces memory usage, improves performance, and makes the dataset semantically clearer for analysis and modeling.

## one-sample t-test

100 mg/dL is the standard normal fasting glucose level used in healthcare, so it gives a meaningful baseline to test whether my dataset’s average glucose level is unusually high or not.

In [6]:
from scipy.stats import ttest_1samp
t_stat, p_value = ttest_1samp(df['Avg_Glucose_Level'],100)

print(f"--- One-Sample T-Test Results ---")
print(f"Hypothesized Population Mean (mu): 100 mg/dL")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value:     {p_value:.4f}")


--- One-Sample T-Test Results ---
Hypothesized Population Mean (mu): 100 mg/dL
T-statistic: 9.6919
P-value:     0.0000


Since the p-value is much smaller than 0.05, we reject the null hypothesis (H₀).
t-statistic is positive : The mean glucose level is higher than 100 mg/dL.
A one-sample t-test showed that the mean average glucose level in the dataset is significantly higher than the normal fasting benchmark of 100 mg/dL (p < 0.001).

## TWO-SAMPLE t-TEST (Independent)

This test checks whether stroke patients have a significantly different BMI compared to non-stroke individuals

In [7]:
from scipy.stats import ttest_ind
bmi_stroke = df[df['Stroke']==1]['BMI']
bmi_no_stroke = df[df['Stroke']==0]['BMI']
t_stat, p_value = ttest_ind(bmi_stroke,bmi_no_stroke, equal_var = False)

print("--- Two-Sample T-Test: BMI vs Stroke ---")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value:     {p_value:.4f}")


--- Two-Sample T-Test: BMI vs Stroke ---
T-statistic: 3.3257
P-value:     0.0010


* Since the p-value (0.001) is less than 0.05, we reject the null hypothesis.
There is a statistically significant difference in BMI between people who had a stroke and those who did not.
* t-statistic is positive : The positive t-statistic (3.33) indicates that the mean BMI of individuals with stroke is higher than that of individuals without stroke

# ANOVA
Does average glucose level differ across different types of work?

Hypothesis

H₀:
Mean Avg_Glucose_Level is the same for all work types

H₁:
At least one work type has a different mean glucose level

In [8]:
from scipy.stats import f_oneway

groups = [
    df[df['Work_Type']==work]['Avg_Glucose_Level']
    for work in df['Work_Type'].cat.categories
]

f_stat, p_value = f_oneway(*groups)
print("--- One-Way ANOVA: Avg Glucose Level by Work Type ---")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value:     {p_value:.4f}")


--- One-Way ANOVA: Avg Glucose Level by Work Type ---
F-statistic: 16.6053
P-value:     0.0000


* Because the p-value is much smaller than 0.05, we reject the null hypothesis.
People in different work types have significantly different average glucose levels.
* F-statistic:
*  Differences between work-type groups are much larger
*   Differences within each group
*   This shows work type matters when it comes to glucose levels.

#### A one-way ANOVA showed that average glucose levels differ significantly across work types (F = 16.61, p < 0.001), indicating work-related lifestyle differences in health risk.

# Chi-Square Test
* chi-square test is appropriate to examine whether smoking behavior is associated with stroke occurrence.

H₀ (Null Hypothesis):
Smoking status and stroke are independent (no association)

H₁ (Alternative Hypothesis):
Smoking status and stroke are associated

In [9]:
from scipy.stats import chi2_contingency

table = pd.crosstab(df['Smoking_Status'],df['Stroke'])

chi2, p_value, dof, expected = chi2_contingency(table)
print("\n--- Chi-Square Test Results ---")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value:              {p_value:.4f}")
print(f"Degrees of freedom:   {dof}")
print(f" expected: \n{expected}")



--- Chi-Square Test Results ---
Chi-square statistic: 29.2257
P-value:              0.0000
Degrees of freedom:   3
 expected: 
[[ 840.91603053   43.08396947]
 [1799.78860834   92.21139166]
 [ 750.54609513   38.45390487]
 [1468.749266     75.250734  ]]


* Because the p-value is much smaller than 0.05, we reject the null hypothesis.
* Smoking status and stroke are not independent.There is a statistically significant association between smoking behavior and stroke occurrence.
* chi2: Stroke cases are not evenly distributed across smoking categories.
* #### A chi-square test showed a significant association between smoking status and stroke occurrence (χ² = 29.23, p < 0.001), indicating that stroke risk varies across smoking categories.

# Chi-Square Test for Hypertension × Stroke
The goal of this test was to determine if having hypertension is statistically independent of having a stroke


In [10]:
from scipy.stats import chi2_contingency

table = pd.crosstab(df['Hypertension'],df['Stroke'])

chi2, p_value, dof, expected = chi2_contingency(table)
print("\n--- Chi-Square Test Results ---")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value:              {p_value:.4f}")



--- Chi-Square Test Results ---
Chi-square statistic: 81.5731
P-value:              0.0000


* This output provides extremely strong evidence of a significant relationship between Hypertension status and Stroke status.
* That hypertension status and stroke status are strongly statistically dependent. People with hypertension are significantly more likely to experience a stroke than those without hypertension.



# Two-Sample t-Test Heart_Disease vs BMI
Is the BMI different between people with and without heart disease?

Hypotheses

H₀: Mean BMI is the same for Heart_Disease = 0 and 1

H₁: Mean BMI is different

In [11]:
bmi_hd = df[df['Heart_Disease']==1]['BMI']
bmi_no_hd = df[df['Heart_Disease']==0]['BMI']

t_stat, p_value = ttest_ind(bmi_hd,bmi_no_hd,equal_var=False )
print("--- Two-Sample T-Test: BMI vs Heart Disease ---")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value:     {p_value:.4f}")


--- Two-Sample T-Test: BMI vs Heart Disease ---
T-statistic: 3.8991
P-value:     0.0001


Since the p-value (0.0001) is much smaller than 0.05, we reject the null hypothesis.

t-statistic is positive:

People with heart disease have a higher average BMI than people without heart disease.

Larger t-value → stronger evidence that the difference is real (not random)

# Engineered Features
### Age_Group

Stroke risk increases sharply with age

Groups are easier to interpret than raw age

Useful for: visualization, chi-square tests, modeling


In [12]:
df['Age_Group'] = pd.cut(
    df['Age'],
    bins = [0 , 20, 40, 60,df['Age'].max()],
    labels=  ['Young', 'Adults', 'Middle-aged', 'Senior']
)
df['Age_Group'].value_counts()

Age_Group
Middle-aged    1562
Senior         1304
Adults         1218
Young          1025
Name: count, dtype: int64

In [13]:
table3 = pd.crosstab(df['Age_Group'],df['Stroke'])

chi2, p_value, dof, expected = chi2_contingency(table3)
print("\n--- Chi-Square Test Results ---")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value:              {p_value:.4f}")


--- Chi-Square Test Results ---
Chi-square statistic: 313.7342
P-value:              0.0000


The p-value is extremely small, so we reject the null hypothesis. This shows that age group and stroke status are related.

In summary, the results clearly show that stroke occurrence is strongly linked to age, with the risk increasing as people get older.

# High_Glucose_Flag

Why this is important

126 mg/dL is a clinical diabetes threshold

Converts continuous glucose into a risk indicator

In [14]:
df['High_Glucose'] = (df['Avg_Glucose_Level']>=126).astype(int)
df['High_Glucose'].value_counts()

High_Glucose
0    4129
1     980
Name: count, dtype: int64

Normal Glucose: The majority of individuals data set have an average glucose level below the clinical threshold of 126 mg/dL.

High Glucose: A smaller subset of individuals  meet or exceed the clinical threshold of 126 mg/dL.

# Chi-Squared Test
Are people who fall into the High_Glucose=1 category more likely to have a stroke than those in the High_Glucose=0 category?


In [15]:
table4 = pd.crosstab(df['High_Glucose'],df['Stroke'])

chi2, p_value, dof, expected = chi2_contingency(table4)
print("\n--- Chi-Square Test Results ---")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value:              {p_value:.4f}")



--- Chi-Square Test Results ---
Chi-square statistic: 72.8966
P-value:              0.0000


Because the p-value is much smaller than 0.05, we reject the null hypothesis. This means high glucose status and stroke status are related.

Overall, the results show strong evidence that people with high glucose levels have a different, and likely higher, risk of stroke compared to those with normal glucose levels.

# Medical_Risk_Score
It combines multiple related health conditions into one meaningful risk indicator, making the data easier to analyze and interpret.

In [16]:
df['Obese'] = (df['BMI']>=30).astype(int)
df['Medical_Risk_Score'] = (df['Hypertension']+df['Heart_Disease']+df['High_Glucose']+df['Obese'])
df['Medical_Risk_Score'] .value_counts()

Medical_Risk_Score
0    2460
1    1830
2     629
3     174
4      16
Name: count, dtype: int64

People with multiple conditions (hypertension, heart disease, high glucose, obesity) form a small portion of the population, but they are the most vulnerable to stroke.

The Medical_Risk_Score shows that most individuals have few or no medical risk factors, while a small subset accumulates multiple conditions, highlighting a clear gradient of increasing health risk relevant to stroke analysis.

# Age_Glucose_Interaction

High glucose at older age is more dangerous.

In [17]:
df['Age_Glucose_Interaction']=df['Age']*df['High_Glucose']

High glucose contributes more risk when age is high

Low glucose → interaction = 0 (no extra risk)

Older + high glucose → large interaction value

The Age–Glucose interaction feature captures the compounded effect of aging and hyperglycemia, Helps to recognize that elevated glucose poses greater stroke risk in older individuals.

High glucose in an older person is especially dangerous

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   Gender                   5109 non-null   category
 1   Age                      5109 non-null   float64 
 2   Hypertension             5109 non-null   int64   
 3   Heart_Disease            5109 non-null   int64   
 4   Ever_Married             5109 non-null   category
 5   Work_Type                5109 non-null   category
 6   Residence_Type           5109 non-null   category
 7   Avg_Glucose_Level        5109 non-null   float64 
 8   BMI                      5109 non-null   float64 
 9   Smoking_Status           5109 non-null   category
 10  Stroke                   5109 non-null   int64   
 11  Age_Group                5109 non-null   category
 12  High_Glucose             5109 non-null   int64   
 13  Obese                    5109 non-null   int64   
 14  Medical_

In [19]:
df.to_csv("final_cleaned_day4.csv",index= False)


# Day 1 Insight — Data Understanding & Quality

The dataset contains over 5,100 patient records with demographic, medical, and lifestyle variables relevant to stroke analysis. Initial profiling revealed that stroke cases are relatively rare and that certain categorical variables contain inconsistencies, highlighting the need for careful preprocessing and cautious interpretation of statistical comparisons.

# Day 2 Insight — Data Cleaning & Preprocessing

Data cleaning focused on improving analytical reliability by removing non-informative identifiers, resolving categorical inconsistencies, handling missing BMI values using median imputation, and standardizing data types and labels; medically meaningful outliers in BMI and glucose were intentionally retained, as they represent real high-risk individuals rather than data errors.

# Day 3 Insights — Univariate & Bivariate EDA

### Age & Stroke Pattern
* Age shows a strong relationship with stroke occurrence, with stroke cases concentrated mainly among older individuals; univariate and bivariate analyses consistently indicate that stroke patients tend to be significantly older than non-stroke individuals.

### Glucose & BMI Distributions
* Both average glucose level and BMI exhibit right-skewed distributions with clinically meaningful high-value outliers, indicating that while most individuals fall within normal ranges, a small subgroup carries substantially elevated metabolic and obesity-related risk.

### Weak Individual Correlations, Strong Combined Effects
* Correlation analysis reveals that individual numeric variables such as age, BMI, and glucose level have only weak linear relationships with stroke, suggesting that stroke risk arises from the combined influence of multiple factors rather than any single dominant predictor.

### Lifestyle & Demographic Segments

*Segment analysis shows that stroke risk varies meaningfully across demographic and lifestyle groups, with higher stroke rates observed among older age groups, males, smokers (especially former smokers), and individuals in certain work types, highlighting the role of lifestyle and age-related patterns in stroke occurrence.


# Day 4 Insights — Statistical Tests & Feature Engineering
### Glucose as a Key Risk Factor
* Statistical testing confirmed that the average glucose level in the dataset is significantly higher than the normal clinical benchmark of 100 mg/dL, providing strong evidence that elevated glucose levels are common and play an important role in stroke-related health risk.

### BMI, Heart Conditions & Stroke
* Two-sample t-tests showed that individuals with stroke and those with heart disease both have significantly higher average BMI compared to their respective comparison groups, reinforcing the link between obesity, cardiovascular conditions, and stroke risk.

### Lifestyle and Medical Conditions Matter
* Chi-square and ANOVA tests demonstrated significant associations between stroke and multiple categorical factors—including smoking status, hypertension, work type, and age group—indicating that stroke risk is influenced by both medical conditions and lifestyle-related factors rather than occurring randomly.

### Value of Engineered Risk Features
* Feature engineering revealed clear risk stratification: the Medical Risk Score shows a strong gradient where individuals with multiple conditions form a small but highly vulnerable group, while the Age–Glucose interaction highlights that high glucose levels pose especially high risk among older individuals, capturing compounded effects not visible in single variables.



# Segment findings
* Stroke risk increases sharply with age, with senior individuals showing the highest stroke occurrence, while young individuals show minimal risk.

* Former smokers and current smokers show higher stroke rates compared to never-smokers, with males consistently having higher stroke rates across most smoking categories.

* Individuals with hypertension, heart disease, high glucose, or obesity show substantially higher stroke risk compared to those without these conditions.

* Average glucose and BMI levels vary across work types, with private and self-employed workers showing higher variability, suggesting lifestyle or occupational influence on health risk.

# Data Quality Issues
* The dataset contained missing values in the BMI column, which were handled using median imputation to reduce the influence of extreme values.

* Stroke cases were relatively rare compared to non-stroke cases, which may limit the strength of some statistical comparisons and requires cautious interpretation.

* Certain categorical variables required standardization (e.g., work type labels and smoking status formatting) to ensure consistent grouping during analysis.

* A single “Other” gender record was removed due to its extremely small sample size, which could distort categorical analysis.

* The dataset does not include a date or time variable, so time-based trend analysis could not be performed.

* Some variables (such as smoking status marked as “Unknown”) represent incomplete information rather than true absence, and were treated as valid categories.