## 1. Business Understanding
Stroke is a major global health concern and one of the leading causes of death. Early identification of high-risk patients can help in better medical planning and prevention.

This project aims to explore patient characteristics and lifestyle factors to understand which variables are most associated with stroke.

**Objective:**
We want to identify:
* Which patient attributes (age, glucose level, hypertension, heart disease, etc.) are linked to higher stroke risk.
* Whether lifestyle choices (such as smoking) influence stroke likelihood.
* Which demographic groups may require more attention from healthcare providers.

The insights gained can guide early screening, improve awareness strategies, and support healthcare decision-making.

## 2. Data Description
This dataset contains 5,110 patient records, each representing an individual with demographic, medical, and lifestyle information.

| Feature | Description |
| :--- | :--- |
| **id** | Unique patient identifier |
| **gender** | Male, Female, Other |
| **age** | Age of the patient |
| **hypertension** | 0 = No, 1 = Yes |
| **heart_disease** | 0 = No, 1 = Yes |
| **ever_married** | Yes / No |
| **work_type** | Children, Govt_job, Private, Self-employed, Never_worked |
| **Residence_type** | Rural / Urban |
| **avg_glucose_level** | Average blood glucose level |
| **bmi** | Body Mass Index |
| **smoking_status** | Never smoked, Formerly smoked, Smokes, Unknown |
| **stroke** | 1 = Stroke occurred, 0 = No stroke |

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv


In [3]:
df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [6]:
df.shape

(5110, 12)

In [8]:
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [9]:
df.nunique()

id                   5110
gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3979
bmi                   418
smoking_status          4
stroke                  2
dtype: int64

In [10]:
df.select_dtypes(include=['int64','float64']).columns


Index(['id', 'age', 'hypertension', 'heart_disease', 'avg_glucose_level',
       'bmi', 'stroke'],
      dtype='object')

In [13]:
df.select_dtypes(include=[object]).columns

Index(['gender', 'ever_married', 'work_type', 'Residence_type',
       'smoking_status'],
      dtype='object')

In [14]:
df.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [20]:
df['gender'].value_counts()


gender
Female    2994
Male      2115
Other        1
Name: count, dtype: int64

In [21]:
df['smoking_status'].value_counts() 

smoking_status
never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: count, dtype: int64

In [27]:
df['stroke'].value_counts()

stroke
0    4861
1     249
Name: count, dtype: int64

### ⚠️ Data Quality Issue Log

| Feature | Issue Type | Count / % | Severity | Action Plan for Day 2 |
| :--- | :--- | :--- | :--- | :--- |
| **id** | Irrelevant Info | 5110 unique values | Low | **Drop column.** No predictive value. |
| **bmi** | Missing Values | 201 rows (3.9%) | Medium | **Impute** using Median (robust to outliers). |
| **bmi** | Outliers | Max value is 97.6 | Medium | **Investigate** distribution; consider capping or noting for future scaling. |
| **gender** | Inconsistency | 1 row = "Other" | Low | **Drop row.** It is statistically insignificant (1/5110). |
| **smoking_status**| "Hidden" Missing | 1544 rows "Unknown" | High | **Keep as category.** Do NOT impute; treat "Unknown" as a specific group. |
