# Exploratory Data Analysis: Stroke Dataset


## Why This Project?
- **Data Cleaning**: Handle missing values and inconsistencies using Pandas.
- **Statistical Analysis**: Use NumPy and Pandas for calculations like mean and median.
- **Visualization**: Create plots with Matplotlib and Seaborn to identify patterns.
- **Feature Relationships**: Analyze how features like age, glucose levels, and smoking status relate to stroke.


## Step 1: Import Libraries
Load the required libraries for data manipulation, analysis, and visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

## Step 2: Load Dataset
Load the stroke dataset and inspect the first few rows.

In [3]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## Step 3: Exploring the Data
Check the dataset's structure, missing values, and basic statistics.

In [None]:
print('Dataset Shape:', df.shape)
print(df.info())
print('\nMissing Values:')
print(df.isnull().sum())
print('\nSummary Statistics:')
print(df.describe())

### Initial Insights
- **Dataset Size**: Contains 12 features, including `age`, `avg_glucose_level`, `bmi`, and `stroke` (0 = no stroke, 1 = stroke).
- **Missing Values**: `bmi` has missing values marked as 'N/A', and `smoking_status` includes 'Unknown' entries.
- **Numerical Features**: `age` ranges from very young to 82, and `avg_glucose_level` shows high variability (55.12 to 271.74).


## Step 4: Data Cleaning
Handle missing `bmi` values and 'Unknown' in `smoking_status`.

In [None]:
# Replace 'N/A' in bmi with NaN and fill with median
df['bmi'] = df['bmi'].replace('N/A', np.nan).astype(float)
df['bmi'].fillna(df['bmi'].median(), inplace=True)

# Replace 'Unknown' in smoking_status with mode
df['smoking_status'] = df['smoking_status'].replace('Unknown', df['smoking_status'].mode()[0])

# Verify cleaning
print('Missing Values After Cleaning:')
print(df.isnull().sum())

### Cleaning Insights
- `bmi` missing values were filled with the median to avoid skew from outliers.
- `smoking_status` 'Unknown' values were replaced with the mode to maintain consistency.


## Step 5: Basic Analysis
Perform simple statistical analysis to explore relationships with `stroke`.

In [None]:
# Stroke rate by hypertension
stroke_by_hypertension = df.groupby('hypertension')['stroke'].mean()
print('Stroke Rate by Hypertension:\n', stroke_by_hypertension)

# Average age of stroke vs. non-stroke patients
avg_age_stroke = df[df['stroke'] == 1]['age'].mean()
avg_age_no_stroke = df[df['stroke'] == 0]['age'].mean()
print(f'Average Age (Stroke): {avg_age_stroke:.2f}')
print(f'Average Age (No Stroke): {avg_age_no_stroke:.2f}')

### Analysis Insights
- **Hypertension**: Patients with hypertension have a higher stroke rate.
- **Age**: Stroke patients are significantly older on average compared to non-stroke patients.


## Step 6: Visualizations
Create plots to visualize relationships between features and stroke occurrence.

In [None]:
# Barplot of Stroke by Hypertension
plt.figure(figsize=(8, 5))
sns.countplot(x='hypertension', hue='stroke', data=df, palette='Set2')
plt.title('Stroke by Hypertension')
plt.xlabel('Hypertension (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

# Boxplot of Age by Stroke
plt.figure(figsize=(8, 5))
sns.boxplot(x='stroke', y='age', data=df, palette='Set1')
plt.title('Age Distribution by Stroke Status')
plt.xlabel('Stroke (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.show()

# Correlation Heatmap for numerical features
plt.figure(figsize=(8, 6))
corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

## Insights from Visualizations
- **Hypertension Barplot**: Patients with hypertension are more likely to have strokes, though stroke cases are rare overall.
- **Age Boxplot**: Stroke patients have a higher median age (~68) compared to non-stroke patients (~40), confirming age as a key risk factor.
- **Correlation Heatmap**:
  - `age` has a moderate positive correlation (0.25) with `stroke`, reinforcing its importance.
  - `avg_glucose_level` shows a weaker correlation (0.13) with `stroke`.
  - `bmi` has a very low correlation with `stroke`, suggesting it may not be a strong predictor.


## Step 7: Key Takeaways
- **Key Risk Factors**: Age and hypertension are strongly associated with stroke risk.
- **Glucose and BMI**: Elevated glucose levels show some association with stroke, but BMI appears less influential.
- **Next Steps**: Consider encoding categorical variables and building a predictive model to further analyze stroke risk factors.
