# Heart Disease Dataset - Exploratory Data Analysis

This notebook explores the heart disease dataset to understand the features, distributions, and relationships before model building.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for plots
plt.style.use('ggplot')
sns.set(style="whitegrid")
%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('../data/heart.csv')

# Display the first few rows
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Check for missing values
print("Missing values per column:")
df.isnull().sum()

In [None]:
# Summary statistics
df.describe()

In [None]:
# Check target class distribution
target_counts = df['target'].value_counts()
print("Target class distribution:")
print(target_counts)

plt.figure(figsize=(8, 6))
sns.countplot(x='target', data=df)
plt.title('Heart Disease Class Distribution')
plt.xlabel('Has Heart Disease')
plt.ylabel('Count')
plt.show()

In [None]:
# Analyze relationships between features
plt.figure(figsize=(15, 12))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Heart Disease Features')
plt.show()

In [None]:
# Visualize key numerical features by target class
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

fig, axes = plt.subplots(len(numerical_features), 1, figsize=(12, 15))
fig.tight_layout(pad=3.0)

for i, feature in enumerate(numerical_features):
    sns.boxplot(x='target', y=feature, data=df, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature} by Heart Disease')
    axes[i].set_xlabel('Has Heart Disease')
    
plt.show()

In [None]:
# Visualize categorical features
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

fig, axes = plt.subplots(4, 2, figsize=(18, 15))
fig.tight_layout(pad=3.0)
axes = axes.flatten()

for i, feature in enumerate(categorical_features):
    sns.countplot(x=feature, hue='target', data=df, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature} by Heart Disease')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Count')
    axes[i].legend(title='Has Heart Disease')
    
plt.show()

In [None]:
# Pairplot for key features
important_features = ['age', 'thalach', 'chol', 'oldpeak', 'target']
sns.pairplot(df[important_features], hue='target', palette='Set1')
plt.suptitle('Pairplot of Key Heart Disease Features', y=1.02)
plt.show()

## Key Findings

1. **Class Distribution**: The dataset is fairly balanced between positive and negative cases, which is good for model training.

2. **Correlations**: Several features show moderate to strong correlations with the target variable:
   - `thalach` (maximum heart rate achieved): Higher heart rates tend to be associated with less heart disease
   - `oldpeak` (ST depression induced by exercise): Higher values correlate with higher disease risk
   - `cp` (chest pain type): Certain types of chest pain are more indicative of heart disease

3. **Age Distribution**: Older patients appear slightly more likely to have heart disease, but the trend isn't overwhelming

4. **Cholesterol**: There's less separation by target class than might be expected, suggesting this may not be as predictive on its own

5. **Categorical Features**: `ca` (number of major vessels), `thal`, and `cp` (chest pain type) show clear differences in distribution between positive and negative cases

These insights will guide our feature engineering and modeling approach in the next notebook.