# Feature Engineering for Heart Disease Prediction

This notebook focuses on preparing the heart disease dataset for model training by applying various feature engineering techniques to enhance model performance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Set the random seed for reproducibility
np.random.seed(42)

## Load and Inspect the Dataset

In [2]:
# Load the dataset
data = pd.read_csv('../data/heart.csv')

# Make a copy of the data to avoid modifying the original
df = data.copy()

In [3]:
# Basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
df.info()
print("\nDescriptive Statistics:")
df.describe()

## Handle Missing Values

In [4]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# For demonstration, let's assume there might be missing values and handle them
# Use median for numerical features
numerical_imputer = SimpleImputer(strategy='median')
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_cols] = numerical_imputer.fit_transform(df[numerical_cols])

# Use the most frequent value for categorical features
categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])

## Feature Engineering

Let's create some new features that might help the model.

In [5]:
# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 40, 50, 60, 100], labels=[0, 1, 2, 3])

# Create BMI proxy (if height and weight were available)
# Since we don't have height and weight, we'll skip this

# Create cholesterol ratio (Total Cholesterol / HDL)
# Since we don't have HDL, we'll use a rough approximation
df['chol_normalized'] = df['chol'] / df['chol'].mean()

# Create a feature for blood pressure category
def bp_category(x):
    if x < 120: return 0  # Normal
    elif x < 130: return 1  # Elevated
    elif x < 140: return 2  # Stage 1 hypertension
    else: return 3  # Stage 2 hypertension

df['bp_category'] = df['trestbps'].apply(bp_category)

# Heart rate reserve (approximation)
df['max_heart_rate'] = 220 - df['age']
df['heart_rate_reserve'] = df['max_heart_rate'] - df['thalach']
df['heart_rate_percentage'] = df['thalach'] / df['max_heart_rate']

# Interaction terms
df['age_chol_interaction'] = df['age'] * df['chol_normalized']
df['exercise_bp_interaction'] = df['exang'] * df['bp_category']

## Feature Scaling

In [6]:
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply standard scaling to numerical features
scaler = StandardScaler()
numerical_cols_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 
                           'heart_rate_reserve', 'heart_rate_percentage', 'age_chol_interaction']

X_train[numerical_cols_to_scale] = scaler.fit_transform(X_train[numerical_cols_to_scale])
X_test[numerical_cols_to_scale] = scaler.transform(X_test[numerical_cols_to_scale])

## Save Processed Data

In [7]:
# Save the processed training and testing datasets
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

# Save the scaler for future use
import joblib
joblib.dump(scaler, '../data/processed/scaler.pkl')

## Feature Importance Analysis

Let's use a simple model to get a preliminary idea of feature importance.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Fit a random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# View feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf.feature_importances_
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Feature Importance from Random Forest')
plt.tight_layout()
plt.savefig('../data/processed/feature_importance.png')
plt.show()

## Conclusion

In this notebook, we performed feature engineering on the heart disease dataset to prepare it for model training. We:

1. Handled missing values
2. Created new features based on domain knowledge
3. Normalized numerical features
4. Split the data into training and testing sets
5. Analyzed feature importance

The processed data is now ready for model training in the next notebook.