# Heart Disease Risk Prediction - Model Training

This notebook implements a machine learning pipeline to predict heart disease risk using the BRFSS (Behavioral Risk Factor Surveillance System) dataset.

## Approach
1. Load and preprocess the data
2. Feature selection based on correlation with target variable
3. Train a Linear SVM classifier with class balancing
4. Optimize decision threshold for better F1-score
5. Generate predictions on test set

## Model Performance
- **Algorithm**: Linear Support Vector Machine (LinearSVC)
- **Validation F1-score**: ~0.366 (with threshold optimization)
- **Key features**: 23 features selected based on |correlation| > 0.1 with target


## 1. Import Libraries


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score, classification_report, precision_recall_curve
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')


## 2. Load Data


In [None]:
# Load training and test datasets
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")
print(f"\nTarget distribution:")
print(train_df['TARGET'].value_counts(normalize=True))


## 3. Data Preprocessing

Remove columns that are 100% NaN (no information)


In [None]:
# Remove columns with 100% missing values
cols_all_nan_train = train_df.columns[train_df.isna().mean() == 1.0]
train_df = train_df.drop(columns=cols_all_nan_train)

cols_all_nan_test = test_df.columns[test_df.isna().mean() == 1.0]
test_df = test_df.drop(columns=cols_all_nan_test)

print(f"Removed {len(cols_all_nan_train)} columns with 100% NaN from train")
print(f"Removed {len(cols_all_nan_test)} columns with 100% NaN from test")
print(f"\nCleaned train shape: {train_df.shape}")
print(f"Cleaned test shape: {test_df.shape}")


## 4. Feature Selection

Select features with |correlation| > 0.1 with the target variable


In [None]:
# Identify target column
target_col = train_df.columns[-1]
print(f"Target variable: {target_col}")

# Convert to numeric for correlation computation
train_num = train_df.apply(pd.to_numeric, errors='coerce')

# Compute correlation with target
corr_with_target = train_num.corr(numeric_only=True)[target_col].drop(labels=[target_col], errors='ignore')

# Select features with |correlation| > 0.1
selected_features = corr_with_target[abs(corr_with_target) > 0.1].index.tolist()

print(f"\nNumber of features selected: {len(selected_features)}")
print(f"\nTop 10 correlated features:")
print(corr_with_target.abs().sort_values(ascending=False).head(10))


## 5. Prepare Training Data


In [None]:
# Extract features and target
X = train_df[selected_features].copy()
y = train_df[target_col]

# Encode categorical variables
encoder_dict = {}
for col in X.select_dtypes(include=['object', 'category']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    encoder_dict[col] = le

print(f"Encoded {len(encoder_dict)} categorical columns")


## 6. Train-Validation Split

Use stratified split to maintain class distribution


In [None]:
# Split data with stratification (important for imbalanced dataset)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.3, 
    stratify=y, 
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")


## 7. Model Training

Pipeline with:
- SimpleImputer: handle missing values
- StandardScaler: normalize features
- LinearSVC: SVM classifier with class balancing


In [None]:
# Create pipeline
pipe = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    StandardScaler(),
    LinearSVC(C=1, class_weight='balanced', random_state=42, max_iter=2000)
)

# Train model
print("Training model...")
pipe.fit(X_train, y_train)
print("✓ Model trained successfully")


## 8. Model Evaluation & Threshold Optimization

Since the dataset is imbalanced, we optimize the decision threshold to maximize F1-score


In [None]:
# Standard prediction (threshold = 0)
y_pred_val = pipe.predict(X_val)
f1_standard = f1_score(y_val, y_pred_val)
print(f"F1-score with default threshold: {f1_standard:.4f}")
print("\nClassification Report (default threshold):")
print(classification_report(y_val, y_pred_val))


In [None]:
# Find optimal threshold using decision function
scores_val = pipe.decision_function(X_val)

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_val, scores_val)

# Compute F1-score for each threshold
f1_scores = 2 * precision * recall / (precision + recall + 1e-12)

# Find best threshold (excluding last point which has no threshold)
best_idx = f1_scores[:-1].argmax()
best_thresh = thresholds[best_idx]

print(f"\nOptimal threshold: {best_thresh:.4f}")
print(f"Best F1-score: {f1_scores[best_idx]:.4f}")


In [None]:
# Predict with optimized threshold
y_pred_val_optimized = (scores_val >= best_thresh).astype(int)
f1_optimized = f1_score(y_val, y_pred_val_optimized)

print(f"F1-score with optimized threshold: {f1_optimized:.4f}")
print("\nClassification Report (optimized threshold):")
print(classification_report(y_val, y_pred_val_optimized))


## 9. Predictions on Test Set


In [None]:
# Prepare test data with same features
X_test = test_df.reindex(columns=selected_features, fill_value=np.nan).copy()

# Apply same encoding to categorical features
for col, le in encoder_dict.items():
    if col in X_test.columns:
        X_test[col] = X_test[col].astype(str)
        # Map unknown values to the first class
        X_test[col] = X_test[col].map(lambda x: x if x in le.classes_ else None)
        X_test[col] = le.transform(X_test[col].fillna(le.classes_[0]))

print("Test data prepared")


In [None]:
# Get decision function scores
scores_test = pipe.decision_function(X_test)

# Apply optimized threshold
y_test_pred = (scores_test >= best_thresh).astype(int)

print(f"Generated {len(y_test_pred)} predictions")
print(f"\nPrediction distribution:")
print(pd.Series(y_test_pred).value_counts(normalize=True))


## 10. Export Predictions


In [None]:
# Get ID column from test set
id_col = 'ID' if 'ID' in test_df.columns else test_df.columns[0]

# Create submission dataframe
submission = pd.DataFrame({
    id_col: test_df[id_col],
    'pred': y_test_pred
})

# Save to CSV
submission.to_csv('predictions_svm_optimized.csv', index=False)

print("✓ Predictions exported to 'predictions_svm_optimized.csv'")
print("\nFirst 10 predictions:")
print(submission.head(10))


## Summary

This notebook demonstrates:
- Feature selection based on correlation analysis
- Handling imbalanced datasets with class weighting
- Threshold optimization to improve F1-score
- Complete pipeline from data loading to predictions

### Key Takeaways:
- The dataset is highly imbalanced (~9% positive class)
- Feature selection reduced dimensionality from 324 to 23 features
- Threshold optimization significantly improved F1-score
- StandardScaler is crucial for SVM performance
