# 🎯 Support Vector Machine (SVM) Classification

## 📊 Project Overview

This notebook demonstrates **Support Vector Machine (SVM)** for binary classification - predicting whether a patient has heart disease or not.

### 🎯 Learning Objectives:
- Understand Support Vector Machine for classification
- Compare with Linear Regression and Logistic Regression
- Implement complete ML pipeline with SVM
- Evaluate classification models with different kernels
- Feature importance analysis and hyperparameter tuning

---

## 🔍 Linear Regression vs Logistic Regression vs SVM

### Key Differences:

| Aspect | Linear Regression | Logistic Regression | Support Vector Machine |
|--------|------------------|---------------------|------------------------|
| **Purpose** | Predict continuous values | Predict probabilities/classes | Find optimal decision boundary |
| **Output** | Real number (-∞ to +∞) | Probability (0 to 1) | Class prediction with margin |
| **Use Case** | Car prices, house prices | Disease prediction, spam detection | Complex classification, high-dimensional data |
| **Function** | y = mx + b | y = 1/(1 + e^-(mx+b)) | f(x) = sign(w·x + b) |
| **Loss Function** | Mean Squared Error (MSE) | Log Loss (Cross-Entropy) | Hinge Loss |
| **Evaluation** | R², RMSE, MAE | Accuracy, Precision, Recall, F1, ROC-AUC | Accuracy, Precision, Recall, F1, ROC-AUC |
| **Algorithm** | Ordinary Least Squares | Maximum Likelihood Estimation | Quadratic Programming |
| **Decision Boundary** | Not applicable | Linear boundary | Linear/Non-linear (with kernels) |
| **Key Feature** | Best fit line | Sigmoid transformation | Maximum margin separation |
| **Handles Non-linearity** | No | No | Yes (with kernel trick) |

### When to Use Which?

**Use Linear Regression when:**
- ✅ Predicting continuous numerical values
- ✅ Output can be any real number
- ✅ Examples: prices, temperatures, sales

**Use Logistic Regression when:**
- ✅ Simple binary classification
- ✅ Need probability estimates
- ✅ Interpretable coefficients required
- ✅ Examples: spam/not spam, disease/healthy

**Use SVM when:**
- ✅ Complex classification problems
- ✅ High-dimensional data
- ✅ Non-linear relationships (with kernels)
- ✅ Small to medium datasets
- ✅ Need robust decision boundaries
- ✅ Examples: text classification, image recognition, gene classification

### Visual Comparison:

```
Linear Regression:           Logistic Regression:         SVM:
       y                            y (probability)           y (class)
       │                            │    1.0 ─────────        │    +1 ────────
       │     /                      │         ╱               │      ╱│╲
       │    /                       │       ╱                 │    ╱  │  ╲
       │   /                        │     ╱  (S-curve)        │  ╱    │    ╲
       │  /                         │   ╱                     │╱   margin  ╲│
       │ /                          │ ╱                       │      │      │
       └─────── x                   └─────────── x            └──────│──────── x
                                       0.0                           -1
                                                              (Maximum Margin)
```


## 📚 Import Libraries


In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Machine Learning - Models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression  # For comparison

# Machine Learning - Evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score,
    precision_recall_curve, average_precision_score
)

# Statistical analysis
from scipy import stats

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✅ Libraries imported successfully!")


## 📁 Load Dataset

**Dataset**: Heart Disease UCI Dataset

### Dataset Information:
- **Source**: UCI Machine Learning Repository
- **Target**: Binary classification (0 = No disease, 1 = Disease present)
- **Features**: 13 clinical attributes

### Features Description:
1. **age**: Age in years
2. **sex**: Sex (1 = male, 0 = female)
3. **cp**: Chest pain type (0-3)
4. **trestbps**: Resting blood pressure (mm Hg)
5. **chol**: Serum cholesterol (mg/dl)
6. **fbs**: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7. **restecg**: Resting ECG results (0-2)
8. **thalach**: Maximum heart rate achieved
9. **exang**: Exercise induced angina (1 = yes, 0 = no)
10. **oldpeak**: ST depression induced by exercise
11. **slope**: Slope of peak exercise ST segment (0-2)
12. **ca**: Number of major vessels colored by fluoroscopy (0-3)
13. **thal**: Thalassemia (1 = normal, 2 = fixed defect, 3 = reversible defect)


In [None]:
# Load the dataset
# You can download the dataset from: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
# Place it in the dataset folder

df = pd.read_csv('../../dataset/heart.csv')

print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")
print("\n" + "="*50)
print("First 5 rows:")
df.head()


## 🔍 Exploratory Data Analysis (EDA)

### Step 1: Basic Information


In [None]:
# Dataset info
print("📊 Dataset Information:")
print("="*50)
df.info()


In [None]:
# Statistical summary
print("📈 Statistical Summary:")
print("="*50)
df.describe()


In [None]:
# Check for missing values
print("🔍 Missing Values:")
print("="*50)
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values found! ✅")


In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"🔍 Duplicate rows: {duplicates}")
if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df = df.drop_duplicates()
    print(f"✅ New shape: {df.shape}")


### Step 2: Target Variable Analysis


In [None]:
# Analyze target variable distribution
print("🎯 Target Variable Distribution:")
print("="*50)
target_col = 'target'  # Adjust if your target column has a different name

target_counts = df[target_col].value_counts()
print(f"\nClass Distribution:")
print(target_counts)
print(f"\nPercentage:")
print(df[target_col].value_counts(normalize=True) * 100)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=df, x=target_col, ax=axes[0], palette='Set2')
axes[0].set_title('Target Variable Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Heart Disease (0=No, 1=Yes)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)

# Pie chart
axes[1].pie(target_counts, labels=['No Disease', 'Disease'], autopct='%1.1f%%', 
            colors=['#90EE90', '#FFB6C1'], startangle=90)
axes[1].set_title('Target Variable Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()


## 🔧 Data Preprocessing

### Step 1: Prepare Data for Modeling


In [None]:
# Separate features and target
X = df.drop(target_col, axis=1)
y = df[target_col]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("\n✅ Data split completed!")
print("="*50)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining target distribution:")
print(y_train.value_counts())
print(f"\nTesting target distribution:")
print(y_test.value_counts())


In [None]:
# Feature Scaling (CRITICAL for SVM!)
print("⚠️  IMPORTANT: SVM requires feature scaling!")
print("="*50)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for better readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("✅ Feature scaling completed!")
print("="*50)
print("\nScaled training data (first 5 rows):")
X_train_scaled.head()


## 🤖 SVM Model Training

### Understanding SVM Kernels

**SVM Kernel Functions:**
- **Linear**: For linearly separable data
- **RBF (Radial Basis Function)**: Most popular, handles non-linear patterns
- **Polynomial**: For polynomial relationships
- **Sigmoid**: Similar to neural networks

### Model 1: Linear SVM


In [None]:
# Train Linear SVM model
print("🤖 Training Linear SVM Model...")
print("="*50)

# Initialize and train the model
svm_linear = SVC(kernel='linear', random_state=42, probability=True)
svm_linear.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred_linear = svm_linear.predict(X_train_scaled)
y_test_pred_linear = svm_linear.predict(X_test_scaled)

# Predict probabilities
y_train_pred_proba_linear = svm_linear.predict_proba(X_train_scaled)[:, 1]
y_test_pred_proba_linear = svm_linear.predict_proba(X_test_scaled)[:, 1]

print("✅ Linear SVM training completed!")


### Model 2: RBF SVM (Non-linear)


In [None]:
# Train RBF SVM model
print("🤖 Training RBF SVM Model...")
print("="*50)

# Initialize and train the model
svm_rbf = SVC(kernel='rbf', random_state=42, probability=True)
svm_rbf.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred_rbf = svm_rbf.predict(X_train_scaled)
y_test_pred_rbf = svm_rbf.predict(X_test_scaled)

# Predict probabilities
y_train_pred_proba_rbf = svm_rbf.predict_proba(X_train_scaled)[:, 1]
y_test_pred_proba_rbf = svm_rbf.predict_proba(X_test_scaled)[:, 1]

print("✅ RBF SVM training completed!")


## 📊 Model Evaluation

### Step 1: Performance Comparison


In [None]:
# Calculate metrics for both models
def calculate_metrics(y_true, y_pred, y_pred_proba, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_pred_proba)
    
    return {
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': auc
    }

# Calculate metrics for all models
linear_metrics = calculate_metrics(y_test, y_test_pred_linear, y_test_pred_proba_linear, 'Linear SVM')
rbf_metrics = calculate_metrics(y_test, y_test_pred_rbf, y_test_pred_proba_rbf, 'RBF SVM')

# Create comparison DataFrame
comparison_df = pd.DataFrame([linear_metrics, rbf_metrics])

print("📊 SVM MODEL COMPARISON - Performance Metrics")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)


### Step 2: Confusion Matrix Comparison


In [None]:
# Confusion Matrices for both models
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Linear SVM Confusion Matrix
cm_linear = confusion_matrix(y_test, y_test_pred_linear)
sns.heatmap(cm_linear, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=['No Disease', 'Disease'],
            yticklabels=['No Disease', 'Disease'], ax=axes[0])
axes[0].set_title('Confusion Matrix - Linear SVM', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Actual Label', fontsize=12)
axes[0].set_xlabel('Predicted Label', fontsize=12)

# RBF SVM Confusion Matrix
cm_rbf = confusion_matrix(y_test, y_test_pred_rbf)
sns.heatmap(cm_rbf, annot=True, fmt='d', cmap='Greens', cbar=True,
            xticklabels=['No Disease', 'Disease'],
            yticklabels=['No Disease', 'Disease'], ax=axes[1])
axes[1].set_title('Confusion Matrix - RBF SVM', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Actual Label', fontsize=12)
axes[1].set_xlabel('Predicted Label', fontsize=12)

plt.tight_layout()
plt.show()

# Print confusion matrix breakdown for best model
best_model_name = comparison_df.loc[comparison_df['Accuracy'].idxmax(), 'Model']
if best_model_name == 'Linear SVM':
    cm_best = cm_linear
else:
    cm_best = cm_rbf

TN, FP, FN, TP = cm_best.ravel()
print(f"\n📊 Best Model ({best_model_name}) - Confusion Matrix Breakdown:")
print("="*50)
print(f"True Negatives (TN): {TN} - Correctly predicted No Disease")
print(f"False Positives (FP): {FP} - Incorrectly predicted Disease")
print(f"False Negatives (FN): {FN} - Incorrectly predicted No Disease")
print(f"True Positives (TP): {TP} - Correctly predicted Disease")


## 🎛️ Hyperparameter Tuning


In [None]:
# Grid Search for best hyperparameters
print("🔍 Performing Grid Search for SVM Hyperparameter Tuning...")
print("="*70)

param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}

# Note: gamma is only used for RBF kernel, but GridSearchCV will handle this
grid_search = GridSearchCV(SVC(random_state=42, probability=True),
                          param_grid,
                          cv=5,
                          scoring='accuracy',
                          n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

print(f"\n✅ Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate best model
best_svm = grid_search.best_estimator_
y_test_pred_best = best_svm.predict(X_test_scaled)
y_test_pred_proba_best = best_svm.predict_proba(X_test_scaled)[:, 1]

best_accuracy = accuracy_score(y_test, y_test_pred_best)
best_auc = roc_auc_score(y_test, y_test_pred_proba_best)
best_f1 = f1_score(y_test, y_test_pred_best)

print(f"\n📊 Best SVM Model Performance on Test Set:")
print(f"Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
print(f"ROC-AUC: {best_auc:.4f}")
print(f"F1-Score: {best_f1:.4f}")


## 🆚 Comparison with Logistic Regression


In [None]:
# Train Logistic Regression for comparison
print("🔄 Training Logistic Regression for Comparison...")
print("="*50)

log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

y_test_pred_lr = log_reg.predict(X_test_scaled)
y_test_pred_proba_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics for Logistic Regression
lr_metrics = calculate_metrics(y_test, y_test_pred_lr, y_test_pred_proba_lr, 'Logistic Regression')

# Add to comparison
final_comparison = pd.DataFrame([linear_metrics, rbf_metrics, lr_metrics])

print("\n📊 FINAL MODEL COMPARISON - SVM vs Logistic Regression")
print("="*80)
print(final_comparison.to_string(index=False))
print("="*80)

# Visual comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
x = np.arange(len(metrics))
width = 0.25

fig, ax = plt.subplots(figsize=(14, 8))
ax.bar(x - width, final_comparison.iloc[0, 1:], width, label='Linear SVM', color='skyblue')
ax.bar(x, final_comparison.iloc[1, 1:], width, label='RBF SVM', color='lightcoral')
ax.bar(x + width, final_comparison.iloc[2, 1:], width, label='Logistic Regression', color='lightgreen')

ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Comparison - SVM vs Logistic Regression', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.grid(alpha=0.3, axis='y')
ax.set_ylim([0, 1.1])

# Add value labels on bars
for i, model_idx in enumerate([0, 1, 2]):
    offset = (i - 1) * width
    for j, metric in enumerate(metrics):
        value = final_comparison.iloc[model_idx, j+1]
        ax.text(j + offset, value + 0.02, f'{value:.3f}', 
               ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()


## 📝 Summary and Key Takeaways

### 🎯 Model Performance Summary:
- **Linear SVM**: Good performance for linearly separable data
- **RBF SVM**: Excellent for non-linear patterns with kernel trick
- **Best Model**: {best_model_name} achieved highest accuracy

### 💡 Key Learnings:

1. **SVM is powerful for classification**
   - Finds optimal decision boundary with maximum margin
   - Handles non-linear data with kernel trick
   - Robust to outliers

2. **Kernel selection is crucial**
   - Linear: Fast, good for linearly separable data
   - RBF: Flexible, handles complex patterns
   - Polynomial: Good for polynomial relationships

3. **Feature scaling is mandatory for SVM**
   - SVM is sensitive to feature scales
   - Always use StandardScaler or MinMaxScaler
   - Critical for proper distance calculations

4. **Hyperparameter tuning improves performance**
   - C parameter controls regularization
   - Gamma parameter affects RBF kernel width
   - Use GridSearchCV for systematic optimization

### 🔄 Linear vs Logistic vs SVM - Final Comparison:

| When to Use | Linear Regression | Logistic Regression | SVM |
|-------------|------------------|---------------------|-----|
| **Task Type** | Regression (predict numbers) | Simple binary classification | Complex classification |
| **Data Size** | Any size | Small to large | Small to medium |
| **Non-linearity** | No | No | Yes (with kernels) |
| **Interpretability** | High | High | Medium |
| **Training Speed** | Fast | Fast | Slower |
| **Examples** | House prices, sales | Spam detection, simple diagnosis | Text classification, image recognition |

### 🚀 Next Steps:
1. Try other SVM kernels (polynomial, sigmoid)
2. Implement multi-class SVM classification
3. Explore ensemble methods combining SVM with other algorithms
4. Deploy the model as a web application
5. Add feature selection techniques
6. Try other advanced algorithms (Random Forest, XGBoost, Neural Networks)

---

## 📚 Additional Resources:

- [Scikit-Learn SVM Documentation](https://scikit-learn.org/stable/modules/svm.html)
- [StatQuest: Support Vector Machines](https://www.youtube.com/watch?v=efR1C6CvhmE)
- [Understanding SVM Kernels](https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d)

---

**✅ Project Complete! You've successfully implemented Support Vector Machine for Heart Disease Prediction!**
