                                  # Heart_Disease_Prediction&Algorithm_Comparison-SVM-KNN                      

### heart disease prediction using SVM and KNN algorithms. In this project, we will use the UCI Heart Disease dataset to build and compare the performance of Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) algorithms for predicting heart disease.

# Import The Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/itsluckysharma01/Datasets/refs/heads/main/heart_disease_uci.csv')

In [3]:
df

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0


# Dataset Column Descriptions

## Feature Variables

| Column | Description |
|--------|-------------|
| **id** | Unique identifier for each patient record |
| **age** | Patient's age in years |
| **sex** | Patient's gender (Male/Female) |
| **dataset** | Source dataset (Cleveland in this case) |
| **cp** | Chest pain type |
| **trestbps** | Resting blood pressure (in mm Hg) |
| **chol** | Serum cholesterol level (in mg/dl) |
| **fbs** | Fasting blood sugar > 120 mg/dl (True/False) |
| **restecg** | Resting electrocardiographic results |
| **thalch** | Maximum heart rate achieved during exercise |
| **exang** | Exercise induced angina (True/False) |
| **oldpeak** | ST depression induced by exercise relative to rest |
| **slope** | Slope of the peak exercise ST segment |
| **ca** | Number of major vessels (0-3) colored by fluoroscopy |
| **thal** | Thalassemia type |

## Categorical Variable Details

### Chest Pain Type (cp):
- **typical angina**: Classic chest pain related to heart disease
- **atypical angina**: Chest pain not typical of heart disease
- **non-anginal**: Chest pain not related to heart disease
- **asymptomatic**: No chest pain symptoms

### Resting ECG Results (restecg):
- **normal**: Normal electrocardiogram
- **having ST-T wave abnormality**: Abnormal ST-T wave patterns
- **lv hypertrophy**: Left ventricular hypertrophy (enlarged heart muscle)

### ST Segment Slope (slope):
- **upsloping**: Upward sloping ST segment
- **flat**: Flat ST segment
- **downsloping**: Downward sloping ST segment

### Thalassemia Type (thal):
- **normal**: Normal thalassemia
- **fixed defect**: Fixed defect in thalassemia
- **reversable defect**: Reversible defect in thalassemia

## Target Variable

| Column | Description |
|--------|-------------|
| **num** | **Target variable** - Presence of heart disease |

**Target Values:**
- **0**: No heart disease
- **1-4**: Varying degrees of heart disease presence (1 = mild, 4 = severe)

---

**Note:** The `num` column is your target variable for prediction in this heart disease classification task.

# Data Analyse

In [4]:
df.isnull().sum()

id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

In [5]:
df.shape

(920, 16)

In [6]:
df

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0


# Handling Missing Values

## Numerical Data

In [7]:
df['trestbps'] = df['trestbps'].fillna(df['trestbps'].mean())

In [8]:
df.isnull().sum()

id            0
age           0
sex           0
dataset       0
cp            0
trestbps      0
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

In [9]:
df['chol'] = df['chol'].fillna(df['chol'].mean())

In [10]:
df['thalch']=df['thalch'].fillna(df['thalch'].mean())

In [11]:
df['oldpeak']=df['oldpeak'].fillna(df['oldpeak'].mean())

In [12]:
df['ca']=df['ca'].fillna(df['ca'].mode()[0])

In [None]:
df

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.000000,233.0,True,lv hypertrophy,150.000000,False,2.300000,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.000000,286.0,False,lv hypertrophy,108.000000,True,1.500000,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.000000,229.0,False,lv hypertrophy,129.000000,True,2.600000,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.000000,250.0,False,normal,187.000000,False,3.500000,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.000000,204.0,False,lv hypertrophy,172.000000,False,1.400000,upsloping,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915,916,54,Female,VA Long Beach,asymptomatic,127.000000,333.0,True,st-t abnormality,154.000000,False,0.000000,,0.0,,1
916,917,62,Male,VA Long Beach,typical angina,132.132404,139.0,False,st-t abnormality,137.545665,,0.878788,,0.0,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.000000,223.0,True,st-t abnormality,100.000000,False,0.000000,,0.0,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,132.132404,385.0,True,lv hypertrophy,137.545665,,0.878788,,0.0,,0


In [14]:
df.isnull().sum()


id            0
age           0
sex           0
dataset       0
cp            0
trestbps      0
chol          0
fbs          90
restecg       2
thalch        0
exang        55
oldpeak       0
slope       309
ca            0
thal        486
num           0
dtype: int64

## For Categorical Data

In [16]:
categorical_cols = ['fbs', 'exang', 'slope', 'thal', 'restecg']

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0] if len(df[col].mode()) > 0 else df[col].value_counts().index[0]
        df[col] = df[col].fillna(mode_value)
        print(f"Filled {col} missing values with mode: {mode_value}")

# Final check for missing values
print("\n" + "="*50)
print("Final missing values check:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Filled restecg missing values with mode: normal

Final missing values check:
id          0
age         0
sex         0
dataset     0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64

Total missing values: 0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  920 non-null    float64
 6   chol      920 non-null    float64
 7   fbs       920 non-null    bool   
 8   restecg   920 non-null    object 
 9   thalch    920 non-null    float64
 10  exang     920 non-null    bool   
 11  oldpeak   920 non-null    float64
 12  slope     920 non-null    object 
 13  ca        920 non-null    float64
 14  thal      920 non-null    object 
 15  num       920 non-null    int64  
dtypes: bool(2), float64(5), int64(3), object(6)
memory usage: 102.6+ KB
