# Heart Attack Analysis & Prediction Dataset

In this task you are asked to use `heart-data.csv` to train a support vector machine to predict heart attacks.

See `Data description.docx` or `Data description.pdf` for description of dataset.

# Reading Dataset

In [96]:
import pandas as pd

data = pd.read_csv('heart-data.csv', na_values=["?",'UNDEFINED'])

data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping,0.0,Fixed defect,1
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping,0.0,Normal,1
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping,0.0,Normal,1
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping,0.0,Normal,1
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping,0.0,Normal,1


# TODO
1. Remove samples with missing data (there are **7 samples** with missing data).
2. Split the data to input and output.
3. Replace categorical values with numeric values (Use numeric encoding and one-hot encoding when suitable).
4. Split the dataset to (train - validation - test) by calling `train_test_split` two times:
    - First time: use `test_size=0.20` and `random_state=0`.
    - Second time: use `test_size=0.25` and `random_state=0`.
5. Apply feature scaling using `MinMaxScaler`.
6. Train a support vector machine classifier using suitable hyper-parameter values. 
7. Print the accuracy of both training and validation. Try to achieve **validation accuracy > 82%**.
8. Test your support vector machine and print accuracy of testing.

# 1) Removing missing values:-

In [137]:
mask=data.isnull().any(axis=0)
mask

age         False
sex         False
cp          False
trtbps      False
chol        False
fbs         False
restecg     False
thalachh    False
exng        False
oldpeak     False
slp         False
caa          True
thall        True
output      False
dtype: bool

In [171]:
mask=data.isnull().any(axis=1)
mask

0      False
1      False
2      False
3      False
4      False
       ...  
298    False
299    False
300    False
301    False
302    False
Length: 303, dtype: bool

In [142]:
num_of_rows_with_nan =mask.sum()  
num_of_total_rows= len(data)
print(num_of_rows_with_nan / num_of_total_rows )
data.isnull().sum()/ len(data)

0.0231023102310231


age         0.000000
sex         0.000000
cp          0.000000
trtbps      0.000000
chol        0.000000
fbs         0.000000
restecg     0.000000
thalachh    0.000000
exng        0.000000
oldpeak     0.000000
slp         0.000000
caa         0.016502
thall       0.006601
output      0.000000
dtype: float64

In [143]:
dta_clean = data[~mask]
data_clean.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping,0.0,Fixed defect,1
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping,0.0,Normal,1
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping,0.0,Normal,1
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping,0.0,Normal,1
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping,0.0,Normal,1


In [148]:
data_without_ms= data_clean.drop(columns=['caa','thall'])
data_without_ms.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,output
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping,1
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping,1
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping,1
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping,1
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping,1


# 2) Spliting the data to Input and output:-

In [151]:
data_input =data_without_ms.drop(columns=['output']) 
data_output =data_without_ms['output']
data_input.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping


# 3) Replacing categorical values:-

   ##     1) Numeric encoding:-

In [156]:
data_input.dtypes

age           int64
sex          object
cp           object
trtbps        int64
chol          int64
fbs          object
restecg      object
thalachh      int64
exng         object
oldpeak     float64
slp          object
dtype: object

In [160]:
print( data['sex'].unique() )
print( data['cp'].unique() )
print( data['fbs'].unique() )
print( data['restecg'].unique() )
print( data['exng'].unique() )
print( data ['slp'].unique() )

['Male' 'Female']
['Non-anginal pain' 'Atypical angina' 'Typical angina' 'Asymptomatic']
['High' 'Low']
['Hypertrophy' 'Normal' 'ST-T wave abnormality']
['No' 'Yes']
['Down-sloping' 'Up-sloping' 'Flat']


In [161]:
data_input_encoded = data_input.replace({
    
    'sex': {'Male': 0, 'Female': 1},
    'fbs': {'High': 0, 'Low': 1},
    'exng': {'No': 0, 'Yes': 1},
})
   

In [162]:
data_input_encoded.dtypes

age           int64
sex           int64
cp           object
trtbps        int64
chol          int64
fbs           int64
restecg      object
thalachh      int64
exng          int64
oldpeak     float64
slp          object
dtype: object

## 2) One-hot Encoding:-

In [163]:
one_hot_cols =pd.get_dummies(data_input_encoded,prefix='is') 

In [164]:
one_hot_cols.head()

Unnamed: 0,age,sex,trtbps,chol,fbs,thalachh,exng,oldpeak,is_Asymptomatic,is_Atypical angina,is_Non-anginal pain,is_Typical angina,is_Hypertrophy,is_Normal,is_ST-T wave abnormality,is_Down-sloping,is_Flat,is_Up-sloping
0,63,0,145,233,0,150,0,2.3,0,0,1,0,1,0,0,1,0,0
1,37,0,130,250,1,187,0,3.5,0,1,0,0,0,1,0,1,0,0
2,41,1,130,204,1,172,0,1.4,0,0,0,1,1,0,0,0,0,1
3,56,0,120,236,1,178,0,0.8,0,0,0,1,0,1,0,0,0,1
4,57,1,120,354,1,163,1,0.6,1,0,0,0,0,1,0,0,0,1


# 4) Spliting into (train - validation - test):-

In [177]:
from sklearn.model_selection import train_test_split

X, X_test, y, y_test = train_test_split(
    
        one_hot_cols, data_output, test_size=0.20, random_state=0
)

X_train, X_val, y_train, y_val = train_test_split(

            X, y, test_size=0.25, random_state=0
)

In [165]:
print(X_train.shape)
print(y_train.shape)
print('---------------------')
print(X_val.shape)
print(y_val.shape)
print('---------------------')
print(X_test.shape)
print(y_test.shape)

(181, 12)
(181,)
---------------------
(61, 12)
(61,)
---------------------
(61, 12)
(61,)


# 5) Appling feature Scaling using MinMaxScaler:-

In [178]:
from sklearn.preprocessing import MinMaxScaler

scaler =MinMaxScaler()
scaler.fit(X_train)

X_train_scaled =scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val) 
X_test_scaled = scaler.transform(X_test)  
X_train_scaled[:5]

array([[0.68085106, 0.        , 0.43396226, 0.13975904, 1.        ,
        0.51145038, 1.        , 0.30645161, 1.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 1.        ],
       [0.70212766, 0.        , 0.24528302, 0.28433735, 1.        ,
        0.21374046, 1.        , 0.29032258, 1.        , 0.        ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 1.        , 0.        ],
       [0.70212766, 1.        , 0.62264151, 0.03614458, 1.        ,
        0.5648855 , 0.        , 1.        , 1.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        1.        , 0.        , 0.        ],
       [0.78723404, 1.        , 0.49056604, 0.31084337, 1.        ,
        0.61832061, 0.        , 0.        , 0.        , 1.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 1.        , 0.        ],
    

# 6) Training a (SVM) classifier :-

## 1) Linear SVM:-

In [194]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [201]:
svc = SVC( kernel = 'linear', random_state = 0, C = 0.7)
svc.fit ( X_train_scaled , y_train)

SVC(C=0.7, kernel='linear', random_state=0)

In [219]:
y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)

print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_val, y_pred_val))

0.9717514124293786
0.6949152542372882


## 2) Poly SVM:-

In [234]:
svc = SVC(kernel='poly', degree=4, random_state=0, C=0.5)
svc.fit(X_train_scaled, y_train)

y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)

print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_val, y_pred_val))

0.9096045197740112
0.6949152542372882


## 3) RBF SVM:-

In [235]:
svc = SVC(kernel='rbf', gamma=0.02, random_state=0, C=100)
svc.fit(X_train_scaled, y_train)

y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)

print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_val, y_pred_val))

0.8926553672316384
0.6610169491525424


# 8) Testing:- 

In [236]:
y_pred_test = svc.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred_test))

0.85
