# kernelized SVMs

### Metadata/Sources

The data was also collected in 2019 and retrieved from Kaggle

DV: Cardio = having or not having cardiovascular disease (CVD)

IV: Age, Height, Weight, Gender, SBP, DBP, Cholesterol, Glucose, Alcohol, Active, and Smoke

SBP = systolic blood pressure

DBP = diastolic blood pressure

Observation of 4221 individuals and 12 variables

### Summary of the data

Age | Objective Feature | age | int (days)

Height | Objective Feature | height | int (cm) |

Weight | Objective Feature | weight | float (kg) |

Gender | Objective Feature | gender | categorical code |

Systolic blood pressure | Examination Feature | ap_hi | int |

Diastolic blood pressure | Examination Feature | ap_lo | int |

Cholesterol | Examination Feature | cholesterol | 1: normal,2: above normal, 3: well above normal |

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |

Smoking | Subjective Feature | smoke | binary |

Alcohol intake | Subjective Feature | alco | binary |

Physical activity | Subjective Feature | active | binary |

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |


### Introduction

They collected data from some patients during a medical examination alongside the survey they took.

This study investigates cardiovascular risk factors in adults in the United States. The data used to 

conduct this study is collected from Kaggle and entitled "Cardiovascular." The data initially has 

70,000 observations and 12 variables. The data is preprocessed and reduced to 4221 observations and 12

variables for this assignment. 

### Objective

The purpose of this study is to predict the status of cardiovascular disease in people.

### Load and preprocess data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('cardio.csv')

In [3]:
pd.set_option('display.float_format', lambda x: '%.1f' % x)

In [4]:
data.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age.1
0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50.4
1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55.4
2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51.7
3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48.3
4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47.9


In [5]:
#let's drop the column age from the data
data = data.drop('age', axis =1)

### set the variables

In [6]:
X = data[['age.1', 'gender', 'ap_hi', 'ap_lo', 'cholesterol']].values
y = data[['cardio']].values.ravel()

In [7]:
from sklearn import preprocessing

In [8]:
X = preprocessing.scale(X)

# SVM

In [9]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

In [10]:
param_grid ={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001, 10]}

### Linear SVM: Specifying and fitting model

In [11]:
svm_linear = GridSearchCV(svm.SVC(kernel = 'linear'),param_grid, verbose =1)
svm_linear.fit(X,y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(estimator=SVC(kernel='linear'),
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001, 10]},
             verbose=1)

### Best parameters

In [12]:
svm_linear.best_params_

{'C': 1, 'gamma': 1}

### Linear SVM: Cross validation

In [13]:
svm_linear_accuracy = cross_val_score(svm_linear, X, y, scoring = 'accuracy', cv = 5).mean()

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [14]:
svm_linear_accuracy

0.7223346139824448

### Linear SVM with C=1 and gamma = 1

In [15]:
svm_linear2 = SVC(kernel ='linear', C=1, gamma =1).fit(X,y)
svm_linear2_ac = cross_val_score(svm_linear2, X, y, scoring ='accuracy', cv = 5).mean()
svm_linear2_ac

0.7225715808070893

### Accuracy statistics

In [16]:
svm_linear.cv_results_['mean_test_score']

array([0.72233461, 0.72233461, 0.72233461, 0.72233461, 0.72233461,
       0.72233461, 0.72257158, 0.72257158, 0.72257158, 0.72257158,
       0.72257158, 0.72257158, 0.72257158, 0.72257158, 0.72257158,
       0.72257158, 0.72257158, 0.72257158, 0.72233461, 0.72233461,
       0.72233461, 0.72233461, 0.72233461, 0.72233461])

### SD of accuracy statistics

In [17]:
svm_linear.cv_results_['std_test_score']

array([0.01782803, 0.01782803, 0.01782803, 0.01782803, 0.01782803,
       0.01782803, 0.01774515, 0.01774515, 0.01774515, 0.01774515,
       0.01774515, 0.01774515, 0.01774515, 0.01774515, 0.01774515,
       0.01774515, 0.01774515, 0.01774515, 0.01781227, 0.01781227,
       0.01781227, 0.01781227, 0.01781227, 0.01781227])

### Sigmoid SVM

In [18]:
svm_sigmoid = GridSearchCV(svm.SVC(kernel='sigmoid'), param_grid, verbose =1)
svm_sigmoid.fit(X,y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(estimator=SVC(kernel='sigmoid'),
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001, 10]},
             verbose=1)

### Best parameters

In [19]:
svm_sigmoid.best_params_

{'C': 100, 'gamma': 0.001}

In [20]:
svm_sigmoid.best_estimator_

SVC(C=100, gamma=0.001, kernel='sigmoid')

In [21]:
svm_sigmoid_accuracy = cross_val_score(svm_sigmoid, X,y, scoring ='accuracy', cv =5).mean()

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [22]:
svm_sigmoid_accuracy

0.7223346139824448

### svm_sigmoid with SVC(C=100, gamma =0.001)

In [23]:
svm_sigmoid2 = SVC(kernel ='sigmoid', C=100, gamma =0.001).fit(X,y)
svm_sigmoid2_ac = cross_val_score(svm_sigmoid2, X, y, scoring ='accuracy', cv = 5).mean()
svm_sigmoid2_ac

0.7223346139824448

### Accuracy statistics

In [24]:
svm_sigmoid.cv_results_['mean_test_score']

array([0.56929331, 0.71096049, 0.68987717, 0.50225076, 0.50153986,
       0.56550072, 0.56858128, 0.65600409, 0.716411  , 0.69011414,
       0.50225076, 0.56692196, 0.57047758, 0.61359488, 0.71925405,
       0.71617404, 0.68987717, 0.56739561, 0.57047758, 0.60743347,
       0.68087636, 0.72233461, 0.71617404, 0.56763229])

### SD of accuracy statistics

In [25]:
svm_sigmoid.cv_results_['std_test_score']

array([0.00991581, 0.02136152, 0.01614395, 0.00078618, 0.00047372,
       0.01601622, 0.01090424, 0.00972623, 0.01894007, 0.01647352,
       0.00078618, 0.01672767, 0.00906959, 0.01547276, 0.0188247 ,
       0.01895564, 0.01647107, 0.01679506, 0.00906959, 0.01723219,
       0.01253518, 0.01782803, 0.01895564, 0.01712456])

### rbf SVM

In [26]:
svm_rbf = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, verbose = 1)
svm_rbf.fit(X,y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001, 10]},
             verbose=1)

### svm_rbf: Cross validation

In [27]:
svm_rbf_accuracy = cross_val_score(svm_rbf, X, y, scoring ='accuracy', cv = 5).mean()
svm_rbf_accuracy

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits


0.7263605260943942

### Best parameters 

In [28]:
svm_rbf.best_params_

{'C': 1, 'gamma': 0.1}

In [29]:
svm_rbf.best_estimator_

SVC(C=1, gamma=0.1)

### rbf SVM with SVC(C=1, gamma=0.1)

In [30]:
svm = SVC(kernel ='rbf', C=1, gamma =0.1).fit(X,y)
svm_rbf_ac = cross_val_score(svm, X, y, scoring ='accuracy', cv = 5).mean()
svm_rbf_ac

0.7296772203370818

### Accuracy statistics

In [31]:
svm_rbf.cv_results_['mean_test_score']

array([0.71949045, 0.72351805, 0.70101125, 0.66808211, 0.50153986,
       0.69627668, 0.7247026 , 0.72967722, 0.72351945, 0.70290726,
       0.66926666, 0.71143722, 0.71877927, 0.72730643, 0.72801929,
       0.72162399, 0.70124877, 0.70006674, 0.71451583, 0.72470092,
       0.72470372, 0.72446647, 0.71806921, 0.68395945])

### SD of accuracy statistics

In [32]:
svm_rbf.cv_results_['std_test_score']

array([0.01716771, 0.018747  , 0.02024215, 0.01589508, 0.00047372,
       0.0170136 , 0.01634635, 0.01956796, 0.01750618, 0.0189991 ,
       0.0181695 , 0.01511439, 0.0183314 , 0.02211755, 0.01858043,
       0.01932906, 0.01908957, 0.01385218, 0.01740036, 0.02103662,
       0.01729969, 0.01809721, 0.02018155, 0.01190106])

In [33]:
pd.set_option('display.float_format', lambda x: '%.6f' % x)

In [34]:
accuracy_table = pd.DataFrame({'SVM_linear':svm_linear2_ac,
                    'SVM_sigmoid':svm_sigmoid2_ac,
                    'SVM_rbf': svm_rbf_ac},
                   index= ['accuracy'])

In [35]:
accuracy_table

Unnamed: 0,SVM_linear,SVM_sigmoid,SVM_rbf
accuracy,0.722572,0.722335,0.729677


# Classification Models

# KNN

In [36]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

In [37]:
knn = KNeighborsClassifier(n_neighbors = 3)

In [38]:
knn_accuracy = cross_val_score(knn, X,y, scoring='accuracy', cv = 5).mean()
knn_accuracy

0.6704557054320088

# Decision tree

In [39]:
from sklearn.tree import DecisionTreeClassifier

In [40]:
dtree = DecisionTreeClassifier(criterion='entropy')

In [41]:
dtree_accuracy = cross_val_score(dtree, X,y, scoring='accuracy', cv = 5).mean()
dtree_accuracy

0.6169160660702768

# Naive bayes

### Gaussian Naive Bayes

In [42]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB 

In [43]:
nbayes = GaussianNB() 

In [44]:
nbayes_accuracy = cross_val_score(nbayes, X,y, scoring='accuracy', cv = 5).mean()
nbayes_accuracy

0.5823228357497406

### Bernouilli Naive Bayes

In [45]:
nbern = BernoulliNB()

In [46]:
nbern_accuracy = cross_val_score(nbern, X,y, scoring='accuracy', cv = 5).mean()
nbern_accuracy

0.7180728567822989

# comparison of classifications by their accuracy

In [47]:
comparison_table = pd.DataFrame({'SVM_fbf':svm_rbf_ac,
                                 'KNN':knn_accuracy,
                                 'Decision Tree': dtree_accuracy,
                                'Gaussian NB':nbayes_accuracy,
                                'Bernoulli NB':nbern_accuracy},
                                index= ['accuracy'])

In [48]:
comparison_table

Unnamed: 0,SVM_fbf,KNN,Decision Tree,Gaussian NB,Bernoulli NB
accuracy,0.729677,0.670456,0.616916,0.582323,0.718073


# Discussion and conclusion

As you can see,the support vector machine deals with both classification and regression problems. As compared to other classification, its approach seems to be a bit different.The problem in question is a non linear problem. However, using a kernel method allows us to solve a no-linear problem by using a linear classifier.
During the first analysis, 3 different models are created. they are, linear radial basis function (RBF), and the sigmoid kernel.
Each kernel is computed by using its best parameters C, gamma that suit the model.

During the comparison of all different Kernels, it came to the realization that the linear radial basis function (RBF) perfromed well with a cross validation of 0.729677. it appears to be the most accurate model among all the kernel models used in this study.

Few classification methods are used to determine and compare the results to the ones of kernel models. As you can see from the above result, KNN diplays a cross validation of 0.670456, Decison Tree classification 0.612413, Gaussian NB 0.582323, and Bernoulli NB 0.718073. 

The Bernouilli NB method came close to the RBF medoth as both display some cross validations of 0.718073, and 0.729677.