# Support Vector Machine (SVM)

In this notebook, we will demonstrate how to build a SVM classifier to predict the Body Mass Index (BMI) based on people's gender, Height and Weight. We will then: 1) visualize the data, 2) evaluate the accuracy of the predicitons and 3) compare the predictions with the k-NN classifier that was previously trained with this same dataset. The dataset was obtained from Kaggle. Source: [500 Person Gender-Height-Weight-Body Mass Index](https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex)

`Note`: remember the first step is EDA. Even though its not performed in this notebook doesn’t imply that it isn’t needed. The EDA was excluded to focus on the ML task.

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('BMI_data.csv')
df.head(10)

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3
5,Male,189,104,3
6,Male,147,92,5
7,Male,154,111,5
8,Male,174,90,3
9,Female,169,103,4


In [3]:
#partition the data
X   = df[['Height', 'Weight', 'Gender']] #get the input features
y   = df['Index']              #get the target

X_train, X_test, y_train, y_test = train_test_split(X,              #the input features
                                                    y,              #the label
                                                    test_size=0.3,  #set aside 30% of the data as the test set
                                                    random_state=7, #reproduce the results
                                                    stratify=y      #preserve the distribution of the labels
                                                   )

In [4]:
X_train = X_train.copy()
X_test  = X_test.copy()

In [5]:
#encode the categorical feature
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore') #drop one of the encoded gender columns
encoder.fit(X_train[['Gender']])
#enc.categories_ #view the categories

X_train['Gender'] = encoder.transform(X_train[['Gender']])
 

In [6]:
X_train.head()

Unnamed: 0,Height,Weight,Gender
54,196,116,0.0
117,146,104,1.0
12,192,101,1.0
371,160,124,1.0
301,182,108,0.0


In [7]:
#normalize the numeric features
scaler = StandardScaler()
scaler.fit(X_train[['Height', 'Weight']])

X_train[['Height', 'Weight']] = scaler.transform(X_train[['Height', 'Weight']]) #scale the training data

In [8]:
#view the prepared data
X_train.head(10)

Unnamed: 0,Height,Weight,Gender
54,1.614241,0.333217,0.0
117,-1.428545,-0.040285,1.0
12,1.370819,-0.13366,1.0
371,-0.576564,0.582218,1.0
301,0.762261,0.084216,0.0
77,-0.089719,-0.476037,1.0
429,-0.272286,-0.320411,1.0
116,-0.272286,-1.503168,1.0
181,0.457983,0.364342,0.0
370,-1.124266,-1.347542,0.0


In [9]:
#initialize the classifier

svm = SVC() #the default kernel is rbf
svm.fit(X_train, y_train) #fit the data

SVC()

In [10]:
#encode & scale the new/test data
X_test[['Height', 'Weight']] = scaler.transform(X_test[['Height', 'Weight']]) #scale the test data
X_test['Gender']             = encoder.transform(X_test[['Gender']]) #encode the test data

#predict the labels for the test set
y_pred   = svm.predict(X_test)

print('The predicted BMI is: {}'.format(y_pred))

The predicted BMI is: [4 5 4 2 1 5 5 4 4 5 2 3 4 4 5 5 3 5 4 2 2 2 4 4 2 5 2 2 5 5 5 2 2 4 5 5 4
 4 5 5 2 5 5 5 2 5 5 5 0 5 2 4 5 3 3 1 5 5 1 5 4 5 4 3 4 5 5 4 3 1 5 4 4 4
 5 1 5 2 5 5 3 4 3 5 5 5 4 3 5 4 4 4 2 4 5 2 4 5 1 5 3 5 3 5 1 3 5 1 5 4 5
 5 1 3 2 4 4 4 2 2 3 5 5 5 4 4 2 5 5 5 5 5 5 5 3 4 5 5 3 4 4 5 4 2 4 4 3 5
 5 3]


In [11]:
#evaluate the prediction accuracy
print('The accuracy of the model is: {}'.format(svm.score(X_test, y_test)))

The accuracy of the model is: 0.8866666666666667


In [12]:
#get the classification report
target_names = ['Extremely Weak', 'Weak', 'Normal', 'Overweight', 'Obesity', 'Extreme Obesity']
print(classification_report(y_test, y_pred, target_names = target_names))

                 precision    recall  f1-score   support

 Extremely Weak       1.00      0.25      0.40         4
           Weak       0.67      0.86      0.75         7
         Normal       0.86      0.86      0.86        21
     Overweight       0.94      0.85      0.89        20
        Obesity       0.87      0.87      0.87        39
Extreme Obesity       0.92      0.97      0.94        59

       accuracy                           0.89       150
      macro avg       0.88      0.78      0.79       150
   weighted avg       0.89      0.89      0.88       150



## Hyperparameter Tuning
Choosing the best values for our algorithms can improve its ability to identify patterns and result in better performance. Hyperparameter tuning involves experimenting with different values to determine which improves the model.

Click here to learn more about [tuning the hyper-parameters of an estimator using sklearn](https://scikit-learn.org/stable/modules/grid_search.html) and also view the [documention on SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) to learn about its hyperparameters.

In [13]:
#setup the parameter grid to try values of c ranging from 1 to 10, and values for gamma ranging from 1 to 5.
params = {'C': np.arange(1, 10),
         'gamma': np.arange(1, 5)}

svm = SVC()

#initialize Grid Search with Cross Validation
svm_grid = GridSearchCV(estimator = svm, param_grid = params, cv=5)
svm_grid.fit(X_train, y_train)

print(svm_grid.best_params_) #print the best parameters
print(svm_grid.best_score_)  #print the best score

{'C': 6, 'gamma': 1}
0.9257142857142858


## Diagnosing Bias-Variance Problems

There is a tradeoff between the bias and variance. As the model becomes more complex, the variance increases and the bias decreases; when the model become less complex, the bias increases and the variance decreases.

How do we diagnose the bias-variance problem? We need to evaluate the model's performance on both the training set and the test set.

In [14]:
params = {'C': np.arange(1, 10),
         'gamma': np.arange(1, 5)}

svm = SVC()

#initialize Grid Search with Cross Validation and return the training score
svm_grid = GridSearchCV(estimator = svm, param_grid = params, cv=5, return_train_score = True) 
svm_grid.fit(X_train, y_train)

print(svm_grid.best_params_)
print(svm_grid.best_score_)

{'C': 6, 'gamma': 1}
0.9257142857142858


In [15]:
#display the cv results for the train and test set
cv_results = pd.DataFrame(svm_grid.cv_results_)
cv_results = cv_results.sort_values('mean_test_score', ascending=False)
cv_results[['mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']].head()

Unnamed: 0,mean_train_score,std_train_score,mean_test_score,std_test_score
20,0.967857,0.004518,0.925714,0.029137
32,0.970714,0.005249,0.925714,0.020996
24,0.967143,0.005249,0.922857,0.030772
28,0.967857,0.005051,0.922857,0.024908
13,0.974286,0.004165,0.92,0.021381


the highest mean_train_score is 96.78 and mean_test_score is 92.57. The SVC does not appear to overfit or underfit the data. This demonstrates low bias and low variance.