<font size="20"><u><b>Support Vector Machines (SVM)</b></u></font>

<font size="6"><u><b>By Nishit Sonawane</b></u></font>

## •<u>Introduction to Support Vector Machines</u>

### – Support Vector Machines (SVMs) are powerful machine learning models used for classification and regression tasks. They have gained significant popularity due to their ability to handle complex datasets and produce accurate results. SVMs are particularly effective when dealing with high-dimensional feature spaces.

### – At its core, SVM is a binary classification algorithm that aims to find an optimal hyperplane that separates different classes in a dataset. The hyperplane is determined by maximizing the margin, which is the distance between the hyperplane and the nearest data points of each class. The data points that lie closest to the hyperplane are known as support vectors, hence the name "Support Vector Machines."


## •<u>Linearly Separable vs Non-Separable Data</u>  

### Since scikit-learn's breast cancer dataset consists of multiple features, it is not as simple to determine linear separability by visual inspection. Linear separability refers to the property of data points being separable by a straight line in a 2D space or a hyperplane in higher-dimensional spaces.
### In the case of the breast cancer dataset, the features represent various characteristics of tumors, such as mean radius, mean texture, etc. These features are not inherently visualizable in a 2D space, making it challenging to determine linear separability through visual inspection alone.

# •<u>The use of Support Vector Machines (SVM) for classification on the Breast Cancer Wisconsin (Diagnostic) dataset.</u>•

#### •<u>Importing necessary libraries:</u> pandas, numpy and sklearn modules are imported for data manipulation, and model building.

In [1]:
import pandas as pd
import numpy as np

#### •<u>Loading and exploring the dataset:</u>   The Breast Cancer dataset is loaded using 'load_breast_cancer()' from 'sklearn.datasets'. The dataset's description is printed using 'print(cancer['DESCR'])'. The dataset features are stored in 'df_feat', and their information is displayed using 'df_feat.info()'. The target names are printed using 'cancer['target_names']'.

In [2]:
from sklearn.datasets import load_breast_cancer

In [3]:
cancer = load_breast_cancer()

In [4]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [5]:
print(cancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [6]:
df_feat = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])


In [7]:
df_feat.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
df_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [9]:
cancer['target_names']

array(['malignant', 'benign'], dtype='<U9')

#### •<u>Splitting the data into training and testing sets:</u> The dataset is divided into input features (X) and target labels (y). The 'train_test_split()' function from 'sklearn.model_selection' is used to split the data into training and testing sets, with 30% of the data reserved for testing.

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X = df_feat
y = cancer['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## •<u>The Kernal Trick</u>

### – The kernel trick is a concept in machine learning, particularly in support vector machines (SVMs), that allows us to efficiently compute the results of a potentially high-dimensional feature space without explicitly calculating the transformed feature vectors. It is a mathematical technique that makes SVMs applicable to non-linear classification problems.
### – The SVC (Support Vector Classifier) model from scikit-learn is employed to perform classification on the breast cancer dataset. The SVC model has a parameter called kernel which determines the type of kernel function to be used.
### – By default, the SVC model uses the Radial Basis Function (RBF) kernel, which is a popular choice for non-linear classification problems. The RBF kernel implicitly maps the input data into a higher-dimensional space, allowing the model to find non-linear decision boundaries. This mapping is performed using the kernel trick, which avoids the explicit computation of the transformed feature space.
### – In the context of breast cancer classification, the RBF kernel is effective in capturing non-linear patterns and boundaries in the feature space defined by the dataset. Since breast cancer classification is often a complex and non-linear problem, using the RBF kernel can provide better separation between malignant and benign samples, resulting in improved classification accuracy.

#### •<u>Building an SVM model:</u> An SVM model is created using SVC() from sklearn.svm. The model is trained on the training data using model.fit(X_train, y_train).

In [12]:
from sklearn.svm import SVC

In [13]:
model = SVC()

In [14]:
model.fit(X_train, y_train)

SVC()

#### •<u>Make predictions and evaluate the model:</u> The trained SVM model is used to make predictions on the test data using 'model.predict(X_test)'. The confusion matrix and classification report are printed using 'confusion_matrix()' and 'classification_report()' from 'sklearn.metrics', respectively.

In [15]:
predictions = model.predict(X_test)

In [16]:
from sklearn.metrics import classification_report, confusion_matrix

In [17]:
print(confusion_matrix(y_test, predictions))
print(50*'_')
print(classification_report(y_test, predictions))
print(50*'_')

[[ 56  10]
 [  3 102]]
__________________________________________________
              precision    recall  f1-score   support

           0       0.95      0.85      0.90        66
           1       0.91      0.97      0.94       105

    accuracy                           0.92       171
   macro avg       0.93      0.91      0.92       171
weighted avg       0.93      0.92      0.92       171

__________________________________________________


### •<u>Confusion Matrix Explained (pre tuning)</u>

#### <u>[1,1] True Negative (TN):</u> The model correctly predicted 56 samples as negative (benign tumors).
#### <u>[1,2] False Positive (FP):</u> The model incorrectly predicted 10 samples as positive (malignant tumors) when they were actually negative (benign tumors).
#### <u>[2,1] False Negative (FN):</u> The model incorrectly predicted 3 samples as negative (benign tumors) when they were actually positive (malignant tumors).
#### <u>[2,2] True Positive (TP):</u> The model correctly predicted 102 samples as positive (malignant tumors).

#### -----------------------------------------------------------------------------------------------

## •<u>Understanding the Hyperparameters – Cost and Gamma</u>

### In Support Vector Machines (SVM), the hyperparameters "C" and "gamma" play crucial roles in the model's performance and behavior. Understanding these hyperparameters is essential for effectively tuning SVMs for different datasets and problem domains. Here's an explanation of the cost (C) and gamma hyperparameters:
### <u>1. Cost (C):</u>
### <u>1.1:</u> The cost parameter, often denoted as "C," controls the trade-off between maximizing the margin and minimizing the classification errors in SVM.
### <u>1.2:</u> A smaller value of C allows for a wider margin but may result in more misclassifications. It emphasizes a simpler decision boundary with potentially more margin violations.
### <u>1.3:</u> A larger value of C leads to a narrower margin but aims to minimize the number of misclassifications. It prioritizes a more complex decision boundary that attempts to fit the training data more precisely.
### <u>1.4:</u> The choice of C depends on the problem's characteristics and the desired balance between overfitting (smaller C) and underfitting (larger C).
### <u>2. Gamma:</u>
### <u>2.1:</u> The gamma parameter, often denoted as "gamma," determines the influence of each training example on the decision boundary.
### <u>2.2:</u> A smaller gamma value implies a larger influence radius and results in a smoother decision boundary. It leads to a more global approach to classification.
### <u>2.3:</u> A larger gamma value makes the decision boundary focus on closer data points. It leads to a more complex and localized decision boundary that can fit intricate patterns in the data.
### <u>2.4:</u> High values of gamma can result in overfitting, as the model becomes too sensitive to individual training examples.
### <u>2.5:</u> The optimal gamma value depends on the dataset's characteristics, such as the scale of the features and the level of complexity in the data. It often requires experimentation to find the appropriate value.
### –Finding the right values for C and gamma is typically achieved through hyperparameter tuning techniques, such as grid search or random search, where different combinations of values are evaluated to determine the optimal settings for a given problem.

## •<u>GridSearch CV – to select the optimum hyperparameters</u>  


### – GridSearchCV is a popular method for hyperparameter tuning in machine learning, including for Support Vector Machines (SVMs). It is a systematic approach that exhaustively searches through a predefined grid of hyperparameter values to find the optimal combination that yields the best performance.
### – Here's a step-by-step overview of how GridSearchCV works for selecting the optimum hyperparameters in SVM:
### 1. <u>Define the parameter grid</u> 
### 2. <u>Create the SVM model</u> 
### 3. <u>Set up the GridSearchCV</u> 
### 4. <u>Perform grid search</u> 
### 5. <u>Find the best hyperparameters</u> 
### 6. <u>Evaluate the model</u> 
### – By performing a grid search using cross-validation, GridSearchCV helps in finding the hyperparameter values that optimize the model's performance while avoiding overfitting to the training data.

#### ----------------------------------------------------------------------------------------------

#### •<u>Hyperparameter tuning with GridSearchCV:</u> GridSearchCV is employed for hyperparameter tuning. A parameter grid with different combinations of the cost (C) and gamma values is defined using 'param_grid'. GridSearchCV is initialized with the SVM model, the parameter grid, and verbosity. The 'grid.fit(X_train, y_train)' method is called to perform the grid search and find the best hyperparameters.

In [18]:
from sklearn.model_selection import GridSearchCV


In [19]:
param_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]}

In [20]:
grid = GridSearchCV(SVC(), param_grid, verbose=3)

In [21]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ....................C=0.1, gamma=1;, score=0.637 total time=   0.0s
[CV 2/5] END ....................C=0.1, gamma=1;, score=0.637 total time=   0.0s
[CV 3/5] END ....................C=0.1, gamma=1;, score=0.625 total time=   0.0s
[CV 4/5] END ....................C=0.1, gamma=1;, score=0.633 total time=   0.0s
[CV 5/5] END ....................C=0.1, gamma=1;, score=0.633 total time=   0.0s
[CV 1/5] END ..................C=0.1, gamma=0.1;, score=0.637 total time=   0.0s
[CV 2/5] END ..................C=0.1, gamma=0.1;, score=0.637 total time=   0.0s
[CV 3/5] END ..................C=0.1, gamma=0.1;, score=0.625 total time=   0.0s
[CV 4/5] END ..................C=0.1, gamma=0.1;, score=0.633 total time=   0.0s
[CV 5/5] END ..................C=0.1, gamma=0.1;, score=0.633 total time=   0.0s
[CV 1/5] END .................C=0.1, gamma=0.01;, score=0.637 total time=   0.0s
[CV 2/5] END .................C=0.1, gamma=0.01

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
             verbose=3)

#### •<u>Evaluating the tuned model:</u> The best hyperparameters are accessed using 'grid.best_params_', and the best estimator is obtained using 'grid.best_estimator_'. Predictions are made on the test data using the tuned model ('grid_predictions'). The confusion matrix and classification report are printed again to evaluate the tuned model's performance.

In [22]:
grid.best_params_

{'C': 1, 'gamma': 0.0001}

In [23]:
grid.best_estimator_

SVC(C=1, gamma=0.0001)

In [24]:
grid_predictions = grid.predict(X_test)

In [25]:
print(confusion_matrix(y_test, grid_predictions))
print(50*'_')
print(classification_report(y_test, grid_predictions))
print(50*'_')

[[ 59   7]
 [  4 101]]
__________________________________________________
              precision    recall  f1-score   support

           0       0.94      0.89      0.91        66
           1       0.94      0.96      0.95       105

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171

__________________________________________________


### •<u>Confusion Matrix Explained (post tuning)</u>

#### <u>[1,1] True Negative (TN):</u> The model correctly predicted 59 samples as negative (benign tumors).
#### <u>[1,2] False Positive (FP):</u> The model incorrectly predicted 7 samples as positive (malignant tumors) when they were actually negative (benign tumors).
#### <u>[2,1] False Negative (FN):</u> The model incorrectly predicted 4 samples as negative (benign tumors) when they were actually positive (malignant tumors).
#### <u>[2,2] True Positive (TP):</u> The model correctly predicted 101 samples as positive (malignant tumors).

#### ------------------------------------------------------------------------------------------------

## •<u>Advantages of the above SVM Model</u>

### 1 		<u>Effective in High-Dimensional Spaces:</u> SVMs perform well in datasets with a high number of features or dimensions. They can handle and make accurate predictions in datasets where the number of features is greater than the number of samples.
### 2 		<u>Versatile Kernel Functions:</u> SVMs can utilize different kernel functions, such as linear, polynomial, and radial basis function (RBF), to handle both linearly separable and non-linearly separable data. This flexibility allows SVMs to capture complex patterns and relationships in the data.
### 3 		<u>Robust to Overfitting:</u> SVMs have a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing the classification error. This regularization helps prevent overfitting by balancing the model's ability to fit the training data while generalizing well to unseen data.
### 4 		<u>Effective for Small/Medium-Sized Datasets:</u> SVMs are particularly suitable for datasets with a small to medium number of samples. They can provide accurate and stable predictions even with limited training data.
### 5 		<u>Global Optimal Solution:</u> The objective of SVMs is to find the hyperplane that maximizes the margin between classes. This optimization problem has a unique global optimal solution, ensuring consistency and stability in model training.
### 6 		<u>Robust to Outliers:</u> SVMs are less sensitive to outliers compared to some other classification algorithms. The use of a margin maximization objective helps to ignore outliers that fall within the margin.
### 7 		<u>Memory Efficient:</u> SVMs only rely on a subset of training points (support vectors) to define the decision boundary. This property makes SVMs memory efficient and allows them to scale well to large datasets.


## •<u>Conclusion</u>

### The code implements an SVM model for the classification of breast cancer data. It demonstrates the training and evaluation of the model, including the use of confusion matrices and classification reports to assess its performance. It also showcases the use of grid search for hyperparameter tuning. 
### It hence proves that hyperparameter tuning is therefore important as it gave a better result after its implementation.