# Seed classification using Support Vector Machine

### Importing the necessary libraries and modules

In [9]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, f1_score

### Import of data and data inspection

In [10]:
sd = pd.read_csv(r"C:\Users\pjhop\OneDrive\Documents\Programming & Coding\Python\Projects\Datasets\seeds_dataset.csv")
sd.head()

Unnamed: 0,ID,area,perimeter,compactness,lengthOfKernel,widthOfKernel,asymmetryCoefficient,lengthOfKernelGroove,seedType
0,1,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,2,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,3,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,4,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,5,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


In [11]:
sd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    210 non-null    int64  
 1   area                  210 non-null    float64
 2   perimeter             210 non-null    float64
 3   compactness           210 non-null    float64
 4   lengthOfKernel        210 non-null    float64
 5   widthOfKernel         210 non-null    float64
 6   asymmetryCoefficient  210 non-null    float64
 7   lengthOfKernelGroove  210 non-null    float64
 8   seedType              210 non-null    int64  
dtypes: float64(7), int64(2)
memory usage: 14.9 KB


As we can see from our data, we have no missing values. 

In [12]:
sd = sd.drop(['ID'], axis=1)

In [13]:
sd.seedType.unique()

array([1, 2, 3], dtype=int64)

### Splitting of the data into training and test sets

In [14]:
y = sd.seedType.ravel()
x = sd[['area', 'perimeter', 'compactness', 'lengthOfKernel', 'widthOfKernel', 'asymmetryCoefficient', 'lengthOfKernelGroove']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state = 101)

### Model fit, predictions and metrics

Support Vector Machine is a supervised learning algorithm, that can be used for both classification and regression problems. It works by finding a line (hyperplane) which maximises the separation between classes in the feature space. This line is also known as the decision boundary. SVMs can also handle non-linearly separable data by mapping the data to a higher-dimensional space using a kernel function.

In the code below, we test the perform hyperparameter tuning using 5-fold cross validation. During this, we will test the performance of these four different kernel functions with a range of hyperparameters:

* **Linear kernel** - Linear kernels are often used when the dataset is linearly separable or when the number of features is very large, making it computationally expensive to use a more complex kernel such as the polynomial or radial basis function.  
* **Polynomial kernel** - The polynomial kernel can map the data into a higher-dimensional feature space, allowing for non-linear classification in the original feature space. The polynomial kernel is often used when the relationship between the input features and the output class is expected to be non-linear.
* **Radial basis function kernel** - The RBF kernel is often used when the relationship between the input features and the output class is highly non-linear and complex. However, choosing an appropriate value for the gamma parameter can be challenging, and finding the optimal hyperparameters may require extensive experimentation.

We will also tune two additional parameters:

* `gamma` which is the kernel coefficient for radial basis or sigmoid functions, it determines how much curvature we want in the decision boundary. A higher gamma means more curvature and vice verse.
* `C` which is a regularization parameter, which determines how much error we allow the model to make. This helps the model to avoid learning patterns that don't exist and prevent overfitting.

In [37]:
svc = svm.SVC(random_state=101)

In [41]:
param_grid = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': np.linspace(0.1, 1.0, 10), 
    'C': [0.1, 0.5, 1, 5, 10]
}

In [46]:
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, scoring="accuracy", cv=5)
grid_search.fit(x, y)

In [47]:
print(f"Best kernel function: {grid_search.best_params_['kernel']}")
print(f"Best gamma: {grid_search.best_params_['gamma']}")
print(f"Best C: {grid_search.best_params_['C']}")
print(f"Accuracy: {round(grid_search.best_score_ * 100, 2)} %")

Best kernel function: poly
Best gamma: 0.1
Best C: 0.1
Accuracy: 93.81 %
