Example from: https://www.datacamp.com/tutorial/svm-classification-scikit-learn-python

Let's first load the required dataset you will use.

In [106]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()

Exploring Data

After you have loaded the dataset, you might want to know a little bit more about it. You can check feature and target names.

In [107]:
# print the names of the 13 features
print("Features: ", cancer.feature_names)

# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels:  ['malignant' 'benign']


Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Split the dataset by using the function train_test_split(). you need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.

In [108]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test

Generating Model

Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in SVC() function.

Then, fit your model on train set using fit() and perform prediction on the test set using predict().

In [109]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

Evaluating the Model

Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [110]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9649122807017544


Well, you got a classification rate of 96.49%, considered as very good accuracy.

For further evaluation, you can also check precision and recall of model.

In [111]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Precision: 0.9811320754716981
Recall: 0.9629629629629629


### K-fold cross validation

In [112]:
from sklearn.model_selection import KFold, cross_val_score

k_fold = KFold(n_splits=10)
clf = svm.SVC(kernel='linear')
accuracies = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=k_fold)
accuracies

array([0.975     , 0.975     , 0.95      , 0.925     , 0.95      ,
       0.9       , 0.975     , 0.925     , 0.94871795, 0.94871795])

In [113]:
print("Average accuracy: ", accuracies.mean())

Average accuracy:  0.9472435897435897


### Final performance using test dataset

In [114]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))

Accuracy: 0.9649122807017544
Precision: 0.9811320754716981
Recall: 0.9629629629629629


In [115]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[ 61,   2],
       [  4, 104]])

In [116]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95        63
           1       0.98      0.96      0.97       108

    accuracy                           0.96       171
   macro avg       0.96      0.97      0.96       171
weighted avg       0.97      0.96      0.97       171



### Tuning hyper parameters

In [117]:
# Available hyper parameters to tune
svm_classifier = svm.SVC()
svm_classifier.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [118]:
# More parameters can be added to the list.
hyperparameters_to_tune = {
    'C': [0.1, 1],
    'kernel': ['linear', 'rbf']
}

In [119]:
from sklearn.model_selection import GridSearchCV

grid_search_classifier = GridSearchCV(svm.SVC(), hyperparameters_to_tune, cv=10)

grid_search_classifier.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=SVC(),
             param_grid={'C': [0.1, 1], 'kernel': ['linear', 'rbf']})

In [120]:
import pandas as pd

grid_search_cv_results = pd.DataFrame(grid_search_classifier.cv_results_)
grid_search_cv_results[['params', 'mean_test_score']]

Unnamed: 0,params,mean_test_score
0,"{'C': 0.1, 'kernel': 'linear'}",0.944679
1,"{'C': 0.1, 'kernel': 'rbf'}",0.879551
2,"{'C': 1, 'kernel': 'linear'}",0.944679
3,"{'C': 1, 'kernel': 'rbf'}",0.899487


In [121]:
print("Best score:", grid_search_classifier.best_score_)
print("Best parameters:", grid_search_classifier.best_params_)

Best score: 0.9446794871794871
Best parameters: {'C': 0.1, 'kernel': 'linear'}
