This script gives a basic description on how to use sklearn using an example.

In [1]:
from sklearn.svm import LinearSVC,SVC
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing
# 訓練データとテストデータを分けるライブラリ
from sklearn.model_selection import train_test_split

# データの読み込み
cancer = load_breast_cancer()

In [None]:
# First, we create the data frame
from sklearn.metrics import recall_score, precision_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score,cross_val_predict
c= pd.concat([pd.DataFrame(cancer.data),pd.DataFrame(cancer.target)], axis= 1)
a=list(cancer.feature_names)
a.append('cancer')
c.columns=a

# Next, we take out the x and y , and scale the data
y = c['cancer']
x = preprocessing.scale(c.drop('cancer', axis = 1),axis=0)

# Next, we split model into tran and test 
x_train, x_verify, y_train, y_verify=train_test_split(x,y, test_size=0.5,random_state=0)


# Basics about the sklearn specification

1.The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if
you scale the data using the StandardScaler. 
2.Also make sure you set the loss hyperparameter to "hinge", as it is not the default value. By setting this, you are using soft margin
3.Finally, for better performance, you should set the dual hyperparameter to False, unless there are more features than training instances 
some error:
Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False

In [63]:

# import the model
#If your SVM model is overfitting, you can try regularizing it by reducing C.
# loss function specify the loss from misclassfication

model = LinearSVC(C=1, loss='hinge')
model.fit(x_train,y_train)
model.score(x_verify,y_verify)

# use cross validation to see the accuracy, recall, precision 
cross_val_score(model,x,y,cv=3,scoring='precision')
cross_val_score(model,x,y,cv=3,scoring='recall')

# can also get the prediction 
# unlike the logistic regression, the svm classifier sklearn command cannot return the 'possibility' by default


a=model.predict(x_verify)

if you want the svm to be able to get the probability, you should  you need to set its probability hyper‐
parameter to True (this will make the SVC class use cross-validation to estimate class
probabilities, slowing down training, and it will add a predict_proba() method).

# How to Tackle Linearly Unseparability 1

Of course, when the data itself is hard to separat linearly, a good method is to use kernel function to map them to higher dimension.
The following block shows a svc with polynomial kernel. 
If your model is overfitting, you might want to
reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing
it. The hyperparameter coef0 controls how much the model is influenced by highdegree polynomials versus low-degree polynomials.

In [72]:
# we next do an excercise to compare the performance of degree of polynomial kernel . 

from statistics import mean
for d in [1,2,3,4,5,6]:
    model= SVC(kernel="poly", degree=d, coef0=1, C=1)
    pre_score= cross_val_score(model,x,y,cv=3,scoring='precision')
    rec_score = cross_val_score(model,x,y,cv=3,scoring='recall')
    print (mean(list(map(lambda x,y: x*y/(x+y), pre_score, rec_score))))
    

0.48894454918224145
0.49102583009940254
0.4917233013423949
0.490334412453506
0.4874925933611028
0.4867077701186449


# How to Tackle Linearly Unseparability 2

Use rbf, which can map the original data into infinite dimension. 
γ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it; if it is underfitting,
you should increase it (similar to the C hyperparameter)

In [73]:
from statistics import mean
for g in [1,2,3,4,5,6,7]:
    model= SVC(kernel="rbf",gamma=g , C=0.001)
    pre_score= cross_val_score(model,x,y,cv=3,scoring='precision')
    rec_score = cross_val_score(model,x,y,cv=3,scoring='recall')
    print (mean(list(map(lambda x,y: x*y/(x+y), pre_score, rec_score))))

0.38553005786015493
0.38553005786015493
0.38553005786015493
0.38553005786015493
0.38553005786015493
0.38553005786015493
0.38553005786015493


# How to choose kernel

As a rule of thumb, you should always try the linear kernel first (remember that LinearSVC is much faster than SVC(kernel="linear")), especially if the training set is very large or if it has plenty of features. If the training set is not too large, you should also try the Gaussian RBF kernel; it works well in most cases. Then if you have spare time and computing power, you can experiment with a few other kernels, using cross-validation and grid search.You’d want to experiment like that especially if there are kernels specialized for your training set’s data structure.