# 第三题：支持向量机的分类任务

实验内容：
1. 使用支持向量机完成spambase垃圾邮件分类任务
2. 使用训练集训练模型，计算测试集的精度，查准率，查全率，F1值

## 请你使用SVC，完成spambase分类任务

要求：使用全部特征，完成以下内容的填写

###### 双击此处填写

核函数 | C | 精度 | 查准率 | 查全率 | F1
- | - | - | - | - | -
rbf | 0.1 | 0.7 | 0.65 | 0.46 | 0.54
rbf | 1 | 0.72 | 0.67 | 0.49 | 0.57
linear | 0.1 | 0.89 | 0.85 | 0.86 | 0.85
linear | 1 | 0.78 | 0.64 | 0.97 | 0.77
sigmoid | 0.1 | 0.63 | 0.52 | 0.53 | 0.52
sigmoid | 1 | 0.63 | 0.52 | 0.52 | 0.53

In [1]:
# 导入数据
import numpy as np
data = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
spamx = data[:, :57]
spamy = data[:, 57]

In [2]:
# 数据集分割
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(spamx, spamy, test_size = 0.3, random_state = 32)
trainX.shape, trainY.shape, testX.shape, testY.shape

((3220, 57), (3220,), (1381, 57), (1381,))

**注意：计算线性核的时候，要使用 LinearSVC 这个类，不要使用SVC(kernel = 'linear')。LinearSVC不需要设置kernel参数！**

In [3]:
# 引入模型
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [4]:
clf_r0=SVC(kernel='rbf',C=0.1)
clf_r0.fit(trainX,trainY)
prediction=clf_r0.predict(testX)
as_spam_r=round(accuracy_score(testY,prediction),2)
ps_spam_r=round(precision_score(testY,prediction),2)
rs_spam_r=round(recall_score(testY,prediction),2)
f1_spam_r=round(f1_score(testY,prediction),2)
print('rbf：C=0.1')
print('精度：',as_spam_r,'查准率：',ps_spam_r,'查全率：',rs_spam_r,'f1：',f1_spam_r)


rbf：C=0.1
精度： 0.7 查准率： 0.65 查全率： 0.46 f1： 0.54


In [5]:
clf_r0=SVC(kernel='rbf',C=1)
clf_r0.fit(trainX,trainY)
prediction=clf_r0.predict(testX)
as_spam_r=round(accuracy_score(testY,prediction),2)
ps_spam_r=round(precision_score(testY,prediction),2)
rs_spam_r=round(recall_score(testY,prediction),2)
f1_spam_r=round(f1_score(testY,prediction),2)
print('rbf：C=1')
print('精度：',as_spam_r,'查准率：',ps_spam_r,'查全率：',rs_spam_r,'f1：',f1_spam_r)

rbf：C=1
精度： 0.72 查准率： 0.67 查全率： 0.49 f1： 0.57


In [6]:
clf=LinearSVC(C=0.1)
clf.fit(trainX,trainY)
prediction=clf.predict(testX)
as_spam_l=round(accuracy_score(testY,prediction),2)
ps_spam_l=round(precision_score(testY,prediction),2)
rs_spam_l=round(recall_score(testY,prediction),2)
f1_spam_l=round(f1_score(testY,prediction),2)
print('linear：C=0.1')
print('精度：',as_spam_l,'查准率：',ps_spam_l,'查全率：',rs_spam_l,'f1：',f1_spam_l)

linear：C=0.1
精度： 0.89 查准率： 0.85 查全率： 0.86 f1： 0.85




In [7]:
clf=LinearSVC(C=1)
clf.fit(trainX,trainY)
prediction=clf.predict(testX)
as_spam_l=round(accuracy_score(testY,prediction),2)
ps_spam_l=round(precision_score(testY,prediction),2)
rs_spam_l=round(recall_score(testY,prediction),2)
f1_spam_l=round(f1_score(testY,prediction),2)
print('linear：C=1')
print('精度：',as_spam_l,'查准率：',ps_spam_l,'查全率：',rs_spam_l,'f1：',f1_spam_l)


linear：C=1
精度： 0.78 查准率： 0.64 查全率： 0.97 f1： 0.77




In [8]:
clf_s0=SVC(kernel='sigmoid',C=0.1)
clf_s0.fit(trainX,trainY)
prediction=clf_s0.predict(testX)
as_spam_s=round(accuracy_score(testY,prediction),2)
ps_spam_s=round(precision_score(testY,prediction),2)
rs_spam_s=round(recall_score(testY,prediction),2)
f1_spam_s=round(f1_score(testY,prediction),2)
print('sigmoid：C=0.1')
print('精度：',as_spam_s,'查准率：',ps_spam_s,'查全率：',rs_spam_s,'f1：',f1_spam_s)

clf_s1=SVC(kernel='sigmoid',C=1)
clf_s1.fit(trainX,trainY)
prediction=clf_s1.predict(testX)
as_spam_s=round(accuracy_score(testY,prediction),2)
ps_spam_s=round(precision_score(testY,prediction),2)
rs_spam_s=round(recall_score(testY,prediction),2)
f1_spam_s=round(f1_score(testY,prediction),2)
print('sigmoid：C=1')
print('精度：',as_spam_s,'查准率：',ps_spam_s,'查全率：',rs_spam_s,'f1：',f1_spam_s)


sigmoid：C=0.1
精度： 0.63 查准率： 0.52 查全率： 0.53 f1： 0.52
sigmoid：C=1
精度： 0.63 查准率： 0.52 查全率： 0.54 f1： 0.53


# 选做：比较LinearSVC和SVC(kernel = 'linear')的运行时间

In [9]:
# YOUR CODE HERE
import time
start = time.time()
clf=LinearSVC(C=0.1)
clf.fit(trainX,trainY)
prediction=clf.predict(testX)
end = time.time()
print("LinearSVC运行时间",end-start)


LinearSVC运行时间 0.5790750980377197




In [10]:
start = time.time()
clf=LinearSVC(C=1)
clf.fit(trainX,trainY)
prediction=clf.predict(testX)
end = time.time()
print("LinearSVC运行时间",end-start)

LinearSVC运行时间 0.5755894184112549




In [11]:
start = time.time()
clf_r0=SVC(kernel='linear',C=0.1)
clf_r0.fit(trainX,trainY)
prediction=clf_r0.predict(testX)
end = time.time()
print("LinearSVC运行时间",end-start)

LinearSVC运行时间 98.0829975605011


**可看出SVC的运行时间比LinearSVC长很多，前者约100s，后者仅需0.5s左右**

**LinearSVC基于liblinear，罚函数是对截矩进行惩罚；SVC基于libsvm，罚函数不是对截矩进行惩罚。**
**我们知道SVM解决问题时，问题是分为线性可分和线性不可分问题的，liblinear对线性可分问题做了优化，故在大量数据上收敛速度比libsvm快**
**（一句话，大规模线性可分问题上LinearSVC更快）**