# 准确率

准确率是指分类正确的样本占总样个数的比例：$$Accuracy = \frac{n_{correct}}{n_{total}}$$

$n_{correct}$为被正确分类的样本个数，$n_{total}$为总样本的个数

准确率是分类问题中最简单也是最直观的评价指标，但存在明显的缺陷，当不同总类的样本比例非常不均衡时，占比大的类别往往成为影响准确率的最主要因素。比如：当负样本占99%,分类器把所有样本都预测为负样本也可以得到99%的准确率，换句话说总体准确率高，并不代表类别比例小的准确率高

# 精确率和召回率

精确率是指正确分类的正样本个数占分类器判定为正样本的样本个数的比例

召回率是指正确分类的正样本个数占真正的正样本数的比例

Precison值和Recall值是既矛盾又统一的两个指标，为了提高Precison值，分类器需要尽量在‘更有把握’时才把样本预测为正样本，但此时往往会因为过于保守而漏掉很多‘没有把握’的正样本，导致Recall值降低

在排序问题中，通常没有一个确定的阈值把得到的结果直接判定为正样本或负样本，而是采用TopN返回结果的Precision值和Recall值来衡量排序模型的性能，即认为模型返回的TopN的结果就是模型判定的正样本，然后计算N个位置上的Precision和前N个位置上的Recall

In [99]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
print(mnist.keys())

dict_keys(['DESCR', 'COL_NAMES', 'target', 'data'])




# 使用SGDClassifer测试准确率，精确率、召回率和decision_function

In [4]:
from sklearn.datasets import load_iris

iris = load_iris()
print(iris.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [15]:
x,y = iris['data'],iris['target']
print(iris['target_names'])
print(x.shape,y.shape)
print(x[:10],y[:10])

['setosa' 'versicolor' 'virginica']
(150, 4) (150,)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]] [0 0 0 0 0 0 0 0 0 0]


In [49]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=42)
print(len(x_train),len(x_test))

120 30


In [93]:
from sklearn.linear_model import SGDClassifier
import numpy as np
sgd_clf = SGDClassifier(max_iter=500,tol=np.inf)
print(sgd_clf)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=500,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=inf,
       validation_fraction=0.1, verbose=0, warm_start=False)


In [94]:
#使用网格搜索确认最佳参数
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha':[0.0001,0.001,0.01,0.1,0.5,1,10]}
gsc_clf = GridSearchCV(sgd_clf,param_grid,cv=5,verbose=2,n_jobs=-1)
gsc_clf.fit(x_train,y_train)
print(gsc_clf.best_params_)

Fitting 5 folds for each of 7 candidates, totalling 35 fits
{'alpha': 0.01}


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:    0.0s finished


In [96]:
from sklearn.metrics import accuracy_score,precision_score,recall_score
sgd_clf = SGDClassifier(alpha = 0.01,max_iter=500,tol=np.inf)
y_train_binary = (y_train==2)
y_test_binary = (y_test==2)

sgd_clf.fit(x_train,y_train_binary)
y_pred = sgd_clf.predict(x_test)
print(y_pred)
print('accuracy:',accuracy_score(y_test_binary,y_pred))
print('precision:',precision_score(y_test_binary,y_pred))
print('reccall:',recall_score(y_test_binary,y_pred))

[False False  True False False False False  True False False  True False
 False False False False  True False False  True False  True False  True
  True  True  True  True False False]
accuracy: 1.0
precision: 1.0
reccall: 1.0


In [97]:
sgd_clf.decision_function(x_test)

array([ -1.59514741, -21.33742071,   9.92195581,  -1.51654601,
        -2.12670602, -20.28895799,  -5.77567603,   1.65149002,
        -0.35750844,  -4.64594746,   1.3410811 , -19.41768147,
       -22.50311942, -19.40036252, -20.84449669,  -1.84827056,
         6.0664906 ,  -4.02912631,  -1.24477169,   5.78139401,
       -18.27329839,   1.09062241, -18.78220914,   5.42302492,
         2.97572375,   2.89445789,   5.31777898,   5.70412483,
       -18.70094329, -18.28662067])

# 使用SVM测试准确率和decision_function

In [76]:
from sklearn.svm import SVC
svc_clf = SVC(random_state=42)
print(svc_clf)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=42,
  shrinking=True, tol=0.001, verbose=False)


In [80]:
svc_clf.fit(x_train,y_train)
y_pred = svc_clf.predict(x_test)
print(y_pred)
print('accuracy:',accuracy_score(y_test,y_pred))

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
accuracy: 1.0




In [78]:
svc_clf.decision_function(x_test)

array([[-0.34247012,  2.3545421 ,  0.98792802],
       [ 2.33722198,  0.88091852, -0.2181405 ],
       [-0.18833381,  0.83901002,  2.34932379],
       [-0.35303414,  2.35000768,  1.00302646],
       [-0.31330558,  2.28813783,  1.02516775],
       [ 2.36507018,  0.85593114, -0.22100132],
       [-0.27540886,  2.48027321,  0.79513565],
       [-0.2863676 ,  1.00300792,  2.28335968],
       [-0.3328408 ,  2.26376023,  1.06908057],
       [-0.32006764,  2.5       ,  0.82006764],
       [-0.31659828,  1.06215688,  2.2544414 ],
       [ 2.38293247,  0.81784973, -0.2007822 ],
       [ 2.37598257,  0.82541645, -0.20139902],
       [ 2.37937215,  0.82869748, -0.20806963],
       [ 2.38386256,  0.82223633, -0.2060989 ],
       [-0.3241883 ,  2.30215196,  1.02203635],
       [-0.28397955,  0.80983391,  2.47414564],
       [-0.32261479,  2.47464289,  0.84797189],
       [-0.35583633,  2.36421965,  0.99161669],
       [-0.29353484,  0.82341044,  2.47012439],
       [ 2.36890345,  0.83693733, -0.205

# 均方误差、根均方误差、绝对百分比误差

均方误差：$$MSE =\frac{1}{n}\sum_{i=1}^n(y_{pred} - y_i)^2$$ 

根均方误差：$$RMES = \sqrt{MSE}$$

均方误差和根均方误差都会受到异常值的影响，而影响最终的模型评估

平均绝对百分比误差：$$MAPE = \sum_{i=1}^n|\frac{(y_{pred}-y_i)}{y_i}|*\frac{100}{n}$$

平均绝对百分比误差提高了异常值的鲁棒性，相当于把每个点的误差进行了归一化处理，降低了个别离群带来的绝对误差的影响

# ROC曲线