## 超参数   
超参数：算法运行前需要决定的参数   
模型参数：算法过程中学习的参数     

### 1. 如何寻找好的超参数 ？  
- 领域知识 
- 经验数值
- 实验搜索   

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [2]:
digits = datasets.load_digits()

X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

### 2. 寻找最好的 k

In [3]:
best_score = 0.0
best_k = -1

for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score

print('best_k: ', best_k)
print('best_score: ', best_score)

best_k:  4
best_score:  0.9916666666666667


### 3. 是否考虑 距离这个权重？？

In [4]:
best_method = ""
best_score = 0.0
best_k = -1

for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_method = method
            best_k = k
            best_score = score
            
print('best_method: ', best_method)            
print('best_k: ', best_k)
print('best_score: ', best_score)

best_method:  uniform
best_k:  4
best_score:  0.9916666666666667


### 4. 明可夫斯基距离   
采用明可夫斯基距离计算，搜索其相应的超参`p`

In [5]:
%%time

best_method = ""
best_score = 0.0
best_k = -1
best_p = -1

# 只有在distances下，超参 p 才需要求解
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        for p in range(1, 6):
            knn_clf = KNeighborsClassifier(n_neighbors=k, p=p)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_method = method
                best_k = k
                best_score = score
                best_p = p
            
print('best_method: ', best_method)            
print('best_k: ', best_k)
print('best_score: ', best_score)
print('best_p: ', best_p)


best_method:  uniform
best_k:  4
best_score:  0.9916666666666667
best_p:  2
CPU times: user 1min 27s, sys: 595 ms, total: 1min 28s
Wall time: 1min 28s


### 5. Grid Search   
网格搜索法是指定参数值的一种穷举搜索方法，通过将估计函数的参数通过交叉验证的方法进行优化来得到最优的学习算法   

由多个字典组成，每个字典对应一组网格搜索

In [6]:
params_grid = [
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)]
    }
]

In [7]:
knn_clf = KNeighborsClassifier()

In [8]:
%%time
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_clf, params_grid)
grid_search.fit(X_train, y_train)



CPU times: user 5min 51s, sys: 364 ms, total: 5min 52s
Wall time: 5min 52s


In [9]:
grid_search.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=3,
           weights='distance')

In [10]:
grid_search.best_params_

{'n_neighbors': 3, 'p': 3, 'weights': 'distance'}

In [11]:
grid_search.best_score_

0.9853862212943633

In [None]:
# n_jobs: 分配几个核来并行处理
# verbose: 网格搜索过程中，进行信息的输出  
grid_search = GridSearchCV(knn_clf, params_grid, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)