##### K邻近算法的原理
- 如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别。
- 来源：KNN算法最早是由Cover和Hart提出的一种分类算法
- 两个样本的距离可以有多种算法，比如欧式距离

In [3]:
# import datasets
# 加载内置的数据集，格式为[n_samples * n_features]的二维numpy.ndarray数组
from sklearn.datasets import load_iris
iris = load_iris()
iris.data.shape
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

##### sklearn里的估计器：实现算法的API
1. 用于分类：
    - sklearn.neighbors k近邻算法
    - sklearn.naive_bayes 贝叶斯算法
    - sklearn.linear_model.LogisticRegression 逻辑回归
2. 用于回归：
    - sklearn.linear_model.LinearRegression 线性回归
    - sklearn.linear_model.Ridge 岭回归

##### K邻近算法的API
```python
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, algorithm='auto')
    n_neighbors: int，可选，默认为5.默认查询的邻居数
    algorithm：{'auto', 'ball_tree', 'kd_tree', 'brute'}.可选用于计算最近邻居的算法：
        ‘ball_tree’将会使用 BallTree，
        ‘kd_tree’将使用 KDTree。
        ‘auto’将尝试根据传递给fit方法的值来决定最合适的算法。 (不同实现方式影响效率)
```

##### 模型的选择
- 超参数：需要手动指定的参数
- 交叉验证：将数据分成N份，其中一份作为预测集，其余的作为测试集。进行N次测试，每次都更换不同的测试集和验证集，取平均值作为最终结果
- 网格搜索：一个模型可能有多个超算数，预设几种超参数组合，每种组合都采用交叉验证的方式来预测和评估，选取最优的超参数组合建立模型

##### 网格搜索API
``````python
sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
    estimator：估计器对象
    param_grid：估计器参数(dict){“n_neighbors”:[1,3,5]}
    cv：指定几折交叉验证
    
结果分析：
    best_score_:在交叉验证中测试的最好结果
    best_estimator_：最好的参数模型
    cv_results_:每次交叉验证后的测试集准确率结果和训练集准确率结果
``````

In [23]:
def kNeigh(data):
    
    # split data into training and test data
    # 数据集划分
    # 训练数据：用于训练，构建模型
    # 测试数据： 用于评估模型是否有效
    # train_test_split返回：训练集特征值，测试集特征值，训练集标签，测试集标签
    # from sklearn.model_selection import train_test_split
    # x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size = 0.25)
    
    # normalization for both training and test data
    # from sklearn.preprocessing import StandardScaler
    # ss = StandardScaler()
    # x_train = ss.fit_transform(data.data)
    # _test = ss.fit_transform(data.target)
    
    # build model using k nearest neighbors
    from sklearn.neighbors import KNeighborsClassifier
    
    # grid seach
    from sklearn.model_selection import GridSearchCV
    
    knn = KNeighborsClassifier()
    gs = GridSearchCV(knn, param_grid = {"n_neighbors": [3, 5, 7]}, cv = 10)
    
    gs.fit(data.data, data.target)
    #print(gs.predict(x_test))
    #print(gs.score(x_test, y_test))
    
    print("Mean cross-validated score of the best_estimator (see below):")
    print(gs.best_score_)
    print("=" * 50)
    
    print("Best estimator is:")
    print(gs.best_estimator_)
    print("=" * 50)
    
    print("Detailed results are:")
    print(gs.cv_results_)
    print("=" * 50)
    
    return None


if __name__ == "__main__":
    kNeigh(iris)

Mean cross-validated score of the best_estimator (see below):
0.9666666666666667
Best estimator is:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')
Detailed results are:
{'mean_fit_time': array([0.00069294, 0.0004894 , 0.00052092]), 'std_fit_time': array([2.65232546e-04, 7.82706844e-05, 1.08963679e-04]), 'mean_score_time': array([0.00117686, 0.00090501, 0.00099604]), 'std_score_time': array([0.00032406, 0.0001196 , 0.00023415]), 'param_n_neighbors': masked_array(data=[3, 5, 7],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}], 'split0_test_score': array([1., 1., 1.]), 'split1_test_score': array([0.93333333, 0.93333333, 0.93333333]), 'split2_test_score': array([1., 1., 1.]), 'split3_test_score': array([0.93333333, 1.        , 1.        ]), 'split4_test_score'

##### K近邻算法的优劣
- 优势：无需估计参数
- 缺点：依赖k值的选取；计算量大，内存开销大