# SVM优化——样本选择
## 思路
原始的SVM问题往往需要用到大量的支持向量，随着训练样本数的增多，直观上我们需要的支持向量的数量也在增多，因而一个直观的降低支持向量的方法就是利用某些方式减少训练样本的数量。随着样本数量的减少，我们不仅可以减少所需的支持向量的数量，同时在一定程度上我们也可以将样本中的噪声消除。

## 方法
在具体的实现中，我们主要提供了两种方式来降低样本的数量。
- **随机抽样**
    这种方式最简单，但是由于随机抽样获得的样本未必可以很好地反映整个数据集，所以可能对最终模型的泛化行产生一定的影响
- **聚类**
    利用聚类进行样本的选择更为直观，我们将空间上更相近的点归并成为一个点，让归并的所有点投票决定这个归并点的类别，具体的思想类似于决策树的想法，这样的一个好处是可以让获得的样本点更具有代表性，直观上对模型的泛化性有一定的帮助。在这里我们利用了四种较为常用的聚类方式，分别是K均值、均值漂移、DBSCAN以及高斯混合模型。
    - **K均值（K-Means**
        K均值是最常用的一种聚类方式，我们需要预先提供簇的数量，K均值可以很好地将各个点划分到各个簇中。在具体地设计中，我们首先利用K均值进行类别的划分，接下来确定每个簇y值之和，如果y值之和是0，那么我们认为这个簇无法确定其类别，我们就抛弃这个簇，同时将原来的所有点加回样本集中，如果y值之和不是0，那么我们就将这个簇的中心加入样本集中，同时用y值之和表示其类别（y值判断）。
    - **均值漂移（Mean-shift**
        均值漂移不需要预先提供簇的数量，它寻找核密度极值点并作为簇的质心，然后根据最近邻原则为样本点赋予质心，这解决了K均值面临的需要预先给出初始质心的缺点。我们首先利用均值漂移进行簇的划分，然后同理，我们还是计算y值之和，如果y值之和为0，则将簇内所有点加入样本集中，反之将簇中心加入样本集中。
    - **DBSCAN**
        DBSCAN也不需要预先提供簇的数量，需要提供$\epsilon$邻域的距离阈值以及样本数阈值，它的优点是可发现任意形状的簇类，同时可以识别出来噪声点。我们首先利用DBSCAN进行簇的划分，然后同理进行簇的y值判断，由于DBSCAN会识别出噪声点，所以我们最后将噪声点加入样本集中。
    - **高斯混合模型（Gaussian Mixtures Model**
        高斯混合模型是将事物分解为若干的基于高斯概率密度函数形成的模型。我们首先利用高斯混合模型进行划分，然后同上进行y值判断。

## 性能分析
**随机抽样** 

|比例|支持向量数量|训练集准确率|测试集准确率|
|:------:|:------:|:------:|:------:|
|1|199|0.922|0.924|
|0.1|28|0.920|0.914|
|0.2|51|0.920|0.922|
|0.3|68|0.927|0.924|

**K均值**

|簇|支持向量数量|训练集准确率|测试集准确率|
|:------:|:------:|:------:|:------:|
|原始|199|0.922|0.924|
|500|139|0.896|0.924|
|300|83|0.909|0.924|
|200|59|0.905|0.918|

**均值漂移**

|Bandwith|支持向量数量|训练集准确率|测试集准确率|
|:------:|:------:|:------:|:------:|
|原始|199|0.922|0.924|
|0.1|50|0.941|0.926|
|0.2|29|0.917|0.924|
|0.3|50|0.765|0.926|

**DBSCAN**

|eps|min_sample|支持向量数量|训练集准确率|测试集准确率|
|:------:|:------:|:------:|:------:|:------:|
|原始|原始|199|0.922|0.924|
|0.1|10|137|0.893|0.924|
|0.1|20|199|0.922|0.924|
|0.2|10|17|0.936|0.916|

**高斯混合模型**

|component|支持向量数量|训练集准确率|测试集准确率|
|:------:|:------:|:------:|:------:|
|原始|199|0.922|0.924|
|500|142|0.891|0.924|
|300|81|0.907|0.924|
|100|29|0.926|0.920|

## 结论
通过上述实验结果可以发现，使用随机抽样和聚类的方式都可以显著降低支持向量的数量。在随机抽样中我们发现，随着抽样比例的增加，支持向量数越来越多，在测试集上的准确率逐渐上升，有时效果超过了原始的准确率；在聚类方法中，我们发现随着类数量的减少，支持向量的数量也随着减少，在训练集上的准确率以及测试集上的效果基本维持不变，即泛化性基本不受影响。不过Meanshift和DBCAN方法在部分参数下时间会较长，这受到二者具体运行时影响。

In [None]:
def _Select(self,X,y):
    """
        select feature/or sampling of data
        1) origin
        2) random
        3) cluster
    """
    
    # utilize the original X as the sample of the data
    if self.select == 'origin':
        self.Ns = np.shape(X)[0]
        return X,y
    # utilize numpy.random.choice to sample a specific percentage of data, then return the data
    elif self.select == 'random':
        sample_percentage = self.selectargs.get('per', 0.1)
        self.Ns = int(np.shape(X)[0] * sample_percentage)
        sample_id = np.random.choice(np.shape(X)[0], self.Ns, replace = False)
        Xs = np.zeros((self.Ns, np.shape(X)[1]))
        ys = np.zeros(self.Ns)
        for i in range(len(sample_id)):
            Xs[i] = X[sample_id[i]]
            ys[i] = y[sample_id[i]]
        return Xs,ys
    
    # implement 4 clustering methods to cluster all data
    elif self.select == 'cluster':
        method = self.selectargs.get('method', 'kmeans')
        # utilize K-Means to cluster different data, and calculate the sum of y value of each cluster, 
        # if the sum is positive, then the y value of the cluster is +1, else if the sum is negative,
        # then the y value of the cluster is -1, if the sum is zero, we put all data belonging to the
        # cluster back to the sample set.
        if method == 'kmeans':
            clusters = self.selectargs.get('clusters', 10)
            Xs = np.zeros((clusters, np.shape(X)[1]))
            ys = np.zeros(clusters)
            kmeans = KMeans(n_clusters = clusters, random_state=0).fit(X)
            Xs = kmeans.cluster_centers_
            for i in range(np.shape(X)[0]):
                ys[kmeans.labels_[i]] += y[i]
            ys = np.sign(ys)
            zero_indices = np.where(ys == 0)
            Xs = np.delete(Xs, zero_indices, 0)
            ys = np.delete(ys, zero_indices)
            for i in range(np.shape(X)[0]):
                if kmeans.labels_[i] in zero_indices[0]:
                    Xs = np.append(Xs, [X[i]], 0)
                    ys = np.append(ys, y[i])
            self.Ns = np.shape(Xs)[0]
            return Xs,ys

        # utilize Mean-shift to cluster different data, and calculate the sum of y value of each cluster, 
        # if the sum is positive, then the y value of the cluster is +1, else if the sum is negative,
        # then the y value of the cluster is -1, if the sum is zero, we put all data belonging to the
        # cluster back to the sample set.
        elif method == 'meanshift':
            bw = self.selectargs.get('bandwith', 0.1)
            meanshift = MeanShift(bandwidth = bw).fit(X)
            clusters = np.unique(meanshift.labels_).shape[0]
            print(clusters)
            Xs = meanshift.cluster_centers_
            ys = np.zeros(clusters)
            for i in range(np.shape(X)[0]):
                ys[meanshift.labels_[i]] += y[i]
            ys = np.sign(ys)
            zero_indices = np.where(ys == 0)
            Xs = np.delete(Xs, zero_indices, 0)
            ys = np.delete(ys, zero_indices)
            for i in range(np.shape(X)[0]):
                if meanshift.labels_[i] in zero_indices[0]:
                    Xs = np.append(Xs, [X[i]], 0)
                    ys = np.append(ys, y[i])
            self.Ns = np.shape(Xs)[0]
            return Xs,ys

        # utilize DBSCAN to cluster different data, and calculate the sum of y value of each cluster, 
        # if the sum is positive, then the y value of the cluster is +1, else if the sum is negative,
        # then the y value of the cluster is -1, if the sum is zero, we put all data belonging to the
        # cluster back to the sample set
        # finally, we put all nosie data back to the sample set.
        elif method == 'dbscan':
            argeps = self.selectargs.get('eps', 0.1)
            argmin_samples = self.selectargs.get('min_samples', 10)
            dbscan = DBSCAN(eps = argeps, min_samples = argmin_samples).fit(X)
            clusters = 0
            if -1 in np.unique(dbscan.labels_):
                clusters = np.unique(dbscan.labels_).shape[0] - 1 + np.where(dbscan.labels_ == -1)[0].shape[0]
            else:
                clusters = np.unique(dbscan.labels_).shape[0]
            Xs = np.zeros((0, np.shape(X)[1]))
            ys = np.zeros(0)
            for cnum in np.unique(dbscan.labels_):
                if cnum == -1:
                    indices = np.where(dbscan.labels_ == -1)
                    Xs = X[indices]
                    ys = y[indices]
                else:
                    indices = np.where(dbscan.labels_ == cnum)
                    tempX = X[indices]
                    tempy = y[indices]
                    yvalue = np.sum(tempy)
                    yvalue = np.sign(yvalue)
                    if yvalue == 0:
                        Xs = np.append(Xs, tempX, axis = 0)
                        ys = np.append(ys, tempy)
                    else:
                        Xs = np.append(Xs, [np.mean(tempX, axis = 0)], axis = 0)
                        ys = np.append(ys, yvalue)
            self.Ns = np.shape(Xs)[0]
            return Xs,ys

        # utilize Gaussian Mixture Model to cluster different data, and calculate the sum of y value of each cluster, 
        # if the sum is positive, then the y value of the cluster is +1, else if the sum is negative,
        # then the y value of the cluster is -1, if the sum is zero, we put all data belonging to the
        # cluster back to the sample set
        elif method == 'gmm':
            components = self.selectargs.get('n_components', 100)
            print(components)
            gmm = GaussianMixture(n_components = components).fit(X)
            Xs = gmm.means_
            ys = np.zeros(components)
            label = gmm.predict(X)
            for i in range(len(label)):
                ys[label[i]] += y[i]
            ys = np.sign(ys)
            zero_indices = np.where(ys == 0)
            Xs = np.delete(Xs, zero_indices, 0)
            ys = np.delete(ys, zero_indices)
            for i in range(np.shape(X)[0]):
                if label[i] in zero_indices[0]:
                    Xs = np.append(Xs, [X[i]], 0)
                    ys = np.append(ys, y[i])
            self.Ns = np.shape(Xs)[0]
            return Xs,ys
        return X,y