# 利用k-Means聚类算法对未标注数据分组

## 10.1 K-均值聚类算法
1. 优点： 容易实现
2. 缺点： 可能收敛到局部最小值，在大规模数据集上收敛较慢
3. 适用数据类型： 数值型数据

## k-Means聚类的一般流程
1. 收集数据： 使用任意方法
2. 准备数据： 需要数值型数据来计算距离，也可以将标称型数据映射为二值型数据再用于距离计算
3. 分析数据： 使用任意方法
4. 训练算法： 不适用于无监督学习，即无监督学习没有训练过程
5. 测试算法： 应用聚类算法、观察结果，可以使用量化的误差指标如误差平方和来评价算法的结果
6. 使用算法： 可以用于所希望的任何应用，通常情况下，簇质心可以代表整个簇的数据来做出决策

In [117]:
## k-Means聚类支持函数
from numpy import *

# 导入数据
def loadDataSet(fileName):
    dataMat = []
    with open(fileName) as fr:
        for line in fr.readlines():
            curLine = line.strip().split('\t')
            fltLine = list(map(float,curLine))
            dataMat.append(fltLine)
    return dataMat

#计算两个向量的欧式距离
def distEclud(vecA,vecB):
    return sqrt(sum(power(vecA - vecB,2)))

# 为给定数据集构建一个包含k个随机质心的集合
def randCent(dataSet,k):
    n = shape(dataSet)[1]
    centroids = mat(zeros((k,n)))
    #  print(dataSet)
    for j in range(n):
        try:
            minJ = min(dataSet[:,j])
            maxJ = max(dataSet[:,j])
        except:
            print(dataSet)
        rangeJ = float(maxJ - minJ)
        centroids[:,j] = minJ + rangeJ * random.rand(k,1)
    return centroids

In [118]:
dataMat = mat(loadDataSet('testSet.txt'))
# print(dataMat)
print(min(dataMat[:,0]))
print(min(dataMat[:,1]))
print(max(dataMat[:,0]))
print(max(dataMat[:,1]))

[[-5.379713]]
[[-4.232586]]
[[4.838138]]
[[5.1904]]


In [119]:
print(randCent(dataMat,2))
distEclud(dataMat[0],dataMat[1])

[[-0.46629237  2.83074054]
 [ 1.00458155 -3.09816814]]


5.184632816681332

In [120]:
## k-Means聚类算法
def kMeans(dataSet,k,distMeans=distEclud,createCent=randCent):
    '''
    参数:
    数据集
    簇的数目
    计算距离的函数
    创建初始质心的函数
    '''
    # 数据集中数据点的总数
    m = shape(dataSet)[0]
    # 存放簇分类结果，共两列，第一列记录簇索引值，第二列存储误差（即点到质心的距离）
    clusterAssment = mat(zeros((m,2)))
    # 创建初始质心
    centroids = createCent(dataSet,k)
    # 标志变量，如果任一点的簇分配结果发生改变，则更新ClusterChanged标志
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        # 遍历所有数据找到距离每个点最近的质心
        for i in range(m):
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI = distMeans(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i,0] != minIndex:
                clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist ** 2
        # print("centroids:",centroids)
        # 遍历所有的质心，更新它们的取值
        for cent in range(k):
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A == cent)[0]]
            # 按列求平均
            centroids[cent,:] = mean(ptsInClust,axis=0)
    print("centroids:",centroids)
    return centroids,clusterAssment

In [121]:
dataMat = mat(loadDataSet('testSet.txt'))
mycentroids,clusterAssment = kMeans(dataMat,4)

centroids: [[-2.46154315  2.78737555]
 [-3.53973889 -2.89384326]
 [ 2.65077367 -2.79019029]
 [ 2.6265299   3.10868015]]


> 上面的结果可以看出经过3次迭代之后，K-均值算法收敛，共有4个质心

## 10.2 使用后处理来提高聚类性能
1. 度量聚类效果的指标是SSE
2. SSE值越小表示数据点越接近于它们的质心
3. 后处理：一种方法是将具有最大的SSE值的簇划分成两个簇
    具体实现方法：可以将最大簇包含的点过滤出来并在这些点上运行k-means算法，其中k=2
4. 为了保持簇总数不变，可以将两个簇进行合并，合并最近的质心，或者合并两个使得SSE值增幅最小的质心

## 10.3 二分 k-means 算法
选择SSE最大的簇进行划分，直到簇数目达到用户指定的数目为止

In [122]:
## 二分 k - means 均值聚类算法
def biKmeans(dataSet,k,distMeans = distEclud):
    m = shape(dataSet)[0]
    # 簇分配结果矩阵
    clusterAssment = mat(zeros((m,2)))
    #计算整个数据集的每个特征的质心，centroid0:1*n列表
    centroid0 = mean(dataSet,axis=0).tolist()[0]
    #用centList列表保存所有质心
    centList = [centroid0]
    #遍历所有点，计算每个点到质心的误差值，并保存
    for j in range(m):
        clusterAssment[j,1] = distMeans(mat(centroid0),dataSet[j,:]) ** 2
    while (len(centList) < k):
        lowestSSE = inf
        #遍历簇列表中的每个簇，然后将每个簇都生成两个簇，同时给出每个簇的误差
        for i in range(len(centList)):
            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0]]
            print("ptsInCurrCluster长度",len(ptsInCurrCluster))
            centroidMat,splitClustAss = kMeans(ptsInCurrCluster,2,distMeans)
            sseSplit = sum(splitClustAss[:,1])
            # print("i:",i)
            #剩余数据集的误差
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
            print("sseSplit,and notDplit:",sseSplit,sseNotSplit)
            # 将该簇划分后的误差和剩余数据的误差之和作为本次最终误差，
            # 若本次划分的SSE值最小，则本次划分保存，即执行划分操作
            if (sseSplit + sseNotSplit) < lowestSSE:
                bestCentToSplit = i
                bestNewCents = centroidMat
                bestClustAss = splitClustAss.copy()
                lowestSEE = sseSplit + sseNotSplit
        #更新簇的分配结果，且新的质点会被添加到centList中
        bestClustAss[nonzero(bestClustAss[:,0].A==1)[0],0] = len(centList)
        bestClustAss[nonzero(bestClustAss[:,0].A==0)[0],0] = bestCentToSplit
        print("最佳分割点",bestCentToSplit)
        print("最佳聚类簇分配矩阵长度",len(bestClustAss))
        centList[bestCentToSplit] = bestNewCents[0,:]
        centList.append(bestNewCents[1,:])
        clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:] = bestClustAss
        print("clusterAssment:",clusterAssment)
    return centList,clusterAssment

In [123]:
dataMat3 = mat(loadDataSet('testSet2.txt'))
centList,myNewAssments = biKmeans(dataMat3,3)
centList

ptsInCurrCluster长度 60
centroids: [[-0.00675605  3.22710297]
 [-0.45965615 -2.7782156 ]]
sseSplit,and notDplit: 453.0334895807502 0.0
最佳分割点 0
最佳聚类簇分配矩阵长度 60
clusterAssment: [[0.00000000e+00 1.08435724e+01]
 [0.00000000e+00 1.15291655e+01]
 [1.00000000e+00 1.02184582e+00]
 [0.00000000e+00 3.55915016e+00]
 [0.00000000e+00 4.74982956e+00]
 [1.00000000e+00 3.87167519e+00]
 [0.00000000e+00 7.83989775e+00]
 [0.00000000e+00 9.61405951e+00]
 [1.00000000e+00 3.53809057e+00]
 [0.00000000e+00 1.37355664e+01]
 [0.00000000e+00 4.39471728e-01]
 [1.00000000e+00 2.56674394e-02]
 [0.00000000e+00 5.47620209e+00]
 [0.00000000e+00 8.58850041e+00]
 [1.00000000e+00 2.11734245e+00]
 [0.00000000e+00 1.44228725e+01]
 [0.00000000e+00 1.88713505e+01]
 [1.00000000e+00 9.76749869e-03]
 [0.00000000e+00 8.25991037e+00]
 [0.00000000e+00 1.30087682e+01]
 [1.00000000e+00 9.41791924e-01]
 [0.00000000e+00 2.78013075e+00]
 [0.00000000e+00 4.33512814e+00]
 [1.00000000e+00 1.48785604e-01]
 [0.00000000e+00 2.08107319e+01]
 [0

[matrix([[-0.00675605,  3.22710297]]),
 matrix([[ 0.35496167, -3.36033556]]),
 matrix([[-1.12616164, -2.30193564]])]

In [124]:
dataMat[nonzero(clusterAssment[:,0].A == 1)[0],:]

matrix([[-5.379713, -3.362104],
        [-3.487105, -1.724432],
        [-2.786837, -3.099354],
        [-3.195883, -2.283926],
        [-3.403367, -2.778288],
        [-4.007257, -3.207066],
        [-3.674424, -2.261084],
        [-2.579316, -3.497576],
        [-3.837877, -3.253815],
        [-2.121479, -4.232586],
        [-3.762093, -2.432191],
        [-4.323818, -3.938116],
        [-4.009299, -2.978115],
        [-3.171184, -3.572452],
        [-2.565729, -2.012114],
        [-2.651229, -3.103198],
        [-4.599622, -2.185829],
        [-2.793241, -2.149706],
        [-4.905566, -2.91107 ]])

In [125]:
bestclustAss = matrix([[1,2.3019],
                       [2,6.683]])
clustAss = matrix([[0,1.44],
                  [0,2.77]])
clustAss[nonzero(clustAss[:,0].A == 0)[0],:] = bestclustAss
clustAss

matrix([[1.    , 2.3019],
        [2.    , 6.683 ]])

In [126]:
clusterAssment

matrix([[ 3.        ,  2.3201915 ],
        [ 0.        ,  1.39004893],
        [ 2.        ,  7.46974076],
        [ 1.        ,  3.60477283],
        [ 3.        ,  2.7696782 ],
        [ 0.        ,  2.80101213],
        [ 2.        ,  5.10287596],
        [ 1.        ,  1.37029303],
        [ 3.        ,  2.29348924],
        [ 0.        ,  0.64596748],
        [ 2.        ,  1.72819697],
        [ 1.        ,  0.60909593],
        [ 3.        ,  2.51695402],
        [ 0.        ,  0.13871642],
        [ 2.        ,  9.12853034],
        [ 2.        , 10.63785781],
        [ 3.        ,  2.39726914],
        [ 0.        ,  3.1024236 ],
        [ 2.        ,  0.40704464],
        [ 1.        ,  0.49023594],
        [ 3.        ,  0.13870613],
        [ 0.        ,  0.510241  ],
        [ 2.        ,  0.9939764 ],
        [ 1.        ,  0.03195031],
        [ 3.        ,  1.31601105],
        [ 0.        ,  0.90820377],
        [ 2.        ,  0.54477501],
        [ 1.        ,  0.316

## 10.4 示例：对地图上的点进行聚类
示例：对于地理数据应用二分K-均值算法  
1. 收集数据： 使用Yahoo！PlaceFinder API收集数据（这个API已停用）
2. 准备数据： 只保留经纬度信息
3. 分析数据： 使用Matplotlib来构建一个二维数据图，其中包含簇与地图
4. 训练算法： 训练不适用无监督学习
5. 测试算法： 使用10.4节中的biKmeans()函数
6. 使用算法： 最后的输出是包含簇及簇中心的地图

### 10.4.2 对地理坐标进行聚类
需要注意的是：  
不能用经纬度直接计算距离，北极附近没走几米的经度变化可能达到数10度，而在赤道附近走相同的距离，带来的经度可能只是零点几  
可以用球面余弦定理来计算两个经纬度之间的距离

In [127]:
## 球面距离及簇绘图函数
def distSLC(vecA,vecB):
    a = sin(vecA[0,1]*pi/180) * sin(vecB[0,1]*pi/180)
    b = cos(vecA[0,1]*pi/180) * cos(vecB[0,1]*pi/180) * cos(pi * (vecB[0,1]-vecA[0,0])/180)
    return arccos(a + b) * 6371

import matplotlib
import matplotlib.pyplot as plt

def clusterClubs(numClust=5):
    datList = []
    for line in open('places.txt').readlines():
        lineArr = line.split('\t')
        datList.append([float(lineArr[4]),float(lineArr[3])])
    datMat = mat(datList)
    #  print(len(datMat))
    myCentroids,clustAssing = biKmeans(datMat,numClust,distMeans=distSLC)
    
    fig = plt.figure()
    rect = [0.1,0.1,0.8,0.8]
    scatterMarkers = ['s','o','^','8','p','d','v','h','>','<']
    axprops = dict(xticks=[],yticks=[])
    ax0 = fig.add_axes(rect,label='ax0',**axprops)
    imgP = plt.imread('Portland.png')
    ax0.imshow(imgP)
    ax1 = fig.add_axes(rect,label='ax1',frameon=False)
    for i in range(numClust):
        ptsInCurrCluster = datMat[nonzero(clustAssing[:,0].A==i)[0],:]
        markerStyle = scatterMarkers[i % len(scatterMarkers)]
        ax1.scatter(ptsInCurrCluster[:,0].flatten().A[0],ptsInCurrCluster[:,1].flatten().A[0],
                   marker=markerStyle,s=90)
    ax1.scatter(myCentroids[:,0].flatten().A[0],myCentroid[:,1].flatten().A[0],marker='+',s=300)
    plt.show()

In [130]:
clusterClubs(3)

ptsInCurrCluster长度 69
centroids: [[         nan          nan]
 [-122.6316762   45.5123067]]
sseSplit,and notDplit: 6663124362.554555 0.0
最佳分割点 0
最佳聚类簇分配矩阵长度 69
clusterAssment: [[1.00000000e+00 9.66171379e+07]
 [1.00000000e+00 9.66042283e+07]
 [1.00000000e+00 9.65944468e+07]
 [1.00000000e+00 9.66182234e+07]
 [1.00000000e+00 9.65749960e+07]
 [1.00000000e+00 9.66037467e+07]
 [1.00000000e+00 9.67262276e+07]
 [1.00000000e+00 9.65761262e+07]
 [1.00000000e+00 9.67868944e+07]
 [1.00000000e+00 9.67325317e+07]
 [1.00000000e+00 9.66643797e+07]
 [1.00000000e+00 9.66192856e+07]
 [1.00000000e+00 9.66592283e+07]
 [1.00000000e+00 9.63057183e+07]
 [1.00000000e+00 9.65537519e+07]
 [1.00000000e+00 9.65460768e+07]
 [1.00000000e+00 9.65507611e+07]
 [1.00000000e+00 9.65073850e+07]
 [1.00000000e+00 9.64691237e+07]
 [1.00000000e+00 9.65539267e+07]
 [1.00000000e+00 9.65509961e+07]
 [1.00000000e+00 9.64275384e+07]
 [1.00000000e+00 9.65999269e+07]
 [1.00000000e+00 9.65952954e+07]
 [1.00000000e+00 9.65537519e+07]

  return N.ndarray.mean(self, axis, dtype, out, keepdims=True)._collapse(axis)
  ret, rcount, out=ret, casting='unsafe', subok=False)


UnboundLocalError: local variable 'maxJ' referenced before assignment