In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

假设有一个博客平台，用户在平台上发布博客，我们对博客进行聚类分析，以方便展示不同类别下的热门文章

In [2]:
from time import time
from sklearn.datasets import load_files

print("loading documents ...")
t = time()
docs = load_files('datasets/clustering/data')
print("summary: {0} documents in {1} categories.".format(
    len(docs.data), len(docs.target_names)))
print("done in {0} seconds".format(time() - t))

loading documents ...
summary: 3949 documents in 4 categories.
done in 3.5900049209594727 seconds


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

max_features = 20000
print("vectorizing documents ...")
t = time()
vectorizer = TfidfVectorizer(max_df=0.4, 
                             min_df=2, 
                             max_features=max_features, 
                             encoding='latin-1')
X = vectorizer.fit_transform((d for d in docs.data))
print("n_samples: %d, n_features: %d" % X.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    docs.filenames[0], X[0].getnnz()))
print("done in {0} seconds".format(time() - t))

#函数参数 https://blog.csdn.net/laobai1015/article/details/80451371
 #min_df，当设置为浮点数时，过滤出现在超过max_df/低于min_df比例的句子中的词语；正整数时,则是超过max_df句句子。
    #max_features，仅考虑max_features--按语料词频排序

vectorizing documents ...
n_samples: 3949, n_features: 20000
number of non-zero features in sample [datasets/clustering/data\sci.electronics\11902-54322]: 56
done in 3.9820303916931152 seconds


max_df = 0.4表示如果一个单词在40%的文档里都出现过，则认为这是一个高频词，对文档聚类没有帮助，在生成词典时就会剔除这个词    
Min_df=2表示，如果一个单词的词频太低，只在两个以下（包含两个）的文档里出现，则也把这个单词从词典里剔除    
maxfeatures可以进一步过滤词典的大小，它会根据TF-IDF权重从高到低进行排序，然后取前面权重高的单词构成词典   
   
   
TfidfVectorizer 类是用来把所有的文档转换为矩阵，矩阵每行都代表一个文档，一行中每个元素代表一个对应词语的重要性，词语重要性用 TF-IDF 来表示
○其中fit_transform () 方法是fit()和 transform() 合并起来
○其中 ,fit () 会先完成语料库分析、 提取词典等操作
○transform() 把对每篇文档转换为向量，最终构成矩阵，保存在变量里 
   

从输出可知，我们的一篇文章构成的向量是一个稀疏向量，其大部分元素都为0     
这也容易理解，我们的词典大小为20000个，而示例文章中不重复的单词却只有56个  

In [5]:
from sklearn.cluster import KMeans

print("clustering documents ...")
t = time()
n_clusters = 4
kmean = KMeans(n_clusters=n_clusters, 
               max_iter=100,
               tol=0.01,
               verbose=1,
               n_init=3)
kmean.fit(X);
print("kmean: k={}, cost={}".format(n_clusters, int(kmean.inertia_)))
print("done in {0} seconds".format(time() - t))
#n_init设为3意味着进行3次随机初始化，选择效果最好的一种来作为模型 https://blog.csdn.net/EleanorWiser/article/details/70226704

clustering documents ...
Initialization complete
Iteration  0, inertia 7502.026
Iteration  1, inertia 3844.598
Iteration  2, inertia 3832.021
Iteration  3, inertia 3828.654
Iteration  4, inertia 3827.405
Iteration  5, inertia 3826.808
Iteration  6, inertia 3826.471
Iteration  7, inertia 3826.302
Iteration  8, inertia 3826.195
Iteration  9, inertia 3826.019
Iteration 10, inertia 3825.825
Iteration 11, inertia 3825.739
Iteration 12, inertia 3825.628
Iteration 13, inertia 3825.543
Iteration 14, inertia 3825.510
Iteration 15, inertia 3825.465
Iteration 16, inertia 3825.449
Iteration 17, inertia 3825.441
Iteration 18, inertia 3825.440
Iteration 19, inertia 3825.438
Iteration 20, inertia 3825.432
Iteration 21, inertia 3825.431
Iteration 22, inertia 3825.429
Iteration 23, inertia 3825.428
Iteration 24, inertia 3825.426
Iteration 25, inertia 3825.421
Iteration 26, inertia 3825.416
Converged at iteration 26: center shift 0.000000e+00 within tolerance 4.896692e-07
Initialization complete
Iterati

聚类个数为4个  
Max_iter=I00表示最多进行100次k－均值法代   
tol=0.1表示中心点移动距离小于0.1时就认为算法已经收敛,停止迭代   
verbose=I表示输出法代的过程信息    
n_init=3表示进行3次随机初始化，选择效果最好的一种作为模型

In [6]:
len(kmean.labels_)

3949

总共进行了3次k－均值聚类分析，分别做了26 55 24次法代后收敛。这样就把3949个文档进行自动分类   
kmean.labes里保存的就是这些文档的类别信息  
如我们所预料，len(kmean.labes）的值是3949，还可以查看1000～1010这10个文档的聚类情况及其对应的文件名

In [7]:
kmean.labels_[1000:1010]

array([2, 2, 2, 1, 3, 1, 3, 2, 1, 1])

In [8]:
docs.filenames[1000:1010]

array(['datasets/clustering/data\\sci.crypt\\10888-15289',
       'datasets/clustering/data\\sci.crypt\\11490-15880',
       'datasets/clustering/data\\sci.crypt\\11270-15346',
       'datasets/clustering/data\\sci.electronics\\12383-53525',
       'datasets/clustering/data\\sci.space\\13826-60862',
       'datasets/clustering/data\\sci.electronics\\11631-54106',
       'datasets/clustering/data\\sci.space\\14235-61437',
       'datasets/clustering/data\\sci.crypt\\11508-15928',
       'datasets/clustering/data\\sci.space\\13593-60824',
       'datasets/clustering/data\\sci.electronics\\12304-52801'],
      dtype='<U52')

In [34]:
kmean.cluster_centers_.shape

(4, 20000)

In [29]:
#函数补充
a = np.array([10,30,20,40])
a.argsort() #返回升序索引
b = a.argsort()[::-1] #返回降序索引
#t1 = vectorizer.get_feature_names()
#t1[2]
b

array([3, 1, 2, 0], dtype=int64)

In [30]:
e =  b[0:2]
e

array([3, 1], dtype=int64)

In [33]:
a[1]

30

In [17]:
#查看每种类别文档中，其权限最高的10个单词
from __future__ import print_function

print("Top terms per cluster:")

order_centroids = kmean.cluster_centers_.argsort()[:, ::-1] #返回降序索引
#order_centroids


Top terms per cluster:


array([[12313, 16356, 12398, ...,  6878,  6877, 19999],
       [ 2337, 12398, 10635, ...,  7362,  7363,  8367],
       [10522,  4415,  6936, ...,  8539, 16223, 19999],
       [16848,  8962, 12463, ..., 14902,  7521, 19999]], dtype=int64)

In [21]:
 order_centroids[0, :10]

array([12313, 16356, 12398, 13888,  8480,  8263,  2907,  8971,  8877,  8831], dtype=int64)

In [24]:
terms = vectorizer.get_feature_names()#得到词典单词列表
terms[12313]

'msg'

In [25]:

for i in range(n_clusters):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: msg she my pitt gordon geb banks her he has
Cluster 1: any my know by me your like anyone will do
Cluster 2: key clipper encryption chip government will keys escrow we nsa
Cluster 3: space henry nasa toronto shuttle zoo pat moon spencer orbit


OVER

参考：#https://blog.csdn.net/sinat_26917383/article/details/70577710     

1、homogeneity_score： 同质性homogeneity       
2、completeness_score： 完整性completeness
3、v_measure_score： 两者的调和平均，为1最好    
4、adjusted_rand_score：兰德指数，数值越大表示与真实情况越吻合[-1,1]
5、silhouette_score：轮廓系数：同类别样本越近不同类别越远分数越高[-1,1],不需要已标记数据集的前提下

In [15]:
from sklearn import metrics

labels = docs.target
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, kmean.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, kmean.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, kmean.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, kmean.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, kmean.labels_, sample_size=1000))

Homogeneity: 0.453
Completeness: 0.532
V-measure: 0.489
Adjusted Rand-Index: 0.295
Silhouette Coefficient: 0.004


这些数值是好是坏呢？坦白讲，只能算为一般。可以结合上述介绍的指标的含义，理解这些数值背后表达的意义,可能的一个原因是数据集质量不高，感兴趣的同学可以阅读原始的语料库，检验一下如果通过人工标记，是否能够标记出这些文章的正确分类。

In [10]:
a = np.array([10, 30, 20, 40])
a.argsort()[::-1]

array([3, 1, 2, 0])

In [9]:
a = np.array([[20, 10, 30, 40], [100, 300, 200, 400], [1, 5, 3, 2]])
a.argsort()[:, ::-1]

array([[3, 2, 0, 1],
       [3, 1, 2, 0],
       [1, 2, 3, 0]])