In [2]:
from sklearn.datasets import load_files

docs = load_files('../input/clustering/')

In [4]:
type(docs)

sklearn.utils.Bunch

+ max_df：如果一个单词在 40% 以上的文档中都出现过，就表示这是一个高频词，我们并不关心它，对文档聚类是没有帮助的
+ min_df：如果一个单词的词频太低，不超过 2 个，则表示它不能表示足够多的信息，因此我们也不考虑它
+ max_features：进一步过滤词典的大小，它会根据 TF-IDF 权重从高到低进行排序，然后取前面权重高的单词构成词典。

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

max_features = 20000
vectorizer = TfidfVectorizer(
    max_df=0.4, min_df=2, max_features=max_features, encoding='latin-1')
X = vectorizer.fit_transform(docs.data)

In [8]:
type(X)

scipy.sparse.csr.csr_matrix

In [10]:
X.shape

(3949, 20000)

In [21]:
docs.filenames[0], X[0].getnnz()

('./datasets/clustering/clustering/sci.electronics/11902-54322', 56)

从输出可以看出，这是一个稀疏矩阵，20000 个元素里面，只有 56 个非零元素。

In [22]:
X[0].shape

(1, 20000)

+ tol = 0.01 表示中心点移动距离小于 0.01 的时候，我们就认为算法已经收敛
+ verbose=1 输出迭代的过程信息
+ **n_init=3 表示进行 3 次 k-均值运算后求平均值，k-均值聚类算法受初始点的影响很大，不同的初始点可能会导致不同的聚类结果，因此多次运算求平均值的方法可以提高算法的稳定性。**

In [24]:
from sklearn.cluster import KMeans

print('文档聚类')
n_clusters = 4
kmean = KMeans(
    n_clusters=n_clusters, max_iter=100, tol=0.01, verbose=1, n_init=3)
kmean.fit(X)

文档聚类
Initialization complete
Iteration  0, inertia 7506.750
Iteration  1, inertia 3848.901
Iteration  2, inertia 3832.885
Iteration  3, inertia 3827.446
Iteration  4, inertia 3826.241
Iteration  5, inertia 3825.734
Iteration  6, inertia 3825.357
Iteration  7, inertia 3825.091
Iteration  8, inertia 3824.813
Iteration  9, inertia 3824.488
Iteration 10, inertia 3823.996
Iteration 11, inertia 3822.582
Iteration 12, inertia 3820.688
Iteration 13, inertia 3819.748
Iteration 14, inertia 3819.177
Iteration 15, inertia 3818.908
Iteration 16, inertia 3818.742
Iteration 17, inertia 3818.616
Iteration 18, inertia 3818.543
Iteration 19, inertia 3818.499
Iteration 20, inertia 3818.471
Iteration 21, inertia 3818.459
Iteration 22, inertia 3818.451
Iteration 23, inertia 3818.449
Iteration 24, inertia 3818.447
Converged at iteration 24: center shift 0.000000e+00 within tolerance 4.896692e-07
Initialization complete
Iteration  0, inertia 7618.285
Iteration  1, inertia 3851.557
Iteration  2, inertia 3840.

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=4, n_init=3, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.01, verbose=1)

In [26]:
kmean.labels_.shape

(3949,)

In [27]:
kmean.labels_[1000:1010]

array([1, 1, 1, 3, 2, 3, 2, 1, 3, 3], dtype=int32)

In [28]:
docs.filenames[1000:1010]

array(['./datasets/clustering/clustering/sci.crypt/10888-15289',
       './datasets/clustering/clustering/sci.crypt/11490-15880',
       './datasets/clustering/clustering/sci.crypt/11270-15346',
       './datasets/clustering/clustering/sci.electronics/12383-53525',
       './datasets/clustering/clustering/sci.space/13826-60862',
       './datasets/clustering/clustering/sci.electronics/11631-54106',
       './datasets/clustering/clustering/sci.space/14235-61437',
       './datasets/clustering/clustering/sci.crypt/11508-15928',
       './datasets/clustering/clustering/sci.space/13593-60824',
       './datasets/clustering/clustering/sci.electronics/12304-52801'],
      dtype='<U60')