# 对文档进行聚类

load_files 函数文档：https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html

+ 加载带有类别的文本文件作为子文件夹名称。

In [2]:
from sklearn.datasets import load_files

docs = load_files('../input/clustering/')

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

max_features = 20000

vectorizer = TfidfVectorizer(
    max_df=0.4, min_df=2, max_features=max_features, encoding='latin-1')
X = vectorizer.fit_transform(docs.data)

In [8]:
X

<3949x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 461902 stored elements in Compressed Sparse Row format>


`max_df`：`max_df=0.4` 表示如果一个单词在 $40\%$ 的文档里面都出现过，那么这是一个高频率的词汇，对于文档聚类是没有帮助的，生成词典的时候，就会删掉这个词。

`min_df`：`min_df=2` 如果一个单词的词频太低，只在两个以下（包含两个）以下的文档中出现，会把这个单词从字典里面剔除。

`max_features`：`max_features` 进一步过滤了词典的大小，会根据 `TF-IDF` 的权重从高到低进行排序，然后取前面权重高的单词构成词典 

In [6]:
from sklearn.cluster import KMeans

n_clusters = 4

kmean = KMeans(n_clusters=n_clusters, max_iter=100, tol=0.01, verbose=1, n_init=3)
kmean

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=4, n_init=3, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.01, verbose=1)

In [7]:
kmean.fit(X)

Initialization complete
Iteration  0, inertia 7429.981
Iteration  1, inertia 3841.973
Iteration  2, inertia 3829.550
Iteration  3, inertia 3826.611
Iteration  4, inertia 3825.621
Iteration  5, inertia 3824.562
Iteration  6, inertia 3822.343
Iteration  7, inertia 3820.837
Iteration  8, inertia 3820.296
Iteration  9, inertia 3820.076
Iteration 10, inertia 3819.992
Iteration 11, inertia 3819.900
Iteration 12, inertia 3819.805
Iteration 13, inertia 3819.624
Iteration 14, inertia 3819.381
Iteration 15, inertia 3819.292
Iteration 16, inertia 3819.260
Iteration 17, inertia 3819.241
Iteration 18, inertia 3819.217
Iteration 19, inertia 3819.210
Converged at iteration 19: center shift 0.000000e+00 within tolerance 4.896692e-07
Initialization complete
Iteration  0, inertia 7539.440
Iteration  1, inertia 3842.001
Iteration  2, inertia 3828.867
Iteration  3, inertia 3822.184
Iteration  4, inertia 3820.386
Iteration  5, inertia 3819.674
Iteration  6, inertia 3819.317
Iteration  7, inertia 3819.155
I

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=4, n_init=3, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.01, verbose=1)

In [9]:
kmean.inertia_

3817.742355730966

In [10]:
kmean.labels_[1000:1010]

array([1, 1, 1, 0, 2, 0, 2, 1, 0, 0], dtype=int32)

In [11]:
docs.filenames[1000:1010]

array(['../input/clustering/sci.crypt/10888-15289',
       '../input/clustering/sci.crypt/11490-15880',
       '../input/clustering/sci.crypt/11270-15346',
       '../input/clustering/sci.electronics/12383-53525',
       '../input/clustering/sci.space/13826-60862',
       '../input/clustering/sci.electronics/11631-54106',
       '../input/clustering/sci.space/14235-61437',
       '../input/clustering/sci.crypt/11508-15928',
       '../input/clustering/sci.space/13593-60824',
       '../input/clustering/sci.electronics/12304-52801'], dtype='<U47')

In [12]:
kmean.cluster_centers_.shape

(4, 20000)

In [13]:
order_centroids = kmean.cluster_centers_.argsort(axis=1)[:,::-1]

In [14]:
order_centroids.shape

(4, 20000)

In [15]:
order_centroids

array([[ 2337, 12398, 10635, ...,  4856,  4860,  4621],
       [10522,  4415,  6936, ...,  8704,  8703, 19999],
       [16848,  8962, 12463, ...,  6234,  6233, 19999],
       [12398, 16356, 12313, ...,  7358,  7357, 19999]])

In [16]:
terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print('聚类中心',i+1,end='')
    for index in order_centroids[i, :10]:
        print(' %s ' % terms[index],end='')
    print()

聚类中心 1 any  my  know  by  me  anyone  your  ca  like  thanks 
聚类中心 2 key  clipper  encryption  chip  government  will  keys  escrow  we  your 
聚类中心 3 space  henry  nasa  toronto  pat  shuttle  zoo  moon  we  gov 
聚类中心 4 my  she  msg  pitt  he  gordon  her  geb  banks  has 


In [17]:
kmean.cluster_centers_

array([[1.83698178e-03, 1.69844030e-03, 3.78138188e-05, ...,
        2.42740466e-04, 1.87546922e-04, 1.87546922e-04],
       [2.70557485e-04, 1.35559431e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.17714595e-03, 3.20367668e-03, 3.15739811e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.48105028e-03, 1.99367990e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [18]:
x = np.array([[0, 1, 2, 3], [4, 3, 2, 1]])
x.argsort(axis=1)

array([[0, 1, 2, 3],
       [3, 2, 1, 0]])