# Text Mining - Clustering using `K-Means` with `tfidf`

To apply clustering using `K-Means` with `tfidf` vectorizer, we are going to use the example into that [URL](http://scikit-learn.org/stable/auto_examples/text/document_clustering.html). That example use a vectorizer to getting the `tfidf` of all words in a document. This vectorizer is `TfidfVectorizer`.

We are going to use the [20newgroups](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups) corpus and select two group: `alt.atheis` and `sci.space`

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


In [2]:
# Import all libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics

from sklearn.cluster import KMeans

from time import time

### 1. Load the corpus of texts

In [3]:
# Load some categories from the training set
categories = [
    'alt.atheism',
    'sci.space'
]

print("Loading 20 newsgroups dataset for categories:")
print(categories)
print()

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()

# Get the labels of each document
labels = dataset.target
# Get the true k-clusters
true_k = np.unique(labels).shape[0]

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'sci.space']

1786 documents
2 categories



### 2. Texts vectorization 

In [4]:
n_features = 100
use_idf = True

# Create the tfidf vectorizer
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, #max_features = n_features,
                             stop_words='english', use_idf=use_idf)
# Vectorize dataset
vec_dataset = vectorizer.fit_transform(dataset.data)
print(vec_dataset.toarray())
print("n_samples: %d, n_features: %d" % vec_dataset.shape)

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
n_samples: 1786, n_features: 15500


### 3. Texts clustering 

In [5]:
# Apply the KMeans algorithm
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, verbose=True, random_state=1)

# Get time 0
t0 = time()

# Fit the KMeans algorithm with vectorized texts.
km.fit(vec_dataset)
print("Fit time: %0.3fs" % (time() - t0))
print()

# Print the clusters
terms = vectorizer.get_feature_names()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

print ("Print the clusters:")
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
          print(' %s' % terms[ind], end='')
    print()

Initialization complete
Iteration  0, inertia 3446.181
Iteration  1, inertia 1741.009
Iteration  2, inertia 1735.972
Iteration  3, inertia 1733.070
Iteration  4, inertia 1732.291
Iteration  5, inertia 1732.167
Iteration  6, inertia 1732.152
Converged at iteration 6: center shift 0.000000e+00 within tolerance 6.317085e-09
Fit time: 0.710s

Print the clusters:
Cluster 0: god com keith people sgi don livesey atheists say think
Cluster 1: space nasa henry access com toronto digex gov alaska pat


### 4. Clustering quality measurement

In [6]:
# Calculate the clustering goodness with: homogeneity_score, completeness_score and v_measure_score

# A cluster is homogeneous if its all elements contains members of the same class
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))

# A class is compelete if its all elements belong to the same cluster
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))

# V-measure is the weighted average of the last two metrics
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))

Homogeneity: 0.902
Completeness: 0.906
V-measure: 0.904


The quality data are good with our dataset (1550 elements split on 2 clusters). The homogeneity is high (0.902) which indicates the elements which compose each cluster are very similar between them. In a clustering process, this fact is searched and, in addition, that the different clusters are as more heterogeneous as possible.

### 5. Use the trained KMeans to classify others texts 

In [7]:
print("Let's classify 2 new texts:")
print ()

# Set the new texts:
atheism = ["Atheism remains one of the most extreme taboos in Saudi Arabia. It is a red line that no one can cross. Atheists in Saudi Arabia have been suffering from imprisonment, maginalisation, slander, ostracisation and even execution. Atheists are considered terrorists. Efforts for normalisation between those who believe and those who don’t remain bleak in the kingdom. Despite constant warnings of Saudi religious authorities of “the danger of atheism,” many citizens in the kingdom are turning their backs on Islam. The Saudi dehumanizing strict laws in the name of Islam, easy access to information and mass communication are the primary driving forces pushing Saudis to leave religion. Unfortunately, those who explicitly do, find themselves harshly punished or forced to live dual lives."]
space = ["The man speaking was Neil Armstrong, whose brevity marked the moment when the lunar module Eagle completed its perilous journey from Apollo 11 and touched down upon the surface of the Moon. The world waited on tenterhooks as hour after hour of checks were carried out. Finally, the hatch opened, and Armstrong descended the ladder to become the first human to set foot on the Moon, with the now immortal words: That’s one small step for man, one giant leap for mankind.There cannot be many who have not, however briefly, glanced at the Moon and wondered what it must have been like for Armstrong to look back at the blue and green planet we call home. The landing may have happened almost five decades ago, but space exploration has not lost its allure. Even those of us who were not born when this momentous event unfolded are caught in its gravitational pull. With this in mind, it seems only fitting that Sotheby’s New York has decided to host its first space exploration auction, featuring memorabilia from American-led space missions, exactly 48 years to the day after Apollo 11’s lunar landing."]

# Vectorize the texts
tfAtheism =  vectorizer.transform(atheism)
tfSpace =  vectorizer.transform(space)

# Print the texts
print ("TEXT 1 (about the atheism in Arabia Saudí):\n", atheism)
print ()
print ("TEXT 2 (about the arrival of man on the moon):\n", space)
print ()

# Print the tfMatrix
print ("tfAtheism:",tfAtheism.toarray() )
print ("tfSpace:",tfSpace.toarray() )
print ()

# Make the predition and print it
atheismPrediction = km.predict(tfAtheism)[0]
print ("Text 1 prediction (atheism): Cluster", atheismPrediction)
spacePrediction = km.predict(tfSpace)[0]
print ("Text 2 prediction (space): Cluster", spacePrediction)

Let's classify 2 new texts:

TEXT 1 (about the atheism in Arabia Saudí):

TEXT 2 (about the arrival of man on the moon):
 ['The man speaking was Neil Armstrong, whose brevity marked the moment when the lunar module Eagle completed its perilous journey from Apollo 11 and touched down upon the surface of the Moon. The world waited on tenterhooks as hour after hour of checks were carried out. Finally, the hatch opened, and Armstrong descended the ladder to become the first human to set foot on the Moon, with the now immortal words: That’s one small step for man, one giant leap for mankind.There cannot be many who have not, however briefly, glanced at the Moon and wondered what it must have been like for Armstrong to look back at the blue and green planet we call home. The landing may have happened almost five decades ago, but space exploration has not lost its allure. Even those of us who were not born when this momentous event unfolded are caught in its gravitational pull. With this in m

As we can see, the our system has classify the new text correctly. 