# Clustering Wikipedia Articles
TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays.
Here we will combine TrucatedSVD with k-means to cluster some popular pages from the Wikipedia.

Trucnated SVD performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

In [2]:
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from scipy.sparse import csr_matrix

In [6]:
df = pd.read_csv('datasets/wikipedia-vectors.csv', index_col=0)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

In [21]:
svd = TruncatedSVD(n_components=50)
kmeans = KMeans(n_clusters=6)
pipeline = make_pipeline(svd,kmeans)
# Fit the pipeline to articles
pipeline.fit(articles)
# Calculate the cluster labels
labels = pipeline.predict(articles)
# Create a DataFrame aligning labels and titles
df = pd.DataFrame({'label': labels, 'article': titles})

In [22]:
#Set up visualization
def highlight(row):
    options = {0:'LightYellow',1:'Moccasin',2:'Khaki',3:'Gold',4:'Coral',5:'Tomato'}
    color = options.get(row.values[0])
    return ['background-color: %s' % color]*len(row.values)
df.sort_values('label').style.apply(highlight, axis=1)

Unnamed: 0,label,article
14,0,Climate change
17,0,Greenhouse gas emissions by the United States
13,0,Connie Hedegaard
12,0,Nigel Lawson
11,0,Nationally Appropriate Mitigation Action
10,0,Global warming
18,0,2010 United Nations Climate Change Conference
15,0,Kyoto Protocol
19,0,2007 United Nations Climate Change Conference
16,0,350.org
