# Clustering with k-Means on text datasets


In this notebook we will apply the k-Means clustering algorithm on a text dataset. We will apply the techniques that were presented in the three previous notebooks, i.e., MLLAB-11, MLLAB-12, and MLLAB-13.

The PriceRunner CPUs dataset contains 3861 CPU models crawled from the PriceRunner product comparison platform at 2019. For each product, the dataset includes:

* a unique integer identifier
* a title
* an integer identifier of the product vendor
* an integer identifier of the correct cluster (namely, where the product should be placed in).
* two category columns that will not be used.

Our goal here to create clusters of products, so that each cluster contains CPUs that belong to the same model.


In [1]:
import numpy as np
import pandas as pd


In [2]:
# Read the PriceRunner aggregate file with the 3861 product titles
df = pd.read_csv('datasets/pricerunner_cpus.csv', encoding='utf-8', skiprows=1, header=None)
df.columns = ['ID', 'Title', 'VendorID', 'ClusterID', 'ClusterLabel', 'CategoryID', 'CategoryLabel']

print("Dataframe Shape: ", df.shape)
df.head(10)


Dataframe Shape:  (3861, 7)


Unnamed: 0,ID,Title,VendorID,ClusterID,ClusterLabel,CategoryID,CategoryLabel
0,13772,amd ryzen 7 1700x 8 core am4 cpu/processor,4,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
1,13773,amd ryzen 7 1700x 3.4ghz 16mb l3 processor,14,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
2,13774,amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh...,3,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
3,13775,open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ...,30,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
4,13776,amd ryzen 7 1700x 8 core 16 thread am4 cpu/pro...,121,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
5,13777,wof processor amd ryzen 7 1700x 8 x 3.4 ghz oc...,16,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
6,13778,amd ryzen 7 1700x cpu am4 3.4ghz 3.8 turbo 8 c...,18,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
7,13779,amd prozessor cpu ryzen 7 sockel am4 1700x 8 x...,36,5851,AMD Ryzen 7 1700X 3.4GHz Box,2615,CPUs
8,13780,intel core intel core i7 7700 processor 8m cac...,18,5852,Intel Core i7-7700 3.6GHz Box,2615,CPUs
9,13781,intel core i7 7700 3.60ghz s1151 8mb cache kab...,8,5852,Intel Core i7-7700 3.6GHz Box,2615,CPUs


In [3]:
# Copy the dataframe to another dataframe with 2 columns
products_df = df[['Title', 'ClusterID']].copy()
products_df.head(10)


Unnamed: 0,Title,ClusterID
0,amd ryzen 7 1700x 8 core am4 cpu/processor,5851
1,amd ryzen 7 1700x 3.4ghz 16mb l3 processor,5851
2,amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh...,5851
3,open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ...,5851
4,amd ryzen 7 1700x 8 core 16 thread am4 cpu/pro...,5851
5,wof processor amd ryzen 7 1700x 8 x 3.4 ghz oc...,5851
6,amd ryzen 7 1700x cpu am4 3.4ghz 3.8 turbo 8 c...,5851
7,amd prozessor cpu ryzen 7 sockel am4 1700x 8 x...,5851
8,intel core intel core i7 7700 processor 8m cac...,5852
9,intel core i7 7700 3.60ghz s1151 8mb cache kab...,5852


In [4]:
import re

# Remove punctuation and apply case folding to the product titles
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' .join(emoticons).replace('-', ''))

    return text

# Example on the second row
preprocessor(products_df.loc[1, 'Title'])


'amd ryzen 7 1700x 3 4ghz 16mb l3 processor'

In [5]:
# Apply the preprocessor function to the product titles
products_df['Title'] = products_df['Title'].apply(preprocessor)
products_df.head()


Unnamed: 0,Title,ClusterID
0,amd ryzen 7 1700x 8 core am4 cpu processor,5851
1,amd ryzen 7 1700x 3 4ghz 16mb l3 processor,5851
2,amd ryzen 7 1700x 95 w 8 core 16 threads 3 8gh...,5851
3,open box amd ryzen 7 1700x 3 8 ghz 8 core 95w ...,5851
4,amd ryzen 7 1700x 8 core 16 thread am4 cpu pro...,5851


In [6]:
# Copy the Title column of the dataframe into a numpy array
titles_array = products_df['Title'].to_numpy()
real_clusters = products_df['ClusterID'].to_numpy()
titles_array


array(['amd ryzen 7 1700x 8 core am4 cpu processor',
       'amd ryzen 7 1700x 3 4ghz 16mb l3 processor',
       'amd ryzen 7 1700x 95 w 8 core 16 threads 3 8ghz 4mb cpu black',
       ..., 'intel bx80532ke3066e processor 3 06 ghz 1 mb l2',
       'intel bx80532ke2400d processor 2 4 ghz 0 512 mb l2',
       'intel bx80532ke3066e processor 3 06 ghz 1 mb l2'], dtype=object)

In [7]:
# Vectoryze the product titles by using the TfidfVectorizer of scikit learn
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True, norm = 'l2', smooth_idf = True)

vectorized_titles = tfidf_vectorizer.fit_transform(titles_array).toarray()

vectorized_titles.shape, vectorized_titles


((3861, 2288),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

In [8]:
#from sklearn.decomposition import PCA

#pca = PCA(n_components = 100)

#reduced_titles = pca.fit_transform(vectorized_titles)

#reduced_titles.shape, reduced_titles


In [9]:
from sklearn.cluster import KMeans

# Apply k-Means to the dataset. We know in advance that the algorithm must generate 1975 clusters.
km = KMeans(n_clusters=1975, init='random', n_init=1, max_iter=10, tol=1e-04, random_state=0)

predicted_clusters = km.fit_predict(vectorized_titles)

print(predicted_clusters)
print(real_clusters)


[1045  228  958 ...  530 1590  530]
[5851 5851 5851 ... 7913 7914 7915]


In [10]:
from sklearn.metrics import completeness_score
from sklearn.metrics import homogeneity_score
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import adjusted_mutual_info_score
from sklearn.metrics import normalized_mutual_info_score

print("Adjusted Rand Score: ", adjusted_rand_score(real_clusters, predicted_clusters))
print("Completeness: ", completeness_score(real_clusters, predicted_clusters))
print("Homogeneity: ", homogeneity_score(real_clusters, predicted_clusters))
print("NMI: ", normalized_mutual_info_score(real_clusters, predicted_clusters))
print("AMI: ", adjusted_mutual_info_score(real_clusters, predicted_clusters))


Adjusted Rand Score:  0.254997954471954
Completeness:  0.9242097301000132
Homogeneity:  0.9242987714725341
NMI:  0.9242542486417431
AMI:  0.3596284367979433
