# Ducument Clustering and Topic Modeling: User Reviews on Watches

>Ming Zhao <br>
>December 17, 2021

In this project, I clustered reviews into different topics and identified topic keywords by performing NLP techniques (tokenization, stop-word removal, and stemming), extracting Term Frequency-Inverse Document Frequency features, and building K-Means and Latent Dirichlet Allocation. 

### Contents

- Loading Data  

- Preprocessing Data  
    - Stop-word removal
    - Tokenization and stemming
    - Term Frequency-Inverse Document Frequency (TF-IDF)

- K-Means Clustering  

- Latent Dirichlet Allocation (LDA)  
   
<br>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk

from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
import warnings
warnings.filterwarnings("ignore") 
warnings.simplefilter('ignore')

from IPython.display import Image

## Loading Data

In [4]:
df = pd.read_csv('watch_reviews.tsv', sep='\t', error_bad_lines=False, warn_bad_lines=False)

In [5]:
df.tail(3)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
960201,US,40571775,R2B1G5650WWFCE,B00005QEME,252821780,Men's Timex Internet Messenger Sport Watch,Watches,5,3,16,N,N,This is a great watch,"Dear Targeteers,<BR>This watch is exelent. it ...",2001-11-06
960202,US,44474855,R2MMGPUWXXOFI2,B00004YK0H,118389241,Energizer 393 Button Cell Battery,Watches,4,0,0,N,N,Now watt a minute here.,"In the old days, the common hearing battery in...",2001-04-05
960203,US,44474855,R2BZMVAERMRUDE,B00004YK0H,118389241,Energizer 393 Button Cell Battery,Watches,4,5,7,N,N,1/10 Watt difference for hearing aids,I have found that a #393 watch battery is the ...,2001-04-05


In [6]:
# Check missing values
df.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          2
product_category       0
star_rating            0
helpful_votes          0
total_votes            0
vine                   0
verified_purchase      0
review_headline        7
review_body          148
review_date            4
dtype: int64

In [7]:
# Remove missing values
df.dropna(subset=['review_body'], inplace=True)

In [8]:
df.reset_index(inplace=True, drop=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960056 entries, 0 to 960055
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   marketplace        960056 non-null  object
 1   customer_id        960056 non-null  int64 
 2   review_id          960056 non-null  object
 3   product_id         960056 non-null  object
 4   product_parent     960056 non-null  int64 
 5   product_title      960054 non-null  object
 6   product_category   960056 non-null  object
 7   star_rating        960056 non-null  int64 
 8   helpful_votes      960056 non-null  int64 
 9   total_votes        960056 non-null  int64 
 10  vine               960056 non-null  object
 11  verified_purchase  960056 non-null  object
 12  review_headline    960049 non-null  object
 13  review_body        960056 non-null  object
 14  review_date        960052 non-null  object
dtypes: int64(5), object(10)
memory usage: 109.9+ MB


In [10]:
df['product_category'].value_counts()

Watches    960056
Name: product_category, dtype: int64

In [11]:
# Use the first 1000 data as the training data
data = df.loc[0:999, 'review_body'].tolist()
##data=df[:1000].review_body.tolist()

In [12]:
data[:5]

['Absolutely love this watch! Get compliments almost every time I wear it. Dainty.',
 'I love this watch it keeps time wonderfully.',
 'Scratches',
 'It works well on me. However, I found cheaper prices in other places after making the purchase',
 "Beautiful watch face.  The band looks nice all around.  The links do make that squeaky cheapo noise when you swing it back and forth on your wrist which can be embarrassing in front of watch enthusiasts.  However, to the naked eye from afar, you can't tell the links are cheap or folded because it is well polished and brushed and the folds are pretty tight for the most part.<br /><br />I love the new member of my collection and it looks great.  I've had it for about a week and so far it has kept good time despite day 1 which is typical of a new mechanical watch"]

## Preprocessing Data

### Stop-word removal, Tokenizing, Stemming

Tokenizinig is a process of splitting a text into individual words or sequences of words (N-grams). Each of these terms or smaller units are called tokens.  <br> 
-1-grams: "I like this product" -> [I, like, this, product] <br>
-2-grams: "I like this product" -> [I like, like this, this product] <br>
-3-grams: "I like this product" -> [I like this, like this product] <br>

Stemming is the process of breaking a word down into its root.  <br>
-for example, the words care, cared and caring lie under the same stem ‘care’. <br>

Stopwords are the words like "in", "a", and "the" which do not convey any significant meaning. <br>


In [13]:
# Use nltk's English stopwords 
stopwords = nltk.corpus.stopwords.words('english')
# Extra stopwords
stopwords.append("watch") #all for watches
stopwords.append("br") #html <br>
stopwords.append("'s") #he's or she's
stopwords.append("'m") #I'm

print("We use " + str(len(stopwords)) + " stop-words.")
print(stopwords)

We use 183 stop-words.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 

In [14]:
stemmer = SnowballStemmer("english")

# Tokenization anf stemming
def tokenization_and_stemming(text):
    
# exclude stop-words and tokenize the document, generate a list of string
    tokens = []
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())
    ## equivalently,
    ## tokens = [word for word in nltk.word_tokenize(text) if word.lower() not in stopwords]

# filter out any tokens not containing letters (e.g. numeric tokens, raw punctuation)
    filtered_tokens = []
    for token in tokens:
        if token.isalpha():
            filtered_tokens.append(token)
    ## equivalently,
    ## filtered_tokens = [token for token in tokens if token.isalpha()]
    
# stemming    
    stems = [stemmer.stem(t) for t in filtered_tokens] #list comprehension
    ## equivalently,
    ## stems = []
    ## for t in filtered_tokens:
        ## stems.append(stemmer.stem(t))
        
    return stems


In [15]:
tokenization_and_stemming(data[0])

['absolut',
 'love',
 'get',
 'compliment',
 'almost',
 'everi',
 'time',
 'wear',
 'dainti']

In [16]:
data[0]

'Absolutely love this watch! Get compliments almost every time I wear it. Dainty.'

### TF-IDF

Term Frequency-Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. <br>

TF: Term Frequency <br>
IDF: Inverse Documment Frequency <br>
TF-IDF is the product of the two statistics, TF and IDF.  <br>

$TF(wordA\_in\_docB) = \frac{count \space of \space wordA \space in \space docB}{count \space of \space total \space words \space in \space docB}$ <br>

$IDF(wordA) = log(\frac{number \space of \space total \space documents \space in \space corpus}{number \space of \space documents \space where \space wordA \space appears \space + \space 1})$

$TF-IDF(wordA\_in\_docB) = TF(wordA\_in\_docB) \times IDF(wordA)$

In [17]:
# define vectorizer parameters
# TfidfVectorizer will help us to create TF-IDF matrix
# max_df: maximum document frequency for the given word
# min_df: minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words: built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1,3) means the result will include 1-gram, 2-gram, 3-gram

tfidf_model = TfidfVectorizer(max_df=0.99, min_df=0.01, max_features=1000,
                              use_idf=True, ngram_range=(1,1),
                              stop_words="english",
                              tokenizer=tokenization_and_stemming)

tfidf_matrix = tfidf_model.fit_transform(data) #fit the vectorizer to data

print("In total, there are " + str(tfidf_matrix.shape[0]) + " reviews and " \
      + str(tfidf_matrix.shape[1]) + " terms.")


In total, there are 1000 reviews and 239 terms.


In [18]:
tfidf_matrix.shape
## Y.shape is (n,m)
## Y.shape[0] is n rows
## Y.shape[1] is m columns

(1000, 239)

In [19]:
tfidf_matrix.todense() # todense returns a matrix

matrix([[0.       , 0.5125863, 0.       , ..., 0.       , 0.       ,
         0.       ],
        [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ],
        [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ],
        ...,
        [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ],
        [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ],
        [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ]])

In [20]:
type(tfidf_matrix.toarray()) # toarray returns a ndarray

numpy.ndarray

In [21]:
# get feature names
tf_selected_words = tfidf_model.get_feature_names()

In [22]:
len(tf_selected_words)

239

In [23]:
tf_selected_words[:20]

['abl',
 'absolut',
 'accur',
 'actual',
 'adjust',
 'alarm',
 'alreadi',
 'alway',
 'amaz',
 'amazon',
 'anoth',
 'arm',
 'arriv',
 'automat',
 'awesom',
 'bad',
 'band',
 'batteri',
 'beauti',
 'best']

## K-means Clustering

**K-means** is an unsupervised learning algorithm. <br>
It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (also called the cluster center or cluster centroid). <br>
It minimizes the within-cluster sum of squared errors. <br>

In [26]:
Image(url= "https://i.stack.imgur.com/mhwgB.png"
,width=400, height=200)

In [27]:
Image(url= "https://stanford.edu/~cpiech/cs221/img/kmeansViz.png"
,width=400, height=200)

In [28]:
Image(url= "https://www.unioviedo.es/compnum/labs/new/d1.png"
,width=400, height=200)

The objects are represented with $d$ dimension vectors $\left(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\right)$ and the algorithm k-means builds $k$ groups where the sum of the distances of the objects to its centroid is minimized within each group $\mathbf{S}=\left\{ S_{1},S_{2},\ldots,S_{k}\right\}$. The problem can be formulated:

$$\underset{\mathbf{S}}{\mathrm{min}}\; E\left(\boldsymbol{\mu_{i}}\right)=\underset{\mathbf{S}}{\mathrm{min}}\sum_{i=1}^{k}\sum_{\mathbf{x}_{j}\in S_i}\left\Vert \mathbf{x}_{j}-\boldsymbol{\mu}_{i}\right\Vert ^{2}
\quad$$

where $\mathbf{S}$ is the dataset whose elements are the objects 
$\mathbf{x}_{j}$ represented by vectors, where each of its elements represents a characteristic or attribute. We will have $k$ groups or clusters with their corresponding centroid $\boldsymbol{\mu_{i}}$.

In [29]:
# k-means clustering
num_clusters = 5
km = KMeans(n_clusters = num_clusters, init='random', random_state=1991)
km.fit(tfidf_matrix)

KMeans(init='random', n_clusters=5, random_state=1991)

In [30]:
clusters = km.labels_.tolist()

In [31]:
# create data frame
reviews = {'review': df[:1000].review_body, 'cluster': clusters} # dictionary
frame = pd.DataFrame(reviews)

In [32]:
frame.head(10)

Unnamed: 0,review,cluster
0,Absolutely love this watch! Get compliments al...,2
1,I love this watch it keeps time wonderfully.,2
2,Scratches,0
3,"It works well on me. However, I found cheaper ...",0
4,Beautiful watch face. The band looks nice all...,1
5,"i love this watch for my purpose, about the pe...",2
6,"for my wife and she loved it, looks great and ...",2
7,I was about to buy this thinking it was a Swis...,0
8,Watch is perfect. Rugged with the metal &#34;B...,0
9,Great quality and build.<br />The motors are r...,0


In [33]:
print("Number of reviews included in each cluster:")
frame['cluster'].value_counts().to_frame()

Number of reviews included in each cluster:


Unnamed: 0,cluster
0,548
1,194
2,114
4,81
3,63


In [34]:
km.cluster_centers_

array([[0.00466197, 0.00485654, 0.0002854 , ..., 0.00506218, 0.01268035,
        0.01135057],
       [0.00794106, 0.00333389, 0.01235751, ..., 0.00930425, 0.02610331,
        0.02771646],
       [0.        , 0.03673336, 0.        , ..., 0.01189996, 0.01672348,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.00771798,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.00809634,
        0.        ]])

In [35]:
km.cluster_centers_.shape
# 5 clusters and 239 terms
# -> assumption: the cluster centroid can represent the cluster
# -> the greater the tf-idf is, the more representative the term is for this document
# -> select 6 terms that have the greatest values of tf-idf to represent the cluster

(5, 239)

In [36]:
# km.cluster_centers_ denotes the importances of each item in cluster.
# We need to sort it in decreasing-order and get the top items.

order_centroids = km.cluster_centers_.argsort()[:, ::-1] 
## order_centroids = np.argsort(-km.cluster_centers_)

In [37]:
###print("<Document Clustering Result by K-means>")
###print ()
Cluster_keywords_summary = {} #dictionary
for i in range(num_clusters):
    Cluster_keywords_summary[i] = [] #the ith key has an empty list of values
    ###print ("Cluster " + str(i) + " words: ", end='') 
    for ind in order_centroids[i, :6]: 
        Cluster_keywords_summary[i].append(tf_selected_words[ind])       
        ###print (tf_selected_words[ind] + ",", end='')
    ###print ()
    ###cluster_reviews = frame[frame.cluster==i].review.tolist()
    ###print ("Cluster " + str(i) + " reviews: " + str(len(cluster_reviews)) + " reviews ")
    ###print (", ".join(cluster_reviews))
    ###print ()

In [38]:
df_Cluster_keywords = pd.DataFrame(Cluster_keywords_summary).transpose()
df_Cluster_keywords.columns = ["Word" + str(i) for i in range(df_Cluster_keywords.shape[1])]
df_Cluster_keywords.index = ["Topic" + str(i) for i in range(df_Cluster_keywords.shape[0])]

In [39]:
df_Cluster_keywords

Unnamed: 0,Word0,Word1,Word2,Word3,Word4,Word5
Topic0,great,look,work,like,band,perfect
Topic1,time,wear,day,read,hand,easi
Topic2,love,wife,look,husband,beauti,gift
Topic3,nice,price,look,realli,simpl,good
Topic4,good,product,qualiti,look,price,seller


## LDA

Latent Dirichlet Allocation (LDA) is a generative statistical model, a model of the joint probability distribution ${\displaystyle P(X,Y)}$ on given observable variable X and target variable Y. It allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. <br>

If observations are words collected into documents, it posits that 1) each document is a mixture of a small number of topics and 2) each word's presence is attributable to one of the document's topics <br>

LDA is an example of a topic model. In LDA, each document is assumed to be characterized by a particular set of topics. <br>

If ${\displaystyle doc \times word}$ is the input data, LDA transforms it into 2 matrices: ${\displaystyle doc \times topic}$ and ${\displaystyle topic \times word}$. <br>

**LDA vs. K-means**

If both are applied to assign K topics to a set of N documents, the most evident difference is that K-means is going to partition the N documents in K disjoint clusters. However, LDA assigns a document to a mixture of topics. That is, each document is characterized by one or more topics (e.g. Document D belongs for 60% to Topic A, 30% to topic B and 10% to topic C). <br>
Hence, LDA can give more realistic results than k-means for topic assignment.

In [40]:
Image(url= "https://www.researchgate.net/publication/336065245/figure/fig1/AS:807371718815752@1569503826964/Latent-Dirichlet-allocation-LDA-process-and-its-two-outputs-a-LDA-document.ppm"
,width=600, height=400)

In [41]:
Image(url= "https://cdn-images-1.medium.com/max/800/1*a5IlRfBwrv6yVrkj4ExX_g.png"
,width=300, height=100)

In [42]:
Image(url= "https://www.researchgate.net/publication/343176425/figure/fig1/AS:916645862182912@1595556812172/LDA-topic-modelling-process.ppm"
,width=600, height=400)

In [43]:
# use LDA for clustering
lda_model = LatentDirichletAllocation(n_components=5, random_state=1991)

In [44]:
# doc_word matrix
tfidf_matrix.shape

(1000, 239)

In [45]:
# doc-topic matrix
doc_topic = lda_model.fit_transform(tfidf_matrix)

In [46]:
doc_topic.shape

(1000, 5)

In [47]:
doc_topic

array([[0.05950691, 0.53812812, 0.06086787, 0.0598547 , 0.2816424 ],
       [0.08399915, 0.66242875, 0.08432227, 0.08400489, 0.08524495],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       ...,
       [0.59983335, 0.10000195, 0.10015176, 0.10000025, 0.10001269],
       [0.47905571, 0.06713959, 0.06699435, 0.31998322, 0.06682713],
       [0.0672561 , 0.7260046 , 0.06888808, 0.07035559, 0.06749563]])

In [48]:
# topic-word matrix
topic_word = lda_model.components_

In [49]:
topic_word.shape

(5, 239)

In [50]:
topic_word

array([[0.20017765, 0.20232328, 0.87620545, ..., 0.20046613, 2.24236221,
        3.19430708],
       [0.20120433, 0.20672584, 0.20102552, ..., 0.20159425, 2.47888732,
        2.5193017 ],
       [1.50256865, 0.2001359 , 0.2093817 , ..., 3.94013261, 1.98853381,
        1.63803382],
       [0.29642298, 0.20096231, 0.20195735, ..., 1.16893045, 0.2018429 ,
        0.20289249],
       [2.89495347, 7.6856155 , 2.06518506, ..., 1.42457386, 9.14976156,
        5.04257037]])

In [51]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda_model.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

# create a data frame
df_document_topic = pd.DataFrame(np.round(doc_topic,2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
## axis=0 represents rows 
## axis=1 represents columns
## vs. np.max(df_document_topic.values, axis=1)

df_document_topic['topic'] = topic

In [52]:
df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,topic
Doc0,0.06,0.54,0.06,0.06,0.28,1
Doc1,0.08,0.66,0.08,0.08,0.09,1
Doc2,0.2,0.2,0.2,0.2,0.2,0
Doc3,0.06,0.06,0.06,0.06,0.76,4
Doc4,0.19,0.04,0.04,0.04,0.7,4
Doc5,0.08,0.36,0.41,0.07,0.08,2
Doc6,0.54,0.07,0.06,0.06,0.27,0
Doc7,0.06,0.75,0.06,0.06,0.06,1
Doc8,0.05,0.05,0.36,0.05,0.5,4
Doc9,0.06,0.67,0.06,0.06,0.16,1


In [53]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
4,312
0,204
3,175
1,166
2,143


In [54]:
df_topic_words = pd.DataFrame(topic_word)
df_topic_words.columns = tf_selected_words
df_topic_words.index = topic_names

In [55]:
df_topic_words

Unnamed: 0,abl,absolut,accur,actual,adjust,alarm,alreadi,alway,amaz,amazon,...,weight,went,wife,wind,wish,work,worn,worth,wrist,year
Topic0,0.200178,0.202323,0.876205,0.20083,0.200176,0.200273,1.03701,0.201194,8.250275,0.293153,...,0.205074,0.208047,0.202356,0.200096,0.200167,1.706322,0.200049,0.200466,2.242362,3.194307
Topic1,0.201204,0.206726,0.201026,2.80345,0.201914,1.167272,0.85989,3.566437,0.200737,2.052761,...,0.868185,0.20293,0.200692,0.200444,3.760299,1.797782,0.200999,0.201594,2.478887,2.519302
Topic2,1.502569,0.200136,0.209382,1.333917,0.200691,1.825499,2.267823,0.204002,0.202399,0.410212,...,1.864641,0.201754,0.201179,0.200671,0.206354,0.321058,1.414495,3.940133,1.988534,1.638034
Topic3,0.296423,0.200962,0.201957,0.694169,2.211366,0.203925,0.200546,0.662416,0.200845,1.800083,...,0.200753,1.296837,0.200781,0.200227,0.200371,25.352876,0.200127,1.16893,0.201843,0.202892
Topic4,2.894953,7.685616,2.065185,0.82403,3.479024,2.543441,1.398101,0.205955,1.914578,2.246331,...,2.57971,1.807606,11.83739,3.43227,0.203318,11.88363,2.187422,1.424574,9.149762,5.04257


In [56]:
# top keywords for each topic
def top_topic_words (tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have weight for each word
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return pd.DataFrame(topic_words) 

In [57]:
df_topic_words = top_topic_words(tfidf_model, lda_model, 15)

In [58]:
df_topic_words.columns = ["Word" + str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ["Topic" + str(i) for i in range(df_topic_words.shape[0])]

In [59]:
df_topic_words

Unnamed: 0,Word0,Word1,Word2,Word3,Word4,Word5,Word6,Word7,Word8,Word9,Word10,Word11,Word12,Word13,Word14
Topic0,good,great,look,band,beauti,love,pictur,price,qualiti,big,amaz,cheap,deal,broke,feel
Topic1,love,like,gift,bought,larg,husband,watch,buy,beauti,pretti,compliment,realli,stylish,color,time
Topic2,expect,light,realli,comfort,nice,awesom,price,look,wear,band,color,thank,valu,face,blue
Topic3,work,excel,product,perfect,recommend,fast,ship,deliveri,tri,day,nice,qualiti,money,exact,simpl
Topic4,nice,time,look,band,batteri,strap,work,wife,hand,love,fit,use,easi,set,littl
