# Lab 3: Text clustering

Generate and interpret document and term clusters from hotel reviews

Objectives:

- work with sklearn components
- construct document-term and term-document matrices
- apply dimensionality reduction
- perform cluster analysis on document and term collections
- interpret results

In [1]:
import re

import nltk
import numpy as np
import pandas as pd
from cytoolz import *

In [2]:
pd.set_option("display.max_colwidth", None)

In [3]:
df = pd.read_pickle("/data/hotels_id.pkl")
lax = df.query('offering_id==77944')

In [4]:
len(lax)

538

In [5]:
lax.head()

Unnamed: 0,title,text,overall,value,service,cleanliness,offering_id
362503,“Nice stay with a bit of a twist”,"The hotel with a ghetto feel !!\nThis hotel had all the features that were required for a pleasant stay. The staff were pleasant and very helpful and the location was a bonus, as me and my partner were about to make the 15 hour journey back to Australia the next morning. We decided that it would be a good idea to stay close to the airport, so we did not have the chance to miss the flight. \nWe decided to go out and do a bit of shopping. We came back with the outisde of the hotel covered in police tape and a vehicle with the CSI theme, with all the shell casings from a shooting, a mass of new reporters trying to get the story of the day. \nLater that evening we were down in the lobby area, when about 50 people were out the front rowing over the shooting that had just taken place and who was to blame, when all of a sudden a gentleman was screaming about pulling out a gun and shooting them.\nOf course it was not the hotels fault, however it is just some advise for people, to be careful as the area was a bit dangerous and wouldn't be suitable for a young family.",3.0,3.0,4.0,4.0,77944
362504,“Excellent pit stop”,"Popular with longhaul cabin and flight crew is always a good sign when the same needs apply to your stop over: dead close to terminals, good 24hr airport shuttle, quiet, good beds, clean, quality international, no-fuss and locally-aware reception staff. Oh, and quality, not rip-off food and bar/s on site. This establishment managed all of the above without any fuss in the 23hrs we spent as a family breaking a ridiculous NY to Brisbane sector. Add a decent gym, a small pool and some touches of unexpected luxury and I can recommend this one unhesitatingly. The lobby bar food particularly was outstanding.",4.0,3.0,4.0,4.0,77944
362505,“Great if you rent a car from Avis; excellent if you”,"We stayed there throughout our Los Angeles 3 nights. Our rental car was just across the road (very handy). There is nothing bad about the Renaissance... and nothing exciting either! - It's just what you expect from an airport hotel. We had managed to get a really cheap rate at our dates (from memory, under $150 a night), and that made the stay really worthwhile. We could explore the wider area (Los Angeles, Hollywood, Venice Beach, Beverly Hills...) by car from the hotel, and never got stuck in traffic!",4.0,4.0,4.0,4.0,77944
362528,“Decent place to stay”,"This hotel was decent to stay. Clean rooms, nice sheets and courteous staff. Valet parking is a ripoff, and self parking was expensive (12 dollars per day). The rooms had nice terrycloth robes and they were well insulated from the other rooms and airport noise overall. This would be a great place to stay if you needed to be close to the airport.",4.0,4.0,4.0,4.0,77944
362530,“Great for an overnight”,"In October 2012, we did an overnight stay at the Renaissance LAX on the eve of a vacation at a time share down the coast. Our room was clean and comfortable, the staff attentive and friendly and the folks at the bar/restaurant made us feel at home after our late flight. Nice place to stay if you have to stay at an airport, and the price was certainly right. Highly recommend.",5.0,5.0,5.0,5.0,77944


## Construct a document-term matrix

See the [Sklearn user manual](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) for functions for constructing dtm's. The most useful for this lab are [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  and [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).  Both have lots of options you can try adjusting, e.g.,:

- min_df: ignore words that occur in fewer than `min_df` documents (either count or percentage)
- max_df: ignore words that occur in more than `max_df` documents (either count or percentage)
- stop_words: if set to `"english"`, removes common English function words
- analyzer, tokenizer, preprocessor, lowercase, strip_accents: controls sklearn's built-in tokenizer. If you want to use your own tokenizer (like the multi-word one we made last week), set `analyzer=identity` and give the vectorizer pre-tokenized texts.

`TfidfVectorizer` adds a few additional options:

- norm: set to `l2` (the default) to normalize doc vectors to unit length or `None` for no normalization
- use_idf: include idf term
- smooth_idf: add one to document frequencies
- sublinear_tf: replace tf term with 1+log(tf)

And see docs for many other options!

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
tf = TfidfVectorizer(min_df=2, max_df=0.6, use_idf=False)
dtm = tf.fit_transform(lax["text"])

In [8]:
dtm

<538x2781 sparse matrix of type '<class 'numpy.float64'>'
	with 42251 stored elements in Compressed Sparse Row format>

## Apply dimensionality reduction

The original formulation of Latent Semantic Indexing used SVD for dimensionality reduction. The sklearn function [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)  is an efficient implentation of SVD for large sparse matrices (such as dtm's). The argument `n_components` sets the number of dimensions (i.e., columns) in the reduced matrix.

You can experiment with other dimensionality reduction techniques, as described in the manual sections on [Matrix factorization](https://scikit-learn.org/stable/modules/decomposition.html#lsa) and [Manifold](https://scikit-learn.org/stable/modules/manifold.html#t-sne). Or, you can skip this step and use your dtm directly as input to the clusterer



In [9]:
from sklearn.decomposition import TruncatedSVD

In [10]:
lsi = TruncatedSVD(n_components=25, random_state=760)
dtm_lsi = lsi.fit_transform(dtm)

In [11]:
dtm_lsi.shape

(538, 25)

## Perform cluster analysis

Sklearn has implementations of most current clusterining algoroths. See the [clustering chapter](https://scikit-learn.org/stable/modules/clustering.html#clustering) for more discussion. 

The [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm is a particular simple one. It requires one argument: `n_clusters` sets the number of clusters that k-means will find. K-means is a non-deterministic algorithm, meaning each time you run it you may get a different answer. If you want your results to be reproducible, set the `random_state` argument.




In [12]:
from sklearn.cluster import KMeans

In [13]:
kmeans = KMeans(n_clusters=20, random_state=858, n_init='auto')
cluster = kmeans.fit_predict(dtm_lsi)

## Interpret results

Now the hard part: what does all this mean? There are two ways of evaluating a clustering solution, quantitatively and qualitatively.

In [14]:
from sklearn import metrics

A simple quantitative metric: how many documents are in each cluster?

In [15]:
nltk.FreqDist(cluster).most_common()

[(2, 41),
 (12, 41),
 (17, 41),
 (4, 36),
 (3, 33),
 (9, 31),
 (6, 31),
 (8, 30),
 (0, 27),
 (5, 26),
 (1, 26),
 (18, 26),
 (19, 24),
 (14, 23),
 (7, 22),
 (11, 22),
 (15, 21),
 (13, 13),
 (16, 12),
 (10, 12)]

More complex metrics take into account how diffuse the clusters are and how well separated they are. Three potentially useful metrics are:

- [Calinski Harabasz_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score) [higher = better]
- [Davies Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score) [lower = better]
- [Silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) [higher = better]

These are mostly useful for comparing different clusterings of the same data. For example, we could use these to find a good value for *k*, the number of clusters






In [16]:
for k in range(2, 20):
    _kmeans = KMeans(n_clusters=k, random_state=858, n_init="auto")
    _cluster = _kmeans.fit_predict(dtm_lsi)
    print(
        f"{k:2}  "
        f"{metrics.calinski_harabasz_score(dtm_lsi, _cluster):7.3f}   "
        f"{metrics.davies_bouldin_score(dtm_lsi, _cluster):7.3f}   "
        f"{metrics.silhouette_score(dtm_lsi, _cluster):7.3f}  "
    )

 2   73.224     2.523     0.111  
 3   51.110     3.244     0.086  
 4   38.569     3.619     0.047  
 5   34.990     3.219     0.081  
 6   29.994     3.145     0.076  
 7   27.726     3.225     0.055  
 8   25.029     3.192     0.051  
 9   22.464     3.335     0.039  
10   21.392     3.162     0.046  
11   20.269     3.062     0.048  
12   19.216     3.032     0.043  
13   18.683     2.818     0.048  
14   17.463     2.867     0.046  
15   16.575     2.919     0.044  
16   16.524     2.774     0.049  
17   15.700     2.828     0.047  
18   15.283     2.776     0.049  
19   15.123     2.637     0.052  


Qualitative evaluation means reading the data and trying to figure it out. We could look at all of the reviews in some of the smaller clusters:

In [17]:
lax[cluster == 16]["text"].head()

362504    Popular with longhaul cabin and flight crew is always a good sign when the same needs apply to your stop over: dead close to terminals, good 24hr airport shuttle, quiet, good beds, clean, quality international, no-fuss and locally-aware reception staff. Oh, and quality, not rip-off food and bar/s on site. This establishment managed all of the above without any fuss in the 23hrs we spent as a family breaking a ridiculous NY to Brisbane sector. Add a decent gym, a small pool and some touches of unexpected luxury and I can recommend this one unhesitatingly. The lobby bar food particularly was outstanding.
362885                                                                                                                                                                                                                                                                                                                                                                                  The 

Or, since we're using *k* means, we could sort them in order of proximity to the cluster centroid.

In [18]:
C = 16
distances = metrics.pairwise.euclidean_distances(
    kmeans.cluster_centers_[C, np.newaxis, :], dtm_lsi[cluster == C, :]
)
lax[cluster == C].iloc[np.argsort(distances)[0]]["text"].head(5)

362504    Popular with longhaul cabin and flight crew is always a good sign when the same needs apply to your stop over: dead close to terminals, good 24hr airport shuttle, quiet, good beds, clean, quality international, no-fuss and locally-aware reception staff. Oh, and quality, not rip-off food and bar/s on site. This establishment managed all of the above without any fuss in the 23hrs we spent as a family breaking a ridiculous NY to Brisbane sector. Add a decent gym, a small pool and some touches of unexpected luxury and I can recommend this one unhesitatingly. The lobby bar food particularly was outstanding.
364394                                                                                                                                                                                                                                                                                                                                                                                      

## Term clustering

Document clustering is not usually very helpful, unless the documents are very short and/or naturally fall into a small number of distinct classes. Term clustering, on the other hand, does sometimes give interesting results. 

To cluster terms, all we need to do is transpose the document-term matrix to form the term-document matrix (tdm). After that, the rest of the process is the same, except that instead of relating documents by the terms they share, we're relating terms by the documents they co-occur in.



In [124]:
tf = TfidfVectorizer(min_df=2, max_df=0.6, use_idf=True)
tdm = tf.fit_transform(lax["text"]).T

lsi = TruncatedSVD(n_components=5, random_state=858)
tdm_lsi = lsi.fit_transform(tdm)

kmeans = KMeans(20, random_state=858, n_init="auto")
cluster = kmeans.fit_predict(tdm_lsi)

In [125]:
nltk.FreqDist(cluster).most_common()

[(0, 1421),
 (11, 489),
 (14, 382),
 (8, 134),
 (18, 110),
 (12, 72),
 (1, 41),
 (7, 31),
 (17, 29),
 (16, 18),
 (5, 13),
 (4, 11),
 (2, 8),
 (19, 6),
 (9, 5),
 (6, 5),
 (15, 2),
 (10, 2),
 (3, 1),
 (13, 1)]

In [126]:
vocab = tf.get_feature_names_out()

In [132]:
vocab[cluster == 5]

array(['across', 'burger', 'can', 'car', 'close', 'king', 'minutes',
       'rental', 'right', 'street', 'valet', 'walk', 'your'], dtype=object)

In [129]:
vocab[cluster == 17]

array(['about', 'after', 'all', 'also', 'area', 'could', 'day', 'didn',
       'flight', 'free', 'get', 'got', 'just', 'la', 'morning', 'next',
       'no', 'only', 'or', 'out', 'place', 'really', 'time', 'two', 'up',
       'us', 'what', 'when', 'which'], dtype=object)

In [133]:
C = 17
distances = metrics.pairwise.euclidean_distances(
    kmeans.cluster_centers_[C, np.newaxis, :], tdm_lsi[cluster == C, :]
)
vocab[cluster == C][np.argsort(distances)[0]]

array(['about', 'just', 'also', 'la', 'really', 'got', 'only', 'up',
       'morning', 'could', 'two', 'didn', 'next', 'what', 'day', 'after',
       'place', 'area', 'out', 'when', 'time', 'get', 'no', 'flight',
       'or', 'which', 'free', 'all', 'us'], dtype=object)

In [127]:
vocab[cluster == 16]

array(['again', 'angeles', 'bar', 'beds', 'buffet', 'definitely', 'early',
       'excellent', 'food', 'friendly', 'helpful', 'location', 'los',
       'needed', 'overnight', 'recommend', 'restaurant', 'will'],
      dtype=object)

In [128]:
C = 16
distances = metrics.pairwise.euclidean_distances(
    kmeans.cluster_centers_[C, np.newaxis, :], tdm_lsi[cluster == C, :]
)
vocab[cluster == C][np.argsort(distances)[0]]

array(['recommend', 'needed', 'definitely', 'excellent', 'beds', 'buffet',
       'friendly', 'early', 'overnight', 'helpful', 'location', 'bar',
       'los', 'angeles', 'restaurant', 'again', 'food', 'will'],
      dtype=object)

### Results

Once you're done experimenting with document and term clustering, write up a paragraph on your results? What methods did you try? What worked and what didn't? Did you manage to find any useful information about hotels from this?

I first tried changing max_df to 0.9 to ignore words that occur in 90% of documents instead of 80%. I noticed that this affected the number of documents in each cluster. Initially, cluster 3 had 63 documents, which was the most out of the other clusters. After the change, cluster 7 had the most with 55 documents. Taking a look at the metrics, the scores seemed relatively the same, so the change I made probably wasn't very influential. I tested out different values and found that a value of 0.6 for max_df yielded better scores.

Next, I wanted to see how changing the number of clusters would affect the results. Obviously, this affected the amount of documents within each cluster, so more clusters meant documents were more spread out between clusters. However, the scores for finding the optimal k number of clusters did not change. Based on the scores, having two clusters is the most optimal.

Looking at the words in different clusters, some clusters contain words that provide useful information about hotels more than others. For example, cluster 16 has words like 'buffet', 'friendly', and 'recommend' while cluster 17 doesn't have as many distinct keywords that tell us useful information about hotels.
