### Mining Consumer Reviews for Insights using Topic Modelling for Short Text



- Mining Yelp Reviews for Insights. Yelp dataset is availaible from https://www.yelp.com/dataset in json. Be warned that it's about 4gb file
- We will break the reviews down into sentences and cluster them using the gsdmm package. The resulting clusters should be about similar aspects and experience, and while many reviews are about restaurants, there are also other reviews, such as those concerning nail salon ratings.

#### Packages Required

Install GSDM git clone https://github.com/rwalk/gsdmm.git it's a handy package for Topic modelling for short text

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Import mgp from GSDM 

from mgp import MovieGroupProcess

In [3]:
import string
import pickle

In [4]:
from phrases import get_yelp_reviews

In [5]:
from preprocess_bbc_dataset import get_stopwords

In [6]:
import nltk

In [7]:
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
yelp_reviews_file = "yelp_academic_dataset_review.json"
stopwords_file_path = "reviews_stopwords.csv"
stopwords = get_stopwords(stopwords_file_path)

In [8]:
def preprocess(text):
    sentences = tokenizer.tokenize(text)
    sentences = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]
    sentences = [list(set(word_list)) for word_list in sentences]
    sentences = [[word for word in word_list if word not in stopwords and word not in string.punctuation] for word_list in sentences]
    return sentences

we define the preprocessing function. This function first splits the text into sentences, tokenizes the sentences into words, and removes duplicates from the word lists. The duplicate removal is necessary for the GSDMM model, as it requires a list of unique tokens that occur in the text. The preprocessing function then removes stopwords and punctuation from the word lists.


In [9]:
def top_words_by_cluster(mgp, top_clusters, num_words):
    for cluster in top_clusters:
        sort_dicts = sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:num_words]
        print(f'Cluster {cluster}: {sort_dicts}')


 we define the top_words_by_cluster function that prints out the most frequent words that appear in each cluster. It sorts the words in each cluster by their frequency and prints out tuples (of word frequency). The number of words per cluster printed is determined by the num_words parameter.

In [10]:
reviews = get_yelp_reviews(yelp_reviews_file)

 

In [11]:
sentences = preprocess(reviews)

In [13]:
vocab = set(word for sentence in sentences for word in sentence)
n_terms = len(vocab)

we get a set of all the unique words in the review sentences by turning the list of words into a set, and then we assign the count of these words to the n_terms variable to be used later in the creation of the model.

In [14]:
#Fit the Model
mgp = MovieGroupProcess(K=25, alpha=0.1, beta=0.1, n_iters=30)
mgp.fit(sentences, n_terms)

In stage 0: transferred 59089 clusters with 25 clusters populated
In stage 1: transferred 48396 clusters with 25 clusters populated
In stage 2: transferred 40615 clusters with 25 clusters populated
In stage 3: transferred 35691 clusters with 25 clusters populated
In stage 4: transferred 32481 clusters with 25 clusters populated
In stage 5: transferred 30276 clusters with 25 clusters populated
In stage 6: transferred 28640 clusters with 25 clusters populated
In stage 7: transferred 27141 clusters with 25 clusters populated
In stage 8: transferred 26081 clusters with 25 clusters populated
In stage 9: transferred 25141 clusters with 24 clusters populated
In stage 10: transferred 24460 clusters with 23 clusters populated
In stage 11: transferred 24097 clusters with 22 clusters populated
In stage 12: transferred 23561 clusters with 22 clusters populated
In stage 13: transferred 23288 clusters with 22 clusters populated
In stage 14: transferred 23002 clusters with 21 clusters populated
In st

[14,
 22,
 21,
 22,
 5,
 7,
 13,
 5,
 22,
 22,
 22,
 14,
 22,
 7,
 4,
 4,
 8,
 7,
 5,
 5,
 22,
 4,
 19,
 14,
 14,
 5,
 12,
 4,
 4,
 4,
 4,
 4,
 8,
 8,
 7,
 14,
 19,
 8,
 14,
 14,
 22,
 5,
 14,
 7,
 0,
 21,
 7,
 7,
 12,
 7,
 4,
 4,
 8,
 12,
 24,
 24,
 12,
 21,
 14,
 14,
 21,
 14,
 21,
 0,
 14,
 14,
 0,
 24,
 7,
 14,
 5,
 7,
 7,
 24,
 5,
 4,
 4,
 0,
 22,
 16,
 7,
 4,
 4,
 4,
 4,
 7,
 4,
 5,
 5,
 7,
 14,
 8,
 16,
 16,
 22,
 22,
 22,
 5,
 21,
 16,
 24,
 7,
 16,
 21,
 4,
 12,
 7,
 7,
 7,
 4,
 4,
 4,
 7,
 7,
 21,
 4,
 4,
 8,
 21,
 24,
 14,
 21,
 4,
 14,
 19,
 19,
 4,
 4,
 7,
 8,
 8,
 5,
 0,
 7,
 16,
 16,
 7,
 4,
 4,
 4,
 4,
 7,
 7,
 7,
 21,
 4,
 8,
 7,
 14,
 19,
 0,
 5,
 5,
 5,
 4,
 4,
 21,
 24,
 7,
 24,
 8,
 0,
 7,
 0,
 5,
 7,
 8,
 0,
 8,
 0,
 7,
 13,
 8,
 13,
 22,
 22,
 14,
 14,
 7,
 8,
 7,
 8,
 0,
 24,
 24,
 24,
 24,
 24,
 24,
 22,
 5,
 4,
 4,
 13,
 14,
 8,
 4,
 8,
 8,
 7,
 14,
 4,
 4,
 7,
 7,
 7,
 14,
 14,
 14,
 14,
 14,
 7,
 0,
 14,
 14,
 14,
 14,
 5,
 22,
 8,
 7,
 5,
 14,
 4,
 14,
 7,


we create the GSDMM model. The K parameter is the upper bound on the number of clusters, as the algorithm determines the number of clusters less than or equal to this number. The alpha parameter controls the probability that a new cluster will be created, and the beta parameter defines how new text is clustered. If the value of beta is closer to 0, then text will be clustered more according to similarity, while if it is closer to 1,
the clustering will be more based on the frequency of texts. The n_iters parameter determines the number of passes the algorithm makes through the corpus.

In [15]:
'''we get the count of documents by topic and then create a list of the 15 most populous topics. 
We then use this list in step 9 to get the 10 most frequent words in each cluster.'''  
doc_count = np.array(mgp.cluster_doc_count)
top_clusters = doc_count.argsort()[-18:][::-1]

In [16]:
top_words_by_cluster(mgp, top_clusters, 20)

Cluster 4: [('chicken', 985), ('ordered', 872), ('cheese', 780), ('sauce', 765), ('delicious', 649), ('salad', 578), ('fries', 540), ('fresh', 516), ('shrimp', 457), ('fried', 456), ('sandwich', 403), ('burger', 394), ('side', 367), ('cooked', 337), ('bread', 330), ('pork', 324), ('order', 322), ('meat', 321), ('pizza', 307), ('hot', 304)]
Cluster 14: [('order', 671), ('table', 539), ('wait', 498), ('people', 388), ('service', 374), ('asked', 328), ('bar', 310), ('told', 295), ('long', 282), ('waitress', 267), ('night', 264), ('server', 256), ('seated', 244), ('ordered', 241), ('restaurant', 235), ('waiting', 233), ('drink', 228), ('drinks', 226), ('busy', 223), ('menu', 217)]
Cluster 7: [('service', 776), ('experience', 365), ('restaurant', 364), ('worth', 364), ('stars', 316), ('lunch', 279), ('eat', 266), ('location', 249), ('wait', 244), ('breakfast', 210), ('night', 205), ('5', 203), ('spot', 198), ('visit', 191), ('hotel', 190), ('star', 189), ('area', 185), ('find', 180), ('plac

The results of the clustering make sense for many of the clusters. In the preceding results, cluster 6 is about food, clusters 4 and 9 relate to the service, cluster 5 is about the selection available, cluster 24 concerns the atmosphere, cluster 3 is about dessert, and cluster 18 is about hair and nail salons.