## Environment Setting

In [23]:
import pandas as pd #data handlig
import numpy as np
from gsdmm.mgp import MovieGroupProcess

#preprocessing data
import nltk
from nltk.tokenize import RegexpTokenizer
import replacers #making-program "replacers.py"

import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

%matplotlib inline

## Load Data Set

In [3]:
df = pd.read_csv('./data/Amazon_review.csv', encoding='949')
df = df.drop(df.columns[0], axis=1)
print(df.shape)
df.head()

(23210, 4)


Unnamed: 0,stars,title,date,text
0,5.0 out of 5 stars,Sound quality has been improved again! 3rd ti...,"on November 2, 2017",So it’s been an interesting first few weeks wi...
1,4.0 out of 5 stars,Audio is IMPROVED after software update! See ...,"on November 2, 2017",---------------------------UPDATE 11/4/2017---...
2,5.0 out of 5 stars,Audio Now Sounds Great after Software Update -...,"on November 2, 2017",Update: Amazon has pushed a software update an...
3,5.0 out of 5 stars,WoW,"on November 23, 2017","I love my Echo, it's stylish, loud<f0><U+009F>..."
4,3.0 out of 5 stars,Hoped to replace our small bluetooth speakers....,"on November 5, 2017",Sound is not great. A review of the Echo as a ...


## Pre-processing

We want to use review text only for text clustering.

In [10]:
reviewList = df['text']
print(type(reviewList))
reviewList[0]

<class 'pandas.core.series.Series'>


'So it’s been an interesting first few weeks with the Echo and am happy to say Echo 2nd Gen has finally delivered on its promise of improved sound quality over 1st Gen Echo, with the 3rd firmware since launch.  If you are confused about a lot of the negative reviews, old firmware is the likely cause of most of them regarding poor sound quality.Want to keep this short and spare all the gory details, but there was a bug in the launch version of the firmware, which was fixed after a few days, but the first fix, while satisfying some, was not, in my opinion a full fix and left the mid-range frequencies muted and tinny.  Today I noticed that Alexa’s voice in this unit sounded much more like Alexa’s voice on Gen 1 Echo’s I own and, after playing some music, suspected they had upgraded the firmware again, and indeed they have.  The current firmware is 592452720 and it’s a massive improvement over both the original and updated version 592452420.So I decided to do some more side-by-side compari

Making a nested list of review to match the clustering input data type.

In [12]:
newList = [text for text in reviewList]
print(type(newList))
newList

<class 'list'>


['So it’s been an interesting first few weeks with the Echo and am happy to say Echo 2nd Gen has finally delivered on its promise of improved sound quality over 1st Gen Echo, with the 3rd firmware since launch.  If you are confused about a lot of the negative reviews, old firmware is the likely cause of most of them regarding poor sound quality.Want to keep this short and spare all the gory details, but there was a bug in the launch version of the firmware, which was fixed after a few days, but the first fix, while satisfying some, was not, in my opinion a full fix and left the mid-range frequencies muted and tinny.  Today I noticed that Alexa’s voice in this unit sounded much more like Alexa’s voice on Gen 1 Echo’s I own and, after playing some music, suspected they had upgraded the firmware again, and indeed they have.  The current firmware is 592452720 and it’s a massive improvement over both the original and updated version 592452420.So I decided to do some more side-by-side compar

Change review text to all lowercase.

In [13]:
lower_text = [str(line).lower() for line in newList]
lower_text

['so it’s been an interesting first few weeks with the echo and am happy to say echo 2nd gen has finally delivered on its promise of improved sound quality over 1st gen echo, with the 3rd firmware since launch.  if you are confused about a lot of the negative reviews, old firmware is the likely cause of most of them regarding poor sound quality.want to keep this short and spare all the gory details, but there was a bug in the launch version of the firmware, which was fixed after a few days, but the first fix, while satisfying some, was not, in my opinion a full fix and left the mid-range frequencies muted and tinny.  today i noticed that alexa’s voice in this unit sounded much more like alexa’s voice on gen 1 echo’s i own and, after playing some music, suspected they had upgraded the firmware again, and indeed they have.  the current firmware is 592452720 and it’s a massive improvement over both the original and updated version 592452420.so i decided to do some more side-by-side compar

Explode all abbreviations. Remove the apostrophe to restore the original sentence.

In [14]:
replacer = replacers.RegexpReplacer()
decomp_words_text = [replacer.replace(line) for line in lower_text]
decomp_words_text

['so it is been an interesting first few weeks with the echo and am happy to say echo 2nd gen has finally delivered on its promise of improved sound quality over 1st gen echo, with the 3rd firmware since launch.  if you are confused about a lot of the negative reviews, old firmware is the likely cause of most of them regarding poor sound quality.want to keep this short and spare all the gory details, but there was a bug in the launch version of the firmware, which was fixed after a few days, but the first fix, while satisfying some, was not, in my opinion a full fix and left the mid-range frequencies muted and tinny.  today i noticed that alexa is voice in this unit sounded much more like alexa is voice on gen 1 echo is i own and, after playing some music, suspected they had upgraded the firmware again, and indeed they have.  the current firmware is 592452720 and it is a massive improvement over both the original and updated version 592452420.so i decided to do some more side-by-side c

Create a token by disassembling each review sentence.

In [15]:
def tokenize(list_type):
    
    t = RegexpTokenizer("[\w]+")
    tokenized_list = []

    for word in list_type:
        tokens = t.tokenize(word)
        tokenized_list.append(tokens)
    
    return tokenized_list

In [40]:
token = tokenize(decomp_words_text)
print(len(token))
token

23210


[['so',
  'it',
  'is',
  'been',
  'an',
  'interesting',
  'first',
  'few',
  'weeks',
  'with',
  'the',
  'echo',
  'and',
  'am',
  'happy',
  'to',
  'say',
  'echo',
  '2nd',
  'gen',
  'has',
  'finally',
  'delivered',
  'on',
  'its',
  'promise',
  'of',
  'improved',
  'sound',
  'quality',
  'over',
  '1st',
  'gen',
  'echo',
  'with',
  'the',
  '3rd',
  'firmware',
  'since',
  'launch',
  'if',
  'you',
  'are',
  'confused',
  'about',
  'a',
  'lot',
  'of',
  'the',
  'negative',
  'reviews',
  'old',
  'firmware',
  'is',
  'the',
  'likely',
  'cause',
  'of',
  'most',
  'of',
  'them',
  'regarding',
  'poor',
  'sound',
  'quality',
  'want',
  'to',
  'keep',
  'this',
  'short',
  'and',
  'spare',
  'all',
  'the',
  'gory',
  'details',
  'but',
  'there',
  'was',
  'a',
  'bug',
  'in',
  'the',
  'launch',
  'version',
  'of',
  'the',
  'firmware',
  'which',
  'was',
  'fixed',
  'after',
  'a',
  'few',
  'days',
  'but',
  'the',
  'first',
  'fix',

The above preprocessing process is the same as the existing method(our project).

-------------------------------------

## Text Clustering with GSDMM

"A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering" written by Jianhua Yin and Jianyong Wang 

A function that checks for unique words.

In [24]:
def compute_Voca(texts):
    V = set()
    for text in texts:
        for word in text:
            V.add(word)
    return V

Execution result

In [25]:
compute_Voca(token)

{'heard',
 'could',
 'benign',
 'relocate',
 'him',
 'beloved',
 'facebook',
 'pointing',
 'bt',
 '20x20',
 'commercial',
 'smoke',
 'dud',
 'cigarette',
 'speaker',
 'accepted',
 'alit',
 'speak',
 'wine',
 'passcode',
 'vanessaomarie',
 'candid',
 'workyou',
 'guitarists',
 'signing',
 'clumsy',
 'pinpoint',
 'alarming',
 '31st',
 'eleven',
 'eliminates',
 'abour',
 '595530420',
 'rhapsody',
 'indicates',
 'gmt',
 'adjacent',
 'aaaaa',
 'straightforward',
 'pretends',
 'deserving',
 'through',
 'amplifier',
 'messed',
 'hiccup',
 'knowingly',
 'tv',
 'awaking',
 'accomplish',
 'timely',
 'swapping',
 'handlers',
 'recent',
 'labelled',
 'switching',
 'wizz',
 'clearance',
 'nervous',
 'chopin',
 'strum',
 'frustrates',
 'celebrities',
 'hassel',
 'strung',
 '8th',
 'promises',
 'bugging',
 'intimidate',
 'geology',
 'tired',
 'translate',
 'firewall',
 'knowledgeable',
 'limitedi',
 'whoopee',
 'dee',
 'regards',
 'dispose',
 'htings',
 'plate',
 'pat',
 'inconveniently',
 'afforded'

In [29]:
size = len(compute_Voca(token))
size

12883

Set the parameters(K, alpha, beta)

In [30]:
mgp = MovieGroupProcess(K=30, n_iters=100, alpha=0.2, beta=0.01)

### Model fitting

In [31]:
y = mgp.fit(token, size)

In stage 0: transferred 21746 clusters with 30 clusters populated
In stage 1: transferred 14976 clusters with 30 clusters populated
In stage 2: transferred 10470 clusters with 30 clusters populated
In stage 3: transferred 8012 clusters with 30 clusters populated
In stage 4: transferred 6668 clusters with 30 clusters populated
In stage 5: transferred 5858 clusters with 30 clusters populated
In stage 6: transferred 5621 clusters with 30 clusters populated
In stage 7: transferred 5280 clusters with 29 clusters populated
In stage 8: transferred 5212 clusters with 28 clusters populated
In stage 9: transferred 5108 clusters with 28 clusters populated
In stage 10: transferred 4937 clusters with 26 clusters populated
In stage 11: transferred 4890 clusters with 25 clusters populated
In stage 12: transferred 4852 clusters with 25 clusters populated
In stage 13: transferred 4733 clusters with 25 clusters populated
In stage 14: transferred 4717 clusters with 26 clusters populated
In stage 15: tran

In [34]:
len(y)

23210

In [54]:
def tokenize_stream(list_type):
    
    t = RegexpTokenizer("[\w]+")
    tokenized_list = []

    for word in list_type:
        tokens = t.tokenize(word)
        tokenized_list += tokens
    
    return tokenized_list

In [55]:
new_token = tokenize_stream(decomp_words_text)

In [61]:
cluster = {
    'cluster': y,
    'review': decomp_words_text
}

In [62]:
export = pd.DataFrame.from_dict(cluster)

In [63]:
export.to_csv('cluster_review.csv')