## Generating topics from product reviews.
I took a data science class this semester and quickly realized I needed to learn Python (shocking, I know). To help me learn the basics and fulfill several class assignments, I worked on generating common topics that appear in product reviews. Each "topic" is a cluster of words that appear frequently, and in the same contexts.

### Cleaning the data.
The data I'm using is a subset of Amazon office product reviews from [here](http://jmcauley.ucsd.edu/data/amazon/). There are 53,258 reviews, but I only kept the first 10,000 reviews with less than 500 characters. Removing stop words ("and", "the", etc.) left 246,172 words to work with.

In [4]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import unidecode

reviews = []
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
with open('amazon_short_reviews.csv') as inputfile:
    for line in inputfile:
        line_words = tokenizer.tokenize(unidecode.unidecode(line).lower()) 
        reviews.append([w for w in line_words if not w in stop_words])
words = [word for line in reviews for word in line]

In [5]:
print(len(reviews))
print(len(words))

10000
246172


### Calculating SIP scores.
An "SIP" is a statistically improbable phrase. The SIP score tells you just how improbable a word or phrase is. For example, a word with SIP = 5 is used 5 times as often in the corpus of interest (aka, product reviews) as it is in everyday language. 

To find the SIP scores I first calculated the local frequency for each word (number of occurances/total number of words), filtering out words that occured less than 10 times. I used the `wordfreq` library to find the baseline word frequencies.  Dividing local frequency by baseline frequency gives the SIP score.

In [6]:
from wordfreq import word_frequency

num_words = len(words)
text = nltk.Text(words)
uniq_words = list(set(words))
word_pool = []
word_SIP = []
for word in uniq_words:
    count = text.count(word)
    if count < 10:
        continue
    word_pool.append(word)
    freq = word_frequency(word, 'en')
    if freq == 0:
        freq = float(.00001)
    word_SIP.append(count/num_words/freq)

word_pool_sorted = [x for _,x in sorted(zip(word_SIP, word_pool), reverse = True)]
word_SIP_sorted = word_SIP
word_SIP_sorted.sort(reverse = True)
SIPscoreslist = list(zip(word_pool_sorted, word_SIP_sorted))

In this set of product reviews, "binder" is the word with the highest SIP score. 

In [7]:
SIPscoreslist[:5]

[('envelopes', 931.1233099801483),
 ('binder', 869.5694745520298),
 ('folders', 869.1542524751455),
 ('sturdy', 805.7274660321721),
 ('cartridges', 740.3234551354059)]

### Word embeddings with Word2Vec.
asdfasdf

In [9]:
from gensim.models import Word2Vec
model = Word2Vec(reviews, size=300, window=5, min_count=1, workers=4)



Make some comments.

In [10]:
print(model.wv.most_similar(positive=['envelopes'], topn=5))

[('mailing', 0.9965976476669312), ('packages', 0.9965600967407227), ('stock', 0.9963386058807373), ('sealing', 0.9961926937103271), ('professional', 0.9958876371383667)]


### Separate nouns and adjectives.

In [12]:
# this will keep the nouns and adjs in order of SIP score
from nltk import pos_tag
word_pos = word_pool_sorted
word_pos = pos_tag(word_pos)

Some problems. Binder should be a noun.

In [13]:
word_pos[:10]

[('envelopes', 'NNS'),
 ('binder', 'VBP'),
 ('folders', 'NNS'),
 ('sturdy', 'JJ'),
 ('cartridges', 'NNS'),
 ('pens', 'NNS'),
 ('ink', 'VBP'),
 ('inks', 'NNS'),
 ('printer', 'NN'),
 ('staples', 'NNS')]

In [14]:
nounBucket = []
adjBucket = []
# for now, just call all non-nouns "adjectives"
for word in word_pos:
    if word[1][:2] == "NN":
        nounBucket.append(word[0])
    else:
        adjBucket.append(word[0])

Make some comment.

In [15]:
print(nounBucket[:10])
print(adjBucket[:10])

['envelopes', 'folders', 'cartridges', 'pens', 'inks', 'printer', 'staples', 'cartridge', 'pencils', 'tabs']
['binder', 'sturdy', 'ink', 'avery', '3m', 'flimsy', 'adhesive', 'refill', 'staple', 'durable']


### Cluster the nouns.

In [21]:
wordBank = nounBucket

num_clust = 10
similarity_thresh = 0.90
SIP_thresh = 10.0

nounClusters = []
for i in range(num_clust):
    parent_word = wordBank[0] # the first one has highest SIP
    cluster = []
    cluster.append(parent_word)
    count = 0
    for i,word in enumerate(wordBank[1:]):
        if count < 9:
            if model.wv.similarity(parent_word, word) > similarity_thresh:
                if word_SIP_sorted[i+1] > SIP_thresh:
                    cluster.append(word)
                    count += 1
        else:
            break
    nounClusters.append(cluster)
    wordBank = [t for t in wordBank if t not in cluster] # remove current clustr from wordBank

Make some comment.

In [22]:
nounClusters[:5]

[['envelopes',
  'folders',
  'inks',
  'staples',
  'pencils',
  'tabs',
  'markers',
  'scotch',
  'labels',
  'tape'],
 ['cartridges',
  'printer',
  'cartridge',
  'printers',
  'hp',
  'amazon',
  'brands',
  'epson',
  'costco',
  'xl'],
 ['pens',
  'erase',
  'nib',
  'dries',
  'colors',
  'prints',
  'bleed',
  'odor',
  'printing',
  'refills'],
 ['stapler',
  'gel',
  'magnets',
  'matte',
  'flap',
  'pricey',
  'pencil',
  'peel',
  'notebooks',
  'dispenser'],
 ['templates',
  'calculator',
  'notebook',
  'pads',
  'packaging',
  'bulky',
  'sheets',
  'glare',
  'thinner',
  'inserts']]

### Add adjectives to topics.

In [24]:
from nltk.collocations import *

adjClusters = []
bigram_measures = nltk.collocations.BigramAssocMeasures()
adjSet = set(adjBucket)
num_adj = 10
for cluster in nounClusters:
    # make a copy of reviews, replace all cluster words with the parent_word of that cluster
    parent_word = cluster[0]
    similar_words = cluster[1:]
    words_copy = [parent_word if w in similar_words else w for w in words]
    
    # now find bigrams with parent_word and all words in adjBucket
    finder = BigramCollocationFinder.from_words(words_copy, window_size=5)
    parent_filter = lambda *w: parent_word not in w
    adj_filter = lambda w1, w2: adjSet.isdisjoint([w1, w2])
    finder.apply_freq_filter(2)
    finder.apply_ngram_filter(parent_filter)
    finder.apply_ngram_filter(adj_filter)
    adj_temp = finder.nbest(bigram_measures.pmi, num_adj)
    adj_temp = [pair[1] if pair[0] == parent_word else pair[0] for pair in adj_temp]
    adjClusters.append(adj_temp)

Make some comment

In [25]:
adjClusters[:10]

[['masking',
  'packing',
  'virtually',
  'hanging',
  'rounded',
  'mechanical',
  'pendaflex',
  'smead',
  'invisible',
  'thermal'],
 ['www',
  'refilled',
  'genuine',
  'refurbished',
  'lowest',
  'canon',
  'starter',
  'remanufactured',
  'died',
  'prime'],
 ['assorted',
  'sakura',
  'vibrant',
  'fountain',
  '36',
  'dry',
  'assortment',
  'g2',
  'scratchy',
  'pilot'],
 ['commercial',
  'cup',
  'swingline',
  'mechanical',
  'drafting',
  'matching',
  'automatic',
  'electric',
  'reduced',
  'sharpens'],
 ['graphing',
  'spiral',
  'legal',
  'bound',
  '150',
  'solar',
  'website',
  '15',
  '65',
  'poly'],
 ['filler',
  'insertable',
  'mate',
  'leaf',
  'kodak',
  'ruled',
  'photo',
  '24',
  'copy',
  'glossy'],
 ['sits',
  'fellowes',
  'shredding',
  'rarely',
  'build',
  'closer',
  'combined',
  'hurt',
  'spills',
  'puts'],
 ['highest',
  'as_li_tl',
  'www',
  'flawlessly',
  'lesser',
  'lowest',
  'competitive',
  'competitive',
  'high',
  'unbeat

### Save Topics

In [26]:
topics = []
for i in range(num_clust):
    if len(adjClusters[i]) != 0:
        topicName = " ".join([nounClusters[i][0], adjClusters[i][0]])
        queryParams = [topicName, adjClusters[i][1:], nounClusters[i][1:]]
    else:
        topicName = nounClusters[i][0]
        queryParams = [topicName, [], nounClusters[i][1:]]
    topics.append(queryParams)

In [27]:
topics[:2]

[['envelopes masking',
  ['packing',
   'virtually',
   'hanging',
   'rounded',
   'mechanical',
   'pendaflex',
   'smead',
   'invisible',
   'thermal'],
  ['folders',
   'inks',
   'staples',
   'pencils',
   'tabs',
   'markers',
   'scotch',
   'labels',
   'tape']],
 ['cartridges www',
  ['refilled',
   'genuine',
   'refurbished',
   'lowest',
   'canon',
   'starter',
   'remanufactured',
   'died',
   'prime'],
  ['printer',
   'cartridge',
   'printers',
   'hp',
   'amazon',
   'brands',
   'epson',
   'costco',
   'xl']]]