## Generating topics from product reviews.
I took a data science class this semester and quickly realized I needed to learn Python (shocking, I know). To help me learn the basics and fulfill several class assignments, I worked on generating common topics that appear in product reviews. Each "topic" is a cluster of words that frequently appear together.

### Cleaning the data.
The data I'm using is a subset of Amazon office product reviews from [here](http://jmcauley.ucsd.edu/data/amazon/). There are 53,258 reviews, but I only kept the first 10,000 reviews with less than 500 characters. Removing stop words ("and", "the", etc.) left 246,172 words to work with.

In [1]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import unidecode

reviews = []
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
with open('amazon_short_reviews.csv') as inputfile:
    for line in inputfile:
        line_words = tokenizer.tokenize(unidecode.unidecode(line).lower()) 
        reviews.append([w for w in line_words if not w in stop_words])
words = [word for line in reviews for word in line]

In [2]:
print(len(reviews))
print(len(words))

10000
246172


### Calculating SIP scores.
An "SIP" is a statistically improbable phrase. The SIP score tells you just how improbable a word or phrase is. For example, a word with SIP = 5 is used 5 times as often in the corpus of interest (aka, product reviews) as it is in everyday language. 

To find the SIP scores I first calculated the local frequency for each word (number of occurances/total number of words), filtering out words that occured less than 10 times. I used the `wordfreq` library to find the baseline word frequencies.  Dividing local frequency by baseline frequency gives the SIP score.

In [3]:
from wordfreq import word_frequency

num_words = len(words)
text = nltk.Text(words)
uniq_words = list(set(words))
word_pool = []
word_SIP = []
for word in uniq_words:
    count = text.count(word)
    if count < 10:
        continue
    word_pool.append(word)
    freq = word_frequency(word, 'en')
    if freq == 0:
        freq = float(.00001)
    word_SIP.append(count/num_words/freq)

word_pool_sorted = [x for _,x in sorted(zip(word_SIP, word_pool), reverse = True)]
word_SIP_sorted = word_SIP
word_SIP_sorted.sort(reverse = True)
SIPscoreslist = list(zip(word_pool_sorted, word_SIP_sorted))

Here are the 5 words with the highest SIP scores. 

In [4]:
SIPscoreslist[:5]

[('envelopes', 931.1233099801483),
 ('binder', 869.5694745520298),
 ('folders', 869.1542524751455),
 ('sturdy', 805.7274660321721),
 ('cartridges', 740.3234551354059)]

### Word embeddings with Word2Vec.
`Gensim` is a Python library created specifically for topic modeling. Within the library is the `Word2Vec` function which uses a neural network to represent each word as a vector of numbers.

In [5]:
from gensim.models import Word2Vec
model = Word2Vec(reviews, size=300, window=5, min_count=1, workers=4)



These word vectors can be compared (using something like cosine similarity) to find words that are similar to each other. Because the word vectors were created from product reviews, "similar" words are words that are used in the same contexts. Here we list the 5 words most similar to "envelopes".

In [6]:
print(model.wv.most_similar(positive=['envelopes'], topn=5))

[('seal', 0.9962739944458008), ('envelope', 0.9949904680252075), ('seals', 0.9942649602890015), ('sealing', 0.9939274191856384), ('packages', 0.9936246275901794)]


### Separate nouns and adjectives.
Next I used part-of-speech tagging (POS tagging) to separate our words into nouns and adjectives. (In this version, I'm separating into nouns and non-nouns.)

In [8]:
# SIP order is mantained
from nltk import pos_tag
word_pos = word_pool_sorted
word_pos = pos_tag(word_pos)

nounBucket = []
adjBucket = []
# for now, just call all non-nouns "adjectives"
for word in word_pos:
    if word[1][:2] == "NN":
        nounBucket.append(word[0])
    else:
        adjBucket.append(word[0])

Obviously, POS tagging is not perfect. I would like to come back to this and explore other options.

In [12]:
print(nounBucket[:10])
print(adjBucket[:10])

['envelopes', 'folders', 'cartridges', 'pens', 'inks', 'printer', 'staples', 'cartridge', 'pencils', 'tabs']
['binder', 'sturdy', 'ink', 'avery', '3m', 'flimsy', 'adhesive', 'refill', 'staple', 'durable']


### Cluster the nouns.
I used nouns with high SIP scores as cluster "centers". Any other noun that is similar enough to the cluster centers (according to some threshold) is included in the cluster. Clusters are designed to be exclusive, so words can only appear in one cluster.

In [13]:
wordBank = nounBucket

num_clust = 10
similarity_thresh = 0.90
SIP_thresh = 10.0

nounClusters = []
for i in range(num_clust):
    parent_word = wordBank[0] # the first one has highest SIP
    cluster = []
    cluster.append(parent_word)
    count = 0
    for i,word in enumerate(wordBank[1:]):
        if count < 9:
            if model.wv.similarity(parent_word, word) > similarity_thresh:
                if word_SIP_sorted[i+1] > SIP_thresh:
                    cluster.append(word)
                    count += 1
        else:
            break
    nounClusters.append(cluster)
    wordBank = [t for t in wordBank if t not in cluster] # remove current clustr from wordBank

Here are the first two noun clusters. Hopefully, the clusters represent a common topic from the product reviews.

In [14]:
nounClusters[:2]

[['envelopes',
  'folders',
  'inks',
  'staples',
  'pencils',
  'tabs',
  'markers',
  'scotch',
  'labels',
  'tape'],
 ['cartridges',
  'printer',
  'cartridge',
  'printers',
  'hp',
  'amazon',
  'brands',
  'refills',
  'epson',
  'costco']]

### Add adjectives to topics.
I added adjectives to the noun clusters to add meaning and interpretability. This was done by finding common bigrams that contain both an adjective and a noun from one of the clusters.

In [15]:
from nltk.collocations import *

adjClusters = []
bigram_measures = nltk.collocations.BigramAssocMeasures()
adjSet = set(adjBucket)
num_adj = 10
for cluster in nounClusters:
    # make a copy of reviews, replace all cluster words with the parent_word of that cluster
    parent_word = cluster[0]
    similar_words = cluster[1:]
    words_copy = [parent_word if w in similar_words else w for w in words]
    
    # now find bigrams with parent_word and all words in adjBucket
    finder = BigramCollocationFinder.from_words(words_copy, window_size=5)
    parent_filter = lambda *w: parent_word not in w
    adj_filter = lambda w1, w2: adjSet.isdisjoint([w1, w2])
    finder.apply_freq_filter(2)
    finder.apply_ngram_filter(parent_filter)
    finder.apply_ngram_filter(adj_filter)
    adj_temp = finder.nbest(bigram_measures.pmi, num_adj)
    adj_temp = [pair[1] if pair[0] == parent_word else pair[0] for pair in adj_temp]
    adjClusters.append(adj_temp)

Here are the first two adjective clusters, which correspond to the first two noun clusters.

In [16]:
adjClusters[:2]

[['masking',
  'packing',
  'virtually',
  'hanging',
  'rounded',
  'mechanical',
  'pendaflex',
  'smead',
  'invisible',
  'thermal'],
 ['www',
  'refilled',
  'genuine',
  'refurbished',
  'costly',
  'lowest',
  'canon',
  'starter',
  'remanufactured',
  'died']]

### Save Topics
Time to bring the noun clusters and adjective clusters together to form topics.

In [17]:
topics = []
for i in range(num_clust):
    if len(adjClusters[i]) != 0:
        topicName = " ".join([nounClusters[i][0], adjClusters[i][0]])
        queryParams = [topicName, adjClusters[i][1:], nounClusters[i][1:]]
    else:
        topicName = nounClusters[i][0]
        queryParams = [topicName, [], nounClusters[i][1:]]
    topics.append(queryParams)

In [18]:
topics[:2]

[['envelopes masking',
  ['packing',
   'virtually',
   'hanging',
   'rounded',
   'mechanical',
   'pendaflex',
   'smead',
   'invisible',
   'thermal'],
  ['folders',
   'inks',
   'staples',
   'pencils',
   'tabs',
   'markers',
   'scotch',
   'labels',
   'tape']],
 ['cartridges www',
  ['refilled',
   'genuine',
   'refurbished',
   'costly',
   'lowest',
   'canon',
   'starter',
   'remanufactured',
   'died'],
  ['printer',
   'cartridge',
   'printers',
   'hp',
   'amazon',
   'brands',
   'refills',
   'epson',
   'costco']]]

### Conclusions
I have a lot  more to learn about Python and topic modeling. The two main topics "envelopes masking" and "cartridges www" are not particularly helpful, although using the clusters as a search query may yield some interesting results.