## Generating topics from product reviews.
I took a data science class this semester and quicky realized I needed to learn Python (shocking, I know). To help me learn the basics, I worked on generating common topics that appear in prodcut reviews. Each "topic" is a cluster of words that appear frequently and in the same contexts.

## Cleaning the data.
The data I'm using is a subset of Amazon office product reviews from [here](http://jmcauley.ucsd.edu/data/amazon/). There are 53,258 reviews, but I only kept reviews less than 500 characters (to satistfy requirements for a class assignment). This left me with 24,305 reviews. Removing stop words ("and", "the", etc.) left 647,723 words to work with.

In [4]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import unidecode

reviews = []
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
with open('amazon_short_reviews.csv') as inputfile:
    for line in inputfile:
        line_words = tokenizer.tokenize(unidecode.unidecode(line).lower()) 
        reviews.append([w for w in line_words if not w in stop_words])  # remove stop words
words = [word for line in reviews for word in line]
reviews = [x for x in reviews if x != []]  # remove empty reviews

In [5]:
print(len(reviews))
print(len(words))

24305
647723


## Calculating SIP scores.
An "SIP" is a statistically improbable phrase. The SIP score tells you just how improbable a word or phrase is. For example, a word with SIP = 5 is used 5 times as often in the corpus of interest (aka, product reviews) as it is in everyday language. 

To find the SIP scores I first calculated the local frequency for each word as (number of occurances / total number of words), filtering out words that occured less than 10 times. I used the `wordfreq` library to find the baseline word frequencies.  Dividing local frequency by baseline frequency gives the SIP score.

In [6]:
from wordfreq import word_frequency

num_words = len(words)
text = nltk.Text(words)
uniq_words = list(set(words))
word_pool = []
word_SIP = []
for word in uniq_words:
    count = text.count(word)
    if count < 10:
        continue
    word_pool.append(word)
    freq = word_frequency(word, 'en')
    if freq == 0:
        freq = float(.00001)
    word_SIP.append(count/num_words/freq)

word_pool_sorted = [x for _,x in sorted(zip(word_SIP, word_pool), reverse = True)]
word_SIP_sorted = word_SIP
word_SIP_sorted.sort(reverse = True)
SIPscoreslist = list(zip(word_pool_sorted, word_SIP_sorted))

In [8]:
SIPscoreslist[:5]

[('binder', 1289.0747265053337),
 ('folders', 886.1633999626201),
 ('sturdy', 807.2216685543215),
 ('cartridges', 715.1373118764491),
 ('pens', 678.9790593846693)]

## Word vectors with Word2Vec.
asdfasdf

In [None]:
# WARNING: if you set min_count more than 1, you may get a word that has high SIP but then 
# wouldn't appear in your word vectors
model = Word2Vec(reviews, size=300, window=5, min_count=1, workers=4)
# another concern: what if you find a word that has high similarity, but is not included in the SIP scores?
# SIP and WORD2VEC need to have same base words

In [None]:
print(model.wv.most_similar(positive=['binder'], topn=5))
print(model.wv.most_similar(positive=['pens'], topn=5))