In [70]:
import pandas as pd
from gensim import corpora, models, similarities
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

### Topic Modeling

In this notebook, we're going to learn about an unsupervised learning approach from NLP called Topic Modeling. This is a broad category, and we're going to learn about its most popular implementation, Latent Dirichlet Allocation.

Here's some intuition about the goals of Topic Modeling:
  * Similar in spirit to cluster techniques - find things with are similar 
  * Our datapoints are documents and our features are word counts
  * Documents that are about a similar subject probably use a lot of the same words with high probability
  * Thus, the clusters of word counts can be thought of as simply "topics" (eg. Sports, Politics, Business)
  * This is unsupervised learning, so we want to extract these hidden topics from our corpus (we have no labels)
  * In LDA, there can be some overlap of the topics within a single document. A news article might use words and concepts from both economics and politics.
  
We're going to use the *gensim* package to do LDA.  

In [5]:
# read yelp.csv into a DataFrame
url = '../data/yelp.csv'
yelp = pd.read_csv(url)
X = yelp.text

In the past, we've used CountVectorizer to do things like remove stop words, select the top features, and construct a document term matrix. We're going to do some of those things by hand, and let gensim take care of some others.

In [80]:
# remove common words and tokenize

# lets steal the stopwords from CountVectorizer. 
# So we'll initalialize a CountVectorizer() but we won't use it, we just want its word list.

stoplist = set(CountVectorizer(stop_words='english').get_stop_words() )
texts = [[word for word in document.lower().split() if word not in stoplist] for document in list(X)]

In [110]:
# count up the frequency of each word
frequency = defaultdict(int)
for text in texts:
     for token in text:
         frequency[token] += 1
        
        

# remove words that only occur a small number of times
texts = [[token for token in text if frequency[token] > 1] for text in texts]

Next, we're going to create our bag-of-words representation of the data. Previously, we used CountVectorizer to generate a document-term-matrix. Gensim has similar functionality for us.

In [111]:
dictionary = corpora.Dictionary(texts)

In [105]:
print(dictionary)

Dictionary(6838 unique tokens: [u'yellow', u"friend's", u'hanging', u'deli', u'regional']...)


In [115]:
# this objects stores everything we need for our bag of words representation
dictionary.doc2bow("golf car window waiter".split())

[(115, 1), (793, 1), (3677, 1), (3864, 1)]

Now we create a corpus in the bag of words representation by transforming each of our texts.

In [116]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [53]:
print(corpus[5])

[(6, 3), (12, 1), (15, 1), (16, 4), (19, 1), (25, 4), (36, 3), (37, 1), (44, 1), (50, 2), (54, 4), (56, 1), (59, 5), (60, 1), (63, 1), (65, 5), (83, 1), (88, 9), (91, 2), (93, 1), (96, 1), (105, 2), (108, 3), (113, 1), (127, 1), (131, 3), (140, 4), (148, 2), (149, 1), (154, 1), (157, 1), (159, 1), (161, 2), (165, 1), (168, 7), (174, 5), (175, 1), (180, 4), (181, 1), (182, 2), (185, 1), (186, 1), (189, 2), (192, 4), (193, 2), (196, 1), (251, 1), (264, 2), (282, 1), (284, 1), (285, 1), (286, 1), (287, 1), (288, 1), (289, 1), (290, 1), (291, 1), (292, 1), (293, 1), (294, 1), (295, 1), (296, 1), (297, 1), (298, 1), (299, 1), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 1), (306, 1), (307, 1), (308, 1), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 1), (317, 1), (318, 1), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 2), (325, 1), (326, 1), (327, 1), (328, 1), (329, 1), (330, 1), (331, 1), (332, 1), (333, 2), (334, 1), (335, 1), (336, 1), (

Now that we have our corpus, we will fit an LDA model to this data. We can pick any number of topics we like and see how the result turns out.


In [117]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=10, alpha = 'auto')
# warning, this takes a minute or two



In [118]:
lda.show_topics()

[u'0.011*- + 0.009*food + 0.009*ordered + 0.009*good + 0.008*like + 0.008*place + 0.007*just + 0.007*great + 0.006*best + 0.006*really',
 u"0.014*good + 0.010*food + 0.010*it's + 0.009*like + 0.008*chicken + 0.008*just + 0.007*place + 0.006*got + 0.006*little + 0.006*burger",
 u"0.011*just + 0.011*place + 0.009*good + 0.008*like + 0.007*really + 0.007*food + 0.007*i'm + 0.005*it's + 0.005*didn't + 0.005*little",
 u"0.015*it's + 0.010*just + 0.008*like + 0.007*really + 0.006*place + 0.006*don't + 0.005*- + 0.005*good + 0.005*food + 0.005*i've",
 u"0.014*like + 0.012*just + 0.010*it's + 0.010*- + 0.008*place + 0.007*really + 0.006*don't + 0.005*good + 0.005*food + 0.005*got",
 u"0.009*place + 0.009*just + 0.007*love + 0.007*like + 0.007*time + 0.006*good + 0.006*- + 0.006*food + 0.006*great + 0.005*it's",
 u"0.014*place + 0.012*like + 0.011*food + 0.008*good + 0.008*just + 0.008*it's + 0.008*really + 0.006*pizza + 0.006*- + 0.006*pretty",
 u'0.014*food + 0.011*service + 0.010*dog + 0.006

In [102]:
lda.top_topics(corpus)

[([(0.014065597961374307, u'place'),
   (0.013558365075836501, u'good'),
   (0.012024852163405949, u'food'),
   (0.0077697351437719222, u'just'),
   (0.0067911028195645996, u'service'),
   (0.0062420825318874424, u'really'),
   (0.0058014426375006057, u'great'),
   (0.0055872844700345049, u'time'),
   (0.0055568370110874454, u"it's"),
   (0.0053055146042074791, u'got'),
   (0.0051860177915481614, u'like'),
   (0.0050400587262547265, u"i'm"),
   (0.0047866217048370529, u'-'),
   (0.0044233059799882006, u"i've"),
   (0.0043957643274345629, u'little'),
   (0.0040343576117017887, u"didn't"),
   (0.0037519270388395288, u"don't"),
   (0.003584143032704142, u'came'),
   (0.0035284232220913485, u'went'),
   (0.0031748211735253455, u'good.')],
  -504.48052087638246),
 ([(0.0098665676284805902, u'like'),
   (0.0084474962467171455, u'food'),
   (0.0073160061650787119, u'place'),
   (0.0072213481376723959, u'good'),
   (0.0059832165880357707, u'just'),
   (0.0046442391853328364, u'came'),
   (0.00

## Exercise
  * Try this again with only 2 topics. How do the topics seem?
  * Try this again without removing stop words. How do the topics seem?
  * From what you've learned about LDA, do you think this method would work better with documents the size of tweets or the size of long news articles?