The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.  

Use Latent Dirichlet Allocation (LDA):  
- need a document-term matrix 
- need number of topics you would like the algorithm to detect

## Topic Modeling
First, let's look only at the nouns


In [1]:
import pandas as pd
from gensim import matutils, models
import scipy.sparse
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jgcao\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
cleaned = pd.read_pickle('./pickle/corpus.pkl')
cleaned

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,1/26/2011,fWKvX83p0-ka4JS3dc6E5A,5,my wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,7/27/2011,IjZ33sJrzXqU-0X6U8NwyA,5,i have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,6/14/2012,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate rice is so good and i also...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,5/27/2010,G-WvGaISbqqaMHlNnByodA,5,rosie dakota and i love chaparral dog park its...,review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,1/5/2012,1uJFq2r5QfJG_6ExMRCaGw,5,general manager scott petello is a good egg no...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
...,...,...,...,...,...,...,...,...,...,...
9995,VY_tvNUCCXGXQeSvJl757Q,7/28/2012,Ubyfp2RSDYW0g7Mbr8N3iA,3,first visithad lunch here today used my group...,review,_eqQoPtQ3e3UxLE4faT6ow,1,2,0
9996,EKzMHI1tip8rC1-ZAy64yg,1/18/2012,2XyIOQKbVFb6uXQdJ0RzlQ,4,should be called house of deliciousness i cou...,review,ROru4uk5SaYc3rg8IU7SQw,0,0,0
9997,53YGfwmbW73JhFiemNeyzQ,11/16/2010,jyznYkIbpqVmlsZxSDSypA,4,i recently visited olive and ivy for business ...,review,gGbN1aKQHMgfQZkqlsuwzg,0,0,0
9998,9SKdOoDHcFoxK5ZtsgHJoA,12/2/2012,5UKq9WQE1qQbJ0DJbc-B6Q,2,my nephew just moved to scottsdale recently so...,review,0lyVoNazXa20WzUyZPLaQQ,0,0,0


In [3]:
data_nouns = pd.DataFrame(cleaned.text.apply(nouns))
data_nouns

Unnamed: 0,text
0,wife birthday breakfast weather grounds pleasu...
1,i idea people reviews place everyone something...
2,gyro plate rice i candy selection
3,rosie dakota dog park lot paths desert xerisca...
4,manager scott petello egg detail issues speak ...
...,...
9995,lunch today groupon bruschetta pretzels calzon...
9996,house deliciousness i item item blah food mind...
9997,i business week visits fox restaurants establi...
9998,nephew bunch friends bar girlfriend pool watch...


In [4]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns['text'])
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aa,aaa,aaaaaalright,aaand,aaron,aarons,ab,abacus,abbaye,abbreviations,...,zumaroka,zumba,zupa,zupas,zur,zuzu,zuzus,zzzzzzzzzzzzzzzzz,éclairs,école
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [6]:
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.017*"place" + 0.007*"store" + 0.007*"service" + 0.007*"staff" + 0.007*"bar" + 0.006*"coffee" + 0.005*"day" + 0.005*"way" + 0.005*"night" + 0.005*"room"'),
 (1,
  '0.032*"food" + 0.025*"place" + 0.012*"service" + 0.009*"restaurant" + 0.009*"menu" + 0.008*"order" + 0.007*"chicken" + 0.006*"pizza" + 0.006*"lunch" + 0.006*"sauce"')]

In [7]:
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.026*"place" + 0.011*"bar" + 0.010*"food" + 0.010*"service" + 0.008*"night" + 0.007*"staff" + 0.006*"room" + 0.006*"coffee" + 0.006*"area" + 0.006*"way"'),
 (1,
  '0.033*"food" + 0.024*"place" + 0.013*"service" + 0.010*"restaurant" + 0.010*"menu" + 0.009*"chicken" + 0.009*"order" + 0.007*"pizza" + 0.007*"lunch" + 0.007*"sauce"'),
 (2,
  '0.010*"store" + 0.007*"ice" + 0.007*"cream" + 0.006*"dog" + 0.006*"chocolate" + 0.005*"staff" + 0.004*"day" + 0.004*"place" + 0.004*"shop" + 0.004*"experience"')]

In [7]:
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.034*"food" + 0.024*"place" + 0.013*"service" + 0.011*"restaurant" + 0.010*"menu" + 0.010*"chicken" + 0.009*"order" + 0.008*"lunch" + 0.007*"sauce" + 0.007*"salad"'),
 (1,
  '0.011*"place" + 0.011*"store" + 0.009*"service" + 0.007*"staff" + 0.006*"room" + 0.006*"day" + 0.006*"way" + 0.005*"car" + 0.005*"experience" + 0.005*"hotel"'),
 (2,
  '0.034*"pizza" + 0.024*"burger" + 0.016*"fries" + 0.015*"place" + 0.009*"dog" + 0.007*"burgers" + 0.006*"crust" + 0.005*"order" + 0.005*"staff" + 0.005*"way"'),
 (3,
  '0.033*"place" + 0.022*"bar" + 0.020*"food" + 0.013*"night" + 0.012*"beer" + 0.010*"drinks" + 0.010*"service" + 0.010*"hour" + 0.009*"coffee" + 0.008*"music"')]

### Topic Modeling (Nouns and Adjs)

In [6]:
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [7]:
data_nouns_adj = pd.DataFrame(cleaned.text.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,text
0,wife birthday breakfast excellent weather perf...
1,i idea people bad reviews place everyone somet...
2,gyro plate rice good i candy selection
3,rosie dakota i chaparral dog park convenient l...
4,general manager scott petello good egg detail ...
...,...
9995,lunch today groupon bruschetta pretzels cheese...
9996,house deliciousness i item item blah blah i wa...
9997,i olive ivy business last week visits fox rest...
9998,nephew bunch friends local bar girlfriend shoo...


In [8]:
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj['text'])
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aa,aaa,aaaaaalright,aaand,aaron,aarons,aarp,ab,abacus,abbaye,...,zumba,zupa,zupas,zur,zuzu,zuzus,zweigel,zzzzzzzzzzzzzzzzz,éclairs,école
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [12]:
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.012*"place" + 0.009*"great" + 0.007*"good" + 0.005*"nice" + 0.005*"store" + 0.005*"staff" + 0.005*"service" + 0.004*"new" + 0.004*"coffee" + 0.004*"bar"'),
 (1,
  '0.023*"food" + 0.021*"good" + 0.018*"place" + 0.014*"great" + 0.009*"service" + 0.006*"restaurant" + 0.006*"menu" + 0.006*"chicken" + 0.006*"little" + 0.005*"order"')]

In [13]:
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.013*"place" + 0.010*"great" + 0.008*"good" + 0.006*"nice" + 0.006*"service" + 0.005*"store" + 0.005*"staff" + 0.005*"bar" + 0.005*"new" + 0.004*"ive"'),
 (1,
  '0.015*"good" + 0.015*"pizza" + 0.014*"place" + 0.009*"food" + 0.008*"breakfast" + 0.008*"cheese" + 0.006*"best" + 0.006*"sandwich" + 0.006*"great" + 0.005*"order"'),
 (2,
  '0.026*"food" + 0.022*"good" + 0.019*"place" + 0.016*"great" + 0.011*"service" + 0.008*"restaurant" + 0.007*"menu" + 0.007*"chicken" + 0.006*"salad" + 0.006*"little"')]

In [10]:
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.027*"place" + 0.021*"good" + 0.019*"food" + 0.017*"great" + 0.010*"bar" + 0.010*"service" + 0.009*"pizza" + 0.007*"night" + 0.006*"nice" + 0.005*"drinks"'),
 (1,
  '0.011*"cream" + 0.010*"chocolate" + 0.010*"ice" + 0.007*"fresh" + 0.005*"flavors" + 0.005*"delicious" + 0.005*"best" + 0.005*"market" + 0.004*"cheese" + 0.004*"cake"'),
 (2,
  '0.008*"great" + 0.008*"store" + 0.007*"place" + 0.006*"good" + 0.005*"nice" + 0.005*"staff" + 0.005*"new" + 0.005*"service" + 0.004*"room" + 0.004*"ive"'),
 (3,
  '0.023*"food" + 0.022*"good" + 0.015*"place" + 0.012*"great" + 0.009*"chicken" + 0.009*"service" + 0.008*"restaurant" + 0.007*"menu" + 0.007*"salad" + 0.006*"sauce"')]

### Identifying Topics

In [None]:
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

Topic 0:   
Topic 1:    
Topic 2:   
Topic 3:  

In [None]:
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))