<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Guided Practice with Topic Modeling and LDA


---


In practice it would be a very rare to need to build an unsupervised topic model like LDA from scratch. Lucky for us, sklearn comes with LDA topic modeling functionality. Another popular LDA module which we will explore in this lab is from the `gensim` package. 

Let's explore a brief walkthrough of LDA and topic modeling using gensim. We will work with a small collection of documents represented as a list.

### 1. Load the packages and create the small "documents".

You may need to install the gensim package with `pip` or `conda`.

In [2]:
from gensim import corpora, models, matutils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
import pandas as pd


doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
documents = [doc_a, doc_b, doc_c, doc_d, doc_e]
df        = pd.DataFrame(documents, columns=['text'])

In [3]:
df

Unnamed: 0,text
0,Brocolli is good to eat. My brother likes to e...
1,My mother spends a lot of time driving my brot...
2,Some health experts suggest that driving may c...
3,I often feel pressure to perform well at schoo...
4,Health professionals say that brocolli is good...


### 2. Load stop words either from NLTK or sklearn

In [4]:
from nltk.corpus import stopwords

nltk_stops = stopwords.words()

In [5]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stop_words = list(ENGLISH_STOP_WORDS)

custom_stop_words.append('mother')
custom_stop_words.append('brother')

In [5]:
# A:

### 3. Use CountVectorizer to transform our text, taking out the stopwords.

In [6]:
vect = CountVectorizer(stop_words = custom_stop_words)
X = vect.fit_transform(df['text'])

### 4. Extract the tokens that remain after stopword removal.

The `.vocabulary_` attribute of the vectorizer contains a dictionary of terms. There is also the built-in function `.get_feature_names()` which will extract the column names.

In [11]:
terms = vect.vocabulary_
features = vect.get_feature_names()
print terms, '\n',  features

{u'tension': 23, u'good': 10, u'increased': 12, u'feel': 9, u'practice': 16, u'brocolli': 3, u'pressure': 17, u'say': 19, u'experts': 8, u'likes': 13, u'spends': 21, u'professionals': 18, u'school': 20, u'driving': 6, u'perform': 15, u'suggest': 22, u'drive': 5, u'eat': 7, u'better': 1, u'health': 11, u'lot': 14, u'time': 24, u'baseball': 0, u'cause': 4, u'blood': 2} 
[u'baseball', u'better', u'blood', u'brocolli', u'cause', u'drive', u'driving', u'eat', u'experts', u'feel', u'good', u'health', u'increased', u'likes', u'lot', u'perform', u'practice', u'pressure', u'professionals', u'say', u'school', u'spends', u'suggest', u'tension', u'time']


### 5. Get counts of tokens.

Convert the matrix from the vectorizer to a dense matrix, then sum by column to get the counts per term.

In [14]:
X.todense()

matrix([[0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
         0, 0, 1],
        [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         1, 1, 0],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
         0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
         0, 0, 0]])

In [17]:
token_df = pd.DataFrame(X.todense(), columns=features)
token_df

Unnamed: 0,baseball,better,blood,brocolli,cause,drive,driving,eat,experts,feel,...,perform,practice,pressure,professionals,say,school,spends,suggest,tension,time
0,0,0,0,2,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,1
2,0,0,1,0,1,0,1,0,1,0,...,0,0,1,0,0,0,0,1,1,0
3,0,1,0,0,0,1,0,0,0,1,...,1,0,1,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0


In [22]:
sum_matrix = X.toarray().sum(axis=0)
sum_token_df = pd.DataFrame(sum_matrix.reshape(1, 25), columns=features)
sum_token_df

Unnamed: 0,baseball,better,blood,brocolli,cause,drive,driving,eat,experts,feel,...,perform,practice,pressure,professionals,say,school,spends,suggest,tension,time
0,1,1,1,3,1,1,2,2,1,1,...,1,1,2,1,1,1,1,1,1,1


### 6. Setup the vocabulary dictionary

First we need to setup the vocabulary.  Gensim's LDA expects our vocabulary to be in a format where the dictionary keys are the column indices and the values are the words themselves.

Create this dictionary below.

In [29]:
vocab = {v:k for k, v in vect.vocabulary_.iteritems()}
vocab

{0: u'baseball',
 1: u'better',
 2: u'blood',
 3: u'brocolli',
 4: u'cause',
 5: u'drive',
 6: u'driving',
 7: u'eat',
 8: u'experts',
 9: u'feel',
 10: u'good',
 11: u'health',
 12: u'increased',
 13: u'likes',
 14: u'lot',
 15: u'perform',
 16: u'practice',
 17: u'pressure',
 18: u'professionals',
 19: u'say',
 20: u'school',
 21: u'spends',
 22: u'suggest',
 23: u'tension',
 24: u'time'}

### 7. Create a token to id mapping with gensim's `corpora.Dictionary`

This dictionary class is a more standard way to work with with gensim models. There are a few standard steps we should go through:

**7.1. Count the frequency of words.**

We can do this easily with the python `defaultdict(int)`, which doesn't require us to already have the key in the dictionary to be able to add to it:

```python
frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1
```




In [31]:
frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1

In [32]:
frequency

defaultdict(int,
            {'Brocolli': 1,
             'Health': 1,
             'I': 1,
             'My': 2,
             'Some': 1,
             'a': 1,
             'and': 1,
             'around': 1,
             'at': 1,
             'baseball': 1,
             'better.': 1,
             'blood': 1,
             'brocolli': 1,
             'brocolli,': 1,
             'brother': 3,
             'but': 2,
             'cause': 1,
             'do': 1,
             'drive': 1,
             'driving': 2,
             'eat': 1,
             'eat.': 1,
             'experts': 1,
             'feel': 1,
             'for': 1,
             'good': 3,
             'health': 1,
             'health.': 1,
             'increased': 1,
             'is': 2,
             'likes': 1,
             'lot': 1,
             'may': 1,
             'mother': 2,
             'mother.': 1,
             'my': 4,
             'never': 1,
             'not': 1,
             'of': 1,
             'often

**7.2 Remove any words that only appear once, or appear in the stopwords.**

Iterate through the documents and only keep useful words/tokens.

In [33]:
texts = [[token for token in text.split() if frequency[token] > 1 and token not in nltk_stops]
         for text in documents]
texts

[['good', 'My', 'brother', 'good'],
 ['My', 'mother', 'driving', 'brother'],
 ['driving'],
 ['mother', 'brother'],
 ['good']]

**7.3 Create the `corpora.Dictionary` object with the retained tokens.**

In [35]:
corp_dict = corpora.Dictionary(texts)
corp_dict

<gensim.corpora.dictionary.Dictionary at 0x11d88e990>

**7.4 Use the `dictionary.doc2bow()` function to convert the texts to bag-of-word representations.**

In [36]:
corpus = [corp_dict.doc2bow(text) for text in texts]
corpus

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 1), (3, 1), (4, 1)],
 [(3, 1)],
 [(2, 1), (4, 1)],
 [(0, 1)]]

**Why should we use this process?**

The main advantage is that this dictionary object has quick helper functions.

However, there are also some major performance advantages if you ever want to save your model to a file, then load it at a later time.  Tokenizations can take a while to be computed, especially when your text files are quite large. You can save these post-computed dictionary items to file, then load them from disk later which is quite a bit faster.  Also, it's possible to add new documents to your corpus without having to re-tokenize your entire set.  This is great for online systems that can take new documents on demmand.  

As you work with larger datasets with text, this is a much better way to handle LDA and other Gensim models from a performance point of view.

### 8. Set up the LDA model

We can create the gensim LDA model object like so:

```python
lda = models.LdaModel(
    # supply our sparse predictor matrix wrapped in a matutils.Sparse2Corpus object
    matutils.Sparse2Corpus(X, documents_columns=False),
    # or alternatively use the corpus object created with the dictionary in the previous frame!
    # corpus,
    # The number of topics we want:
    num_topics  =  3,
    # how many passes over the vocabulary:
    passes      =  20,
    # The id2word vocabulary we made ourselves
    id2word     =  vocab
    # or use the gensim dictionary object!
    # id2word     =  dictionary
)
```

In [37]:
lda = models.LdaModel(
    # supply our sparse predictor matrix wrapped in a matutils.Sparse2Corpus object
    matutils.Sparse2Corpus(X, documents_columns=False),
    # or alternatively use the corpus object created with the dictionary in the previous frame!
    # corpus,
    # The number of topics we want:
    num_topics  =  3,
    # how many passes over the vocabulary:
    passes      =  20,
    # The id2word vocabulary we made ourselves
    id2word     =  vocab)
    # or use the gensim dictionary object!
    # id2word     =  dictionary

### 9. Look at the topics

The model has a `.print_topics` function that accepts the number of topics to print and number of words per topic. The number before the word is the probability of occurance for that word in the topic.

In [38]:
lda.print_topics(num_topics=3, num_words=5)

[(0,
  u'0.122*"good" + 0.122*"brocolli" + 0.085*"eat" + 0.085*"health" + 0.049*"likes"'),
 (1,
  u'0.077*"driving" + 0.077*"health" + 0.077*"pressure" + 0.077*"experts" + 0.077*"suggest"'),
 (2,
  u'0.093*"pressure" + 0.093*"school" + 0.093*"perform" + 0.093*"feel" + 0.093*"better"')]

### 10. Get the topic scores for a document

The `.get_document_topics` function accepts a bag-of-words representation for a document and returns the scores for each topic.

In [39]:
lda.get_document_topics(corp_dict.doc2bow(text))

[(0, 0.66202914149771219), (1, 0.16874577422559342), (2, 0.16922508427669442)]

### 11. Label and visualize the topics

Lets come up with some high level labels. This is the subjective part of LDA. What do the word probabilties that represent topics mean?  Let's make some up.

Plot a heatmap of the topic probabilities for each of the documents.

In [17]:
# A:

### 12. Fit an LDA model with sklearn

Sklearn's LDA model is in the decomposition submodule:

```python
from sklearn.decomposition import LatentDirichletAllocation
```

One of the greatest benefits of the sklearn implementation is that it comes with the familiar `.fit()`, `.transform()` and `.fit_transform()` methods.

**12.1 Initialize and fit an sklearn LDA with `n_topics=3` on our output from the CountVectorizer.**

In [45]:
from sklearn.decomposition import LatentDirichletAllocation

#vect = CountVectorizer(stop_words = custom_stop_words)
#X = vect.fit_transform(df['text'])

sk_lda = LatentDirichletAllocation(n_topics=3)
sk_lda.fit_transform(X)



array([[ 0.91330721,  0.04428874,  0.04240405],
       [ 0.04943166,  0.90149105,  0.04907729],
       [ 0.9282845 ,  0.03644799,  0.03526751],
       [ 0.04889211,  0.0483507 ,  0.90275719],
       [ 0.05529972,  0.89605737,  0.04864292]])

**12.2 Print out the topic-word distributions using the `.components_` attribute.**

Each row of this matrix represents a topic, and the columns are the words. (These are not probabilities).

In [46]:
sk_lda.components_

array([[ 0.45376817,  0.44615914,  1.27981472,  2.08053476,  1.25378989,
         0.46195035,  1.26839293,  2.06222441,  1.26640099,  0.43911772,
         2.11259359,  1.25768786,  1.28041945,  1.21865986,  0.45087237,
         0.45387897,  0.45913257,  1.25029409,  0.47266146,  0.45203179,
         0.44959027,  0.47153734,  1.28911913,  1.30417881,  0.48047492],
       [ 1.24499584,  0.44140145,  0.45592915,  1.28355638,  0.44442038,
         0.45345222,  1.27290712,  0.43358889,  0.4973353 ,  0.45115552,
         1.28578162,  2.10633102,  0.47537939,  0.46851729,  1.25096756,
         0.46952503,  1.24799025,  0.4781014 ,  1.31443938,  1.23209942,
         0.47281691,  1.25368818,  0.46795221,  0.48932845,  1.22196078],
       [ 0.47636212,  1.30644177,  0.43111887,  0.44981828,  0.44704825,
         1.24878852,  0.47141648,  0.45502627,  0.48574516,  1.25163179,
         0.47762516,  0.47018697,  0.48572543,  0.47089863,  0.45053364,
         1.26760799,  0.4889668 ,  1.28672186,  0

**12.3 Use the `.transform()` method to convert the matrix into the topic scores.**

These are the document-topic distributions.

In [47]:
sk_lda.transform(X)

array([[ 0.91330721,  0.04428874,  0.04240405],
       [ 0.04943166,  0.90149105,  0.04907729],
       [ 0.9282845 ,  0.03644799,  0.03526751],
       [ 0.04889211,  0.0483507 ,  0.90275719],
       [ 0.05529972,  0.89605737,  0.04864292]])

### 13. Further steps

This has been a very basic example.  LDA typically doesn't perform well on very small datasets.  You should try to see how it behaves on your own using a larger text dataset.  Keep in mind: finding the optimal number of topics can be tricky and subjective.

**Generally, you should consider:**
- How well topics are applied to documents overall
- The strength of topics overall, to all documents
- Improving preprocessing such as stopword removal
- Building a nice web interface to explore your documents (see: [LDAExplorer](https://github.com/dyerrington/LDAExplorer), and [pyLDAvis](https://github.com/bmabey/pyLDAvis/blob/master/README.rst))

These general guidelines should help you tune your hyperparameter **K** for number of topics.