## Unit #11 Topic Modeling

* Overview
* LDA
* Example Code

<font color=blue>---------------------------------------------------------------</font>

### 11.1 Overview

Topic Modeling is a machine learning technique for determining what topics are in a document or set of documents.  It is an _unsupervised_ technique in that it doesn't require pre-defined information to _learn_ how to identify topics.  Instead it has performs a series of statistical approaches to identify patterns of words.


In this unit, we will look into using a _Latent Dirichlet Allocation_ (LDA) algorithm to search for topics.



### 11.2 LDA

LDA assumes that documents are a collection of topics.  Topics will have words related to those topics mixed in with other words. LDA tries to work backwards, to determine the topics from the given the set of word.  It uses statistical analysis and pattern recognition to accomplish the task.

To use LDA, we will 
- Read in one or more documents as text;
- Pre-process the text (e.g., convert to lower case, remove stop words;
- Create a dictionary of unique words and a **document-term matrix** -- I'll explain in a minute;
- Feed into the LDA algorithm 
    - the document-term matrix;
    - the dictionary;
    - the number of topics we want it to find;
    - the number of iterations we want the algorithm to perform to fine-tune the results
- Display the resulting words for the topics
    
One of the preprocessing steps that we have not seen before is the converson of our bag of words to a **document-term matrix**. Basically, this is a list of counts for each unique word in the document.  But, instead of keeping the word with its count, we will create a pair of numbers where the first number is the position of the word in the dictionary and the second number is the count for that word.  For example, `(5, 12)` would mean that the word in position 5 (i.e., sixth word) of the dictionary occurs 12 times.  Note:  This dictionary is not the same as the dictionary data type that we learned about.  

The other thing that you should note is that however many topics that you tell LDA to find, it will find that number.  If you over-estimate the number of topics, you may see significant overlap in the topic words.  Of course, this will depend on how large the body of work is.  Larger amounts will have more subtopics. 

Finally, because LDA returns a list of words for each of the topics, it will be up to us to interpret what the topic actually is.

Now, we are ready to look at an example.  Before you attempt to run the example yourself, you will need to install `gensim`.  You can do this by typing `!pip install gensim` in iPython console.  It may take several minutes to install.

<font color=blue>---------------------------------------------------------------</font>

### 11.3 Example Code

In [150]:
#  This example borrowed from 
# http://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
doc_list = [doc_a, doc_b, doc_c, doc_d, doc_e]

In [151]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem.porter import PorterStemmer
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [152]:
texts = []
for doc in doc_list:
    doc_lower = doc.lower()
    tokens = tokenizer.tokenize(doc_lower)
    # remove stop words from tokens
    stopped_tokens = [w for w in tokens if w not in stop_words]
      

    # stem token
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)
    
  

In [153]:
from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
print(dictionary)
#The dictionary determines how many unique words that we have in all of the texts.


Dictionary(32 unique tokens: ['brocolli', 'brother', 'eat', 'good', 'like']...)


In [154]:
#We now want to convert the texts to vectors based on the dictionary
doc_term_matrix = [dictionary.doc2bow(text) for text in texts]
print(doc_term_matrix)

[[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)], [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(8, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)], [(1, 1), (5, 1), (8, 1), (19, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(0, 1), (3, 1), (16, 2), (30, 1), (31, 1)]]


In [155]:
import gensim
# Generate an LDA model
# The more passes, the better the results
ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrix, # the vector for the words
                                           num_topics=5,    # how many topics we suspect
                                           id2word = dictionary, # list of words for vector
                                           passes=20)  # number of iterations we want

In [157]:
topics = ldamodel.print_topics(num_topics=5, num_words=7)
for item in topics:
    print(item)

(0, '0.031*"mother" + 0.031*"brother" + 0.031*"drive" + 0.031*"say" + 0.031*"profession" + 0.031*"pressur" + 0.031*"health"')
(1, '0.078*"mother" + 0.078*"brother" + 0.078*"drive" + 0.078*"spend" + 0.078*"practic" + 0.078*"lot" + 0.078*"basebal"')
(2, '0.102*"good" + 0.102*"brocolli" + 0.102*"health" + 0.070*"eat" + 0.038*"tension" + 0.038*"suggest" + 0.038*"caus"')
(3, '0.065*"mother" + 0.065*"brother" + 0.065*"pressur" + 0.065*"drive" + 0.065*"perform" + 0.065*"well" + 0.065*"better"')
(4, '0.031*"brother" + 0.031*"mother" + 0.031*"drive" + 0.031*"profession" + 0.031*"say" + 0.031*"pressur" + 0.031*"health"')


## Activity: Topic Modeling on Emma Chapter 1

Try applying LDA to "emma_chapter_one".  Because this is a single document, you will include just it in the `doc_list`.  For example, 
```
doc_list = [raw_text]
```


<font color=blue>---------------------------------------------------------------</font>