## Topic Modeling with LDA

**First, install the following packages:

>```
pip install numpy
```
>```
pip install lda
```
>```
pip install textmining
```

The following program is based on the sample code from these pages: 
- https://pypi.python.org/pypi/lda
- https://gist.github.com/cstrelioff/4e84d18fc13b0de8aac4#file-lda_textmine_ex-py
- http://pydoc.net/Python/textmining/1.0/textmining/


### Notes:
- clean_up_text() removes words less than or equal to 3 characters long.

In [30]:
# Copyright © 2015 Christopher C. Strelioff <chris.strelioff@gmail.com>
# Distributed under terms of the MIT license.
# Modified by Suguru Ishizaki (Nov.10, 2016)

"""
An example of getting titles and vocab for lda using textmine package.
-- adapted from: http://www.christianpeccei.com/textmining/
"""

import os
from nltk.tokenize import RegexpTokenizer  
from nltk.corpus import stopwords
import numpy as np                     # numpy is a package for scientific computing
import textmining                      # text mining is used for creating a document term matrix.

tokenizer = RegexpTokenizer("[\w']+")     # Notice there is a single quote within the squore brackets.
stopset = stopwords.words('english')    # set is a data type. 
print(stopset)

def remove_curly_quotes(text):
    'this function replaces curly quotes with ascii quotes'
    return text.replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u201c",""").replace(u"\u201d", """)

def clean_up_text(text, min_word):
    'this function removes all the punctuations, and stopwords.'
    lst = tokenizer.tokenize(remove_curly_quotes(text))          # make a list of words w/o punctuations
    res = ' '.join([word for word in lst if word not in stopset and len(word) > min_word])  # remove stopwords  
    return res

#
# Create a list of documents (strings), and a list of titles (= file names)
#
titles = []
docs = []
for file in os.listdir("data"):
    fin = open("data/" + file)
    s = fin.read().lower()
    docs.append(clean_up_text(s, 2))   # clean_up_text removes punctuations and short words (< characters)
    titles.append(file)

#
# Just for debugging, we are printing the first 100 character of each file.
#
print("\n**These are the 'documents', making up our 'corpus':")
for n, title in enumerate(titles):    
    doc = docs[n]                      # n-th document (string) in the list.                   
    print("document {}: {}".format(n+1, title))
    print("{}".format(doc[0:100]))
 
# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix(tokenizer=tokenizer.tokenize)    # use the RegexpTokenizer

# Add the documents (string) to the term-document matrix
for doc in docs:
    tdm.add_doc(doc)

temp = list(tdm.rows(cutoff=1))    # create a tempporary variable with doc-term info
vocab = tuple(temp[0])             # get the vocab from the first row
X = np.array(temp[1:])             # get document-term matrix from remaining rows

titles = tuple(titles)             # make a tuple that includes all the titles (file names)

## print out info, a blog post with a little extra info
## post: http://bit.ly/1bxob2E

print("\n** Output produced by the textmining package...")

# document-term matrix
print("*** The 'document-term' matrix")
print("type(X): {}".format(type(X)))
print("shape: {}".format(X.shape))
print("X:", X, sep="\n" )

# the vocab
print("\n*** The 'vocabulary':")
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}".format(len(vocab)))
print("vocab:", vocab, sep="\n")

# titles for each story
print("\n*** The 'titles' for this 'corpus':")
print("type(titles): {}".format(type(titles)))
print("len(titles): {}".format(len(titles)))
print("titles:", titles, sep="\n", end="\n\n")

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [31]:
# Source: https://pypi.python.org/pypi/lda
#
# ---------------------------------------------------------
#
# !!! Make sure to execute the code cell above first. !!!
#
# ---------------------------------------------------------

import numpy as np
import lda
import lda.datasets

#
# This function models the topics for the document term matrix created above.
#
def run_topic_modeling(X, vocab, titles, num_top_words, num_topics):
    model = lda.LDA(n_topics=num_topics, 
                    n_iter=1500, 
                    random_state=1)
    print("Topic modeling... It may take a minute or more ...\n")
    model.fit(X)                                                 # model.fit_transform(X) is also available
    topic_word = model.topic_word_                               # model.components_ also works
    n_top_words = 8
    
    print("Topics found")
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(num_top_words+1):-1]
        print('Topic {}: {}'.format(i, ' '.join(topic_words)))

    print("\nTop topic per document:")
    doc_topic = model.doc_topic_
    for i in range(len(docs)):
        print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))    
        
run_topic_modeling(X, vocab, titles, 8, 10)   # Get 10 topics, with 8 topic words per topic, 


Topic modeling... It may take a minute or more ...

Topics found
Topic 0: nations work states also human cannot democracy many
Topic 1: united international different look conflict russia that's borders
Topic 2: world war we've children young replace iraq across
Topic 3: america time rights one president future every united
Topic 4: people new together power support opportunity path right
Topic 5: world countries progress end global it's make around
Topic 6: united believe weapons states could peaceful region syria
Topic 7: believe better history like order see need democratic
Topic 8: must that's violence faith understand killed vision chris
Topic 9: peace must come security stand year palestinians challenges

Top topic per document:
Obama_UN_Speech_2009.09.23.txt (top topic: 9)
Obama_UN_Speech_2010.09.23.txt (top topic: 3)
Obama_UN_Speech_2011.09.21.txt (top topic: 3)
Obama_UN_Speech_2012.09.25.txt (top topic: 8)
Obama_UN_Speech_2013.09.24.txt (top topic: 6)
Obama_UN_Speech_2014.09.2