## Topic Modeling with LDA

**Install the following packages first:

>```
pip install numpy
```
>```
pip install lda
```
>```
pip install textmining
```

The following is a program based on the sample code from the following pages: 
- https://pypi.python.org/pypi/lda
- https://gist.github.com/cstrelioff/4e84d18fc13b0de8aac4#file-lda_textmine_ex-py



In [1]:
# Copyright © 2015 Christopher C. Strelioff <chris.strelioff@gmail.com>
# Distributed under terms of the MIT license.
# Moified by Suguru Ishizaki (Nov.10, 2016)

"""
An example of getting titles and vocab for lda using textmine package.
-- adapted from: http://www.christianpeccei.com/textmining/
"""

import os
from nltk.tokenize import RegexpTokenizer  
from nltk.corpus import stopwords
import numpy as np                     # numpy is a package for scientific computing
import textmining                      # texmining is used for creating a document term matrix.

tokenizer = RegexpTokenizer("[\w']+")     # Notice that ther is a single quote within the squore brackets.
stopset = stopwords.words('english')    # set is a data type. 

def clean_up_text(text):
    'this function removes all the punctuations, and stopwords.'
    lst = tokenizer.tokenize(text)                                 # make a list of words w/o punctuations
    res = ' '.join([word for word in lst if word not in stopset])  # remove stopwords  
    return res

# Create a list of documents (strings), and a list of titles (= file names)
titles = []
docs = []
for file in os.listdir("data"):
    print(file)
    fin = open("data/" + file)
    s = fin.read()
    docs.append(clean_up_text(s))
    titles.append(file)

# Just for debugging. We are printing the first 100 character of each file.
print("\n**These are the 'documents', making up our 'corpus':")
for n, title in enumerate(titles):    
    doc = docs[n]                     # n-th document (string) in the list.                   
    print("document {}: {}".format(n+1, title))
    print("{}".format(doc[0:100]))

# make a tuple that includes all the titles (file names)
titles = tuple(titles)

# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()

# Add the documents (string) to the term-document matrix
for doc in docs:
    tdm.add_doc(doc)


temp = list(tdm.rows(cutoff=1))    # create a temp variable with doc-term info
vocab = tuple(temp[0])             # get the vocab from first row
X = np.array(temp[1:])             # get document-term matrix from remaining rows

## print out info, as in blog post with a little extra info
## post: http://bit.ly/1bxob2E

print("\n** Output produced by the textmining package...")

# document-term matrix
print("* The 'document-term' matrix")
print("type(X): {}".format(type(X)))
print("shape: {}".format(X.shape))
print("X:", X, sep="\n" )

# the vocab
print("\n* The 'vocabulary':")
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}".format(len(vocab)))
print("vocab:", vocab, sep="\n")

# titles for each story
print("\n* Again, the 'titles' for this 'corpus':")
print("type(titles): {}".format(type(titles)))
print("len(titles): {}".format(len(titles)))
print("titles:", titles, sep="\n", end="\n\n")

Obama_UN_Speech_2009.09.23.txt
Obama_UN_Speech_2010.09.23.txt
Obama_UN_Speech_2011.09.21.txt
Obama_UN_Speech_2012.09.25.txt
Obama_UN_Speech_2013.09.24.txt
Obama_UN_Speech_2014.09.24.txt
Obama_UN_Speech_2015.09.28.txt
Obama_UN_Speech_2016.09.20.txt

**These are the 'documents', making up our 'corpus':
document 1: Obama_UN_Speech_2009.09.23.txt
Remarks President United Nations General Assembly REMARKS BY THE PRESIDENT TO THE UNITED NATIONS GEN
document 2: Obama_UN_Speech_2010.09.23.txt
Remarks President United Nations General Assembly New York New York 10 01 A M EDT THE PRESIDENT Mr P
document 3: Obama_UN_Speech_2011.09.21.txt
Remarks President Obama Address United Nations General Assembly United Nations New York New York 10 
document 4: Obama_UN_Speech_2012.09.25.txt
Remarks President UN General Assembly United Nations Headquarters New York New York 10 22 A M EDT TH
document 5: Obama_UN_Speech_2013.09.24.txt
Remarks President Obama Address United Nations General Assembly United Nations 

In [2]:
# Source: https://pypi.python.org/pypi/lda
#
# ---------------------------------------------------------
#
# !!! Make sure to execute the code cell above first. !!!
#
# ---------------------------------------------------------

import numpy as np
import lda
import lda.datasets

#
# This function does a topic modeling for the document term matrix created above.
#
def run_topic_modeling(X, vocab, titles, num_top_words, num_topics):
    model = lda.LDA(n_topics=num_topics, 
                    n_iter=1500, 
                    random_state=1)
    print("Topic modeling... It may take a minute or more ...\n")
    model.fit(X)                                                 # model.fit_transform(X) is also available
    topic_word = model.topic_word_                               # model.components_ also works
    n_top_words = 8
    
    print("Topics found")
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(num_top_words+1):-1]
        print('Topic {}: {}'.format(i, ' '.join(topic_words)))

    print("\nTop topic per document:")
    doc_topic = model.doc_topic_
    for i in range(len(docs)):
        print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))    
        
run_topic_modeling(X, vocab, titles, 8, 10)


Topic modelings... It may take a minute or more ...

Topics found
Topic 0: must the peace united time security rights new
Topic 1: world we people nations us i together also
Topic 2: that work cannot see people power s end
Topic 3: and human it children war make around so
Topic 4: violence america muslim seen war arab killed transition
Topic 5: i believe but like international countries democracy many
Topic 6: come this pursue responsibility effort address global years
Topic 7: states united iran weapons syria could peaceful iraq
Topic 8: better order young look new reject communities strong
Topic 9: america within problems president lives middle conflict civil

Top topic per document:
Obama_UN_Speech_2009.09.23.txt (top topic: 1)
Obama_UN_Speech_2010.09.23.txt (top topic: 0)
Obama_UN_Speech_2011.09.21.txt (top topic: 0)
Obama_UN_Speech_2012.09.25.txt (top topic: 1)
Obama_UN_Speech_2013.09.24.txt (top topic: 7)
Obama_UN_Speech_2014.09.24.txt (top topic: 1)
Obama_UN_Speech_2015.09.28.tx