For now, I'll just open up a text-heavy subreddit such as r/todayilearned.

In [1]:
import pickle

In [2]:
with open('docs/todayilearned[10-3].p', 'rb') as f:
    _, docs, trans = pickle.load(f)

In [3]:
docs[2]

'>I saw one boy daydreaming and just swinging his broom in the air. \n\n>Me (In Japanese): "What the hell are you doing?" \n\n>**"Uh, you know... air pollution."**\n\nGive that kid a medal.'

We can see that this document is pretty heavy with Markdown markup. After some looking around, apparently the easiest stripping of this markup is to simply convert it to HTML and then strip the HTML. We accomplish this with `BeautifulSoup`.

In [4]:
from bs4 import BeautifulSoup
from markdown import markdown

In [5]:
html = markdown(docs[2])
text = ''.join(BeautifulSoup(html, 'html.parser').findAll(text=True))
text

'\nI saw one boy daydreaming and just swinging his broom in the air. \nMe (In Japanese): "What the hell are you doing?" \n"Uh, you know... air pollution."\n\nGive that kid a medal.'

Ok, now we need to tokenize the text into words. I prefer regex tokenization, since contractions and punctuation makes things hairy. And we'll lowercase everything. Also, I'll include subreddit names and dash name combinations as part of the regex.

In [6]:
from nltk.tokenize import RegexpTokenizer

In [7]:
tokenizer = RegexpTokenizer("([\w'-]+)|/r/\w+")
print(tokenizer.tokenize(text.lower()))

['i', 'saw', 'one', 'boy', 'daydreaming', 'and', 'just', 'swinging', 'his', 'broom', 'in', 'the', 'air', 'me', 'in', 'japanese', 'what', 'the', 'hell', 'are', 'you', 'doing', 'uh', 'you', 'know', 'air', 'pollution', 'give', 'that', 'kid', 'a', 'medal']


Great, now let's encapsulate this into a nice little function. We'll feed it a list of raw untokenized documents.

In [8]:
def docs_tokenize(docs):
    tokenizer = RegexpTokenizer("([\w'-]+)|/r/\w+")
    for d in docs:
        html = markdown(d)
        text = ''.join(BeautifulSoup(html, 'html.parser').findAll(text=True))
        yield tokenizer.tokenize(text.lower())

Now, since that was so quick let's start on the LDA.

In [9]:
from lda_gibbs import LDA

In [10]:
token_docs = list(docs_tokenize(docs))

In [11]:
model = LDA(token_docs)
theta, beta = model.train(ntopics=20, niter=200)

In [12]:
tsr = model.topic_significances()
named_tsr = [(s, model.topic_representatives(i, topn=10, show_scores=False)) for i, s in enumerate(tsr)]
for name, score in sorted(named_tsr, reverse=True):
    print(name, score)

1.0 ('http', 'com', 'www', 'news', 'watch', '10', 'he', 'https', 'uk', 'source')
0.362230566279 ('be', 'you', 'would', 'why', 'it', 'if', 'do', 'is', "that's", 'not')
0.279827290821 ('i', 'you', 'that', 'a', 'know', 'about', 'me', 'am', 'the', 'for')
0.264362053181 ('money', 'we', 'for', 'are', 'have', 'on', 'all', 'back', 'still', 'he')
0.261539971002 ('blood', 'his', 'he', 'donate', 'have', 'this', 'antigen', 'maybe', 'guy', 'from')
0.212612546494 ('in', '2', '1', 'about', '000', 'of', 'years', 'one', 'the', 'world')
0.208819559615 ('beard', 'you', 'grow', 'i', 'get', 'have', 'it', 'to', 'would', 'with')
0.204588527388 ('they', 'i', 'know', 'what', 'he', 'that', 'think', 'even', 'a', 'could')
0.200204152086 ('the', 'my', 'in', 'i', 'a', 'one', 'never', 'no', 'good', 'this')
0.182960059965 ('like', 'i', 'me', 'a', 'you', 'mean', 'but', "it's", 'this', 'go')
0.166761772088 ('was', 'in', 'had', 'i', 'my', 'and', 'a', 'were', 'one', 'school')
0.163292645184 ('not', 'so', 'that', 'is', 'w