## LDA
### Load data
The data is compressed with bzip2. Each line is a bytes instance with a json-like format. We can convert these bytes instances into dictionaries. Each dictionary contains keys like parent_id, comment_id, author_id, etc. For now, we only care about the comments content. 

In [1]:
import bz2
import json
f_in = bz2.BZ2File("data/2017-01.bz2").readlines() 
# each bytes string is json like in this dataset
documents = [json.loads(line) for line in f_in] # documents is a list of dictionaries

In [15]:
print(len(documents))

156850


In [2]:
print(documents[0])
corpus = [doc['body'] for doc in documents]

{'parent_id': 't1_dbufkak', 'id': 'dbumnvn', 'edited': False, 'created_utc': 1483228806, 'distinguished': None, 'author_flair_css_class': None, 'author_flair_text': None, 'controversiality': 0, 'subreddit_id': 't5_2wlj3', 'retrieved_on': 1485679713, 'link_id': 't3_5l72cl', 'author': 'twigwam', 'score': 1, 'gilded': 0, 'stickied': False, 'body': 'OMGosh im an idiot this week.  Thank you', 'subreddit': 'CryptoCurrency'}


### feature extraction
using CountVectorizer since LDA uses raw count 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cvt = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
vecs = cvt.fit_transform(corpus)
vecs_feature_names = cvt.get_feature_names()

### topic model
We need to manually set the number of topics, which is a drawback of LDA.

In [4]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
from time import time
now = time()
n = 10 # number of topics
lda = LDA(n_topics=n, max_iter=20, learning_method='online', learning_offset=50.,random_state=123)
lda = lda.fit(vecs)
print("Training completed in {}s.".format(time() - now))

Training completed in 1056.0191588401794s.


### print topics

In [5]:
def top_k_words(model, names, k):
    for idx, features in enumerate(model.components_):
        print ("Topic {}: ".format(idx) + " ".join([names[int(i)] for i in (-features).argsort()[:k]]))

In [6]:
top_k_words(lda, vecs_feature_names, 15)

Topic 0: 10 amp 000 wait hours fuck message minutes 20 futures left double order kraken banned
Topic 1: bitcoin use gt node work used blockchain like need address problem does people using way
Topic 2: deleted coinbase account bank satoshi cash today lol card job accounts video deposit fud customer
Topic 3: change reason gt away sure clear argument know words work economic point team project rules
Topic 4: bitcoin just people like don think money time know good buy new want really going
Topic 5: com https reddit www http bitcoin comments removed np btc info post gt 2017 github
Topic 6: segwit block gt miners bitcoin blocks core transaction fees fork hard transactions bu chain size
Topic 7: wallet exchange use chinese volume usd old key yes private using just keys send wallets
Topic 8: btc long price trading small short market days term coins try exchanges stop link ok
Topic 9: monero thanks currency bitcoin gold community org question nice paper thank million value crypto watch


### Train the model with 20 topics

In [7]:
now = time()
n = 20
lda = LDA(n_topics=n, max_iter=20, learning_method='online', learning_offset=50.,random_state=123)
lda = lda.fit(vecs)
print("Training completed in {}s.".format(time() - now))

Training completed in 1815.3930730819702s.


In [8]:
top_k_words(lda, vecs_feature_names, 15)

Topic 0: mining power lol hours comment pool vote difficulty banned job pools probably hash minutes hour
Topic 1: people good time like new just bitcoin right did big really gt think thing days
Topic 2: wait futures pboc took cny premium yuan picture huobi managed spike till mtgox quarterly gotta
Topic 3: stop double greg bullshit gonna dumb referring raise kept adam maxwell explained surely thoughts satoshis
Topic 4: possible today edit nice looks argument man channel google page far test unfortunately charts relevant
Topic 5: coins probably come satoshi talking working good stuff problems wants privacy future available talk lot
Topic 6: bitcoin segwit miners people network don gt fork value blockchain want just hard doesn way
Topic 7: transaction core fee wallet transactions use need usd old data using wallets send code software
Topic 8: com https reddit www http comments bitcoin amp np org message link en 2017 imgur
Topic 9: removed exchanges chinese volume exchange trading interest

### Front words without sorting

In [14]:
def begin_k_words(model, names, k):
    for idx, features in enumerate(model.components_):
        print ("Topic {}: ".format(idx) + " ".join([names[int(i)] for i in (-features).argsort()[:k]]))

begin_k_words(lda, vecs_feature_names, 20)

Topic 0: mining power lol hours comment pool vote difficulty banned job pools probably hash minutes hour points left hashrate miner ban
Topic 1: people good time like new just bitcoin right did big really gt think thing days make open trying things understand
Topic 2: wait futures pboc took cny premium yuan picture huobi managed spike till mtgox quarterly gotta fighting ach scared lending quarterlies
Topic 3: stop double greg bullshit gonna dumb referring raise kept adam maxwell explained surely thoughts satoshis lying forgot windows excellent bob
Topic 4: possible today edit nice looks argument man channel google page far test unfortunately charts relevant broken awesome house hey luke
Topic 5: coins probably come satoshi talking working good stuff problems wants privacy future available talk lot digital waiting sold early currently
Topic 6: bitcoin segwit miners people network don gt fork value blockchain want just hard doesn way nodes think need make users
Topic 7: transaction core 

## HDP

# 