## Initial idea about topic distribution
Topic distribution is a way for us to evaluate the result of topic communities. Topic communities are calculated by Garven-Newman algorithm, where nodes are words and edges are similarities. Each community is a topic, consisted by a list of words. Topic distribution could be measured by topic TF-IDF, which is mathematical combination of TF-IDF value of each word. 

tf-idf = tf(T, d) * idf(T, D)

where T - topic, d - each document/thread, D - whole corpus

Right now, I have three way to calculate topic TF-IDF based on previous idea:

1. tf(T, d) = ∑tf(t, d)/N, idf(T, d) = (∑log(N/n))/N
2. tf(T, d) = qt(∑tf(t, d), N), idf(T, d) = qt(∑log(N/n), N)
3. tf(T, d) = qt(∏tf(t, d), N), idf(T, d) = qt(∏log(N/n), N)

The last one may seems not being too reasonable... cause it's being in a different scale. 

Though each representation is normalized by N, I'm not sure which one is aligned with practice and why. 


### 1. loading data from file

#### loading topics

In [2]:
topics = []
with open('./data/community.txt') as commF:
    for line in commF.readlines():
        topic = line.strip('\n').split()
        topics.append(topic)

In [2]:
topics

[['[deleted],',
  'china,',
  'chinese,',
  'exchanges,',
  'good,',
  'long,',
  'make,',
  'trading,',
  'well'],
 ['amp,', 'com,', 'comments,', 'http,', 'https,', 'reddit,', 'www'],
 ['anything,',
  'everyone,',
  'need,',
  'say,',
  'short,',
  'someone,',
  'something,',
  'sure,',
  'users,',
  'want'],
 ['back,',
  'bitcoins,',
  'btc,',
  'buy,',
  'coinbase,',
  'coins,',
  'currency,',
  'day,',
  'exchange,',
  'first,',
  'gold,',
  'market,',
  'support,',
  'time,',
  'using,',
  'value,',
  'years'],
 ['better,',
  'get,',
  'going,',
  'lot,',
  'miners,',
  'much,',
  'price,',
  'see,',
  'way,',
  'yes'],
 ['bitcoin,', 'even,', 'like,', 'one,', 'people,', 'think'],
 ['block,', 'blocks'],
 ['blockchain,',
  'chain,',
  'change,',
  'code,',
  'core,',
  'fork,',
  'hard,',
  'monero,',
  'network,',
  'post,',
  'run,',
  'said,',
  'segwit,',
  'transactions,',
  'wallet,',
  'work'],
 ['blocksize,',
  'fee,',
  'fees,',
  'increase,',
  'limit,',
  'size,',
  'tran

#### loading threads

In [3]:
import json
threads = {}
with open('./data/2017-01thread.json') as f:
    data = json.load(f)
    for key in data:
        subred = data[key]
        for thread_id in subred:
            threads[thread_id] = subred[thread_id]
            
        

In [4]:
len(threads)

9690

### 2. calculate word tf and df

In [5]:
word_tf = {}
word_df = {}
for thread_id, thread in threads.items():
    lst = thread.split()
    word_tf[thread_id] = {}
    for topic in topics:
        for word in topic:
            if word in lst:
                tf = lst.count(word) 
                word_tf[thread_id][word] = tf
                word_df[word] = word_df.get(word, 0) + 1
    

In [17]:
word_tf

{'t3_5l72cl': {},
 't3_5lcf0w': {'coins,': 1, 'support,': 1, 'think': 3, 'want': 4, 'years': 1},
 't3_5lbxkn': {'think': 3, 'way,': 1, 'years': 1},
 't3_5l7apn': {'well': 1},
 't3_5le1u7': {'think': 1, 'well': 1},
 't3_5l8sqz': {},
 't3_5la2sq': {},
 't3_5le2zh': {'coins,': 1,
  'price,': 1,
  'think': 6,
  'time,': 1,
  'want': 5,
  'well': 1},
 't3_5k34p2': {'years': 1},
 't3_5le424': {'want': 4},
 't3_5k9kz5': {},
 't3_5letol': {'think': 1, 'want': 1},
 't3_5lep39': {'code,': 1, 'think': 1, 'time,': 1, 'well': 1, 'work': 2},
 't3_5lf1yb': {'coins,': 1, 'think': 1},
 't3_5ldkhd': {'years': 1},
 't3_5lfcwc': {'think': 2, 'time,': 1, 'well': 1, 'years': 1},
 't3_5kpyio': {},
 't3_5lflgx': {'way,': 1},
 't3_5lg6pw': {'make,': 1},
 't3_5lgwmt': {'said,': 1, 'sure,': 1, 'think': 2},
 't3_5lgili': {},
 't3_5gye1h': {},
 't3_57smlq': {},
 't3_5li69u': {'think': 2},
 't3_5lf8ur': {},
 't3_5lhkp7': {},
 't3_5ljrn3': {'better,': 1, 'coins,': 2, 'market,': 1, 'think': 3},
 't3_5lj67j': {'better

In [18]:
word_df

{'[deleted],': 1,
 'anything,': 188,
 'back,': 142,
 'better,': 134,
 'bitcoin,': 704,
 'bitcoins,': 194,
 'block,': 188,
 'blockchain,': 229,
 'blocks': 823,
 'blocksize,': 77,
 'btc,': 123,
 'buy,': 94,
 'chain,': 148,
 'change,': 139,
 'china,': 28,
 'chinese,': 3,
 'code,': 169,
 'coinbase,': 71,
 'coins,': 245,
 'com,': 5,
 'comments,': 49,
 'core,': 66,
 'currency,': 249,
 'day,': 258,
 'even,': 9,
 'everyone,': 67,
 'exchange,': 230,
 'exchanges,': 231,
 'fee,': 138,
 'fees,': 280,
 'first,': 161,
 'fork,': 193,
 'get,': 15,
 'going,': 29,
 'gold,': 162,
 'good,': 175,
 'hard,': 61,
 'increase,': 124,
 'like,': 119,
 'limit,': 140,
 'long,': 84,
 'lot,': 69,
 'make,': 95,
 'market,': 231,
 'miners,': 227,
 'monero,': 63,
 'much,': 147,
 'need,': 31,
 'network,': 274,
 'one,': 281,
 'people,': 304,
 'post,': 130,
 'price,': 257,
 'reddit,': 49,
 'run,': 80,
 'said,': 430,
 'say,': 286,
 'see,': 130,
 'segwit,': 155,
 'short,': 86,
 'size,': 146,
 'someone,': 39,
 'something,': 13

### 3-1 calculate normalized topic tf-idf
#### tf(T, d) = ∑tf(t, d)/N, idf(T, d) = (∑log(N/n))/N

In [6]:
import math

topic_tfidf = {}
for thread_id in word_tf:
    topic_tfidf[thread_id] = []
    for topic in topics:
#         print(topic)
        tfidf = 0
        num_word = 0
        for word in topic:
            tf = word_tf[thread_id].get(word)
            if tf:
                num_word += 1
                tfidf += tf * math.log(len(threads)/word_df[word])
        if num_word != 0:
            tfidf = tfidf / num_word
        topic_tfidf[thread_id].append(tfidf)
                
                
                
    

In [7]:
topic_tfidf


{'t3_5l72cl': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 't3_5lcf0w': [0,
  0,
  5.6210209500984725,
  3.4133618105338983,
  0,
  3.5104545656888484,
  0,
  0,
  0],
 't3_5lbxkn': [0,
  0,
  0,
  2.0376045825343208,
  3.071826817142558,
  3.5104545656888484,
  0,
  0,
  0],
 't3_5l7apn': [2.0064251277599667, 0, 0, 0, 0, 0, 0, 0, 0],
 't3_5le1u7': [2.0064251277599667, 0, 0, 0, 0, 1.1701515218962828, 0, 0, 0],
 't3_5l8sqz': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 't3_5la2sq': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 't3_5le2zh': [2.0064251277599667,
  0,
  7.026276187623091,
  3.2356234001135857,
  3.6297736199895922,
  7.020909131377697,
  0,
  0,
  0],
 't3_5k34p2': [0, 0, 0, 2.0376045825343208, 0, 0, 0, 0, 0],
 't3_5le424': [0, 0, 5.6210209500984725, 0, 0, 0, 0, 0, 0],
 't3_5k9kz5': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 't3_5letol': [0, 0, 1.4052552375246181, 0, 0, 1.1701515218962828, 0, 0, 0],
 't3_5lep39': [2.0064251277599667,
  0,
  0,
  2.7936553058870865,
  0,
  1.1701515218962828,
  0,
  3.9364978523450906,
  0],
 't3_5

### 4. Calculate topic tf-idf distribution

In [8]:
topic_distribution = [0] * len(topics)
for thread_id, topic_ifidf_lst in topic_tfidf.items():
    for i in range(len(topic_ifidf_lst)):
        if topic_ifidf_lst[i] != 0:
            topic_distribution[i] += 1

In [9]:
topic_distribution

[1611, 99, 2645, 2366, 1308, 3306, 856, 2330, 1562]

In [10]:
[len(t) for t in topics]

[9, 7, 10, 17, 10, 6, 2, 16, 7]

### 3-2. calculate normalized topic tf-idf
#### tf(T, d) = qt(∏tf(t, d), N), idf(T, d) = qt(∏log(N/n), N)

This solution turns out to be exactly the same as previous one.

Even though we assign a different threshold, we basically do the same thing because we just did log on tf-idf value of each word....

噗……


In [18]:
import math

topic_tfidf = {}
for thread_id in word_tf:
    topic_tfidf[thread_id] = []
    for topic in topics:
#         print(topic)
        tfidf = 0
        num_word = 0
        for word in topic:
            tf = word_tf[thread_id].get(word)
            if tf:
                num_word += 1
                tfidf += math.log(tf * math.log(len(threads)/word_df[word]))
        if num_word != 0:
            tfidf = tfidf/num_word
        topic_tfidf[thread_id].append(tfidf)


In [19]:
topic_distribution = [0] * len(topics)
for thread_id, topic_ifidf_lst in topic_tfidf.items():
    for i in range(len(topic_ifidf_lst)):
        if topic_ifidf_lst[i] != 0:
            topic_distribution[i] += 1

In [20]:
topic_distribution

[1611, 99, 2645, 2366, 1308, 3306, 856, 2330, 1562]

In [21]:
[1611, 99, 2645, 2366, 1308, 3306, 856, 2330, 1562]

[1611, 99, 2645, 2366, 1308, 3306, 856, 2330, 1562]

### 3-3 Topic TF-IDF
TF-IDF = ∑∑tf/∑df

In numerator, sum over tf over all words of all documents per topic

In denomenator, sum over df value of each word per topic

In [31]:
topic_tfidf = []
for topic in topics:
    topic_tf = 0
    topic_df = 0
    for word in topic:
        # calculate sum of word tf
        for thread_id, value in word_tf.items():
            if word in value:
                topic_tf += value[word]
        # calculate sum of word df
        topic_df += word_df.get(word, 0)
    topic_ti = topic_tf/topic_df
    topic_tfidf.append(topic_ti)

In [32]:
topic_tfidf

[1.8674223755544603,
 1.1238095238095238,
 2.5262135922330096,
 1.6969350411710888,
 1.3413612565445026,
 3.371157323688969,
 4.101879327398615,
 1.7580229561958305,
 2.9413854351687387]