# Topic Modeling using LDA

### References

* Data: Drug Dataset (400EA)
* Preprocess: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
* LDA: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/07/09/lda/

### Load Raw Data

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', 999)
news_data = pd.read_csv('./mallet_top_sen.tsv', sep='\t')

In [2]:
news_data.head()

Unnamed: 0.1,Unnamed: 0,id,Topic_Num,Topic_Perc_Contribu,Topic_Keywords,Origin_Text,Text
0,0,44029,0.0,0.2935,"analysi, multivari, regress, variabl, model, predictor, cardiac, time, univari, heart",Hazard Ratio (and 95% Confidence Intervals) in Univariate and Multivariate Analysis of Predictors of Major Cardiac Events (Cardiac Death or Worsening of Heart Failure Leading to Heart Transplantation),"['hazard', 'ratio', 'confid', 'interv', 'univari', 'multivari', 'analysi', 'predictor', 'major', 'cardiac', 'event', 'cardiac', 'death', 'worsen', 'heart', 'failur', 'lead', 'heart', 'transplant']"
1,1,23344,0.0,0.2836,"analysi, multivari, regress, variabl, model, predictor, cardiac, time, univari, heart","Left Ventricular and Right Ventricular Ejection Fractions, Left Ventricular and Right Ventricular Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations for Both Ventricles in 30 Cases of Left Sided WPW legend","['leav', 'ventricular', 'right', 'ventricular', 'eject', 'fraction', 'leav', 'ventricular', 'right', 'ventricular', 'mean', 'phase', 'leav', 'right', 'mean', 'phase', 'differ', 'rmp', 'phase', 'standard', 'deviat', 'ventricl', 'case', 'leav', 'side', 'wpw']"
2,2,41163,0.0,0.2817,"analysi, multivari, regress, variabl, model, predictor, cardiac, time, univari, heart","Partial Regression Coefficients (All Subjects, n = 262) for Forward Stepwise Linear Regression for Dependent Variables Augmentation Pressure and Augmentation Index legend","['partial', 'regress', 'coeffici', 'subject', 'forward', 'stepwis', 'linear', 'regress', 'depend', 'variabl', 'augment', 'pressur', 'augment', 'index']"
3,3,23343,0.0,0.2797,"analysi, multivari, regress, variabl, model, predictor, cardiac, time, univari, heart","Left Ventricular (LVEF) and Right Ventricular (RVEF) Ejection Fractions, Left Ventricular (LVMP) and Right Ventricular (RVMP) Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations (LVPSD and RVPSD) for Both Ventricles in 14 Cases of Right Sided WPW legend","['leav', 'ventricular', 'lvef', 'right', 'ventricular', 'rvef', 'eject', 'fraction', 'leav', 'ventricular', 'lvmp', 'right', 'ventricular', 'rvmp', 'mean', 'phase', 'leav', 'right', 'mean', 'phase', 'differ', 'rmp', 'phase', 'standard', 'deviat', 'lvpsd', 'rvpsd', 'ventricl', 'case', 'right', 'side', 'wpw']"
4,4,24968,0.0,0.2782,"analysi, multivari, regress, variabl, model, predictor, cardiac, time, univari, heart",Predictors of Mortality by Multivariable Analysis: Variables Are Shown in the Order They Entered a Stepwise Cox Regression Model,"['predictor', 'mortal', 'multivari', 'analysi', 'variabl', 'show', 'order', 'enter', 'stepwis', 'cox', 'regress', 'model']"


#### Extract target data

In [3]:
data_text = news_data[['Origin_Text']]
data_text['index'] = news_data[['Unnamed: 0']]
documents = data_text
documents.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Origin_Text,index
0,Hazard Ratio (and 95% Confidence Intervals) in Univariate and Multivariate Analysis of Predictors of Major Cardiac Events (Cardiac Death or Worsening of Heart Failure Leading to Heart Transplantation),0
1,"Left Ventricular and Right Ventricular Ejection Fractions, Left Ventricular and Right Ventricular Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations for Both Ventricles in 30 Cases of Left Sided WPW legend",1
2,"Partial Regression Coefficients (All Subjects, n = 262) for Forward Stepwise Linear Regression for Dependent Variables Augmentation Pressure and Augmentation Index legend",2
3,"Left Ventricular (LVEF) and Right Ventricular (RVEF) Ejection Fractions, Left Ventricular (LVMP) and Right Ventricular (RVMP) Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations (LVPSD and RVPSD) for Both Ventricles in 14 Cases of Right Sided WPW legend",3
4,Predictors of Mortality by Multivariable Analysis: Variables Are Shown in the Order They Entered a Stepwise Cox Regression Model,4


### Preprocessing

* Import Libraries

In [4]:
!pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org gensim

[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
!pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org nltk

[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [6]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import numpy as np
np.random.seed(2018)

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gracelee/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

* Preprocess
 1. simple_preprocess: Split Text by whitespace
 2. STOPWORDS: Remove stopwords
 3. lemmatize_stemming
 
* lemmatize_stemming
 - Lemmatizing & Stemming Replace word with original form
 - Lemmatizing consider whether the word exist in the real world
 - pos means a position of the word
 - https://m.blog.naver.com/PostView.nhn?blogId=vangarang&logNo=220963244354&proxyReferer=https%3A%2F%2Fwww.google.com%2F

In [7]:
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

* Test

In [8]:
doc_sample = documents[documents['index'] == 100].values[0][0]
print('original document: ')

words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)

print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['Treatment', 'efficacy', 'at', 'week', '36', 'for', 'the', 'modified', 'intention-to-treat', 'population', 'in', 'the', 'open-label', 'period', 'and', 'at', 'week', '88', 'for', 'the', 'modified', 'intention-to-treat', 'subpopulations', 'in', 'the', 'double-blind', 'period']


 tokenized and lemmatized document: 
['treatment', 'efficaci', 'week', 'modifi', 'intent', 'treat', 'popul', 'open', 'label', 'period', 'week', 'modifi', 'intent', 'treat', 'subpopul', 'doubl', 'blind', 'period']


* Run

In [9]:
%time processed_docs = documents['Origin_Text'].map(preprocess)
processed_docs[:10]

CPU times: user 612 ms, sys: 3.56 ms, total: 615 ms
Wall time: 615 ms


0                                                                                        [hazard, ratio, confid, interv, univari, multivari, analysi, predictor, major, cardiac, event, cardiac, death, worsen, heart, failur, lead, heart, transplant]
1                                           [leav, ventricular, right, ventricular, eject, fraction, leav, ventricular, right, ventricular, mean, phase, leav, right, mean, phase, differ, phase, standard, deviat, ventricl, case, leav, side, legend]
2                                                                                                                   [partial, regress, coeffici, subject, forward, stepwis, linear, regress, depend, variabl, augment, pressur, augment, index, legend]
3    [leav, ventricular, lvef, right, ventricular, rvef, eject, fraction, leav, ventricular, lvmp, right, ventricular, rvmp, mean, phase, leav, right, mean, phase, differ, phase, standard, deviat, lvpsd, rvpsd, ventricl, case, right, side, legend]
4       

----

### T-SNE

* https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/
* https://lumiamitie.github.io/r/python/tsne-for-r-py/

In [10]:
### TSNE모델에는 transform 메소드가 없고 fit_transform만 있음
# library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

np.random.seed(2018)

In [11]:
type(documents['Origin_Text'].values.tolist())

list

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
%time vect.fit([' '.join(d) for d in processed_docs])
%time tsne_data = vect.transform([' '.join(d) for d in processed_docs]).toarray()

CPU times: user 27.3 ms, sys: 1.84 ms, total: 29.1 ms
Wall time: 28.1 ms
CPU times: user 21 ms, sys: 3.39 ms, total: 24.4 ms
Wall time: 24.4 ms


In [13]:
tsne_data[:10]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [14]:
%time tsne_result = TSNE(learning_rate=300, init='pca').fit_transform(np.array(tsne_data))

CPU times: user 16.5 s, sys: 941 ms, total: 17.5 s
Wall time: 17.3 s


In [15]:
tsne_result[:10]

array([[  9.664474  ,   8.992035  ],
       [ 23.601093  , -15.2017975 ],
       [ 11.646137  ,   0.97763896],
       [ 23.60182   , -15.202716  ],
       [  8.242556  ,   3.2811804 ],
       [ 13.225922  ,  -0.97631407],
       [  5.6147494 ,   8.802479  ],
       [  4.419053  ,   8.055964  ],
       [ 12.462818  ,  -1.2678202 ],
       [  9.879312  ,   5.7678313 ]], dtype=float32)

In [16]:
# # 시각화
# plt.scatter(tsne_result[:, 1], tsne_result[:, 0])
# plt.xlim(tsne_result[:, 1].min()-3, tsne_result[:, 1].max()+3) # 최소, 최대
# plt.ylim(tsne_result[:, 0].min()-3, tsne_result[:, 0].max()+3) # 최소, 최대
# plt.xlabel('t-SNE 특성0') # x축 이름
# plt.ylabel('t-SNE 특성1') # y축 이름
# plt.show() # 그래프 출력

In [17]:
%time tsne_3d_result = TSNE(n_components=3, learning_rate=300, init='pca').fit_transform(np.array(tsne_data))

CPU times: user 24.4 s, sys: 952 ms, total: 25.3 s
Wall time: 25.2 s


In [18]:
tsne_3d_result[:10]

array([[ 103.58746  ,   72.253586 ,    5.65633  ],
       [  31.781635 ,   89.92669  ,  -82.63583  ],
       [  20.661259 ,  124.87336  ,  -45.061954 ],
       [  24.231483 ,   92.08438  , -102.17858  ],
       [-146.37094  ,    8.817643 ,  -81.40019  ],
       [ 107.89072  ,  130.8689   ,  -55.291992 ],
       [ 124.24056  ,   16.723385 ,   -6.6259217],
       [ 127.55127  ,   27.194212 ,  -41.71926  ],
       [  96.68417  ,  108.18609  ,  -65.152824 ],
       [  45.471752 , -115.20811  ,   79.15913  ]], dtype=float32)

In [19]:
# from mpl_toolkits.mplot3d import Axes3D

# plt.style.use('fivethirtyeight')

# plt.rcParams["figure.figsize"] = (20,10)
# plt.rcParams['lines.linewidth'] = 1
# plt.rcParams['lines.color'] = 'r'
# plt.rcParams['axes.grid'] = True 

# fig = plt.figure(figsize=(8, 6))
# ax = fig.add_subplot(111, projection='3d')

# for x, y, z in tsne_3d_result:
#     ax.scatter(x, y, z, c='blue')
    
# ax.set_xlabel('X Label')
# ax.set_ylabel('Y Label')
# ax.set_zlabel('Z Label')

----

### LDA

* Setting Variables

    1. document_topic_counts : List of Counter (len = count of documents)
    2. topic_word_counts : List of Counter (len = count of topic)
    3. topic_counts : List of Integer (len = count of topic)
    4. document_lengths : List of length of documents
    5. distinct_words: All unique words in dataset
    6. V: length of distinct words
    7. D: length of documents
    
* Counter Object
 - Calculate count of elements

In [20]:
from collections import Counter

def get_variables(K):
    # 사용자가 원하는 토픽의 갯수
    K = 8

    # 각 토픽이 각 문서에 할당되는 횟수
    # Counter로 구성된 리스트
    # 각 Counter는 각 문서를 의미
    document_topic_counts = [Counter() for _ in processed_docs]

    # 각 단어가 각 토픽에 할당되는 횟수
    # Counter로 구성된 리스트
    # 각 Counter는 각 토픽을 의미
    topic_word_counts = [Counter() for _ in range(K)]

    # 각 토픽에 할당되는 총 단어수
    # 숫자로 구성된 리스트
    # 각각의 숫자는 각 토픽을 의미함
    topic_counts = [0 for _ in range(K)]

    # 각 문서에 포함되는 총 단어수
    # 숫자로 구성된 리스트
    # 각각의 숫자는 각 문서를 의미함
    document_lengths = list(map(len, processed_docs))

    # 단어 종류의 수
    distinct_words = set(word for document in processed_docs for word in document)
    V = len(distinct_words)

    # 총 문서의 수
    D = len(processed_docs)

    return V, D, document_topic_counts, topic_word_counts, topic_counts, document_lengths, distinct_words

In [21]:
def p_topic_given_document(topic, d, alpha=0.1):
    # 문서 d의 모든 단어 가운데 topic에 속하는
    # 단어의 비율 (alpha를 더해 smoothing)
    return ((document_topic_counts[d][topic] + alpha) /
            (document_lengths[d] + K * alpha))

def p_word_given_topic(word, topic, beta=0.1):
    # topic에 속한 단어 가운데 word의 비율
    # (beta를 더해 smoothing)
    return ((topic_word_counts[topic][word] + beta) /
            (topic_counts[topic] + V * beta))

def topic_weight(d, word, k):
    # 문서와 문서의 단어가 주어지면
    # k번째 토픽의 weight를 반환
    return p_word_given_topic(word, k) * p_topic_given_document(k, d)

In [22]:
def choose_new_topic(d, word):
    return sample_from([topic_weight(d, word, k) for k in range(K)])

import random
def sample_from(weights):
    # i를 weights[i] / sum(weights)
    # 확률로 반환
    total = sum(weights)
    # 0과 total 사이를 균일하게 선택
    rnd = total * random.random()
    # 아래 식을 만족하는 가장 작은 i를 반환
    # weights[0] + ... + weights[i] >= rnd
    for i, w in enumerate(weights):
        rnd -= w
        if rnd <= 0:
            return i

* Run
 - Initialize Topic using random value by word in documents
 - Calculate variables
    1. document_topic_counts
        - count of topic word in every document
        - 개별 문서에서 topic word의 등장 횟수
    2. topic_word_counts
        - appearance count of words in whole documents
        - every word seperate by topic
        - 개별 Topic에서 topic word의 등장 횟수(전체 문서 기준)

In [23]:
random.seed(0)

K = 8
V, D, document_topic_counts, topic_word_counts, topic_counts, document_lengths, distinct_words = get_variables(K)

# 각 단어를 임의의 토픽에 랜덤 배정
document_topics = [[random.randrange(K) for word in document] for document in processed_docs]

# 위와 같이 랜덤 초기화한 상태에서 
# AB를 구하는 데 필요한 숫자를 세어봄
for d in range(D):
    for word, topic in zip(processed_docs[d], document_topics[d]):
        document_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

In [24]:
len(processed_docs)

400

----

In [25]:
import time
start_time = time.time() 

for iter in range(3):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(processed_docs[d], document_topics[d])):
            # 깁스 샘플링 수행을 위해
            # 샘플링 대상 word와 topic을 제외하고 세어봄
            document_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            document_lengths[d] -= 1

            # 깁스 샘플링 대상 word와 topic을 제외한 
            # 말뭉치 모든 word의 topic 정보를 토대로
            # 샘플링 대상 word의 새로운 topic을 선택
            new_topic = choose_new_topic(d, word)
            document_topics[d][i] = new_topic

            # 샘플링 대상 word의 새로운 topic을 반영해 
            # 말뭉치 정보 업데이트
            document_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1
            document_lengths[d] += 1
    
    print("--- %d iter: %s mins ---" % (iter, str((time.time() - start_time) / 60.)))

print("--- %s mins ---" % str((time.time() - start_time) / 60.))

--- 0 iter: 0.005571548144022624 mins ---
--- 1 iter: 0.011192631721496583 mins ---
--- 2 iter: 0.016779232025146484 mins ---
--- 0.01678473154703776 mins ---


In [26]:
## i번째 document의 topic 비중
document_topic_counts[0]

Counter({0: 6, 1: 0, 2: 6, 3: 0, 4: 0, 5: 6, 6: 1, 7: 0})

In [27]:
## i번째 topic의 단어 비중
for i in range(8):
    print('Topic %d: %s' % (i, ','.join(['%s(%s)' % (k, topic_word_counts[i].get(k)) for k in topic_word_counts[i].keys() if topic_word_counts[i].get(k) >= 25])))

Topic 0: risk(28)
Topic 1: patient(25)
Topic 2: 
Topic 3: treatment(25)
Topic 4: year(30),death(30)
Topic 5: coronari(30)
Topic 6: event(30),advers(30)
Topic 7: 


In [28]:
## i번째 topic의 단어 비중
for i in range(8):
    print('Topic %d: %s' % (i, ','.join(['%s(%s)' % (a, b) for a, b in topic_word_counts[i].most_common(10)])))

Topic 0: risk(28),score(23),relat(23),heart(21),combin(18),predict(17),cardiac(15),failur(14),analysi(13),stroke(12)
Topic 1: patient(25),advers(18),treatment(14),year(14),clinic(14),event(13),women(13),modern(13),grade(12),model(10)
Topic 2: model(24),patient(23),sensit(18),differ(17),hazard(16),standardis(16),surviv(14),year(14),relat(14),cancer(14)
Topic 3: treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)
Topic 4: year(30),death(30),risk(18),myocardi(18),acut(18),efficaci(17),caus(16),accord(16),infarct(15),patient(14)
Topic 5: coronari(30),intervent(17),diseas(12),cancer(10),base(9),peak(9),effect(8),estim(8),control(8),tomographi(8)
Topic 6: event(30),advers(30),patient(22),grade(16),mortal(13),regress(11),multivari(11),coronari(10),variabl(10),occur(9)
Topic 7: legend(14),analysi(14),patient(12),specif(8),adjust(8),number(8),associ(7),ratio(7),imag(7),rate(7)


In [29]:
documents.head()

Unnamed: 0,Origin_Text,index
0,Hazard Ratio (and 95% Confidence Intervals) in Univariate and Multivariate Analysis of Predictors of Major Cardiac Events (Cardiac Death or Worsening of Heart Failure Leading to Heart Transplantation),0
1,"Left Ventricular and Right Ventricular Ejection Fractions, Left Ventricular and Right Ventricular Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations for Both Ventricles in 30 Cases of Left Sided WPW legend",1
2,"Partial Regression Coefficients (All Subjects, n = 262) for Forward Stepwise Linear Regression for Dependent Variables Augmentation Pressure and Augmentation Index legend",2
3,"Left Ventricular (LVEF) and Right Ventricular (RVEF) Ejection Fractions, Left Ventricular (LVMP) and Right Ventricular (RVMP) Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations (LVPSD and RVPSD) for Both Ventricles in 14 Cases of Right Sided WPW legend",3
4,Predictors of Mortality by Multivariable Analysis: Variables Are Shown in the Order They Entered a Stepwise Cox Regression Model,4


In [30]:
import operator

doc_result = documents[['index', 'Origin_Text']]
doc_result.columns = ['id', 'document']
doc_result['topic'] = doc_result.id.apply(lambda x: max(document_topic_counts[x].items(), key=operator.itemgetter(1))[0])
doc_result['topic_prob'] = doc_result.id.apply(lambda x: max(document_topic_counts[x].items(), key=operator.itemgetter(1))[1])
doc_result['topic_word'] = doc_result.topic.apply(lambda x: ','.join(['%s(%s)' % (a, b)for a, b in topic_word_counts[x].most_common(10)]))
doc_result = pd.merge(doc_result, pd.DataFrame(tsne_result, columns=['plot_x', 'plot_y']), left_index=True, right_index=True)
doc_result = pd.merge(doc_result, pd.DataFrame(tsne_3d_result, columns=['td_x', 'td_y', 'td_z']), left_index=True, right_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [31]:
plt.style.use('fivethirtyeight')

plt.rcParams["figure.figsize"] = (20,10)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True 

# doc_result.plot.scatter(x='plot_x', y='plot_y', c='topic', colormap='Accent')

In [32]:
# threedee = plt.figure().gca(projection='3d')
# threedee.scatter(doc_result.td_x, doc_result.td_y, doc_result.td_z, c=doc_result.topic)

# plt.savefig('3d_scatter_lda.png')

In [33]:
doc_result[doc_result.topic == 3].sort_values('topic_prob', ascending=False).head()

Unnamed: 0,id,document,topic,topic_prob,topic_word,plot_x,plot_y,td_x,td_y,td_z
3,3,"Left Ventricular (LVEF) and Right Ventricular (RVEF) Ejection Fractions, Left Ventricular (LVMP) and Right Ventricular (RVMP) Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations (LVPSD and RVPSD) for Both Ventricles in 14 Cases of Right Sided WPW legend",3,25,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",23.60182,-15.202716,24.231483,92.084381,-102.178581
1,1,"Left Ventricular and Right Ventricular Ejection Fractions, Left Ventricular and Right Ventricular Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations for Both Ventricles in 30 Cases of Left Sided WPW legend",3,22,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",23.601093,-15.201797,31.781635,89.926689,-82.635834
373,373,Pooled analysis of the effect of any aspirin versus control in secondary prevention after TIA and ischaemic stroke on the early risk of any recurrent ischaemic stroke and on disabling or fatal ischaemic stroke stratified by the nature of the presenting event (TIA and minor stroke vs major stroke) and by time from presenting event to randomisation (14 days vs >14 days),3,21,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",-9.190104,5.256001,11.063636,38.820911,129.07402
361,361,"Pooled analysis of the early risk of recurrent vascular events, given per time period after randomisation, in trials of aspirin versus control in secondary prevention after transient ischaemic attack and ischaemic stroke",3,16,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",-8.648508,4.525882,-0.745511,36.26992,104.87133
297,297,Percentage of Total Cross Section Represented by Plaque Area (Difference Between Lumen Area and Area Delimited by Internal Elastic Lamina) at Either 7 or 21 Days After Treatment with rhVEGF (2 g/kg by a Single Intramuscular Injection) or Albumin legend,3,16,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",26.421835,3.316678,-65.335838,-24.417757,-64.738106


In [34]:
doc_result.sort_values('topic_prob', ascending=False).head()

Unnamed: 0,id,document,topic,topic_prob,topic_word,plot_x,plot_y,td_x,td_y,td_z
69,69,"Antibodies to neuronal antigens in cerebellar syndromes Sera were screened by routine immunohistochemistry on frozen sections of rat cerebellum and positive staining patterns1 were confirmed, as appropriate, by western blotting on rat cerebellar extracts or recombinant Hu or Yo polypeptides. VGCC antibodies were measured by immunoprecipitation of 125I--conotoxin MVIIC-labelled VGCCs extracted from human cerebellum, 2 and antibodies to glutamic acid decarboxylase measured with a commercial kit (RSR Ltd, Cardiff, UK) .",6,31,"event(30),advers(30),patient(22),grade(16),mortal(13),regress(11),multivari(11),coronari(10),variabl(10),occur(9)",-1.035476,4.40404,-65.114105,-27.71023,-144.433136
3,3,"Left Ventricular (LVEF) and Right Ventricular (RVEF) Ejection Fractions, Left Ventricular (LVMP) and Right Ventricular (RVMP) Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations (LVPSD and RVPSD) for Both Ventricles in 14 Cases of Right Sided WPW legend",3,25,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",23.60182,-15.202716,24.231483,92.084381,-102.178581
1,1,"Left Ventricular and Right Ventricular Ejection Fractions, Left Ventricular and Right Ventricular Mean Phases, Left-to-Right Mean Phase Difference (L-RMP) and Phase Standard Deviations for Both Ventricles in 30 Cases of Left Sided WPW legend",3,22,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)",23.601093,-15.201797,31.781635,89.926689,-82.635834
50,50,"Potential effect of the internet on physical activity interventions (based on effect estimates from web-based physical activity interventions and other physical activity interventions) and the potential effect of mobile phones on physical activity interventions (based on effect estimates from telephone-based physical activity interventions and from other physical activity interventions) , by country income",5,22,"coronari(30),intervent(17),diseas(12),cancer(10),base(9),peak(9),effect(8),estim(8),control(8),tomographi(8)",-15.383671,0.556147,45.630489,-75.589569,-66.935539
201,201,"Decomposition analysis of the change of global disability-adjusted life years (thousands) by level 1 causes from 1990 to 2010 into total population growth, population ageing, and changes in age-specific, sex-specific, and cause-specific disability-adjusted-life-year rates",4,21,"year(30),death(30),risk(18),myocardi(18),acut(18),efficaci(17),caus(16),accord(16),infarct(15),patient(14)",-30.504271,-1.075239,115.293465,-109.835449,25.390877


* 깁스 샘플링(Gibbs Sampling) 
    * http://4four.us/article/2014/10/lda-parameter-estimation
    * https://bab2min.tistory.com/569

* PyLDAvis
    * https://lovit.github.io/nlp/2018/09/27/pyldavis_lda/

----

In [35]:
import pyLDAvis.gensim
import sklearn

In [36]:
# numpy.ndarray, shape = (n_topics, n_terms)
topic_term_dists = np.array([topic_word_counts[i][k] for i in range(K) for k in list(distinct_words)]).reshape((K, len(distinct_words))) 

# numpy.ndarray, shape = (n_docs, n_topics)
doc_topic_dists = pd.DataFrame([d.values() for d in document_topic_counts]).fillna(0).values
doc_topic_dists = doc_topic_dists + [6.25] * 8
doc_topic_dists = sklearn.preprocessing.normalize(doc_topic_dists, norm='l1', axis=1)

# numpy.ndarray, shape = (n_docs,)
doc_lengths = np.array(document_lengths)

# list of str, vocab list
vocab = list(distinct_words)

# numpy.ndarray, shape = (n_vocabs,)
term_frequency = np.array([topic_word_counts[i][k] for i in range(K) for k in list(distinct_words)]).reshape((K, len(distinct_words))).sum(axis=0)

* topic_term_dists: topic_term_dists
* doc_topic_dists: doc_topic_dists
* doc_lengths: doc_lengths
* vocab: vocab
* term_frequency: term_frequency

In [46]:
topic_term_dists

array([[0, 0, 0, ..., 0, 0, 7],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 4],
       ...,
       [1, 0, 0, ..., 1, 1, 2],
       [0, 0, 0, ..., 1, 0, 0],
       [3, 1, 2, ..., 0, 0, 0]])

In [37]:
pd.DataFrame([d.values() for d in document_topic_counts]).fillna(0).values

array([[6., 0., 6., ..., 6., 1., 0.],
       [0., 0., 0., ..., 0., 0., 3.],
       [7., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 6., 0., 0.],
       [4., 4., 0., ..., 0., 0., 0.],
       [8., 0., 1., ..., 0., 0., 0.]])

In [38]:
lda_mallet_data = {
    'topic_term_dists':topic_term_dists,
    'doc_topic_dists':doc_topic_dists,
    'doc_lengths':doc_lengths,
    'vocab':vocab,
    'term_frequency':term_frequency
}
vis_data = pyLDAvis.prepare(**lda_mallet_data)
# pyLDAvis.display(vis_data)
# pyLDAvis.save_html(vis_data, 'test.html')

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [39]:
# LDAvis의 우측 HBar Chart Data
# Freq: Estimated term frequency within the selected topic
# Total: Overall term frequency
print(vis_data.topic_info.Category.unique())
vis_data.topic_info[vis_data.topic_info.Category == 'Topic1'].sort_values('Freq', ascending=False).head()

['Default' 'Topic1' 'Topic2' 'Topic3' 'Topic4' 'Topic5' 'Topic6' 'Topic7'
 'Topic8']


Unnamed: 0_level_0,Category,Freq,Term,Total,loglift,logprob
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
736,Topic1,19086.89814,treatment,52818.859371,7.6038,3.2189
752,Topic1,17559.946289,event,60135.107296,7.3907,3.1355
423,Topic1,16032.994438,popul,31774.648659,7.9377,3.0445
469,Topic1,14506.042586,studi,21253.895208,8.2397,2.9444
808,Topic1,11452.138884,stroke,21026.516902,8.0141,2.7081


----

## Visualization

### 1. Main View
* Layout: https://www.codingfactory.net/10530

#### a. HBar Chart
* Data: vis_data.topic_info[vis_data.topic_info.Category == 'Topic1'].sort_values('Freq', ascending=False).head()
* D3: http://bl.ocks.org/erikvullings/51cc5332439939f1f292

In [40]:
import json

hbar_json = {}
hbar_json['labels'] = vis_data.topic_info.Category.unique().tolist()
hbar_json['max_width'] = vis_data.topic_info[vis_data.topic_info.Category != 'Default'][['Total']].max()[0]
for l in vis_data.topic_info.Category.unique().tolist():
    tmp_df = vis_data.topic_info[vis_data.topic_info.Category == l].sort_values(['Category', 'Freq'], ascending=[True, False]).groupby('Category').head()
    sub_json = {}

    hbar_json[l] = list(tmp_df[['Term', 'Freq', 'Total']].sort_values('Freq', ascending=False).reset_index().to_dict('index').values())
    
f = open('./Visualization/res/lda/hbar_data.json', 'w')
f.write(json.dumps(hbar_json, indent=4))
f.close()

#### b. Scatter Chart
* Data: tsne_result
* D3: https://bl.ocks.org/Niekes/1c15016ae5b5f11508f92852057136b5

In [64]:
doc_topic_dists

array([[0.17753623, 0.09057971, 0.17753623, ..., 0.17753623, 0.10507246,
        0.09057971],
       [0.08333333, 0.08333333, 0.08333333, ..., 0.08333333, 0.08333333,
        0.12333333],
       [0.20384615, 0.09615385, 0.09615385, ..., 0.09615385, 0.09615385,
        0.09615385],
       ...,
       [0.10964912, 0.10964912, 0.10964912, ..., 0.21491228, 0.10964912,
        0.10964912],
       [0.17372881, 0.17372881, 0.1059322 , ..., 0.1059322 , 0.1059322 ,
        0.1059322 ],
       [0.23360656, 0.10245902, 0.11885246, ..., 0.10245902, 0.10245902,
        0.10245902]])

In [41]:
doc_result = documents[['index', 'Origin_Text']]
doc_result.columns = ['id', 'document']
doc_result['topic'] = doc_result.id.apply(lambda x: max(document_topic_counts[x].items(), key=operator.itemgetter(1))[0])
doc_result['topic_prob'] = doc_result.id.apply(lambda x: max(document_topic_counts[x].items(), key=operator.itemgetter(1))[1])
doc_result['topic_word'] = doc_result.topic.apply(lambda x: ','.join(['%s(%s)' % (a, b)for a, b in topic_word_counts[x].most_common(10)]))
doc_result = pd.merge(doc_result, pd.DataFrame(tsne_result, columns=['plot_x', 'plot_y']), left_index=True, right_index=True)
doc_result = pd.merge(doc_result, pd.DataFrame(tsne_3d_result, columns=['td_x', 'td_y', 'td_z']), left_index=True, right_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [42]:
scatter_json = list(doc_result[['id', 'plot_x', 'plot_y', 'topic']].to_dict('index').values())

f = open('./Visualization/res/lda/scatter_data.json', 'w')
f.write(json.dumps(scatter_json, indent=4))
f.close()

#### c. Table
* Data: doc_result[['topic', 'document']].head()

In [43]:
doc_result.to_csv('./data_output/lda.tsv', sep='\t', index_label=False)

In [44]:
doc_result.groupby('topic').head(1)[['topic', 'topic_word']]

Unnamed: 0,topic,topic_word
0,0,"risk(28),score(23),relat(23),heart(21),combin(18),predict(17),cardiac(15),failur(14),analysi(13),stroke(12)"
1,3,"treatment(25),event(23),popul(21),studi(19),stroke(15),advers(14),patient(13),area(11),score(10),ischaem(10)"
2,1,"patient(25),advers(18),treatment(14),year(14),clinic(14),event(13),women(13),modern(13),grade(12),model(10)"
5,2,"model(24),patient(23),sensit(18),differ(17),hazard(16),standardis(16),surviv(14),year(14),relat(14),cancer(14)"
18,6,"event(30),advers(30),patient(22),grade(16),mortal(13),regress(11),multivari(11),coronari(10),variabl(10),occur(9)"
36,7,"legend(14),analysi(14),patient(12),specif(8),adjust(8),number(8),associ(7),ratio(7),imag(7),rate(7)"
42,5,"coronari(30),intervent(17),diseas(12),cancer(10),base(9),peak(9),effect(8),estim(8),control(8),tomographi(8)"
47,4,"year(30),death(30),risk(18),myocardi(18),acut(18),efficaci(17),caus(16),accord(16),infarct(15),patient(14)"


In [45]:
doc_result.groupby('topic').agg({'id': 'unique'})

Unnamed: 0_level_0,id
topic,Unnamed: 1_level_1
0,"[0, 6, 9, 12, 20, 24, 26, 27, 33, 37, 48, 55, 64, 66, 75, 83, 87, 103, 119, 149, 174, 190, 209, 210, 217, 220, 232, 261, 262, 266, 268, 270, 274, 276, 278, 283, 288, 290, 298, 315, 328, 334, 345, 346, 350, 351, 352, 353, 354, 359, 364, 366, 376, 382, 387, 389, 394, 395, 398, 399]"
1,"[2, 4, 11, 14, 16, 17, 22, 30, 31, 46, 52, 62, 81, 93, 94, 98, 114, 120, 121, 123, 124, 125, 129, 131, 133, 143, 144, 147, 150, 157, 164, 171, 172, 173, 179, 181, 183, 184, 194, 215, 226, 233, 239, 244, 245, 252, 259, 277, 285, 331, 333, 343, 349, 384, 386]"
2,"[5, 7, 8, 13, 15, 23, 28, 32, 34, 35, 38, 39, 43, 45, 57, 58, 59, 67, 73, 74, 84, 85, 100, 108, 110, 113, 130, 141, 142, 145, 163, 166, 177, 186, 192, 200, 203, 204, 205, 206, 212, 213, 214, 218, 219, 223, 225, 229, 231, 234, 242, 248, 254, 255, 264, 272, 286, 326, 362, 379, 388]"
3,"[1, 3, 10, 65, 79, 90, 101, 104, 105, 107, 115, 116, 122, 126, 128, 136, 137, 139, 140, 165, 182, 198, 221, 241, 251, 256, 280, 282, 289, 296, 297, 300, 332, 338, 341, 358, 361, 363, 365, 367, 368, 370, 373, 374, 377, 378, 380, 390, 393, 396]"
4,"[47, 54, 68, 70, 96, 99, 158, 168, 175, 176, 178, 180, 189, 191, 197, 201, 202, 207, 211, 216, 222, 227, 230, 235, 238, 240, 246, 247, 249, 257, 260, 265, 267, 269, 275, 279, 287, 291, 292, 293, 294, 304, 355, 357, 369, 375, 385, 392]"
5,"[42, 50, 56, 60, 71, 72, 76, 80, 86, 91, 109, 111, 151, 153, 167, 187, 195, 208, 224, 236, 237, 253, 258, 273, 305, 306, 307, 308, 310, 314, 317, 319, 320, 322, 324, 325, 327, 329, 330, 335, 336, 339, 342, 360, 381, 391]"
6,"[18, 19, 21, 25, 29, 41, 44, 49, 61, 69, 78, 82, 92, 95, 97, 102, 106, 112, 117, 118, 127, 132, 134, 135, 138, 146, 148, 160, 169, 185, 188, 263, 271, 284, 299, 301, 302, 311, 318, 323, 344, 347, 371]"
7,"[36, 40, 51, 53, 63, 77, 88, 89, 152, 154, 155, 156, 159, 161, 162, 170, 193, 196, 199, 228, 243, 250, 281, 295, 303, 309, 312, 313, 316, 321, 337, 340, 348, 356, 372, 383, 397]"
