# Latent Analysis & Topic Modeling

## REF

* [A Comparison of LDA and NMF for Topic Modeling on Literary Themes](https://wiki.ubc.ca/Course:CPSC522/A_Comparison_of_LDA_and_NMF_for_Topic_Modeling_on_Literary_Themes)
* [Complete Guide to Topic Modeling](https://nlpforhackers.io/topic-modeling/)
* [Topic Modelling with LSA and LDA](https://www.kaggle.com/rcushen/topic-modelling-with-lsa-and-lda)

## Data

In [1]:
# 인삼이 포함된 처방군
corpus_path = "../../_data/formulas_including_insam.txt"
corpus_ = open( corpus_path, 'r', encoding='utf-8' ).read()
header, corpus_raw = corpus_.split("***")
corpus_raw = corpus_raw.strip()
corpus = [ line.strip() for line in corpus_raw.split("\n") ]
print( "# Corpus Description \n", header.strip() )
print()
print( "# Corpus Size: ", len(corpus) )


# Corpus Description 
 InSAM 1.10에서 "인삼"이 포함된 처방을 추출한 결과. 
모두 2,196건
대상 서적 : 동의보감, 방약합편, 의학입문, 사상의학, 현대상한론, 경악전서, 의방집해, 금궤요략

# Corpus Size:  2196


In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from time import time

In [3]:
n_components = 2 ** 4
n_top_words = 4
stop_words = ["인삼"]
min_df = 6

In [4]:
# Build TF Matrix and TF-IDF Matrix

t0 = time()
tf_vectorizer = CountVectorizer( max_df=0.95, min_df=min_df, stop_words=stop_words )
tf = tf_vectorizer.fit_transform( corpus )
print( "TF Matrix done in {:03f}s.".format(time() - t0) )

t0 = time()
tfidf_vectorizer = TfidfVectorizer( max_df=0.95, min_df=min_df, stop_words=stop_words )
tfidf = tfidf_vectorizer.fit_transform( corpus )
print( "TF-IDF Matrix done in {:03f}s.".format(time() - t0) )

TF Matrix done in 0.020104s.
TF-IDF Matrix done in 0.020566s.


In [5]:
import numpy as np

def get_top_words( components, feature_names, n_top_words):
    arr = np.array( components )
    argsorted = np.argsort( arr )
    rst = []
    for k, lst in enumerate( argsorted):
        q = [ ( feature_names[i], arr[k][i] ) for i in lst ]
        q.reverse()
        rst.append( q[:n_top_words] )
    return rst

def print_top_words( n_top_words_with_score, template="{0}({1:0.3f})" ):
    for i, topic in enumerate( n_top_words_with_score ):
        print( "Topic {:02d}".format(i+1), "\t".join( [ template.format(f,s) for f, s in topic ] ) )

## Extracting tf-idf features for NMF

[sklearn.decomposition.NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)

In [6]:
print("Extracting tf-idf features for NMF...")

# Fit the NMF model
t0 = time()
print("Fitting the NMF model with tf-idf features ..."  )
nmf_params = {
    "n_components": n_components, 
    "random_state": 1, 
    "max_iter": 1000, 
#     "alpha": .1, 
#     "l1_ratio": .5
}
nmf_model = NMF( **nmf_params )
nmf_W = nmf_model.fit_transform( tfidf )  # (NO_DOCUMENTS, NO_TOPICS)
nmf_H = nmf_model.components_             # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print_top_words( get_top_words( nmf_H, tfidf_feature_names, n_top_words) )

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features ...
done in 0.110970s.

Topics in NMF model:
Topic 01 당귀(2.972)	백작약(2.722)	천궁(2.235)	숙지황(1.490)
Topic 02 생강(1.949)	대조(1.724)	반하(0.875)	감초(0.366)
Topic 03 건강(2.812)	감초(0.716)	백출(0.495)	황련(0.348)
Topic 04 오미자(2.273)	맥문동(2.191)	아교(0.453)	상백피(0.380)
Topic 05 황금(2.562)	시호(2.198)	반하(0.892)	감초(0.591)
Topic 06 계지(2.241)	작약(0.818)	대조(0.242)	세신(0.233)
Topic 07 방풍(1.279)	길경(1.225)	강활(0.985)	지각(0.868)
Topic 08 원지(1.395)	산조인(1.180)	주사(1.036)	석창포(0.858)
Topic 09 승마(2.724)	창출(1.104)	황백(1.045)	진피(1.038)
Topic 10 백복령(2.094)	백출(2.005)	진피(1.040)	감초(0.695)
Topic 11 숙지황(1.549)	산약(0.970)	두충(0.876)	구기자(0.751)
Topic 12 복령(2.099)	백출(0.765)	작약(0.372)	감초(0.267)
Topic 13 생지황(1.571)	지모(1.540)	석고(0.893)	지골피(0.678)
Topic 14 목향(1.152)	후박(0.983)	곽향(0.832)	사인(0.710)
Topic 15 부자(2.681)	육계(0.370)	관계(0.141)	정향(0.139)
Topic 16 황기(2.812)	육계(0.549)	감초(0.525)	당귀(0.299)


## Extracting tf features for LDA

[LDA 잠재 디리클레 할당 ](https://ko.wikipedia.org/wiki/%EC%9E%A0%EC%9E%AC_%EB%94%94%EB%A6%AC%ED%81%B4%EB%A0%88_%ED%95%A0%EB%8B%B9)

[sklearn.decomposition.LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

In [7]:
print("Extracting tf features for LDA...")

# Fit the LDA model
t0 = time()
print("Fitting LDA models with tf features ..." )
lda_params = {
    "n_components": n_components, 
    "max_iter": 100,
    "learning_method": 'online',
#     "learning_offset": 50.,
#     "random_state": 0
}
      
lda_model = LatentDirichletAllocation( **lda_params )
lda_L = lda_model.fit_transform( tf )  # (NO_DOCUMENTS, NO_TOPICS)
lda_D = lda_model.components_          # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words( get_top_words( lda_D, tf_feature_names, n_top_words) )

Extracting tf features for LDA...
Fitting LDA models with tf features ...
done in 23.232027s.

Topics in LDA model:
Topic 01 산약(98.240)	숙지황(82.485)	구기자(68.893)	연자육(62.106)
Topic 02 대황(53.218)	당귀(18.821)	사군자(13.428)	감초(12.008)
Topic 03 감초(90.441)	맥문동(85.470)	석고(62.910)	지모(59.303)
Topic 04 감초(212.833)	건강(202.122)	백출(178.917)	진피(150.443)
Topic 05 방풍(194.210)	감초(153.529)	길경(142.302)	천궁(116.777)
Topic 06 주사(107.585)	석창포(57.946)	사향(57.045)	백복령(54.748)
Topic 07 원지(99.640)	산조인(71.727)	감초(67.489)	당귀(62.107)
Topic 08 부자(126.333)	당귀(103.506)	백작약(75.389)	천궁(71.273)
Topic 09 목향(203.025)	백출(180.231)	감초(169.358)	진피(150.473)
Topic 10 생강(472.985)	대조(340.588)	반하(306.904)	감초(281.812)
Topic 11 감초(108.875)	오미자(101.842)	길경(83.363)	아교(66.192)
Topic 12 당귀(420.287)	감초(343.288)	백작약(313.723)	백출(280.603)
Topic 13 목단피(54.045)	관계(39.868)	오수유(33.215)	도인(25.641)
Topic 14 맥문동(156.340)	생지황(123.191)	오미자(118.614)	천문동(88.352)
Topic 15 복령(92.355)	감초(88.719)	황금(87.777)	당귀(82.172)
Topic 16 마황(57.976)	세신(44.331)	천초(41.208)	천오

## Extracting tf-idf features for LSA

In [8]:
print("Extracting tf-idf features for LSA...")

# Fit the LDA model
t0 = time()
print("Fitting LSA models with tf-idf features ..." )
lda_params = {
    "n_components": n_components, 
}
      
lsa_model = TruncatedSVD( **lda_params )
lsa_L = lsa_model.fit_transform( tfidf )  # (NO_DOCUMENTS, NO_TOPICS)
lsa_S = lda_model.components_             # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

print("\nTopics in LSA model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words( get_top_words( lsa_S, tfidf_feature_names, n_top_words) )

Extracting tf-idf features for LSA...
Fitting LSA models with tf-idf features ...
done in 0.015000s.

Topics in LSA model:
Topic 01 산약(98.240)	숙지황(82.485)	구기자(68.893)	연자육(62.106)
Topic 02 대황(53.218)	당귀(18.821)	사군자(13.428)	감초(12.008)
Topic 03 감초(90.441)	맥문동(85.470)	석고(62.910)	지모(59.303)
Topic 04 감초(212.833)	건강(202.122)	백출(178.917)	진피(150.443)
Topic 05 방풍(194.210)	감초(153.529)	길경(142.302)	천궁(116.777)
Topic 06 주사(107.585)	석창포(57.946)	사향(57.045)	백복령(54.748)
Topic 07 원지(99.640)	산조인(71.727)	감초(67.489)	당귀(62.107)
Topic 08 부자(126.333)	당귀(103.506)	백작약(75.389)	천궁(71.273)
Topic 09 목향(203.025)	백출(180.231)	감초(169.358)	진피(150.473)
Topic 10 생강(472.985)	대조(340.588)	반하(306.904)	감초(281.812)
Topic 11 감초(108.875)	오미자(101.842)	길경(83.363)	아교(66.192)
Topic 12 당귀(420.287)	감초(343.288)	백작약(313.723)	백출(280.603)
Topic 13 목단피(54.045)	관계(39.868)	오수유(33.215)	도인(25.641)
Topic 14 맥문동(156.340)	생지황(123.191)	오미자(118.614)	천문동(88.352)
Topic 15 복령(92.355)	감초(88.719)	황금(87.777)	당귀(82.172)
Topic 16 마황(57.976)	세신(44.331)	천초(41.

In [9]:
from sklearn.manifold import TSNE

def get_tsne_model( params ):
    return TSNE( **params )

tsne_parms = {
    "n_components": 2,
    "perplexity": 50,
    "learning_rate": 100,
    "n_iter": 2000,
    "verbose": 1,
    "random_state": 0,
    "angle": 0.75   
}

tsne_nmf_model = get_tsne_model( tsne_parms )
tsne_nmf_vectors = tsne_nmf_model.fit_transform( nmf_W )

tsne_lda_model = get_tsne_model( tsne_parms )
tsne_lda_vectors = tsne_nmf_model.fit_transform( lda_L )

tsne_lsa_model = get_tsne_model( tsne_parms )
tsne_lsa_vectors = tsne_nmf_model.fit_transform( lsa_L )



[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 2196 samples in 0.003s...
[t-SNE] Computed neighbors for 2196 samples in 0.214s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2196
[t-SNE] Computed conditional probabilities for sample 2000 / 2196
[t-SNE] Computed conditional probabilities for sample 2196 / 2196
[t-SNE] Mean sigma: 0.051628
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.540077
[t-SNE] KL divergence after 2000 iterations: 0.853741
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 2196 samples in 0.001s...
[t-SNE] Computed neighbors for 2196 samples in 0.177s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2196
[t-SNE] Computed conditional probabilities for sample 2000 / 2196
[t-SNE] Computed conditional probabilities for sample 2196 / 2196
[t-SNE] Mean sigma: 0.129896
[t-SNE] KL divergence after 250 iterations with early exaggeration: 66.387924
[t-SNE] KL divergence after 2000 iterations: 0.63857

In [10]:
nmf_topics = np.argmax( nmf_W, axis=1 ) 
lda_topics = np.argmax( lda_L, axis=1 ) 
lsa_topics = np.argmax( lsa_L, axis=1 ) 

In [11]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook, show

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5" ])
colormap = colormap[:n_components]

In [12]:
output_notebook()
plot = figure(title="t-SNE Clustering of {} NMF Topics".format( n_components ), plot_width=400, plot_height=400)
plot.scatter( x=tsne_nmf_vectors[:,0], y=tsne_nmf_vectors[:,1], color=colormap[ nmf_topics ] )
show( plot )

In [13]:
output_notebook()
plot = figure(title="t-SNE Clustering of {} LDA Topics".format( n_components ), plot_width=400, plot_height=400)
plot.scatter( x=tsne_lda_vectors[:,0], y=tsne_lda_vectors[:,1], color=colormap[ lda_topics ] )
show( plot )

In [14]:
output_notebook()
plot = figure(title="t-SNE Clustering of {} LSA Topics".format( n_components ), plot_width=400, plot_height=400)
plot.scatter( x=tsne_lsa_vectors[:,0], y=tsne_lsa_vectors[:,1], color=colormap[ lsa_topics ] )
show( plot )