# Latent Analysis & Topic Modeling

## REF

* [A Comparison of LDA and NMF for Topic Modeling on Literary Themes](https://wiki.ubc.ca/Course:CPSC522/A_Comparison_of_LDA_and_NMF_for_Topic_Modeling_on_Literary_Themes)
* [Complete Guide to Topic Modeling](https://nlpforhackers.io/topic-modeling/)
* [Topic Modelling with LSA and LDA](https://www.kaggle.com/rcushen/topic-modelling-with-lsa-and-lda)
* [text2vec/topic modeling](http://text2vec.org/topic_modeling.html)

Bag-of-words model로 텍스트를 분석할 경우, 기초 데이터는 주로 Document-Term matrix(DTM)가 된다. 여기서 Term은 관찰하고자 하는 feature이다. 

이 DTM에서 row vector는 Document를 공간상에 embedding한 vector가 되고, column vector는 Term을 공간상에 embedding한 vecotr로 볼 수 있다. 따라서 DTM을 통해 document와 term을 수학적으로 조작할 수 있게 되는 것이다. 

그런데, Document와 Term 이외에 제3의 잠재 속성(latent features)을 상정하고, 이를 중심으로 해당 데이터를 조망하고 싶을 때가 있다. 예를 들어 전체 corpus가 몇가지 주제로 이루어진 문서 집합이라면, 각각의 Document 들은 어떤 주제이며, 각각의 Term 들은 그 주제와 얼마나 연관이 있는지 분석해 볼 수 있다. 

* 이 제3의 속성이 위의 예시처럼 "주제(Topic)"이라면 이를 토픽 모델링(Topic Modeling)이라고 하고, 
* 잠재된 어떤 의미라고 한다면 잠재분석(Latent Analysis)이라고 한다. 


위의 Task들은 목적이 다소 다를 뿐이지, 선형대수의 입장에서 보았을 때 DTM을 2개의 matrix로 분해하는 문제로 볼 수 있다. 즉 DTM을 Document-LatentFeature Matrix(X)와 LatentFeature-Term Matrix(U)의 곱으로 나타내는 문제이다. 

$$
DTM = X \cdot U
$$

* n개의 Document에 대해 m개의 Term을 관찰하였고, 내재된 속성(Latent Feature)의 개수를 p라고 한다면 각각의 행렬은 다음과 같아진다,
  - DTM.shape : ( n, m )
  - X.shape : ( n, p )
  - U.shape : ( p, m )

이렇게 표현할 수 있다면, X는 document와 내재 latent feature의 관계를, U는 latent feature와 term의 관계를 의미하는 matrix가 된다. 

[행렬 분해(matrix decomposition)](https://ko.wikipedia.org/wiki/%ED%96%89%EB%A0%AC_%EB%B6%84%ED%95%B4)의 방법은 상당히 다양하며 한가지 답만 존재하는 것도 아니다. 여기에서는 이미 자주 사용되고 있는 NMF, LDA, LSA의 방법을 수행해 보고자 한다. 

* NMF : Non-negative Matrix Factorization
* LDA : Latent Dirichlet Allocation
* LSA : Latent Semantic Analysis

덧붙이자면, 도출된 X와 U는 각각 Documnet와 Term에 대한 __새로운 embedding vector로서의 의미__를 갖는다. 특히 본래의 TF나 TFIDF matrix는 sparse matrix인데, 새롭게 도출된 X와 U는 보다 dense하게 나타난다. 연산에서의 이점 뿐만 아니라 결과도 우수하여 근래에는 TFIDF를 그대로 사용하지 않고 SVD로 분해한 것을 embedding으로 더 자주 사용하는 추세이다. 

또 덧붙이자면 여기서 p를 2 혹은 3으로 잡는다면, 이것은 DTM을 2차원 내지 3차원에 시각화 하기 위한 차원축소 결과가 되기도 한다는 점이다. __따라서 해당 방법은 차원축소(dimensionality reduction) 방법__이기도 하다. 

## Data

In [1]:
corpus_path = "../data/formulas.txt"
corpus_ = open( corpus_path, 'r', encoding='utf-8' ).read()
header, corpus_raw = corpus_.split("***")
corpus_raw = corpus_raw.strip()
corpus = [ line.strip() for line in corpus_raw.split("\n") ]
corpus_tokenized = [ line.split() for line in corpus ]

print( "# Corpus Description" )
print( header.strip() )
print()
print( "# Corpus Size: ", len(corpus) )


# Corpus Description
- 출처 : 한국전통지식포탈(www.koreantk.com) > 전통의료 > 처방
- 특징 : 본초 구성만 추출한 데이터
- 데이터 생성일 : 2016.01.16

# Corpus Size:  19162


In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from time import time

In [3]:
n_components = 2 ** 4
n_top_words = 4
stop_words = []
min_df = 6

In [4]:
# Build TF Matrix and TF-IDF Matrix

t0 = time()
tf_vectorizer = CountVectorizer( max_df=0.95, min_df=min_df, stop_words=stop_words )
tf = tf_vectorizer.fit_transform( corpus )
print( "TF Matrix done in {:03f}s.".format(time() - t0) )

t0 = time()
tfidf_vectorizer = TfidfVectorizer( max_df=0.95, min_df=min_df, stop_words=stop_words )
tfidf = tfidf_vectorizer.fit_transform( corpus )
print( "TF-IDF Matrix done in {:03f}s.".format(time() - t0) )

print( "Shape ( Documents, Terms ):", tf.shape )

TF Matrix done in 0.125995s.
TF-IDF Matrix done in 0.126005s.
Shape ( Documents, Terms ): (19162, 916)


In [5]:
import numpy as np
np.set_printoptions( precision=2, edgeitems=6, linewidth=240 )

def get_top_words( components, feature_names, n_top_words):
    arr = np.array( components )
    argsorted = np.argsort( arr )
    rst = []
    for k, lst in enumerate( argsorted):
        q = [ ( feature_names[i], arr[k][i] ) for i in lst ]
        q.reverse()
        rst.append( q[:n_top_words] )
    return rst

def print_top_words( n_top_words_with_score, template="{0}({1:0.3f})" ):
    for i, topic in enumerate( n_top_words_with_score ):
        print( "Topic {:02d}".format(i+1), "\t".join( [ template.format(f,s) for f, s in topic ] ) )

## Extracting tf-idf features for NMF

NMF : Non-negative matrix factorization

[sklearn.decomposition.NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)

In [11]:
print("Extracting tf-idf features for NMF...")

# Fit the NMF model
t0 = time()
print("Fitting the NMF model with tf-idf features ..."  )
nmf_params = {
    "n_components": n_components, 
    "random_state": 1, 
    "max_iter": 1000, 
#     "alpha": .1, 
#     "l1_ratio": .5
}
nmf_model = NMF( **nmf_params )
nmf_W = nmf_model.fit_transform( tfidf )  # (NO_DOCUMENTS, NO_TOPICS)
nmf_H = nmf_model.components_             # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features ...
done in 1.275000s.


In [12]:
print("\n# NMF model")
print("\n## Document-Topic Matrix :", nmf_W.shape )
print( nmf_W )
print("\n## Topic-Term Matrix:", nmf_H.shape )
print( nmf_H )
print("\n## Topics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words( get_top_words( nmf_H, tfidf_feature_names, n_top_words) )


# NMF model

## Document-Topic Matrix : (19162, 16)
[[0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 ... 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 [1.21e-03 0.00e+00 0.00e+00 2.54e-04 4.87e-02 0.00e+00 ... 9.77e-04 1.03e-03 2.45e-03 0.00e+00 3.79e-03 1.46e-03]
 [1.21e-03 0.00e+00 0.00e+00 2.54e-04 4.87e-02 0.00e+00 ... 9.77e-04 1.03e-03 2.45e-03 0.00e+00 3.79e-03 1.46e-03]
 [0.00e+00 0.00e+00 9.78e-03 0.00e+00 0.00e+00 0.00e+00 ... 0.00e+00 3.36e-03 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 [0.00e+00 0.00e+00 4.88e-03 0.00e+00 0.00e+00 0.00e+00 ... 0.00e+00 1.68e-03 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 [0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 ... 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 ...
 [2.13e-03 0.00e+00 0.00e+00 0.00e+00 1.39e-02 3.01e-02 ... 6.18e-02 4.24e-02 2.68e-02 6.95e-02 2.02e-02 0.00e+00]
 [2.13e-03 0.00e+00 0.00e+00 0.00e+00 1.39e-02 3.01e-02 ... 6.18e-02 4.24e-02 2.68e-02 6.95e-02 2.02e-02 0.00e+00]
 [0.00e+00 0.00e+00 0.

## Extracting tf features for LDA


[잠재 디리클레 할당(Latent Dirichlet allocation, LDA)](https://ko.wikipedia.org/wiki/%EC%9E%A0%EC%9E%AC_%EB%94%94%EB%A6%AC%ED%81%B4%EB%A0%88_%ED%95%A0%EB%8B%B9)

[sklearn.decomposition.LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

In [13]:
print("Extracting tf features for LDA...")

# Fit the LDA model
t0 = time()
print("Fitting LDA models with tf features ..." )
lda_params = {
    "n_components": n_components, 
    "max_iter": 100,
    "learning_method": 'online',
    "random_state": 1,
#     "learning_offset": 50.,
}
      
lda_model = LatentDirichletAllocation( **lda_params )
lda_L = lda_model.fit_transform( tf )  # (NO_DOCUMENTS, NO_TOPICS)
lda_D = lda_model.components_          # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

Extracting tf features for LDA...
Fitting LDA models with tf features ...
done in 138.558999s.


In [14]:
print("\n# LDA model")
print("\n## Document-Topic Matrix :", lda_L.shape )
print( lda_L )
print("\n## Topic-Term Matrix:", lda_D.shape )
print( lda_D )
print("\n## Topics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words( get_top_words( lda_D, tf_feature_names, n_top_words) )


# LDA model

## Document-Topic Matrix : (19162, 16)
[[0.03 0.03 0.03 0.03 0.03 0.53 ... 0.03 0.03 0.03 0.03 0.03 0.03]
 [0.01 0.01 0.29 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.27]
 [0.01 0.01 0.29 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.27]
 [0.03 0.03 0.03 0.53 0.03 0.03 ... 0.03 0.03 0.03 0.03 0.03 0.03]
 [0.02 0.02 0.02 0.35 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.35]
 [0.06 0.06 0.06 0.06 0.06 0.06 ... 0.06 0.06 0.06 0.06 0.06 0.06]
 ...
 [0.01 0.01 0.32 0.01 0.01 0.01 ... 0.01 0.17 0.01 0.09 0.09 0.01]
 [0.01 0.01 0.32 0.01 0.01 0.01 ... 0.01 0.17 0.01 0.09 0.09 0.01]
 [0.03 0.03 0.03 0.03 0.03 0.03 ... 0.03 0.03 0.03 0.03 0.03 0.03]
 [0.03 0.03 0.03 0.03 0.03 0.03 ... 0.03 0.03 0.03 0.03 0.03 0.03]
 [0.02 0.27 0.27 0.02 0.02 0.02 ... 0.02 0.02 0.27 0.02 0.02 0.02]
 [0.27 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.27 0.02]]

## Topic-Term Matrix: (16, 916)
[[6.25e-02 6.25e-02 6.25e-02 6.25e-02 6.25e-02 6.25e-02 ... 6.25e-02 6.25e-02 6.25e-02 6.25e-02 6.25e-02

## Extracting tf-idf features for LSA

[Latent semantic analysis (LSA)](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

In [15]:
print("Extracting tf-idf features for LSA...")

# Fit the SVD model
t0 = time()
print("Fitting LSA models with tf-idf features ..." )
lsa_params = {
    "n_components": n_components, 
}
      
lsa_model = TruncatedSVD( **lsa_params )
lsa_L = lsa_model.fit_transform( tfidf )  # (NO_DOCUMENTS, NO_TOPICS)
lsa_S = lsa_model.components_             # (NO_TOPICS, NO_FEATURES)
print( "done in {:03f}s.".format(time() - t0) )

Extracting tf-idf features for LSA...
Fitting LSA models with tf-idf features ...
done in 0.104999s.


In [16]:
print("\n# LSA model")
print("\n## Document-Topic Matrix :", lsa_L.shape )
print( lsa_L )
print("\n## Topic-Term Matrix:", lsa_S.shape )
print( lsa_S )
print("\n## Topics in LSA model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words( get_top_words( lsa_S, tfidf_feature_names, n_top_words) )


# LSA model

## Document-Topic Matrix : (19162, 16)
[[ 1.15e-01 -5.50e-02 -1.40e-01 -1.34e-01  2.34e-01 -2.80e-01 ...  3.80e-02  2.49e-01  2.61e-02  1.16e-01 -1.29e-01 -1.02e-01]
 [ 9.27e-02 -1.17e-04 -5.55e-02 -2.99e-02  6.83e-02 -2.29e-02 ... -9.02e-02 -1.73e-02 -1.11e-01  3.50e-02  1.34e-02  3.12e-02]
 [ 9.27e-02 -1.17e-04 -5.55e-02 -2.99e-02  6.83e-02 -2.29e-02 ... -9.02e-02 -1.73e-02 -1.11e-01  3.50e-02  1.34e-02  3.12e-02]
 [ 3.15e-02  3.95e-02  5.22e-03  1.97e-02 -2.07e-02  5.08e-02 ...  5.95e-03 -4.70e-03  3.35e-02 -7.06e-02  1.16e-02 -1.38e-02]
 [ 1.60e-02  2.00e-02  2.59e-03  9.92e-03 -1.01e-02  2.55e-02 ...  2.92e-03 -2.15e-03  1.73e-02 -3.58e-02  6.46e-03 -6.75e-03]
 [ 1.59e-18 -2.36e-17  5.73e-16  6.90e-16 -6.79e-16 -4.24e-16 ...  2.89e-15 -1.85e-15  4.98e-15  7.39e-16  3.03e-15 -2.81e-15]
 ...
 [ 3.77e-01 -1.07e-01 -1.96e-01  7.22e-02  4.41e-02 -2.92e-02 ... -7.23e-02 -1.75e-01  6.25e-02 -1.32e-01 -3.44e-02  3.35e-02]
 [ 3.77e-01 -1.07e-01 -1.96e-01  7.22e-02  4.41e-02 -

## Visualization

### Documnets

In [17]:
from sklearn.manifold import TSNE

def get_tsne_model( params ):
    return TSNE( **params )

tsne_parms = {
    "n_components": 2,
    "perplexity": 50,
    "learning_rate": 100,
    "n_iter": 2000,
    "verbose": 1,
    "random_state": 0,
    "angle": 0.75   
}

tsne_nmf_doc_model = get_tsne_model( tsne_parms )
tsne_nmf_doc_vectors = tsne_nmf_doc_model.fit_transform( nmf_W )

tsne_lda_doc_model = get_tsne_model( tsne_parms )
tsne_lda_doc_vectors = tsne_lda_doc_model.fit_transform( lda_L )

tsne_lsa_doc_model = get_tsne_model( tsne_parms )
tsne_lsa_doc_vectors = tsne_lsa_doc_model.fit_transform( lsa_L )

[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 19162 samples in 0.227s...
[t-SNE] Computed neighbors for 19162 samples in 7.271s...
[t-SNE] Computed conditional probabilities for sample 1000 / 19162
[t-SNE] Computed conditional probabilities for sample 2000 / 19162
[t-SNE] Computed conditional probabilities for sample 3000 / 19162
[t-SNE] Computed conditional probabilities for sample 4000 / 19162
[t-SNE] Computed conditional probabilities for sample 5000 / 19162
[t-SNE] Computed conditional probabilities for sample 6000 / 19162
[t-SNE] Computed conditional probabilities for sample 7000 / 19162
[t-SNE] Computed conditional probabilities for sample 8000 / 19162
[t-SNE] Computed conditional probabilities for sample 9000 / 19162
[t-SNE] Computed conditional probabilities for sample 10000 / 19162
[t-SNE] Computed conditional probabilities for sample 11000 / 19162
[t-SNE] Computed conditional probabilities for sample 12000 / 19162
[t-SNE] Computed conditional probabilities for sa

In [18]:
nmf_doc_topics = np.argmax( nmf_W, axis=1 ) 
lda_doc_topics = np.argmax( lda_L, axis=1 ) 
lsa_doc_topics = np.argmax( lsa_L, axis=1 ) 

In [19]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook, show

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5" ])
colormap = colormap[:n_components]

In [28]:
output_notebook()
plot = figure(title="t-SNE Clustering of Documents by {} NMF Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_nmf_doc_vectors[:,0], y=tsne_nmf_doc_vectors[:,1], color=colormap[ nmf_doc_topics ] )
show( plot )

In [29]:
output_notebook()
plot = figure(title="t-SNE Clustering of Documents by  {} LDA Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_lda_doc_vectors[:,0], y=tsne_lda_doc_vectors[:,1], color=colormap[ lda_doc_topics ] )
show( plot )

In [30]:
output_notebook()
plot = figure(title="t-SNE Clustering of Documents by  {} LSA Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_lsa_doc_vectors[:,0], y=tsne_lsa_doc_vectors[:,1], color=colormap[ lsa_doc_topics ] )
show( plot )

### Terms

In [21]:
from sklearn.manifold import TSNE

def get_tsne_model( params ):
    return TSNE( **params )

tsne_parms = {
    "n_components": 2,
    "perplexity": 50,
    "learning_rate": 100,
    "n_iter": 2000,
    "verbose": 1,
    "random_state": 0,
    "angle": 0.75   
}

tsne_nmf_term_model = get_tsne_model( tsne_parms )
tsne_nmf_term_vectors = tsne_nmf_term_model.fit_transform( nmf_H.T )

tsne_lda_term_model = get_tsne_model( tsne_parms )
tsne_lda_term_vectors = tsne_lda_term_model.fit_transform( lda_D.T )

tsne_lsa_term_model = get_tsne_model( tsne_parms )
tsne_lsa_term_vectors = tsne_lsa_term_model.fit_transform( lsa_S.T )



[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 916 samples in 0.001s...
[t-SNE] Computed neighbors for 916 samples in 0.040s...
[t-SNE] Computed conditional probabilities for sample 916 / 916
[t-SNE] Mean sigma: 0.007256
[t-SNE] KL divergence after 250 iterations with early exaggeration: 60.197289
[t-SNE] KL divergence after 2000 iterations: 0.673097
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 916 samples in 0.000s...
[t-SNE] Computed neighbors for 916 samples in 0.042s...
[t-SNE] Computed conditional probabilities for sample 916 / 916
[t-SNE] Mean sigma: 8.631911
[t-SNE] KL divergence after 250 iterations with early exaggeration: 62.073399
[t-SNE] KL divergence after 2000 iterations: 0.337237
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 916 samples in 0.001s...
[t-SNE] Computed neighbors for 916 samples in 0.040s...
[t-SNE] Computed conditional probabilities for sample 916 / 916
[t-SNE] Mean sigma: 0.001978
[t-SNE] KL divergence after 250 itera

In [25]:
nmf_term_topics = np.argmax( nmf_H, axis=0 ) 
lda_term_topics = np.argmax( lda_D, axis=0 ) 
lsa_term_topics = np.argmax( lsa_S, axis=0 ) 

In [34]:
output_notebook()
plot = figure(title="t-SNE Clustering of Terms by {} NMF Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_nmf_term_vectors[:,0], y=tsne_nmf_term_vectors[:,1], color=colormap[ nmf_term_topics ] )
show( plot )

In [35]:
output_notebook()
plot = figure(title="t-SNE Clustering of Terms by {} LDA Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_lda_term_vectors[:,0], y=tsne_lda_term_vectors[:,1], color=colormap[ lda_term_topics ] )
show( plot )

In [36]:
output_notebook()
plot = figure(title="t-SNE Clustering of Terms by {} LSA Topics".format( n_components ), plot_width=600, plot_height=600)
plot.scatter( x=tsne_lsa_term_vectors[:,0], y=tsne_lsa_term_vectors[:,1], color=colormap[ lsa_term_topics ] )
show( plot )

## Conclusion

이와 같이 NMF, LDA, LSA의 방법을 이용해 TF matrix 혹은 TFIDF matrix를 통해 잠재된 제3의 속성을 도출해 보았다. 