topic modeling : 문서를 특정 토픽으로 할당 (차원 축소)

- NMF (Non-negative Matrix Factorization) : 비음수 행렬 분해

- SVD (Singluar Value Decomposition) : 특이값 분해

- LDA (Latent Dirichlet Allocation) : 잠재 디리클레 할당 -> 베이즈 기반 확률적 생성 모델

단어들이 어떤 주제와 관련있을지 확률적으로 분석하여 분류

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv("../datasets/imdb_sentiment.csv")
df
# 0 : 부정 / 1 : 긍정

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
...,...,...
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
cv = CountVectorizer(max_features=10000, max_df=.15)
x = cv.fit_transform(df["review"])

In [9]:
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
# 10개의 토픽으로 분류
lda = LatentDirichletAllocation(n_components=10, learning_method="batch", max_iter=25, random_state=0, n_jobs=-1)
topics = lda.fit_transform(x)

In [11]:
topics

array([[0.00090936, 0.00090936, 0.39891652, ..., 0.06949323, 0.00090925,
        0.00090931],
       [0.84771635, 0.09677388, 0.00116307, ..., 0.0011632 , 0.00116298,
        0.04736832],
       [0.33645733, 0.10527601, 0.17411418, ..., 0.00089306, 0.00089301,
        0.07038862],
       ...,
       [0.51980577, 0.00217448, 0.08891902, ..., 0.08935927, 0.09440624,
        0.00217453],
       [0.00277895, 0.00277863, 0.50764398, ..., 0.24729753, 0.00277834,
        0.00277863],
       [0.00625198, 0.00625067, 0.00625188, ..., 0.00625088, 0.19705439,
        0.00625188]])

In [12]:
topics.shape

(50000, 10)

In [13]:
lda.components_

array([[7.23812846e+01, 9.64245592e+01, 1.00001796e-01, ...,
        4.98972645e+01, 1.00004776e-01, 1.00002884e-01],
       [1.00017631e-01, 1.75939748e+01, 1.00007020e-01, ...,
        4.75696349e+00, 6.80999552e+01, 6.10999747e+01],
       [6.84518344e+01, 1.72127782e+02, 1.00011418e-01, ...,
        1.00030531e-01, 1.00006859e-01, 1.00001436e-01],
       ...,
       [2.54343125e+00, 4.83742782e+01, 1.00001507e-01, ...,
        1.16313405e+01, 1.00001806e-01, 1.00005899e-01],
       [2.76797580e+00, 1.06667794e+02, 7.60999424e+01, ...,
        2.15079090e+00, 1.00007036e-01, 1.00000545e-01],
       [1.00022705e-01, 9.76326885e-01, 1.00009641e-01, ...,
        1.00036243e-01, 1.00011314e-01, 1.00004374e-01]])

In [14]:
lda.components_.shape

(10, 10000)

In [15]:
import numpy as np

In [16]:
feature_names = np.array(cv.get_feature_names_out())
feature_names

array(['00', '000', '007', ..., 'zoom', 'zorro', 'zucco'], dtype=object)

In [17]:
for idx, words in enumerate(lda.components_):
    total = words.sum()
    largest = words.argsort()[::-1]
    print(f"topic {idx+1 : 2}", end=" : ")

    for i in range(0, 10):
        print(f"{feature_names[largest[i]]}({words[largest[i]] * 100 / total:.2f})", end=" ")

        print()

topic  1 : didn(0.59) 
worst(0.57) 
nothing(0.55) 
actors(0.48) 
actually(0.47) 
minutes(0.46) 
want(0.41) 
funny(0.41) 
script(0.40) 
re(0.39) 
topic  2 : book(1.21) 
original(0.61) 
van(0.44) 
version(0.40) 
king(0.39) 
match(0.38) 
series(0.38) 
lee(0.34) 
monster(0.33) 
sequel(0.33) 
topic  3 : gets(0.47) 
wife(0.46) 
guy(0.43) 
police(0.38) 
plays(0.34) 
doesn(0.33) 
woman(0.32) 
down(0.31) 
himself(0.30) 
goes(0.30) 
topic  4 : young(0.77) 
father(0.63) 
mother(0.57) 
woman(0.54) 
family(0.54) 
role(0.38) 
old(0.38) 
wife(0.38) 
son(0.37) 
performance(0.36) 
topic  5 : show(2.16) 
series(0.97) 
tv(0.77) 
episode(0.68) 
funny(0.65) 
years(0.65) 
now(0.63) 
kids(0.61) 
saw(0.58) 
old(0.55) 
topic  6 : us(0.39) 
work(0.38) 
director(0.37) 
world(0.34) 
without(0.32) 
may(0.31) 
own(0.31) 
between(0.30) 
audience(0.29) 
real(0.29) 
topic  7 : war(1.56) 
american(0.93) 
world(0.62) 
years(0.49) 
history(0.49) 
western(0.48) 
documentary(0.46) 
us(0.44) 
country(0.41) 
real(0.38) 
topi