<a href="https://colab.research.google.com/github/hdpark1208/StudyCode/blob/main/NLP/NLP_TopicModeling(LSA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Modeling
>토픽(Topic)은 한국어로는 주제라고 합니다. 토픽 모델링(Topic Modeling)이란 기계 학습 및 자연어 처리 분야에서 토픽이라는 문서 집합의 추상적인 주제를 발견하기 위한 통계적 모델 중 하나로, 텍스트 본문의 숨겨진 의미 구조를 발견하기 위해 사용되는 텍스트 마이닝 기법입니다.

## 특이값 분해(Singular Value Decomposition, SVD)

![image.png](attachment:image.png)

참고 :https://angeloyeo.github.io/2019/08/01/SVD.html

## 잠재 의미 분석(Latent Semantic Analysis, LSA)
> 기존의 DTM, TF-IDF 행렬은 단어의 의미를 고려하지 못한다  
> LSA는 기본적으로 DTM, TF-IDF 행렬에 truncated SVD 를 사용하여 차원을 축소시키고 단어의 잠재적인 의미를 끌어낸다는 아이디어이다

### Full SVD 수행

In [None]:
import numpy as np
A=np.array([[0,0,0,1,0,1,1,0,0],[0,0,0,1,1,0,1,0,0],[0,1,1,0,2,0,0,0,0],[1,0,0,0,0,0,0,1,1]])
print(A) # DTM 이라 가정
np.shape(A)

[[0 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0]
 [0 1 1 0 2 0 0 0 0]
 [1 0 0 0 0 0 0 1 1]]


(4, 9)

In [None]:
U, s, VT = np.linalg.svd(A,full_matrices = True)
print(U.round(2))
np.shape(U)

[[ 0.24  0.75  0.    0.62]
 [ 0.51  0.44 -0.   -0.74]
 [ 0.83 -0.49 -0.    0.27]
 [ 0.   -0.    1.   -0.  ]]


(4, 4)

In [None]:
print(s.round(2)) # 대각행렬이 아닌 특이값의 리스트로 반환
np.shape(s)

[2.69 2.05 1.73 0.77]


(4,)

In [None]:
S = np.zeros((4,9)) # 대각 행렬의 크기를 가지는 영행렬 생성
S[:4, :4] = np.diag(s) # 특이값 리스트를 대각행렬에 삽입
print(S.round(2))
np.shape(S)

[[2.69 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   2.05 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.73 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.77 0.   0.   0.   0.   0.  ]]


(4, 9)

In [None]:
print(VT.round(2))
np.shape(VT)

[[ 0.    0.31  0.31  0.28  0.8   0.09  0.28  0.    0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]
 [ 0.58 -0.    0.    0.   -0.    0.   -0.    0.58  0.58]
 [-0.    0.35  0.35 -0.16 -0.25  0.8  -0.16  0.    0.  ]
 [-0.   -0.78 -0.01 -0.2   0.4   0.4  -0.2   0.    0.  ]
 [-0.29  0.31 -0.78 -0.24  0.23  0.23  0.01  0.14  0.14]
 [-0.29 -0.1   0.26 -0.59 -0.08 -0.08  0.66  0.14  0.14]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19  0.75 -0.25]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19 -0.25  0.75]]


(9, 9)

* np.allclose() : 2개의 행렬이 동일하면 True 리턴

In [None]:
np.allclose(A,np.dot(np.dot(U,S),VT).round(2))

True

### Truncated SVD

In [None]:
U = U[:,:2]
S = S[:2,:2]
VT = VT[:2,:]

print(U)
print(S)
print(VT)
print(np.shape(U),np.shape(S),np.shape(VT))

[[ 2.39751712e-01  7.51083898e-01]
 [ 5.06077194e-01  4.44029376e-01]
 [ 8.28495619e-01 -4.88580485e-01]
 [ 7.37437945e-17 -1.55319324e-17]]
[[2.68731789 0.        ]
 [0.         2.04508425]]
[[ 5.55111512e-17  3.08298331e-01  3.08298331e-01  2.77536539e-01
   8.04917216e-01  8.92159849e-02  2.77536539e-01  5.17165587e-17
   5.17165587e-17]
 [ 1.11022302e-16 -2.38904821e-01 -2.38904821e-01  5.84383395e-01
  -2.60689306e-01  3.67263060e-01  5.84383395e-01 -4.00759068e-17
  -4.00759068e-17]]
(4, 2) (2, 2) (2, 9)


In [None]:
A # 기존의 A와 A를 SVD 수행한 행렬 비교

array([[0, 0, 0, 1, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 2, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 1]])

In [None]:
np.dot(np.dot(U,S),VT).round(2)

array([[ 0.  , -0.17, -0.17,  1.08,  0.12,  0.62,  1.08, -0.  , -0.  ],
       [ 0.  ,  0.2 ,  0.2 ,  0.91,  0.86,  0.45,  0.91,  0.  ,  0.  ],
       [ 0.  ,  0.93,  0.93,  0.03,  2.05, -0.17,  0.03,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ]])

In [None]:
print(np.shape(A))
print(np.shape(U))
print(np.shape(VT))

(4, 9)
(4, 2)
(2, 9)


축소된 U, S, VT 가 가지는 의미
A : 문서의 개수 X 단어의 수 : 4 X 9  
U : 문서의 개수 X 토픽의 수 : 4 X 2  
(U의 각 행은 잠재 의미를 표현하기 위한 수치화 된 각각의 문서 벡터로 볼 수 있다)   
VT : 토픽의 수 X 단어의 개수 : 2 X 9  
(VT의 각 열은 잠재 의미를 표현하기 위한 수치화 된 각각의 문서 벡터로 볼 수 있다)   

## 실습

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True,random_state=1,remove=('headers','footers','quotes'))
documents = dataset.data
print(type(dataset))
print(type(documents))

<class 'sklearn.utils.Bunch'>
<class 'list'>


In [None]:
print(len(documents))
documents[1]

11314


"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [None]:
dataset.target_names # 20개의 카테고리

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

* Text Preprocessing

In [None]:
test = pd.DataFrame({'a':["it's just test","hi i am boy","가 나다 라"]})
test

Unnamed: 0,a
0,it's just test
1,hi i am boy
2,가 나다 라


In [None]:
test['a'] = test['a'].apply(lambda x:'-'.join([w for w in x.split()]))
# 공백을 기준으로 나눈 리스트를 '-' 를 구분자로 나눔
test

Unnamed: 0,a
0,it's-just-test
1,hi-i-am-boy
2,가-나다-라


In [None]:
news_df = pd.DataFrame({'document':documents})
news_df.head()

Unnamed: 0,document
0,Well i'm not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ..."


In [None]:
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]"," ") # 특수 문자 공백으로 대체
news_df.head()

Unnamed: 0,document,clean_doc
0,Well i'm not sure about the story nad it did s...,Well i m not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...",Yeah do you expect people to read the ...
2,Although I realize that principle is not one o...,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ...",Well I will have to change the scoring on my ...


In [None]:
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x:' '.join([w for w in x.split() if len(w)>3]))
# 공백을 기준으로 split
#길이가 3보다 큰 것들만 공백을 기준으로 join
news_df.head()

Unnamed: 0,document,clean_doc
0,Well i'm not sure about the story nad it did s...,Well sure about story seem biased What disagre...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...",Yeah expect people read actually accept hard a...
2,Although I realize that principle is not one o...,Although realize that principle your strongest...
3,Notwithstanding all the legitimate fuss about ...,Notwithstanding legitimate fuss about this pro...
4,"Well, I will have to change the scoring on my ...",Well will have change scoring playoff pool Unf...


In [None]:
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x:x.lower())
# 소문자로 변환
news_df.head()

Unnamed: 0,document,clean_doc
0,Well i'm not sure about the story nad it did s...,well sure about story seem biased what disagre...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...",yeah expect people read actually accept hard a...
2,Although I realize that principle is not one o...,although realize that principle your strongest...
3,Notwithstanding all the legitimate fuss about ...,notwithstanding legitimate fuss about this pro...
4,"Well, I will have to change the scoring on my ...",well will have change scoring playoff pool unf...


* 불용어 제거

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english') # NLTK로부터 불용어를 받아온다
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # 토큰화
tokenized_doc[:5]

0    [well, sure, about, story, seem, biased, what,...
1    [yeah, expect, people, read, actually, accept,...
2    [although, realize, that, principle, your, str...
3    [notwithstanding, legitimate, fuss, about, thi...
4    [well, will, have, change, scoring, playoff, p...
Name: clean_doc, dtype: object

In [None]:
tokenized_doc = tokenized_doc.apply(lambda x : [item for item in x if item not in stop_words])
tokenized_doc[0][:10] # 불용어 제거

['well',
 'sure',
 'story',
 'seem',
 'biased',
 'disagree',
 'statement',
 'media',
 'ruin',
 'israels']

* Detokenization

In [None]:
# 불용어가 제거된 상태에서 다시 문서 형태로 되돌리는 것
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
    
news_df['clean_doc'] = detokenized_doc
detokenized_doc[0]

'well sure story seem biased disagree statement media ruin israels reputation rediculous media israeli media world lived europe realize incidences described letter occured media whole seem ignore subsidizing israels existance europeans least degree think might reason report clearly atrocities shame austria daily reports inhuman acts commited israeli soldiers blessing received government makes holocaust guilt away look jews treating races power unfortunate'

In [None]:
news_df['clean_doc'].head()

0    well sure story seem biased disagree statement...
1    yeah expect people read actually accept hard a...
2    although realize principle strongest points wo...
3    notwithstanding legitimate fuss proposal much ...
4    well change scoring playoff pool unfortunately...
Name: clean_doc, dtype: object

In [None]:
len(news_df['clean_doc'])

11314

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',
                             max_features=1000, # 상위 1000개의 단어를 보존
                             max_df = 0.5,
                             smooth_idf=True)
X = vectorizer.fit_transform(news_df['clean_doc'])
X.shape # 행렬 크기 확인

(11314, 1000)

* Topic Modeling

In [None]:
from sklearn.decomposition import TruncatedSVD # SVD
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)
len(svd_model.components_)

20

In [None]:
np.shape(svd_model.components_) # VT 에 해당

(20, 1000)

In [None]:
svd_model.components_[0] # '0' 토픽의 1000개 단어에 대한 값

array([0.01469447, 0.05019033, 0.02132607, 0.03099971, 0.01786723,
       0.01260325, 0.01715725, 0.01224439, 0.01126587, 0.06126056,
       0.01114634, 0.01188523, 0.00909293, 0.03608965, 0.0119157 ,
       0.0424128 , 0.01857113, 0.00890044, 0.03624319, 0.01657918,
       0.02174245, 0.01354628, 0.01112828, 0.01230344, 0.01455431,
       0.02946656, 0.0115149 , 0.00979221, 0.00642394, 0.00974007,
       0.03633115, 0.01418675, 0.02011534, 0.04356877, 0.01712549,
       0.01251488, 0.01446864, 0.029691  , 0.02643804, 0.01646404,
       0.01288599, 0.02301893, 0.03350333, 0.01178545, 0.01300871,
       0.01944342, 0.01244579, 0.00888671, 0.03220778, 0.01262578,
       0.02545491, 0.01232535, 0.00834398, 0.01797053, 0.01751491,
       0.01108651, 0.01278451, 0.03838722, 0.01352356, 0.02482804,
       0.02255126, 0.02324325, 0.01423322, 0.01096089, 0.01091531,
       0.01466118, 0.01037784, 0.01137089, 0.01225925, 0.0532616 ,
       0.02038375, 0.01650816, 0.04202284, 0.01134817, 0.01758

In [None]:
terms = vectorizer.get_feature_names() # 상위 1000개의 단어 집합, 알파벳 순
terms[:10]

['ability',
 'able',
 'accept',
 'access',
 'according',
 'account',
 'action',
 'actions',
 'actual',
 'actually']

* nparray.argsort() : 오름차순으로 sort 된 인덱스 리스트를 반환

In [None]:
a = np.array([1.5, 0.2, 4.2, 2.5])
s = a.argsort()

print(type(a))
print(s)
print(a[s])

<class 'numpy.ndarray'>
[1 0 3 2]
[0.2 1.5 2.5 4.2]


In [None]:
b = np.array([[3,2,1,4],[4,3,2,1],[1,2,3,4]])
s = b.argsort()
print(s)
print("\n")
print(b[0][s[0]],b[1][s[1]],b[2][s[2]])
print("\n")
print(s[::-1])
print("\n")
print(b[0][s[0][::-1]],b[1][s[1][::-1]],b[2][s[2][::-1]])


[[2 1 0 3]
 [3 2 1 0]
 [0 1 2 3]]


[1 2 3 4] [1 2 3 4] [1 2 3 4]


[[0 1 2 3]
 [3 2 1 0]
 [2 1 0 3]]


[4 3 2 1] [4 3 2 1] [4 3 2 1]


In [None]:
svd_model.components_[19] # 마지막 토픽의 1000개의 단어에 대한 TF-IDF 1차원 리스트 

array([ 1.97575734e-03, -3.69580918e-02, -5.46641336e-03, -2.47811744e-02,
        2.10618109e-03, -3.24084255e-03, -3.09080912e-03, -8.41008376e-04,
        7.03964537e-03,  1.34951748e-02, -1.60699001e-03, -1.98147554e-03,
        4.60503093e-03, -1.11554157e-01,  1.17360504e-02,  3.28627459e-02,
       -2.07097105e-03,  1.31495461e-02, -1.31700442e-02,  3.70167909e-02,
        3.51813667e-04,  1.88475156e-03, -6.72352608e-03, -8.75746977e-03,
       -5.90888776e-03, -5.77533985e-03,  4.46013825e-03,  2.92565285e-03,
        2.35323479e-03, -3.74359598e-03, -2.81315282e-02, -1.72086900e-02,
       -7.02901044e-03,  1.83700398e-02,  1.46594705e-02, -4.83038701e-05,
        6.24345652e-03,  9.61445912e-03, -6.55360308e-02, -1.28855554e-02,
        1.12085328e-02, -9.88147278e-03, -2.61564251e-02,  7.37334068e-03,
       -3.47468806e-03,  4.40256753e-03, -1.15441920e-02, -5.44492023e-03,
        1.69835737e-02,  4.87786413e-03,  1.79735137e-03, -1.46804864e-03,
        3.34108013e-03,  

In [None]:
def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" %(idx+1),[(feature_names[i],topic[i].round(5)) for i in topic.argsort()[:-n-1:-1]]) # 상위 5개
get_topics(svd_model.components_,terms)

Topic 1: [('like', 0.21386), ('know', 0.20046), ('people', 0.19293), ('think', 0.17805), ('good', 0.15128)]
Topic 2: [('thanks', 0.32888), ('windows', 0.29088), ('card', 0.18069), ('drive', 0.17455), ('mail', 0.15111)]
Topic 3: [('game', 0.37064), ('team', 0.32443), ('year', 0.28154), ('games', 0.2537), ('season', 0.18419)]
Topic 4: [('drive', 0.53324), ('scsi', 0.20165), ('hard', 0.15628), ('disk', 0.15578), ('card', 0.13994)]
Topic 5: [('windows', 0.40399), ('file', 0.25436), ('window', 0.18044), ('files', 0.16078), ('program', 0.13894)]
Topic 6: [('chip', 0.16114), ('government', 0.16009), ('mail', 0.15625), ('space', 0.1507), ('information', 0.13562)]
Topic 7: [('like', 0.67086), ('bike', 0.14236), ('chip', 0.11169), ('know', 0.11139), ('sounds', 0.10371)]
Topic 8: [('card', 0.46633), ('video', 0.22137), ('sale', 0.21266), ('monitor', 0.15463), ('offer', 0.14643)]
Topic 9: [('know', 0.46047), ('card', 0.33605), ('chip', 0.17558), ('government', 0.1522), ('video', 0.14356)]
Topic 10

* LSA의 장단점
> 정리해보면 LSA는 쉽고 빠르게 구현이 가능할 뿐만 아니라 단어의 잠재적인 의미를 이끌어낼 수 있어 문서의 유사도 계산 등에서 좋은 성능을 보여준다는 장점을 갖고 있습니다. 하지만 SVD의 특성상 이미 계산된 LSA에 새로운 데이터를 추가하여 계산하려고하면 보통 처음부터 다시 계산해야 합니다. 즉, 새로운 정보에 대해 업데이트가 어렵습니다. 이는 최근 LSA 대신 Word2Vec 등 단어의 의미를 벡터화할 수 있는 또 다른 방법론인 인공 신경망 기반의 방법론이 각광받는 이유이기도 합니다.

* 즉, 새로 데이터가 추가되면 결과값이 모든 데이터에 영향을 끼친다