# 텍스트 분석
***

## 토픽 모델링

토픽 모델링이란 문서들의 주제를 찾아내는 것으로 머신러닝 기반의 주요 기법은 **LSA(Latent Semantic Analysis)**와 **LDA(Latent Dirichlet Allocation)**이다.

### 1. LDA

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
categories = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x', 'talk.politics.mideast',
              'soc.religion.christian', 'sci.electronics', 'sci.med']

In [3]:
news = fetch_20newsgroups(subset = 'all', remove = ['headers', 'footers', 'quotes'], categories = categories)

In [4]:
tfidf = TfidfVectorizer(max_features = 2000, stop_words = 'english', lowercase = True, ngram_range = (1, 2))

In [5]:
tfidf_vect = tfidf.fit_transform(news.data)

In [6]:
lda = LatentDirichletAllocation(random_state = 42, n_components = 8)

In [7]:
lda.fit(tfidf_vect)

LatentDirichletAllocation(n_components=8, random_state=42)

In [8]:
lda.components_.shape

(8, 2000)

lda의 components는 각 토픽별(8개)로 각 feature가 토픽마다 할당된 정도를 의미한다. 높은 값일수록 해당 feature는 그 토픽의 중심 word가 된다.

In [28]:
import pandas as pd

In [45]:
comp = pd.DataFrame({'topic1' : lda.components_[0], 'topic2' : lda.components_[1],
                    'topic3' : lda.components_[2], 'topic4' : lda.components_[3],
                    'topic5' : lda.components_[4], 'topic6' : lda.components_[5],
                    'topic7' : lda.components_[6], 'topic8' : lda.components_[7]}, index = tfidf.get_feature_names())

In [46]:
comp

Unnamed: 0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8
00,0.519589,0.125117,0.125703,10.620845,2.663954,0.414047,0.125030,5.449856
00 00,0.125159,0.125000,0.125000,2.161238,0.125006,0.125000,0.125000,0.125003
000,0.661151,0.125450,11.195645,0.484808,5.472321,0.125820,0.282969,10.315446
01,0.125143,0.125130,0.125040,5.751957,0.131944,0.125023,0.125022,2.631434
02,0.125396,0.125305,0.466798,4.638846,0.846349,0.125011,0.125021,2.381068
...,...,...,...,...,...,...,...,...
yes,2.546639,5.671099,2.945623,6.233634,11.465248,3.603143,1.373197,16.442142
yesterday,0.125518,3.291266,1.612367,0.627138,1.641267,0.376250,0.125001,4.776229
york,8.206182,0.125045,2.444131,0.194254,4.348075,0.360954,0.125019,0.499134
young,0.618343,0.318258,0.919693,0.268605,7.373503,2.679148,0.125104,6.353781


지금 위의 결과에서는 topic1에서 'york' 단어가 제일 중심단어로 판단된다. 

In [22]:
def display_topic_words(model, features, topn) :
    for topic_index, topic in enumerate(model.components_) :

        
        topic_word_idx = topic.argsort()[::-1]
        top_idx = topic_word_idx[:topn]
        
        all_features = ' '.join([features[i] for i in top_idx])
        print(f'Topic # {topic_index + 1}\n{all_features}\n')

In [23]:
features = tfidf.get_feature_names()

In [24]:
display_topic_words(lda, features, 15)

Topic # 1
new york york san writes colorado michael francisco san francisco bell baseball nl april angeles los california

Topic # 2
bike dod ride riding dog bikes just motorcycle road miles ve rear bmw don got

Topic # 3
israel israeli jews arab jewish arabs state muslims war people peace palestinian palestinians israelis killed

Topic # 4
thanks window file program know graphics does use windows files using edu mail server hi

Topic # 5
god people think jesus church say don believe just christian did know like christ christians

Topic # 6
doctor msg disease medical cause food pain don patients like know just treatment body people

Topic # 7
banks gordon gordon banks skepticism pitt pitt edu shameful edu soon battery heat ground chain helmet water

Topic # 8
like don just think year good game know time use ve games team does used

