## 05. 감성 분석

감성분석의 4요소 : 개체(의 특성), 감성, 주체, 발화 시점 \
감성분석 단계 : 데이터 수집 -> 주관성 탐지 -> 극성 탐지
![image.png](attachment:image.png)

* 지도 학습 : 학습 데이터와 타깃 레이블 값 기반으로 학습. 다른 종류의 분류와 거의 동일
* 비지도 학습 : 'Lexicon'이라는 일종의 감성 어휘 사전 이용
    
![image.png](attachment:image.png)

#### 지도학습 기반 감성 분석 실습 - IMDB 영화평

캐글의 IMDB 영화 사이트의 영화평 이용 \
https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [1]:
import pandas as pd

#탭 문자로 분리된 파일
#quoting = 3 옵션을 통해 큰따옴표 무시
review_df = pd.read_csv('data/word2vec-nlp-tutorial/labeledTrainData.tsv', header = 0, sep = '\t', quoting = 3)
review_df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


sentiment : 1이면 긍정적 평가, 0이면 부정적 평가\
review : 영화평의 텍스트

In [2]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

< br > html 태그와 특수문자는 제거해주자.\
정규표현식을 이용하면 이러한 텍스트 처리를 쉽게 할 수 있다.

In [4]:
import re

# <br> 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />', ' ')

#정규표현식 모듈 re를 이용해 알파벳이 아닌 모든 문자를 공백으로 변환
#각 x에 대해 [^a-zA-Z] 즉, 알파벳이 아니면, " " 공백으로 바꿔라
review_df['review'] = review_df['review'].apply(lambda x: re.sub("[^a-zA-Z]"," ", x))

In [5]:
print(review_df['review'][0])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for  

In [6]:
#학습, 평가 데이터 분리
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id', 'sentiment'], axis = 1, inplace = False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)
X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [10]:
#모델링
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

#카운트 벡터라이저, 로지스틱 회귀를 파이프라인으로 수행
pipeline = Pipeline([('cnt_vect', CountVectorizer(stop_words = 'english', ngram_range = (1,2))),
                    ('lr_clf', LogisticRegression(C = 10))])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print('에측 정확도는 {0:.4f}, Roc-Auc는 {1:.4f}'.format(accuracy_score(y_test, pred),roc_auc_score(y_test, pred)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


에측 정확도는 0.8860, Roc-Auc는 0.8859


In [11]:
pipeline.predict(X_test['review'])

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [12]:
pipeline.predict_proba(X_test['review'])

array([[9.99999776e-01, 2.24446720e-07],
       [9.62425178e-01, 3.75748218e-02],
       [9.85471048e-01, 1.45289518e-02],
       ...,
       [1.56097428e-01, 8.43902572e-01],
       [9.97565990e-01, 2.43400970e-03],
       [9.94172366e-01, 5.82763360e-03]])

In [13]:
#tfidf 벡터라이저로 다시 수행
pipeline = Pipeline([('tfidf_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1,2))),
                    ('lr_clf', LogisticRegression(C = 10))])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print('에측 정확도는 {0:.4f}, Roc-Auc는 {1:.4f}'.format(accuracy_score(y_test, pred),roc_auc_score(y_test, pred)))

에측 정확도는 0.8936, Roc-Auc는 0.8934


#### 비지도 학습 기반 감성분석

감성 지수 : 단어의 긍정 감성 또는 부정 감성의 정도를 의미하는 수치로, 단어의 위치나 주변 단어, 문맥, POS(Part of Speech) 등을 참고해 결정된다.

* NLTK의 WordNet : 문맥에 따라 달라지는 의미를 고려할 수 있도록 Synset이라는 어휘의 시맨틱 정보를 제공한다. 아쉽게도 NLTK의 감성 사전의 예측 성능은 그리 좋지 못하다.
* SentiWordNet : NLTK의 워드넷과 유사. 긍정 지수, 부정 지수, 객관 지수 총 3가지의 감정 점수를 할당
* VADER : 주로 소셜 미디어의 텍스트에 대한 감성 분석 패키지. 비교적 빠른 시간과 뛰어난 감성 분석 결과로 대용량 데이터에 잘 사용됨.
* Pattern : 예측 성능 측면에서 가장 주목받음

#### SentiWordNet을 이용한 감성 분석

In [15]:
import nltk
#nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\J

[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading pac

[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\porter_test.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\JIHYE\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping misc\mwa_ppdb.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all


True

In [16]:
from nltk.corpus import wordnet as wn

term = 'present'

#'present'라는 단어로 워드넷의 synsets 생성
synsets = wn.synsets(term)
print('synsets() 반환 type : ', type(synsets))
print('synsets() 반환 값 개수 : ', len(synsets))
print('synsets() 반환 값 : ', synsets)

synsets() 반환 type :  <class 'list'>
synsets() 반환 값 개수 :  18
synsets() 반환 값 :  [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


총 18개의 시맨틱을 가지는 synset 객체가 반환되었다. 'present.n.01'이라는 POS 태그는 present 라는 의미, 명사 품사, 명사로써의 첫번째 의미를 뜻한다.

In [17]:
for synset in synsets:
    print('##### Synset name : ', synset.name(), '#####')
    print('POS : ', synset.lexname())
    print('Definition : ', synset.definition())
    print('Lemmas : ', synset.lemma_names())

##### Synset name :  present.n.01 #####
POS :  noun.time
Definition :  the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas :  ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS :  noun.possession
Definition :  something presented as a gift
Lemmas :  ['present']
##### Synset name :  present.n.03 #####
POS :  noun.communication
Definition :  a verb tense that expresses actions or states at the time of speaking
Lemmas :  ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS :  verb.perception
Definition :  give an exhibition of to an interested audience
Lemmas :  ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS :  verb.communication
Definition :  bring forward and present to the mind
Lemmas :  ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS :  verb.creation
Definition :  perform (a play), especially on a stage
Lemmas :  

present.n.01을 보면, noun.time으로, 시간으로서의 명사적 의미, 즉 '현재'를 뜻한다.\
present.n.02는, noun.possession으로, '선물'을 뜻한다.

다른 어휘와의 관계를 유사도로 나타내는 path_similarity() 메소드

In [20]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree, lion, tiger, cat, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]

#단어별 synset을 반복하면서 다른 단어의 synset과 유사도 측정
for entity in entities:
    similarity = [round(entity.path_similarity(compared_entity), 2)
                 for compared_entity in entities]
    similarities.append(similarity)
    
#데이터프레임 형태로 저장
similarity_df = pd.DataFrame(similarities, columns = entity_names, index = entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


사자와 호랑이의 유사도가 0.33으로 가장 높고, 사자와 나무의 유사도가 0.07로 가장 낮다.

이번엔 wordnet 대신 sentiwordnet을 사용해보자.

In [21]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type : ', type(senti_synsets))
print('senti_synsets() 반환 값 개수 : ', len(senti_synsets))
print('senti_synsets() 반환 값 : ', senti_synsets)

senti_synsets() 반환 type :  <class 'list'>
senti_synsets() 반환 값 개수 :  11
senti_synsets() 반환 값 :  [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


어떤 단어가 감성적이지 않은 경우, 부정 지수와 긍정지수는 0, 객관성 지수를 1로 표현한다.

In [22]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수 : ', father.pos_score())
print('father 부정감성 지수 : ', father.neg_score())
print('father 객관성 지수 : ', father.obj_score())
print('\n')

fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수 : ', fabulous.pos_score())
print('fabulous 부정감성 지수 : ', fabulous.neg_score())
print('fabulous 객관성 지수 : ', fabulous.obj_score())

father 긍정감성 지수 :  0.0
father 부정감성 지수 :  0.0
father 객관성 지수 :  1.0


fabulous 긍정감성 지수 :  0.875
fabulous 부정감성 지수 :  0.125
fabulous 객관성 지수 :  0.0


#### Sentiwordnet을 이용한 영화 감상평 감성 분석 [비지도 학습]

1. 문서를 문장 단위로 분해
2. 문장을 다시 단어 단위로 토큰화, 품사 태깅
3. 품사 태깅된 단어를 기반으로 synset, senti_synset 객체 생성
4. 긍정/부정 지수를 구해 합산하여 특정 임계치를 기준으로 긍정/부정 판단

In [27]:
from nltk.corpus import wordnet as wn

# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return

from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

##한 문서의 감성을 계산하는 함수
def swn_polarity(text):
    #감성지수 초기화
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    
    #문서를 문장 단위로 분해
    raw_sentences = sent_tokenize(text)
    
    ##문장별 프로세스
    #문장별 단어 토큰화 및 품사 태깅 후 sentisynset 생성, 감성 지수 합산
    for raw_sentence in raw_sentences:
        
        #NLTK 기반 품사 태깅 문장 추출
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        
        ##토큰화된 단어별 프로세스
        for word, tag in tagged_sentence:
            #워드넷 기반 품사 태깅
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
                continue
            #어근 추출
            lemma = lemmatizer.lemmatize(word, pos = wn_tag)
            if not lemma:
                continue
                
            #추출한 어근과 품사를 입력해 synset 객체 생성
            synsets = wn.synsets(lemma, pos = wn_tag)
            if not synsets:
                continue
            #sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            #모든 단어에 대해 긍정 감성 지수는 +, 부정 감성 지수는 -로 합산
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            tokens_count += 1
                
    if not tokens_count:
        return 0
        
    #총 score가 0 이상인 경우 긍정(1), 아닌 경우 부정(0)
    if sentiment >= 0:
        return 1
    return 0

In [28]:
train_df = review_df #앞에서 생성한 데이터셋 review_df 이용
train_df['preds'] = train_df['review'].apply(lambda x: swn_polarity(x))
y_target = train_df['sentiment'].values
preds = train_df['preds'].values

In [29]:
#성능
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print("정확도:", np.round(accuracy_score(y_target, preds), 4))
print("정밀도:", np.round(precision_score(y_target, preds), 4))
print("재현율:", np.round(recall_score(y_target, preds), 4))

[[7668 4832]
 [3636 8864]]
정확도: 0.6613
정밀도: 0.6472
재현율: 0.7091


#### VADER을 이용한 감성분석

SentimentIntensityAnalyzer의 polarity_scores() 메소드를 이용하면 각 문서의 감성 점수를 구할 수 있다.

In [30]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(train_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [31]:
def vader_polarity(review, threshold = 0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    #compound 값이 threshold 입력값보다 크면 1, 작으면 0
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

#문서별로 위의 함수를 수행하여 예측
review_df['vader_preds'] = review_df['review'].apply(lambda x: vader_polarity(x, 0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print(confusion_matrix(y_target, vader_preds))
print('정확도 : ', np.round(accuracy_score(y_target, vader_preds), 4))
print("정밀도:", np.round(precision_score(y_target, vader_preds), 4))
print("재현율:", np.round(recall_score(y_target, vader_preds), 4))

[[ 6736  5764]
 [ 1867 10633]]
정확도 :  0.6948
정밀도: 0.6485
재현율: 0.8506


## 06. 토픽모델링 - 20 뉴스그룹

토픽모델링을 통해, 숨겨진 주제를 효과적으로 표현할 수 있는 중심 단어들을 함축적으로 추출해낼 수 있다. 대표적으로 LSA, LDA가 자주 사용된다.

LDA를 이용해 실습을 진행해보자.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 8개 주제를 추출
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics','comp.windows.x',
       'talk.politics.mideast', 'soc.religion.christian','sci.electronics','sci.med']

#위 카테고리에 해당하는 애들만 추출
news_df = fetch_20newsgroups(subset = 'all', remove = ('headers', 'footers', 'quotes'),
                            categories = cats, random_state = 0)

#LDA는 카운트 기반 벡터화만 적용함
count_vect = CountVectorizer(max_df = 0.95, max_features = 1000, min_df = 2,
                            stop_words = 'english', ngram_range = (1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape : ', feat_vect.shape)

CountVectorizer Shape :  (7862, 1000)


In [3]:
news_df.data[:2]

['I appreciate if anyone can point out some good books about the dead sea\nscrolls of Qumran. Thanks in advance.',
 'hi all, i got several emails and a couple news replies and i guess i\nshoulda went into more detail... Being my anxiety level is peaking and you\nfolks have no clue who I am I may as well post the specifics and see what\nyou people think regarding my previous post.\nTo recap i applied to 20 schools total, 16 of which were MD and 4 DO.\n\nas it stands now i have had 13 rejects, 4 interviews( 2 MD and 2 DO), the\nresults of which are 2 waiting lists (1 MD and one DO)\n\n3 schools i heard nothing from at all.\n\nI have contacted all institutions other than the rejects and they have no\ninfo whatsoever to tell me.\n\nI have taken a good mix to apply to.. 2-3 top schools a bunch of middles\nand a few "safety"  (funny that most of my safety schools were the first\nto reject me)\n\nmy index is at like a 3.5 mcats were R7 P9 B10 WQ and R7 P9 B11 WR\nI couldnt get the damn readin

In [41]:
print(feat_vect)

  (0, 93)	1
  (0, 669)	1
  (0, 390)	1
  (0, 148)	1
  (0, 251)	1
  (0, 876)	1
  (0, 70)	1
  (0, 877)	1
  (1, 390)	1
  (1, 428)	1
  (1, 391)	1
  (1, 237)	1
  (1, 607)	1
  (1, 403)	1
  (1, 955)	2
  (1, 512)	2
  (1, 678)	2
  (1, 655)	2
  (1, 881)	2
  (1, 733)	1
  (1, 688)	1
  (1, 23)	1
  (1, 894)	1
  (1, 15)	1
  (1, 12)	1
  :	:
  (7858, 61)	3
  (7858, 864)	2
  (7858, 133)	1
  (7859, 511)	1
  (7859, 528)	1
  (7859, 782)	1
  (7859, 773)	1
  (7859, 54)	1
  (7859, 666)	1
  (7859, 159)	1
  (7859, 387)	1
  (7859, 126)	1
  (7860, 876)	1
  (7860, 70)	1
  (7860, 877)	1
  (7860, 428)	1
  (7860, 678)	1
  (7860, 922)	1
  (7860, 243)	1
  (7860, 795)	1
  (7860, 911)	1
  (7860, 682)	1
  (7860, 909)	1
  (7860, 490)	1
  (7861, 973)	1


In [34]:
#lda 토픽 모델링 수행, 주제는 8개
lda = LatentDirichletAllocation(n_components = 8, random_state = 0)
lda.fit(feat_vect)
print(lda.components_.shape)
lda.components_

(8, 1000)


array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

components_ 속성은 개별 토픽별로 각 워드 피처가 얼마나 많이 그 토픽에 할당되었는지 수치값을 알려준다. 값이 클수록 그 토픽의 중심 단어가 된다.

직관적으로 살펴보기 위해, 각 토픽별로 연관도가 높은 순으로 단어를 나열하는 display_topics() 함수를 만들어보자.

In [45]:
def display_topics(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #', topic_index)
        
        #components_array에서 큰 순으로 정렬
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes = topic_word_indexes[:no_top_words]
        
        feature_concat = ' '.join([feature_names[i] for i in top_indexes])
        print(feature_concat)
        
# CountVectorizer객체 내의 전체 word의 명칭을 get_features_names( )를 통해 추출
feature_names = count_vect.get_feature_names()

#토픽별 연관도 높은 단어를 상위 15개 추출
display_topics(lda, feature_names, 15)

Topic # 0
year 10 game medical health team 12 20 disease cancer 1993 games years patients good
Topic # 1
don just like know people said think time ve didn right going say ll way
Topic # 2
image file jpeg program gif images output format files color entry 00 use bit 03
Topic # 3
like know don think use does just good time book read information people used post
Topic # 4
armenian israel armenians jews turkish people israeli jewish government war dos dos turkey arab armenia 000
Topic # 5
edu com available graphics ftp data pub motif mail widget software mit information version sun
Topic # 6
god people jesus church believe christ does christian say think christians bible faith sin life
Topic # 7
use dos thanks windows using window does display help like problem server need know run
