### 토픽 모델링
- 머신러닝 기반의 토픽 모델링을 적용해 문서 집합에 숨어 있는 주제를 찾아냄
> 사람이 수행하는 토픽 모델링은 더 함축적인 의미로 문장을 요약하는 것에 반해 머신러닝 기반의 토픽 모델링은 숨겨진 주제를 효과적으로 표현할 수 있는 중심 단어를 함축적으로 추출
- LSA(Latent Sementic Analysis) 와 LDA(Latent Dirichlet Allocation) 기법
 - LSA는 단어-문서행렬(Word-Document Matrix), 단어-문맥행렬(window based co-occurrence matrix) 등 입력 데이터에 특이값 분해를 수행해 데이터의 차원수를 줄여 계산 효율성을 키우면서 행간에 숨어있는(latent) 의미를 이끌어내기 위한 방법론
 - LDA는 미리 알고 있는 주제별 단어수 분포를 바탕으로, 주어진 문서에서 발견된 단어수 분포를 분석, 해당 문서가 어떤 주제들을 함께 다루고 있을지를 예측
 
 Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as 'unsupervised' machine learning because it doesn't require a predefined list of tags or training data that's been previously classified by humans.

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
"""
# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 8개 주제를 추출
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics'
        , 'comp.windows.x', 'talk.politics.guns'
        , 'soc.religion.christian', 'sci.electronics', 'sci.med']

# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 categories에 cats 입력
news_df = fetch_20newsgroups(subset='all', remove=('header', 'footer', 'quotes'), categories=cats, random_state=0)

# LDA 는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2, stop_words='english', ngram_range=(1, 2))

feat_vect = count_vect.fit_transform(news_df.data)
print(feat_vect.shape)
"""

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 8개 주제를 추출.
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', \
        'comp.windows.x', 'talk.politics.mideast', 'soc.religion.christian',\
        'sci.electronics', 'sci.med'  ]
# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 
# categories에 cats 입력
news_df = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'),
                            categories = cats, random_state=0)
# LDA 는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2,\
                            stop_words='english', ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print(feat_vect.shape)
print(news_df.data[0])
#print(feat_vect[0].toarray())


(7862, 1000)
I appreciate if anyone can point out some good books about the dead sea
scrolls of Qumran. Thanks in advance.


In [6]:
feat_vect.toarray()[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,

In [7]:
lda = LatentDirichletAllocation(n_components=8, random_state=0)
lda.fit(feat_vect)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=8, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

The graphical model of LDA is a three-level generative mode:

![screenshot]("./images/lda_model_graph.png")

Note on notations presented in the graphical model above, which can be found in Hoffman et al. (2013):

The corpus is a collection of  documents.

A document is a sequence of  words.

There are  topics in the corpus.

The boxes represent repeated sampling.

In the graphical model, each node is a random variable and has a role in the generative process. A shaded node indicates an observed variable and an unshaded node indicates a hidden (latent) variable. In this case, words in the corpus are the only data that we observe. The latent variables determine the random mixture of topics in the corpus and the distribution of words in the documents. The goal of LDA is to use the observed words to infer the hidden topic structure.

In [8]:
print(lda.components_.shape)
lda.components_

(8, 1000)


array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

In [9]:
# argsort() 넘파이 배열의 원소를 오름차순으로 정렬하는 메소드임
import numpy as np
d1 = np.arange(10, 25)
print(d1)

d2 = d1.argsort() # 오름차순
print(d2)

topic_word_indexes = d1.argsort()[::-1] # 내림차순
print(topic_word_indexes)

top_indexes = topic_word_indexes[:10]
top_indexes

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]


array([14, 13, 12, 11, 10,  9,  8,  7,  6,  5], dtype=int64)

In [10]:
def display_topics(model,feature_names,no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #', topic_index)
        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes]) # feature_names를 공백으로 join
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출        
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topics(lda, feature_names, 15)

# cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', \
#         'comp.windows.x', 'talk.politics.mideast', 'soc.religion.christian',\
#         'sci.electronics', 'sci.med'  ]

Topic # 0
year 10 game medical health team 12 20 disease cancer 1993 games years patients good
Topic # 1
don just like know people said think time ve didn right going say ll way
Topic # 2
image file jpeg program gif images output format files color entry 00 use bit 03
Topic # 3
like know don think use does just good time book read information people used post
Topic # 4
armenian israel armenians jews turkish people israeli jewish government war dos dos turkey arab armenia 000
Topic # 5
edu com available graphics ftp data pub motif mail widget software mit information version sun
Topic # 6
god people jesus church believe christ does christian say think christians bible faith sin life
Topic # 7
use dos thanks windows using window does display help like problem server need know run


In [30]:
news20_df = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'))
print(news20_df.keys())
news20_df.target_names

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

copied from: 
https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset

### The 20 newsgroups text dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.

#### Usage

The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the ~/scikit_learn_data/20news_home folder and calls the sklearn.datasets.load_files on either the training or testing set folder, or both of them:

>from sklearn.datasets import fetch_20newsgroups
>
> newsgroups_train = fetch_20newsgroups(subset='train')
>
> from pprint import pprint
>
> pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [11]:
# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 8개 주제를 추출.
dogs = ['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

news_df0 = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes')
                             , categories = dogs, random_state=0)
# LDA 는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(
    max_df=0.95
    , max_features=1000
    , min_df=2
    , stop_words='english'
    , ngram_range=(1,2)
)

feat_vect = count_vect.fit_transform(news_df0.data)
print('shape: ', feat_vect.shape); print()
print(news_df0.data[0])
#print(feat_vect[0].toarray())

shape:  (18846, 1000)

FOR SALE

                 1945 King Feature Syndicate
                 Jaymar Specialty Company
                 200 Fifth Avenue New York, NY

                 Cardboard puzzle - NO BOX
                 Pieces worn from use
                 NO MISSING PIECES
                 Size: 13 3/4 inches by 21 1/2 inches
                 60 Puzzle Pieces

   Puzzle depicts Dagwood, Blondie, the kids, and dog Daisey with her
   puppies on a picnic with Dagwood and Alexander trying to get
   a fishing line out of a tree.


In [12]:
lda = LatentDirichletAllocation(n_components=12, random_state=0)
lda.fit(feat_vect)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=12, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [13]:
print(lda.components_.shape)
lda.components_

(12, 1000)


array([[9.39469574e-01, 8.33348059e-02, 8.33358581e-02, ...,
        8.33362650e-02, 8.33352742e-02, 8.33351868e-02],
       [2.88621344e-01, 2.44540281e+00, 1.91765266e-01, ...,
        5.06485320e+01, 8.07463074e-01, 2.50887610e+00],
       [8.33343439e-02, 7.99417886e+00, 2.59375616e+00, ...,
        1.51646327e+02, 2.68679498e+00, 1.93496691e+02],
       ...,
       [1.73463667e-01, 1.16912548e+02, 9.80581779e+00, ...,
        1.73620118e+02, 8.33343608e-02, 8.33349021e-02],
       [4.79149104e+01, 1.05611128e+02, 3.95812615e+00, ...,
        8.33365761e-02, 1.20410251e+02, 1.82289497e+01],
       [8.33345558e-02, 5.11426974e+02, 8.47992925e-02, ...,
        1.05589750e+02, 1.87311405e+02, 9.08093391e+01]])

In [14]:
# argsort() 넘파이 배열의 원소를 오름차순으로 정렬하는 메소드임
import numpy as np
d1 = np.arange(10, 25)
print(d1)

d2 = d1.argsort() # 오름차순
print(d2)

topic_word_indexes = d1.argsort()[::-1] # 내림차순
print(topic_word_indexes)

top_indexes = topic_word_indexes[:10]
top_indexes

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]


array([14, 13, 12, 11, 10,  9,  8,  7,  6,  5], dtype=int64)

In [15]:
def display_topics(model,feature_names,no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #', topic_index)
        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes]) # feature_names를 공백으로 join
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출        
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topics(lda, feature_names, 15)

Topic # 0
edu image file color jpeg images format files gif version quality 24 display convert programs
Topic # 1
key data available edu software graphics use ftp com chip pub mail information encryption bit
Topic # 2
time just said didn year don did like think game know got years went going
Topic # 3
ax ax ax max max ax ax max g9v b8f a86 g9v g9v pl 145 1d9 a86 a86 b8f b8f 34u
Topic # 4
10 00 25 12 11 15 20 16 14 17 13 50 18 19 30
Topic # 5
windows file use program dos window using problem thanks run does set com screen application
Topic # 6
god people don think just does like say believe know way good jesus make point
Topic # 7
know don thanks does mr think president db like just don know going ve mail ll
Topic # 8
games new team game sale hockey price players best offer good edu league list shipping
Topic # 9
drive like just use car used problem scsi ve power good don hard work card
Topic # 10
space university research information nasa new center 1993 program earth dos national scie

### Q. fetch_20newsgroups로 다음 작업을 수행
- TfidfVectorizer 방식으로 벡터 처리, lr 알고리즘으로 precision 포함하여 평가
- precision 기준으로 평가지수 높은 순으로 5개 그룹을 선정하여 토픽 모델링 수행
- 텍스트 분류 예측 정밀도와 그룹별로 토픽 모델링 성능간의 상관관계 기술

In [36]:
import numpy as np
import pandas as pd
#from numpy import r_
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


newsgroups_train = fetch_20newsgroups(
    subset='train'
    , remove=('headers','footers','quotes')
    , random_state=156
)

newsgroups_test = fetch_20newsgroups(
    subset='test'
    , remove=('headers','footers','quotes')
    , random_state=156
)

tfidf = TfidfVectorizer(stop_words='english', min_df=0.001, max_df=0.20)
tfidf_vect = tfidf.fit_transform(newsgroups_train.data)
tfidf_vect_test = tfidf.fit_transform(newsgroups_test.data)

x_train = tfidf_vect
print('x_train: ', x_train.shape)

y_train = newsgroups_train.target
print('y_train:', y_train.shape)
x_test = tfidf_vect_test
print('x_test', x_test.shape)
y_test = newsgroups_test.target
print('y_test', y_test.shape)

# x = np.array(np.r_[x_train.todense(), x_test.todense()])
# y = np.r_[y_train, y_test]

x_train:  (18846, 8930)
y_train: (18846,)
x_test (18846, 8930)
y_test (18846,)


In [121]:
# TF-IDF Vect - Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# TF-IDF Vect - Logistic Regression
lr = LogisticRegression()
lr.fit(x_train, y_train)
lr_pred = lr.predict(x_test)
lr_accuracy = accuracy_score(y_test, lr_pred)

print('accuracy=', lr_accuracy)
print()

from sklearn import metrics
report = metrics.classification_report(y_test, lr_pred, target_names = categories)
print(report)


# 'rec.sport.baseball',
# 'rec.sport.hockey',
# 'sci.crypt',
# 'sci.med',
# 'talk.politics.mideast'

accuracy= 0.8647458346598748

                          precision    recall  f1-score   support

             alt.atheism       0.86      0.81      0.83       799
           comp.graphics       0.85      0.83      0.84       973
 comp.os.ms-windows.misc       0.83      0.81      0.82       985
comp.sys.ibm.pc.hardware       0.84      0.83      0.84       982
   comp.sys.mac.hardware       0.90      0.85      0.87       963
          comp.windows.x       0.91      0.88      0.89       988
            misc.forsale       0.87      0.86      0.87       975
               rec.autos       0.89      0.85      0.87       990
         rec.motorcycles       0.59      0.92      0.72       996
      rec.sport.baseball       0.95      0.91      0.93       994
        rec.sport.hockey       0.98      0.93      0.95       999
               sci.crypt       0.95      0.87      0.91       991
         sci.electronics       0.85      0.85      0.85       984
                 sci.med       0.92      0.92

In [76]:
feature_names = np.asarray(tfidf.get_feature_names())

for i, category in enumerate(categories):
        top5 = np.argsort(lr.coef_[i])[-5:]
        print("%s: %s" % (category, " ".join(feature_names[top5])))

alt.atheism: atheist atheists atheism god religion
comp.graphics: pov images image 3d graphics
comp.os.ms-windows.misc: ax file cica microsoft windows
comp.sys.ibm.pc.hardware: ide drive card bios pc
comp.sys.mac.hardware: se centris lc apple mac
comp.windows.x: widget xterm window server motif
misc.forsale: asking sell shipping offer sale
rec.autos: oil engine ford cars car
rec.motorcycles: motorcycle bikes ride dod bike
rec.sport.baseball: ball year team game baseball
rec.sport.hockey: season nhl team game hockey
sci.crypt: government key nsa encryption clipper
sci.electronics: motorola voltage battery electronics circuit
sci.med: photography disease msg medical doctor
sci.space: sky launch shuttle orbit space
soc.religion.christian: christianity christians christian church god
talk.politics.guns: firearms fbi weapons guns gun
talk.politics.mideast: armenians jews arab israeli israel
talk.politics.misc: gay people clayton clinton government
talk.religion.misc: koresh jesus christian 

In [122]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

models = ['rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.med',
'talk.politics.mideast'  ]
# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 
# categories에 cats 입력
news_df = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'),
                            categories = models, random_state=0)
# LDA 는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2,\
                            stop_words='english', ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print(feat_vect.shape)
print(news_df.data[0])

(4914, 1000)
Elias' initial statement certain *is* hot air. But it seems to be
almost standard procedure around here to first throw out an absurb,
overstated image in order to add extra "meaning" to the posting's
*real point*. 

However, his second statement *is* quite real. The essential sealing off
of Gaza residents from the possibility of making a living *has happened*.
Certainly, the Israeli had a legitimate worry behind the action they took,
but isn't that action a little draconian?




In [123]:
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(feat_vect)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [124]:
print(lda.components_.shape)
lda.components_

(5, 1000)


array([[3.09476082e+00, 7.61316452e+00, 3.01051510e-01, ...,
        1.50868478e+00, 2.03580516e+01, 2.03428191e-01],
       [6.65688807e+01, 2.43798745e+01, 1.14414332e+02, ...,
        9.28108137e+01, 6.52216404e+01, 8.14314221e+01],
       [2.01543778e-01, 3.05785593e+02, 2.00676470e-01, ...,
        9.49873251e+01, 8.33073840e+01, 1.35670831e+01],
       [5.13728599e-01, 8.68011505e+00, 2.01243905e-01, ...,
        4.44716171e+00, 1.90355743e+00, 2.65921787e+01],
       [3.39621086e+02, 1.73541253e+02, 5.18826963e+01, ...,
        1.26246015e+02, 5.12093665e+01, 2.05887944e-01]])

In [126]:
# # argsort() 넘파이 배열의 원소를 오름차순으로 정렬하는 메소드임
# import numpy as np
# d1 = np.arange(10, 25)
# print(d1)

# d2 = d1.argsort() # 오름차순
# print(d2)

# topic_word_indexes = d1.argsort()[::-1] # 내림차순
# print(topic_word_indexes)

# top_indexes = topic_word_indexes[:10]
# top_indexes

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]


array([14, 13, 12, 11, 10,  9,  8,  7,  6,  5], dtype=int64)

In [128]:
def display_topics(model,feature_names,no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #', topic_index)
        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes]) # feature_names를 공백으로 join
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출        
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topics(lda, feature_names, 15)

"""
models = ['rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.med',
'talk.politics.mideast'  ]
"""

Topic # 0
don like edu know people just use time think good key does com used information
Topic # 1
game team year games play season hockey think players don good time just like league
Topic # 2
people armenian said armenians turkish did jews know turkey like just don children armenia killed
Topic # 3
israel government key chip encryption clipper law israeli people arab security keys use right state
Topic # 4
10 25 11 12 20 15 16 14 17 13 18 30 55 19 92


### 해답

In [1]:
from sklearn.datasets import fetch_20newsgroups
news_data = fetch_20newsgroups(subset='all', random_state=156)

In [2]:
import pandas as pd
print(news_data.target_names)
# 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'talk.politics.mideast'
# print(news_data.target[10])
# print(news_data.data[10])

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
train_news = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'),
                  random_state=156)
X_train = train_news.data
y_train = train_news.target
test_news = fetch_20newsgroups(subset='test',remove=('header','footers','quotes'),
                              random_state=156)
X_test = test_news.data
y_test = test_news.target

# 내 거는 튜플로 안 했으면 됐었을텐데 왜 튜플로 해가지고는..

In [6]:
print(len(X_test))
print(len(X_train))

7532
11314


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect =TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import sklearn.metrics as metrics
import warnings
warnings.filterwarnings('ignore')

lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
lr_pred = lr_clf.predict(X_test_tfidf_vect)
print(accuracy_score(y_test, lr_pred))
rp = metrics.classification_report(y_test,lr_pred)
print(rp)

0.710169941582581
              precision    recall  f1-score   support

           0       0.63      0.47      0.54       319
           1       0.57      0.76      0.65       389
           2       0.67      0.71      0.69       394
           3       0.73      0.61      0.66       392
           4       0.84      0.64      0.73       385
           5       0.73      0.70      0.71       395
           6       0.58      0.87      0.70       390
           7       0.90      0.66      0.76       396
           8       0.77      0.81      0.79       398
           9       0.87      0.79      0.83       397
          10       0.89      0.92      0.90       399
          11       0.86      0.79      0.82       396
          12       0.44      0.70      0.54       393
          13       0.81      0.72      0.76       396
          14       0.68      0.86      0.76       394
          15       0.71      0.74      0.72       398
          16       0.62      0.77      0.69       364
         

In [7]:
news_df = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'),
                            random_state=0)
news_df.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [8]:
# 풀이
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# cats = ['comp.sys.mac.hardware',
#         'comp.windows.x',
#         'rec.sport.baseball',
#         'rec.sport.hockey',
#         'misc.forsale']
cats = [ 'rec.autos','rec.sport.baseball','rec.sport.hockey','sci.crypt','comp.sys.mac.hardware']
news_df1 = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'),
                            categories = cats, random_state=0)
# LDA 는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2,\
                            stop_words='english', ngram_range=(1,2))
feat_vect1 = count_vect.fit_transform(news_df1.data)
print(feat_vect1.shape)

(4937, 1000)


In [10]:
lda1 = LatentDirichletAllocation(n_components=5, random_state=0)
lda1.fit(feat_vect1)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [11]:
print(lda1.components_.shape)
lda1.components_

(5, 1000)


array([[9.81221751e-01, 8.17236876e+00, 1.13296386e+00, ...,
        2.01107811e-01, 2.02368776e-01, 3.36927648e+00],
       [2.32343648e-01, 3.91708640e+00, 2.22138431e-01, ...,
        6.54346854e+00, 2.11812125e-01, 2.36478489e-01],
       [1.48399734e+01, 1.13958651e+02, 2.09817782e+00, ...,
        7.39111572e+00, 5.33963134e+01, 7.13572392e+01],
       [4.09230018e+01, 5.98206306e+00, 1.25266756e+02, ...,
        1.24770028e+02, 1.81816786e+01, 2.08166468e-01],
       [3.47023459e+02, 9.09698310e+01, 3.92799641e+01, ...,
        1.09427979e+00, 3.10078271e+01, 1.68288393e+01]])

In [13]:
def display_topics(model,feature_names,no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #', topic_index)
        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes])
        print(feature_concat)
# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출        
feature_names = count_vect.get_feature_names()
# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topics(lda1, feature_names, 15)

# cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', \
#         'comp.windows.x', 'talk.politics.mideast', 'soc.religion.christian',\
#         'sci.electronics', 'sci.med'  ]

Topic # 0
chip mac use apple bit know does drive like just problem new used thanks need
Topic # 1
key government encryption use people keys security public privacy information law message mail des edu
Topic # 2
don just like think car good year time game know db better team did right
Topic # 3
game hockey team games play period season nhl new gm st vs 03 02 chicago
Topic # 4
25 10 11 12 16 14 15 55 20 13 18 17 00 19 30


In [None]:
cats = [ 'rec.autos','rec.sport.baseball','rec.sport.hockey','sci.crypt','comp.sys.mac.hardware']