### Introduction to Machine Learning with Python
## Chapter 7. 텍스트 데이터 다루기
---
## IMDb 리뷰 - 문서 군집화 (토픽 모델링)

- 토픽 모델링 : 비지도 학습으로 문서를 하나 또는 그 이상의 토픽으로 할당하는 작업
- LDA (잠재 디리클레 할당, Latent Dirichlet Allocation) : 성분 분해 방법으로 NMF 와 유사

In [1]:
%pylab inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


In [2]:
[imdb_train, imdb_test] = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

- 15% 미만의 문서에서 나타나는 단어 중 가장 많이 출현하는 단어 10,000개를 선정한다

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

In [4]:
X = vect.fit_transform(text_train)

In [5]:
fn = vect.get_feature_names()
fn[::1000]

['00',
 'bikini',
 'consider',
 'elegance',
 'gram',
 'karloff',
 'muscular',
 'prone',
 'shakespearean',
 'thelma']

In [6]:
X.shape

(25000, 10000)

- LDA 를 이용하여 주된 토픽 10개를 선정하고, X 를 해당 좌표로 변환한다.

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, learning_method='batch', max_iter=25,
                               random_state=0)
topics = lda.fit_transform(X)

In [8]:
lda.components_.shape, topics.shape

((10, 10000), (25000, 10))

In [9]:
sorting = np.argsort(lda.components_,axis=1)[::-1]
fn = np.array(fn)

for i in range(len(lda.components_)):
    print('topic: %d' % i)
    print(fn[sorting[i]][:10])
    print('\n')

topic: 0
['caruso' 'ppv' 'undertaker' 'shaq' 'pumbaa' 'btk' 'timon' 'duryea'
 'simba' 'sabretooth']


topic: 1
['cena' 'deathtrap' 'deathstalker' 'mcbain' 'askey' 'hackenstein'
 'darkman' 'harlin' 'segal' 'batwoman']


topic: 2
['batwoman' 'macarthur' 'caruso' 'antwone' 'bathsheba' 'harilal' 'blaine'
 'jaffar' 'ahmad' 'brashear']


topic: 3
['leia' 'vivah' 'tashan' 'azumi' 'trelkovsky' 'dahl' 'deathstalker'
 'kapoor' 'crouse' 'lila']


topic: 4
['gannon' 'gypo' 'macarthur' 'venezuela' 'winchester' 'azumi' 'chavez'
 'zenia' 'creasy' 'khouri']


topic: 5
['aweigh' 'iturbi' 'gannon' 'gypo' 'fagin' 'ossessione' 'durbin'
 'pavarotti' 'blandings' 'newcombe']


topic: 6
['blandings' 'eustache' 'sarne' 'noam' 'kornbluth' 'tremors' 'ollie'
 'palestinian' 'muslims' 'ashraf']


topic: 7
['ossessione' 'harron' 'kells' 'ae' 'carla' 'harilal' 'trelkovsky'
 'visconti' 'jaffar' 'ahmad']


topic: 8
['pumbaa' 'simba' 'yokai' 'venezuela' 'azumi' 'ramones' 'gram' 'shaggy'
 'sematary' 'lordi']


topic: 9
[

- 100개의 토픽을 적용해 보자

In [None]:
#lda100 = LatentDirichletAllocation(n_components=100, learning_method='batch', max_iter=25,
#                               random_state=0)
X_topics100 = lda100.fit_transform(X)

In [None]:
### save results ###
np.save('lda100.npy', [X,X_topics100,lda100])

In [None]:
lda100.components_[0]

In [None]:
sorting = np.argsort(lda100.components_,axis=1)[::-1]
#fn = np.array(fn)

for i in range(len(lda100.components_)):
    print('topic: %d' % i)
    print(fn[sorting[i]][:10])
    print('\n')