### Introduction to Machine Learning with Python
## Chapter 7. 텍스트 데이터 다루기
---
## IMDb 리뷰 - 문서 군집화 (토픽 모델링)

- 토픽 모델링 : 비지도 학습으로 문서를 하나 또는 그 이상의 토픽으로 할당하는 작업
- LDA (잠재 디리클레 할당, Latent Dirichlet Allocation) : PCA와 유사하게 그룹지어지는 문서들이 가지는 단어들의 성분을 구한다

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
imdb_train, imdb_test = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

- 15% 미만의 문서에서 나타나는 단어 중 가장 많이 출현하는 단어 10,000개를 선정한다

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

In [7]:
X = vect.fit_transform(text_train)

In [8]:
fn = np.array(vect.get_feature_names())
fn[::1000]

array(['00', 'bikini', 'consider', 'elegance', 'gram', 'karloff',
       'muscular', 'prone', 'shakespearean', 'thelma'], dtype='<U17')

In [9]:
X.shape

(25000, 10000)

In [10]:
fn[np.argsort((X>0).sum(axis=0))[0,:-10:-1]]

array([['thing', 'now', 'real', 'years', 'doesn', 'actors', 'another',
        'before', 'though']], dtype='<U17')

In [11]:
fn[np.argsort((X>0).sum(axis=0))[0,:10]]

array([['zenia', 'hackenstein', 'khouri', 'hagar', 'darkman',
        'kriemhild', 'ae', 'sarne', 'newcombe', 'floriane']], dtype='<U17')

In [14]:
fn[np.argsort((X>0).sum(axis=0))[0,500:510]]

array([['din', 'lin', 'beth', 'owl', 'therapist', 'manga', 'soles',
        'nanny', 'thorn', 'canon']], dtype='<U17')

- LDA 를 이용하여 주된 토픽 10개를 선정하고, X 를 해당 좌표로 변환한다.

In [32]:
'''from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, learning_method='batch', max_iter=25,
                               random_state=2019)
topics = lda.fit_transform(X)'''

In [15]:
# np.save('imdb_lda.npy', [lda.components_, topics])
lda_components_, topics = np.load('imdb_lda.npy')

In [17]:
lda_components_.shape, topics.shape

((10, 10000), (25000, 10))

In [35]:
sorting = np.argsort(lda_components_,axis=1)[:,::-1]
fn = np.array(fn)

for i in range(len(lda_components_)):
    print('topic: %d' % i)
    print(fn[sorting[i]][:10])
    print('\n')

topic: 0
['role' 'performance' 'john' 'cast' 'played' 'plays' 'actor' 'robert'
 'young' 'performances']


topic: 1
['dvd' 'years' 'video' 'saw' 'tv' 'now' 'remember' 'am' 'again' 'since']


topic: 2
['actors' 'director' 'work' 'interesting' 'though' 'quite' 'script'
 'doesn' 'however' 'didn']


topic: 3
['worst' 'thing' 'guy' 'nothing' 'didn' 'stupid' 'minutes' 'funny'
 'actually' 'want']


topic: 4
['war' 'world' 'us' 'american' 'between' 'both' 'history' 'own' 'yet'
 'work']


topic: 5
['show' 'series' 'episode' 'action' 'episodes' 'tv' 'game' 'season' 'new'
 'original']


topic: 6
['funny' 'show' 'comedy' 'music' 'musical' 'song' 'old' 'songs' 'fun'
 'wonderful']


topic: 7
['family' 'world' 'us' 'kids' 'children' 'real' 'book' 'our' 'things'
 'old']


topic: 8
['horror' 'effects' 'gore' 'special' 'blood' 'scary' 'pretty' 'budget'
 'zombie' 'killer']


topic: 9
['woman' 'house' 'wife' 'girl' 'sex' 'gets' 'young' 'husband' 'women'
 'seems']


