### 파이썬 머신러닝
## 텍스트 데이터 다루기
---
## IMDb 리뷰 - 문서 군집화 (토픽 모델링)

- 토픽 모델링 : 비지도 학습으로 문서를 하나 또는 그 이상의 토픽으로 할당하는 작업
- LDA (잠재 디리클레 할당, Latent Dirichlet Allocation) : PCA와 유사하게 그룹지어지는 문서들이 가지는 단어들의 성분을 구한다

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
imdb_train, imdb_test = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

- 15% 미만의 문서에서 나타나는 단어 중 가장 많이 출현하는 단어 10,000개를 선정한다

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

In [4]:
X = vect.fit_transform(text_train)

In [8]:
fn = np.array(vect.get_feature_names())
fn[::1000]

array(['00', 'bikini', 'consider', 'elegance', 'gram', 'karloff',
       'muscular', 'prone', 'shakespearean', 'thelma'], dtype='<U17')

In [6]:
X.shape

(25000, 10000)

In [7]:
fn[np.argsort((X>0).sum(axis=0))[0,:-10:-1]]

array([['thing', 'now', 'real', 'years', 'doesn', 'actors', 'another',
        'before', 'though']], dtype='<U17')

In [8]:
fn[np.argsort((X>0).sum(axis=0))[0,:10]]

array([['zenia', 'hackenstein', 'khouri', 'hagar', 'darkman',
        'kriemhild', 'ae', 'sarne', 'newcombe', 'floriane']], dtype='<U17')

In [9]:
fn[np.argsort((X>0).sum(axis=0))[0,500:510]]

array([['din', 'lin', 'beth', 'owl', 'therapist', 'manga', 'soles',
        'nanny', 'thorn', 'canon']], dtype='<U17')

- LDA 를 이용하여 주된 토픽 10개를 선정하고, X 를 해당 좌표로 변환한다.

In [2]:
imdb_train, imdb_test = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

In [4]:
X = vect.fit_transform(text_train)

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, learning_method='batch',
                                max_iter=25, random_state=2019)
topics = lda.fit_transform(X)

- lda.components_ 를 이용하여 토픽 10개 각각의 중요한 단어들을 출력하시오

In [10]:
type(lda.components_)

numpy.ndarray

In [11]:
lda.components_.shape

(10, 10000)

In [13]:
best = np.argsort(lda.components_, axis=1)[:,::-1]
best

array([[7608, 6596, 4928, ..., 1403, 3792, 9946],
       [2902, 9958, 9565, ..., 3457, 7736,  872],
       [ 236, 2646, 9896, ..., 9374, 9532, 2761],
       ...,
       [3386, 9902, 9474, ..., 6048, 9374, 4113],
       [4425, 2975, 3973, ..., 4850,  769, 3807],
       [9875, 4439, 9811, ..., 1001, 5155, 9969]], dtype=int64)

In [17]:
for i in range(10):
    print('>>', i)
    print(fn[best[i,:10]])

>> 0
['role' 'performance' 'john' 'cast' 'played' 'plays' 'actor' 'robert'
 'young' 'performances']
>> 1
['dvd' 'years' 'video' 'saw' 'tv' 'now' 'remember' 'am' 'again' 'since']
>> 2
['actors' 'director' 'work' 'interesting' 'though' 'quite' 'script'
 'doesn' 'however' 'didn']
>> 3
['worst' 'thing' 'guy' 'nothing' 'didn' 'stupid' 'minutes' 'funny'
 'actually' 'want']
>> 4
['war' 'world' 'us' 'american' 'between' 'both' 'history' 'own' 'yet'
 'work']
>> 5
['show' 'series' 'episode' 'action' 'episodes' 'tv' 'game' 'season' 'new'
 'original']
>> 6
['funny' 'show' 'comedy' 'music' 'musical' 'song' 'old' 'songs' 'fun'
 'wonderful']
>> 7
['family' 'world' 'us' 'kids' 'children' 'real' 'book' 'our' 'things'
 'old']
>> 8
['horror' 'effects' 'gore' 'special' 'blood' 'scary' 'pretty' 'budget'
 'zombie' 'killer']
>> 9
['woman' 'house' 'wife' 'girl' 'sex' 'gets' 'young' 'husband' 'women'
 'seems']


In [15]:
np.save('imdb_lda.npy', [lda.components_, topics])
#lda_components_, topics = np.load('imdb_lda.npy')

In [17]:
lda_components_.shape, topics.shape

((10, 10000), (25000, 10))

In [35]:
#sorting = np.argsort(lda_components_,axis=1)[:,::-1]
sorting = np.argsort(lda.components_,axis=1)[:,::-1]
fn = np.array(fn)

for i in range(len(lda_components_)):
    print('topic: %d' % i)
    print(fn[sorting[i]][:10]) # fn[sorting[i,:10]]
    print('\n')

topic: 0
['role' 'performance' 'john' 'cast' 'played' 'plays' 'actor' 'robert'
 'young' 'performances']


topic: 1
['dvd' 'years' 'video' 'saw' 'tv' 'now' 'remember' 'am' 'again' 'since']


topic: 2
['actors' 'director' 'work' 'interesting' 'though' 'quite' 'script'
 'doesn' 'however' 'didn']


topic: 3
['worst' 'thing' 'guy' 'nothing' 'didn' 'stupid' 'minutes' 'funny'
 'actually' 'want']


topic: 4
['war' 'world' 'us' 'american' 'between' 'both' 'history' 'own' 'yet'
 'work']


topic: 5
['show' 'series' 'episode' 'action' 'episodes' 'tv' 'game' 'season' 'new'
 'original']


topic: 6
['funny' 'show' 'comedy' 'music' 'musical' 'song' 'old' 'songs' 'fun'
 'wonderful']


topic: 7
['family' 'world' 'us' 'kids' 'children' 'real' 'book' 'our' 'things'
 'old']


topic: 8
['horror' 'effects' 'gore' 'special' 'blood' 'scary' 'pretty' 'budget'
 'zombie' 'killer']


topic: 9
['woman' 'house' 'wife' 'girl' 'sex' 'gets' 'young' 'husband' 'women'
 'seems']




In [6]:
from sklearn.decomposition import LatentDirichletAllocation

lda2 = LatentDirichletAllocation(n_components=10, learning_method='batch',
                                max_iter=25, random_state=2020)
topics2 = lda2.fit_transform(X)

In [18]:
best2 = np.argsort(lda2.components_, axis=1)[:,::-1]
best2

array([[9958, 7778, 2902, ..., 1037, 5932, 7716],
       [4425, 2975, 5053, ...,  467, 3807,  769],
       [ 230, 2646, 7122, ..., 9987, 4118, 2761],
       ...,
       [3909, 9875, 3386, ..., 9986, 3875, 6517],
       [6002, 6003, 8327, ..., 1786, 4108, 1045],
       [6596, 7608, 1470, ..., 8899, 7996, 9616]], dtype=int64)

In [19]:
for i in range(10):
    print('>>', i)
    print(fn[best2[i,:10]])

>> 0
['years' 'saw' 'dvd' 'now' 'old' 'again' 'remember' 'tv' 'watched' 'am']
>> 1
['horror' 'effects' 'killer' 'budget' 'gore' 'pretty' 'blood' 'low'
 'scary' 'dead']
>> 2
['action' 'director' 'quite' 'work' 'however' 'style' 'though'
 'interesting' 'seems' 'rather']
>> 3
['show' 'comedy' 'funny' 'episode' 'series' 'cast' 'tv' 'shows' 'fun'
 'family']
>> 4
['didn' 'thing' 'nothing' 'actors' 'worst' 'actually' 'thought' 'want'
 'funny' '10']
>> 5
['series' 'original' 'book' 'version' 'animation' 'episode' 'disney'
 'show' 'new' 'star']
>> 6
['world' 'war' 'us' 'our' 'american' 'real' 'own' 'may' 'between' 'human']
>> 7
['girl' 'woman' 'family' 'gets' 'father' 'around' 'down' 'guy' 'doesn'
 'young']
>> 8
['music' 'musical' 'song' 'dance' 'songs' 'dancing' 'singing' 'kelly'
 'number' 'role']
>> 9
['performance' 'role' 'cast' 'john' 'plays' 'actor' 'played' 'young'
 'robert' 'director']
