학습의 편의를 위하여 Gensim 에서 제공하는 FastText 를 이용합니다 (Fasebook Research에서 제공하는 FastText을 쓰셔도 됩니다).

한글의 FastText 를 적용하려면 초/중/종성을 분리해야 합니다. 이를 위하여 compose, decompose 함수를 이용합니다.

In [1]:
import re
from soynlp.hangle import decompose, compose

def remove_doublespace(s):  # 띄어쓰기류의 문자가 연속으로 두번 이상 있으면 한번으로 줄입니다.
    doublespace_pattern = re.compile('\s+')
    return doublespace_pattern.sub(' ', s).strip()

def findrepeat(text):  # 같은 문자가 3번 이상 반복되면 최대 3개만 출력합니다.
    for t in set([c for c in text]):
        for s, e in reversed([(m.start(), m.end()) for m in re.compile('['+t+']{3,}').finditer(text)]):
             text = text[:s] + t*3 + text[e:]
    return text

def encode(s):  # 한글 초/중/종성 처리
    def process(c):
        if re.compile('[0-9|a-z|A-Z|.?!]+').match(c):  # 알파벳, 숫자, ., ?, !에 해당하는 문자가 나오면 '-c-' 형식으로 변경
            return '-' + c + '-'
        jamo = decompose(c)
        # 'a' or 모음 or 자음
        if (jamo is None):
            return ' '
        cho, jung, jong = (jamo)
        if jong == ' ':
            jong = '-'
            if jung == ' ':
                return '-' + cho + '-'
            else:
                if cho == ' ':
                    cho = '-'
        return cho + jung + jong
    
    s = ''.join(re.compile('[0-9|a-z|A-Z|ㄱ-ㅎ|ㅏ-ㅣ|가-힣|.?!|\s]+').findall(s))  # compile에 지정한 문자 외에 모두 제거
    s = findrepeat(s)
    s = ''.join(process(c) for c in s)
    return remove_doublespace(s).strip()

def decode(s):  # 텍스트 복원
    def process(t):
        assert len(t) % 3 == 0
        t_ = t.replace('-', ' ')
        chars = [tuple(t_[3*i:3*(i+1)]) for i in range(len(t_)//3)]
        recovered = list()
        for char in chars:
            try:
                recovered.append(compose(*char))
            except ValueError:
                recovered.append(''.join(char).replace('-', '').strip())
        recovered = ''.join(recovered)
        return recovered

    return ' '.join(process(t) for t in s.split())

In [2]:
text = 'ㄱㅏ나힣a0.?!@ 반복55555'
print(f'인코딩 결과: {encode(text)}')
print(f'디코딩 결과: {decode(encode(text))}')

인코딩 결과: -ㄱ--ㅏ-ㄴㅏ-ㅎㅣㅎ-a--0--.--?--!- ㅂㅏㄴㅂㅗㄱ-5--5--5-
디코딩 결과: ㄱㅏ나힣a0.?! 반복555


처리할 리뷰 데이터를 불러옵니다.

In [3]:
import pandas as pd

path = './data/nm_review(dropna).tsv'
df = pd.read_csv(path, delimiter='\t', index_col=0)
df['review'][:10]

  mask |= (ar1 == a)


0                         한글지켜준건 고맙다. 하지만 영화는 엄청 지루하다.
1              우리나라 영화계는 반일이랑 518 없으면 영화를 만들지를 못해요....
2    어제 메박가서 보고왔는데 솔직히 지루했습니다. 내용전개도 스토리도 특별하지 않구요....
3                                      보지마라 존나 지루하고 잔인
4                                  지루하고 또 지루하고 그저그런 영화
5                                항일, 반일영화만 주구장창 찍어대니 원
6                                  재미가 없어용 ㅠ보다 나왔습니다 ㅠ
7                                   일케 많이 웃게 하는 영화 첨인듯
8                                  별점높아도 기대안했는데 재미있었어요
9                                  오랜만에 한국영화 재밌게 봤어요^^
Name: review, dtype: object

리뷰 데이터를 인코딩 처리하고 띄어쓰기 단위로 토큰화 합니다.

In [4]:
%%time
corpus = list(map(lambda x:encode(x).split(' '), df['review']))

Wall time: 5min 42s


In [5]:
for text in corpus[:10]:
    print(text)

['ㅎㅏㄴㄱㅡㄹㅈㅣ-ㅋㅕ-ㅈㅜㄴㄱㅓㄴ', 'ㄱㅗ-ㅁㅏㅂㄷㅏ--.-', 'ㅎㅏ-ㅈㅣ-ㅁㅏㄴ', 'ㅇㅕㅇㅎㅘ-ㄴㅡㄴ', 'ㅇㅓㅁㅊㅓㅇ', 'ㅈㅣ-ㄹㅜ-ㅎㅏ-ㄷㅏ--.-']
['ㅇㅜ-ㄹㅣ-ㄴㅏ-ㄹㅏ-', 'ㅇㅕㅇㅎㅘ-ㄱㅖ-ㄴㅡㄴ', 'ㅂㅏㄴㅇㅣㄹㅇㅣ-ㄹㅏㅇ', '-5--1--8-', 'ㅇㅓㅄㅇㅡ-ㅁㅕㄴ', 'ㅇㅕㅇㅎㅘ-ㄹㅡㄹ', 'ㅁㅏㄴㄷㅡㄹㅈㅣ-ㄹㅡㄹ', 'ㅁㅗㅅㅎㅐ-ㅇㅛ--.--.--.-']
['ㅇㅓ-ㅈㅔ-', 'ㅁㅔ-ㅂㅏㄱㄱㅏ-ㅅㅓ-', 'ㅂㅗ-ㄱㅗ-ㅇㅘㅆㄴㅡㄴㄷㅔ-', 'ㅅㅗㄹㅈㅣㄱㅎㅣ-', 'ㅈㅣ-ㄹㅜ-ㅎㅐㅆㅅㅡㅂㄴㅣ-ㄷㅏ--.-', 'ㄴㅐ-ㅇㅛㅇㅈㅓㄴㄱㅐ-ㄷㅗ-', 'ㅅㅡ-ㅌㅗ-ㄹㅣ-ㄷㅗ-', 'ㅌㅡㄱㅂㅕㄹㅎㅏ-ㅈㅣ-', 'ㅇㅏㄶㄱㅜ-ㅇㅛ--.-ㄱㅖ-ㅅㅗㄱ', 'ㅁㅏㄹㅁㅏㄹㅁㅏㄹㅎㅏ-ㄷㅏ-ㄱㅏ-', 'ㄲㅡㅌㄴㅏ-ㄴㅡㄴㄷㅔ-', 'ㄷㅓ-ㄷㅗ-', 'ㄷㅓㄹㄷㅗ-ㅁㅏㄹㄱㅗ-', 'ㄷㅓㄱㅎㅖ-ㅇㅗㅇㅈㅜ-', 'ㅂㅗ-ㄷㅏ-', 'ㅈㅗ-ㄱㅡㅁ', 'ㅇㅏ-ㄹㅐㅅㄱㅡㅂㅇㅢ-', 'ㅇㅕㅇㅎㅘ-ㅇㅕㅆㄷㅓㄴㄱㅓ-', 'ㄱㅏㅌㄴㅔ-ㅇㅛ--.-']
['ㅂㅗ-ㅈㅣ-ㅁㅏ-ㄹㅏ-', 'ㅈㅗㄴㄴㅏ-', 'ㅈㅣ-ㄹㅜ-ㅎㅏ-ㄱㅗ-', 'ㅈㅏㄴㅇㅣㄴ']
['ㅈㅣ-ㄹㅜ-ㅎㅏ-ㄱㅗ-', 'ㄸㅗ-', 'ㅈㅣ-ㄹㅜ-ㅎㅏ-ㄱㅗ-', 'ㄱㅡ-ㅈㅓ-ㄱㅡ-ㄹㅓㄴ', 'ㅇㅕㅇㅎㅘ-']
['ㅎㅏㅇㅇㅣㄹ', 'ㅂㅏㄴㅇㅣㄹㅇㅕㅇㅎㅘ-ㅁㅏㄴ', 'ㅈㅜ-ㄱㅜ-ㅈㅏㅇㅊㅏㅇ', 'ㅉㅣㄱㅇㅓ-ㄷㅐ-ㄴㅣ-', 'ㅇㅝㄴ']
['ㅈㅐ-ㅁㅣ-ㄱㅏ-', 'ㅇㅓㅄㅇㅓ-ㅇㅛㅇ', '-ㅠ-ㅂㅗ-ㄷㅏ-', 'ㄴㅏ-ㅇㅘㅆㅅㅡㅂㄴㅣ-ㄷㅏ-', '-ㅠ-']
['ㅇㅣㄹㅋㅔ-', 'ㅁㅏㄶㅇㅣ-', 'ㅇㅜㅅㄱㅔ-', 'ㅎㅏ-ㄴㅡㄴ', 'ㅇㅕㅇㅎㅘ-', 'ㅊㅓㅁㅇㅣㄴㄷㅡㅅ']
['ㅂㅕㄹㅈㅓㅁㄴㅗㅍㅇㅏ-ㄷㅗ-', 'ㄱㅣ-ㄷㅐ-ㅇㅏㄴㅎㅐㅆㄴㅡㄴㄷㅔ-', 'ㅈㅐ-ㅁㅣ-ㅇㅣㅆㅇㅓㅆㅇㅓ-ㅇㅛ-']
['ㅇㅗ-ㄹㅐㄴㅁㅏㄴㅇㅔ-', 'ㅎㅏㄴㄱㅜㄱㅇㅕㅇㅎㅘ-', 'ㅈㅐ-ㅁㅣㅆㄱㅔ-', 'ㅂㅘㅆㅇㅓ-ㅇㅛ-']


fasttext model 학습

FastText의 학습 과정을 볼 수 있도록 logging을 설정합니다.

In [6]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

gensim으로 FastText를 불러온 후 앞서 처리한 데이터를 가지고 학습합니다.

In [7]:
from gensim.models import FastText

fasttext_model = FastText(
    corpus,  # 처리한 데이터를 할당한 변수를 입력합니다.
    size = 128,  # FastText 모델의 벡터 사이즈를 정의합니다.
    window = 3,  # 문장 내에서 근접한 단어는 서로 연관있다고 판단합니다. 연속된 몇개의 단어를 연관있다고 할지 정의합니다.
    min_count = 10,  # 데이터 내에 적게 나타나는 단어를 무시하여 처리속도를 향상시킵니다. 단어가 몇번 이하로 발견되면 무시할지 정의
    min_n = 3,  # FastText는 문자 단위로 단어의 유사도를 판단합니다. n-gram으로 처리할 최소 단위
    max_n = 9  # n-gram으로 처리할 최대 단위를 의미합니다.
)

2019-11-12 23:51:11,953 : INFO : resetting layer weights
2019-11-12 23:51:21,620 : INFO : collecting all words and their counts
2019-11-12 23:51:21,622 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-11-12 23:51:21,687 : INFO : PROGRESS: at sentence #10000, processed 65220 words, keeping 28736 word types
2019-11-12 23:51:21,738 : INFO : PROGRESS: at sentence #20000, processed 121855 words, keeping 45143 word types
2019-11-12 23:51:21,787 : INFO : PROGRESS: at sentence #30000, processed 174160 words, keeping 58158 word types
2019-11-12 23:51:21,837 : INFO : PROGRESS: at sentence #40000, processed 221632 words, keeping 69283 word types
2019-11-12 23:51:21,873 : INFO : PROGRESS: at sentence #50000, processed 279556 words, keeping 84308 word types
2019-11-12 23:51:21,922 : INFO : PROGRESS: at sentence #60000, processed 341720 words, keeping 99074 word types
2019-11-12 23:51:21,963 : INFO : PROGRESS: at sentence #70000, processed 404238 words, keeping 112549 

2019-11-12 23:51:25,992 : INFO : PROGRESS: at sentence #710000, processed 5045533 words, keeping 933254 word types
2019-11-12 23:51:26,042 : INFO : PROGRESS: at sentence #720000, processed 5105721 words, keeping 942799 word types
2019-11-12 23:51:26,104 : INFO : PROGRESS: at sentence #730000, processed 5173122 words, keeping 953709 word types
2019-11-12 23:51:26,153 : INFO : PROGRESS: at sentence #740000, processed 5244974 words, keeping 965125 word types
2019-11-12 23:51:26,199 : INFO : PROGRESS: at sentence #750000, processed 5315437 words, keeping 976344 word types
2019-11-12 23:51:26,260 : INFO : PROGRESS: at sentence #760000, processed 5393395 words, keeping 990023 word types
2019-11-12 23:51:26,325 : INFO : PROGRESS: at sentence #770000, processed 5472720 words, keeping 1002677 word types
2019-11-12 23:51:26,396 : INFO : PROGRESS: at sentence #780000, processed 5538587 words, keeping 1011713 word types
2019-11-12 23:51:26,452 : INFO : PROGRESS: at sentence #790000, processed 5613

2019-11-12 23:52:06,466 : INFO : EPOCH 1 - PROGRESS: at 27.71% examples, 120361 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:07,496 : INFO : EPOCH 1 - PROGRESS: at 29.96% examples, 120853 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:08,510 : INFO : EPOCH 1 - PROGRESS: at 32.25% examples, 121502 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:09,513 : INFO : EPOCH 1 - PROGRESS: at 34.62% examples, 122136 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:10,615 : INFO : EPOCH 1 - PROGRESS: at 36.79% examples, 122437 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:11,675 : INFO : EPOCH 1 - PROGRESS: at 38.78% examples, 122565 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:52:12,685 : INFO : EPOCH 1 - PROGRESS: at 40.58% examples, 122704 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:52:13,697 : INFO : EPOCH 1 - PROGRESS: at 42.36% examples, 122453 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:52:14,701 : INFO : EPOCH 1 - PROGRESS: at 44.53% examples, 122920 words/s, in_qsiz

2019-11-12 23:53:18,306 : INFO : EPOCH 2 - PROGRESS: at 62.59% examples, 118745 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:53:19,355 : INFO : EPOCH 2 - PROGRESS: at 64.55% examples, 118889 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:53:20,455 : INFO : EPOCH 2 - PROGRESS: at 66.55% examples, 118921 words/s, in_qsize 4, out_qsize 1
2019-11-12 23:53:21,592 : INFO : EPOCH 2 - PROGRESS: at 68.41% examples, 118927 words/s, in_qsize 6, out_qsize 1
2019-11-12 23:53:22,663 : INFO : EPOCH 2 - PROGRESS: at 70.38% examples, 119053 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:53:23,725 : INFO : EPOCH 2 - PROGRESS: at 72.50% examples, 119122 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:53:24,796 : INFO : EPOCH 2 - PROGRESS: at 74.53% examples, 119226 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:53:25,862 : INFO : EPOCH 2 - PROGRESS: at 76.64% examples, 119386 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:53:26,945 : INFO : EPOCH 2 - PROGRESS: at 78.42% examples, 119190 words/s, in_qsiz

2019-11-12 23:54:32,361 : INFO : EPOCH 3 - PROGRESS: at 71.82% examples, 89794 words/s, in_qsize 4, out_qsize 1
2019-11-12 23:54:33,458 : INFO : EPOCH 3 - PROGRESS: at 73.17% examples, 89560 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:54:34,467 : INFO : EPOCH 3 - PROGRESS: at 74.20% examples, 89100 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:54:35,576 : INFO : EPOCH 3 - PROGRESS: at 75.24% examples, 88497 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:54:36,679 : INFO : EPOCH 3 - PROGRESS: at 76.64% examples, 88346 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:54:37,788 : INFO : EPOCH 3 - PROGRESS: at 78.01% examples, 88224 words/s, in_qsize 5, out_qsize 1
2019-11-12 23:54:38,863 : INFO : EPOCH 3 - PROGRESS: at 79.02% examples, 87872 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:54:39,903 : INFO : EPOCH 3 - PROGRESS: at 79.82% examples, 87336 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:54:40,909 : INFO : EPOCH 3 - PROGRESS: at 80.87% examples, 87228 words/s, in_qsize 5, out_

2019-11-12 23:55:47,211 : INFO : EPOCH 4 - PROGRESS: at 67.79% examples, 90263 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:55:48,212 : INFO : EPOCH 4 - PROGRESS: at 69.54% examples, 91001 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:55:49,268 : INFO : EPOCH 4 - PROGRESS: at 71.58% examples, 91549 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:55:50,280 : INFO : EPOCH 4 - PROGRESS: at 73.51% examples, 92132 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:55:51,296 : INFO : EPOCH 4 - PROGRESS: at 75.48% examples, 92750 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:55:52,302 : INFO : EPOCH 4 - PROGRESS: at 77.35% examples, 93256 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:55:53,348 : INFO : EPOCH 4 - PROGRESS: at 79.12% examples, 93816 words/s, in_qsize 5, out_qsize 0
2019-11-12 23:55:54,384 : INFO : EPOCH 4 - PROGRESS: at 80.69% examples, 94224 words/s, in_qsize 6, out_qsize 0
2019-11-12 23:55:55,427 : INFO : EPOCH 4 - PROGRESS: at 82.45% examples, 94705 words/s, in_qsize 5, out_

2019-11-12 23:56:56,692 : INFO : EPOCH - 5 : training on 8774744 raw words (6496819 effective words) took 51.3s, 126590 effective words/s
2019-11-12 23:56:56,693 : INFO : training on a 43873720 raw words (32481588 effective words) took 304.7s, 106593 effective words/s


fasttext model save

In [10]:
fasttext_model.save('./model/fasttext_model')

2019-11-12 23:58:12,591 : INFO : saving FastText object under ./model/fasttext_model_v2, separately None
2019-11-12 23:58:12,592 : INFO : storing np array 'vectors_ngrams' to ./model/fasttext_model_v2.wv.vectors_ngrams.npy
2019-11-12 23:58:14,266 : INFO : not storing attribute vectors_ngrams_norm
2019-11-12 23:58:14,267 : INFO : not storing attribute vectors_norm
2019-11-12 23:58:14,267 : INFO : not storing attribute vectors_vocab_norm
2019-11-12 23:58:14,268 : INFO : not storing attribute buckets_word
2019-11-12 23:58:14,269 : INFO : storing np array 'vectors_ngrams_lockf' to ./model/fasttext_model_v2.trainables.vectors_ngrams_lockf.npy
2019-11-12 23:58:17,390 : INFO : saved ./model/fasttext_model_v2


fasttext model load

In [7]:
from gensim.models import FastText
ft_model = FastText.load('./model/fasttext_model')

fasttext 기능을 사용하기 위해 query string을 초-중-종성 형태로 encoding한 후 사용

fasttext 결과를 다시 decoding하여 단어 형태로 변환

In [14]:
def most_similar2(query, topn=10):
    query_ = encode(query)
    similars = ft_model.wv.most_similar(query_, topn=topn)
    similars = [(decode(word), sim) for word, sim in similars]
    return similars

In [41]:
most_similar2('재미있음')

[('재미있음!', 0.9677058458328247),
 ('재미있엇음', 0.9672262072563171),
 ('재미있엇다', 0.9540884494781494),
 ('재미있음ㅋㅋ', 0.9521926641464233),
 ('재미있었음', 0.946642279624939),
 ('재밌음요', 0.9405887722969055),
 ('재미있당', 0.9381404519081116),
 ('재미있슴', 0.9351725578308105),
 ('재밌음ㅎ', 0.9349152445793152),
 ('재미있음!!', 0.9345504641532898)]

In [42]:
most_similar2('재미없음')

[('재미없엇음', 0.9805892109870911),
 ('재미없음!', 0.965717077255249),
 ('재미없어짐', 0.9599900245666504),
 ('재미없어여', 0.957050085067749),
 ('재미없었음', 0.9548401832580566),
 ('재미없슴', 0.9484813213348389),
 ('ㅈㄴ재미없음', 0.9432206153869629),
 ('개재미없음', 0.9382390975952148),
 ('재미없어', 0.9377038478851318),
 ('재미없음.', 0.9357426166534424)]

In [43]:
most_similar2('재미없진않음')

[('재미없진', 0.7632869482040405),
 ('재미없지', 0.743039608001709),
 ('재밌진', 0.7236950397491455),
 ('나쁘진않음', 0.723588228225708),
 ('재미없지도', 0.7176520228385925),
 ('재미없지는', 0.7032293677330017),
 ('재밌지', 0.6922109723091125),
 ('재밌지?', 0.6894176006317139),
 ('나쁘지않다', 0.6890566349029541),
 ('나쁘지않음', 0.6879357099533081)]

In [23]:
def get_vector(query, topn=10):
    query_ = encode(query)
    wv = ft_model.wv.get_vector(query_)
    return wv

In [24]:
get_vector('나쁘지않다고').shape

(128,)

### word vector 생성

모든 리뷰의 단어를 word vector로 변환하여 저장하는 과정입니다.

이 방법보다는 corpus 내의 모든 단어를 FastText 모델로 벡터화한 사전을 만들고

리뷰의 단어 대신 사전의 인덱스를 저장하는 방법이 더 효율적입니다.

이 방법은 참고만 하세요.

In [16]:
import pandas as pd

path = './data/nm_review(score_balanced).tsv'
df = pd.read_csv(path, delimiter='\t', index_col=0)
df[:10]

Unnamed: 0,code,uid,datetime,score,review,sympathy,notsympathy
0,167699,15108707,2019.01.11 05:36,1,우리나라 영화계는 반일이랑 518 없으면 영화를 만들지를 못해요....,10,19
1,167699,15112721,2019.01.12 09:42,4,어제 메박가서 보고왔는데 솔직히 지루했습니다. 내용전개도 스토리도 특별하지 않구요....,1,11
2,167699,15119773,2019.01.13 16:08,2,보지마라 존나 지루하고 잔인,2,12
3,167651,15189096,2019.01.30 14:48,10,지친 하루를 보냈는데 오랫만에 웃어보았어요.^^,1,0
4,164172,15405102,2019.03.12 04:52,8,재밋게 보았음 그냥 기분이 좋아지는 코믹물,5,3
5,164172,15340125,2019.02.28 21:34,5,그저 그랬다. 별로 웃기지도 않았고 나중에 박성웅 배우님의 연기에 딱 한 번 웃었다.,5,3
6,177374,15321353,2019.02.24 20:13,10,정우성 연기힘빼고하니까 진짜 좋네여 김향기도 정우성도 넘 감동적ㅜ,1,2
7,174050,15509814,2019.04.13 20:27,2,별로였다 전도연나온다길래 기대했는데 설경구 인위적인 눈물연기 봐주기 힘듦,6,7
8,174050,15508215,2019.04.13 08:11,4,세월호 사건 안타까운거 이미 다 알고 있으니까 제발 영화 소재로는 쓰지 말자.. 유...,6,7
9,156464,14907556,2018.11.27 08:09,5,그냥 흔하디 흔한 전기영화,5,6


word vectorizing (unigram -> fasttext)

In [17]:
ug_corpus = list()
for row, review in enumerate(df['review']):
    if row % 10000 == 0:
        print(f'{row} / {df.shape[0]}')
    ug_corpus.append(review.split(' '))

0 / 86750
10000 / 86750
20000 / 86750
30000 / 86750
40000 / 86750
50000 / 86750
60000 / 86750
70000 / 86750
80000 / 86750


In [18]:
for ugs in ug_corpus[:10]:
    print(ugs)

['우리나라', '영화계는', '반일이랑', '518', '없으면', '영화를', '만들지를', '못해요....']
['어제', '메박가서', '보고왔는데', '솔직히', '지루했습니다.', '내용전개도', '스토리도', '특별하지', '않구요.계속', '말말말하다가', '끝나는데', '더도', '덜도말고', '덕혜옹주', '보다', '조금', '아랫급의', '영화였던거', '같네요.']
['보지마라', '존나', '지루하고', '잔인']
['지친', '하루를', '보냈는데', '오랫만에', '웃어보았어요.^^']
['재밋게', '보았음', '그냥', '기분이', '좋아지는', '코믹물']
['그저', '그랬다.', '별로', '웃기지도', '않았고', '나중에', '박성웅', '배우님의', '연기에', '딱', '한', '번', '웃었다.']
['정우성', '연기힘빼고하니까', '진짜', '좋네여', '김향기도', '정우성도', '넘', '감동적ㅜ']
['별로였다', '전도연나온다길래', '기대했는데', '설경구', '인위적인', '눈물연기', '봐주기', '힘듦']
['세월호', '사건', '안타까운거', '이미', '다', '알고', '있으니까', '제발', '영화', '소재로는', '쓰지', '말자..', '유가족들', '두', '번', '죽인다는', '걸', '왜', '몰라', '왜..']
['그냥', '흔하디', '흔한', '전기영화']


In [20]:
import numpy as np

In [21]:
word_vectors = list()
for idx, ugs in enumerate(ug_corpus):
    if idx % 10000 == 0:
        print(f'{idx} / {len(ug_corpus)}')
    word_vectors.append([get_vector(ug) for ug in ugs])

0 / 86750
10000 / 86750
20000 / 86750
30000 / 86750
40000 / 86750
50000 / 86750
60000 / 86750
70000 / 86750
80000 / 86750


In [22]:
word_vectors[0][0]

array([-0.47300333,  1.30049   , -0.3641767 ,  1.8157151 , -0.85483354,
       -0.5684351 , -0.2055265 ,  0.9507764 ,  2.1394181 ,  0.8465369 ,
       -0.92357624, -1.2002548 ,  0.55566585, -0.06308649,  0.15278229,
        0.00917215,  0.8447812 , -0.54338545, -1.2037055 , -1.6620326 ,
       -1.1124866 , -1.0869331 , -0.7396704 , -1.2416005 ,  0.568634  ,
        0.4435347 , -0.03420123, -0.84977186, -0.0071111 ,  1.113092  ,
        1.654227  ,  0.8370427 ,  0.98035204, -0.92488986,  1.2195312 ,
        0.27514252,  0.01439372, -1.5257816 ,  0.35278195,  1.158323  ,
       -0.70516914, -1.0562598 , -1.8120387 , -2.3355322 ,  0.06909817,
       -0.02377743,  0.06588261,  0.4355864 ,  1.0005324 , -0.07388912,
       -0.07517002, -0.34457922,  0.37567556,  2.481846  , -0.22776406,
        0.8964096 , -0.5882707 , -0.4701803 , -0.36224553, -0.08499369,
       -0.29483414,  0.4539833 ,  0.86328936, -0.63339067,  1.0016291 ,
        0.56689465, -1.8571078 ,  0.68104553,  0.27385226, -1.58

In [24]:
np.save('./data/word_vec', word_vectors)