#word2vec 을 이용한 모델

- word2vec은 단어로 표현된 리스트를 입력값으로 넣어야 함
- 전처리된 텍스트를 불러온 후 각 단어들의 리스트로 나누어야 함

In [1]:
import os
import re
import pandas as pd
import numpy as np

In [2]:
DATA_IN_PATH = './data_in/'
DATA_OUT_PATH = './data_out/'
TRAIN_CLEAN_DATA = 'train_clean.csv'

train_data = pd.read_csv(DATA_IN_PATH + TRAIN_CLEAN_DATA)
train_data.head()

Unnamed: 0,review,sentiment
0,stuff going moment mj started listening music ...,1
1,classic war worlds timothy hines entertaining ...,1
2,film starts manager nicholas bell giving welco...,0
3,must assumed praised film greatest filmed oper...,0
4,superbly trashy wondrously unpretentious explo...,1


In [3]:
# 한 문장문장을 단어들로 변환
reviews = list(train_data['review'])
sentiments = list(train_data['sentiment'])

In [4]:
print(len(reviews))

25000


In [5]:
print(reviews[0])

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [6]:
sentences = []
for review in reviews:
    sentences.append(review.split())

In [7]:
print(sentences[0])

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

conda install -c anaconda gensim

- num_features : 각 단어에 대해 임베딩된 벡터의 차원 지정(feature 수)
- min_word_count : 모델에 의미 있는 단어를 가지고 학습하기 위해 적은 빈도 수의 단어들은 학습하지 않기 위해 설정  
- num_workers : 모델 학습 시 학습을 위한 쓰레드 수 지정(기본값 3)  
- context : word2vec 을 수행하기 위한 컨텍스트 윈도우 사이즈 지정  
a. Maximum distance between the current and predicted word within a sentence.  
b. 기준 단어의 앞뒤에 존재하는 단어들로 기준 단어를 예측하게 되는데(sg=0, CBOW-Continuous Bag of Words)  
c. 이 때 기준 단어에서 앞뒤 얼마나 떨어져 있는 단어까지 고려하는가를 결정
- downsampling : word2vec 학습을 수행할 때 빠른 학습을 위해 정답 단어 레이블에 대한 다운샘플링 비율을 지정  
a. 보통 0.001이 좋은 성능을 낸다고 알려짐  
b. 0.001 값을 threshold 값으로 보고, 이 값보다 빈도수가 높은 단어들은 무작위로(랜덤) 다운샘플링 됨  
c. 빈도수가 높은 단어는 다운샘플링하여 가끔 학습(랜덤하게 무시)하고 빈도수가 낮은 단어는 출현 족족 학습하는 효과

In [8]:
num_features = 300
min_word_count = 40
num_workers = 4
context = 10
downsampling = 1e-3

In [9]:
# 어느 정도 수준까지 logging 을 할 것인지 (실제 현장에선 많이 쓴다 )

import logging
logging.basicConfig(format='%(asctime)s : $(levelname)s : %(message)s',
                   level=logging.INFO)

In [11]:
import smart_open
from gensim.models import word2vec

model = word2vec.Word2Vec(sentences, workers = num_workers, 
                         size = num_features, min_count = min_word_count,
                         window = context, sample = downsampling)

2021-10-08 11:27:43,698 : $(levelname)s : collecting all words and their counts
2021-10-08 11:27:43,699 : $(levelname)s : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-10-08 11:27:43,917 : $(levelname)s : PROGRESS: at sentence #10000, processed 1205223 words, keeping 51374 word types
2021-10-08 11:27:44,117 : $(levelname)s : PROGRESS: at sentence #20000, processed 2396605 words, keeping 67660 word types
2021-10-08 11:27:44,216 : $(levelname)s : collected 74065 word types from a corpus of 2988089 raw words and 25000 sentences
2021-10-08 11:27:44,217 : $(levelname)s : Loading a fresh vocabulary
2021-10-08 11:27:44,250 : $(levelname)s : effective_min_count=40 retains 8160 unique words (11% of original 74065, drops 65905)
2021-10-08 11:27:44,251 : $(levelname)s : effective_min_count=40 leaves 2627273 word corpus (87% of original 2988089, drops 360816)
2021-10-08 11:27:44,268 : $(levelname)s : deleting the raw counts dictionary of 74065 items
2021-10-08 11:27:44,270

In [13]:
# 여기까지 한 내용 저장하기
model_name = "300features_40minwords_10context"
model.save(model_name)

2021-10-08 11:30:00,796 : $(levelname)s : saving Word2Vec object under 300features_40minwords_10context, separately None
2021-10-08 11:30:00,797 : $(levelname)s : not storing attribute vectors_norm
2021-10-08 11:30:00,798 : $(levelname)s : not storing attribute cum_table
2021-10-08 11:30:00,970 : $(levelname)s : saved 300features_40minwords_10context
