### Sentiment Analysis(SA)

- `knowledge-based techniques` : 단어들의 감정의 정도를 평가하는 사전을 만들고 이를 활용해서 글의 감정 상태를 평가하는 방법
    - 예) SentiWordnet : 단어마다의 긍/부정 척도를 더한 대표적인 DB (Pyrhon의 nltk에 포함)
- `statistical methods` : SVM(Support Vector Machine)이나 BOW(Bag Of Words)등 다양한 방법론을 활용해서 단어의 감정을 분류하는 방법
    - 예) Naive Bayes Classifier (Python nltk에 포함)
- `hybrid approaches` : 위의 두 방법을 적절하게 혼합하여 활용하는 방법

##### knowledge-based techniques
dataset 출처 : http://ai.stanford.edu/~amaas/data/sentiment/

In [12]:
import os
files = os.listdir('../../aclImdb/train/pos')

first_file = files[0]
with open('../../aclImdb/train/pos/{}'.format(first_file), 'r', encoding='utf-8') as f:
    review = f.read()

In [13]:
review

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [3]:
pos_train_list = []
for file in files:
    with open('../../aclImdb/train/pos/{}'.format(file), 'r', encoding='utf-8') as f:
        review = f.read()
    pos_train_list.append(review)
len(pos_train_list)

12500

In [5]:
import nltk
from nltk.corpus import sentiwordnet as swn

nltk.download('sentiwordnet')

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/macbook/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


True

In [6]:
list(swn.senti_synsets('hate'))

[SentiSynset('hate.n.01'), SentiSynset('hate.v.01')]

In [7]:
list(swn.senti_synsets('hate', 'v'))[0].pos_score()

0.0

In [8]:
list(swn.senti_synsets('hate', 'v'))[0].neg_score()

0.75

In [9]:
# 모든 유의어의 긍/부정 스코어를 평균인 것으로 쓰기 위한 함수
def word_sentiment_calculator(word, tag):
    pos_score = 0
    neg_score = 0
    
    # syn_set : 유의어 집합
    if 'NN' in tag and len(list(swn.senti_synsets(word, 'n'))) > 0:
        syn_set = list(swn.senti_synsets(word, 'n'))
    elif 'VB' in tag and len(list(swn.senti_synsets(word, 'v'))) > 0:
        syn_set = list(swn.senti_synsets(word, 'v'))
    elif 'JJ' in tag and len(list(swn.senti_synsets(word, 'a'))) > 0:
        syn_set = list(swn.senti_synsets(word, 'a'))
    elif 'RB' in tag and len(list(swn.senti_synsets(word, 'r'))) > 0:
        syn_set = list(swn.senti_synsets(word, 'r'))
    else:
        return(0, 0)
    
    for syn in syn_set:
        pos_score += syn.pos_score()
        neg_score += syn.neg_score()
    return (pos_score / len(syn_set), neg_score / len(syn_set))

In [10]:
word_sentiment_calculator('love', 'VB')

(0.625, 0.03125)

In [11]:
# 문장에서의 긍/부정 지수 계산함수
def sentence_sentiment_calculator(pos_tags):
    pos_score = 0
    neg_score = 0
    for word, tag in pos_tags:
        pos_score += word_sentiment_calculator(word, tag)[0]
        neg_score += word_sentiment_calculator(word, tag)[1]
    return (pos_score, neg_score)

In [14]:
pos_files = os.listdir('../../aclImdb/train/pos/')[:10]
neg_files = os.listdir('../../aclImdb/train/neg/')[:10]

# 긍정적리뷰 10개 부정적리뷰 10개를 순서대로 불러왔으므로,,,
actual = [1] * 10 + [0] * 10
predicted = []

def sentence_sentiment_calculator2(pos_tags):
    pos_score = 0
    neg_score = 0
    s_tk = nltk.word_tokenize(pos_tags)
    pos_tags = nltk.pos_tag(s_tk)
    for word, tag in pos_tags:
        pos_score += word_sentiment_calculator(word, tag)[0]
        neg_score += word_sentiment_calculator(word, tag)[1]
    return (pos_score, neg_score)

for file in pos_files:
    with open('../../aclImdb/train/pos/{}'.format(file), 'r', encoding='utf-8') as f:
        scores = sentence_sentiment_calculator2(f.read())
        
        if scores[0] >= scores[1]:
            predicted.append(1)
        else:
            predicted.append(0)
            
for file in neg_files:
    with open('../../aclImdb/train/neg/{}'.format(file), 'r', encoding='utf-8') as f:
        scores = sentence_sentiment_calculator2(f.read())
        
        if scores[0] >= scores[1]:
            predicted.append(1)
        else:
            predicted.append(0)
            
correct = 0
incorrect = 0
for i in range(20):
    if actual[i]  == predicted[i]:
        correct += 1
    else:
        incorrect += 1
        
print(actual)
print(predicted)

print('Number of correct instance:', correct)
print('Number of incorrect instance:', incorrect)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1]
Number of correct instance: 12
Number of incorrect instance: 8


##### statistical methods
- Naive Bayes Classifier
    - TF-IDF(Term Frequency - Inverse Document Frequency)
        - 단어의 등장여부를 T/F로 보는 boolean model
        - 단어의 등장횟수로 보는 frequency model

In [15]:
import os
files = os.listdir('../../aclImdb/train/pos')

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
from nltk.corpus import sentiwordnet as swn

# 긍정 학습데이터에서 모든 토큰을 words라는 리스트에 저장
words = []
for file in files:
    with open('../../aclImdb/train/pos/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        for token in review:
            # stopwords 제거
            if token not in stopWords:
                words.append(token)

len(words)

2293713


In [16]:
# 부정 학습데이터도 같은 방식으로 words 리스트에 저장
files = os.listdir('../../aclImdb/train/neg')
for file in files:
    with open('../../aclImdb/train/neg/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        for token in review:
            if token not in stopWords:
                words.append(token)
                
len(words)

4557794

In [17]:
# words 리스트를 이용해서 3000 dimension의 word feature를 생성
words = nltk.FreqDist(words)
word_features = list(words.keys())[:3000]

In [20]:
word_features

['Bromwell',
 'High',
 'cartoon',
 'comedy',
 '.',
 'It',
 'ran',
 'time',
 'programs',
 'school',
 'life',
 ',',
 '``',
 'Teachers',
 "''",
 'My',
 '35',
 'years',
 'teaching',
 'profession',
 'lead',
 'believe',
 "'s",
 'satire',
 'much',
 'closer',
 'reality',
 'The',
 'scramble',
 'survive',
 'financially',
 'insightful',
 'students',
 'see',
 'right',
 'pathetic',
 'teachers',
 "'",
 'pomp',
 'pettiness',
 'whole',
 'situation',
 'remind',
 'schools',
 'I',
 'knew',
 'When',
 'saw',
 'episode',
 'student',
 'repeatedly',
 'tried',
 'burn',
 'immediately',
 'recalled',
 '...',
 'A',
 'classic',
 'line',
 ':',
 'INSPECTOR',
 "'m",
 'sack',
 'one',
 'STUDENT',
 'Welcome',
 'expect',
 'many',
 'adults',
 'age',
 'think',
 'far',
 'fetched',
 'What',
 'pity',
 "n't",
 '!',
 'Homelessness',
 '(',
 'Houselessness',
 'George',
 'Carlin',
 'stated',
 ')',
 'issue',
 'never',
 'plan',
 'help',
 'street',
 'considered',
 'human',
 'everything',
 'going',
 'work',
 'vote',
 'matter',
 'Most',

In [18]:
# word feature를 통해서 분석할 데이터의 feature를 생성하는 함수
def find_features(doc):
    words = set(doc)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

In [19]:
# 첫번째 리뷰 데이터의 문장들을 이용해서 feature 생성
with open('../../aclImdb/train/neg/{}'.format(files[0]), 'r', encoding='utf-8') as f:
    review = nltk.word_tokenize(f.read())
find_features(review)

{'Bromwell': False,
 'High': False,
 'cartoon': False,
 'comedy': True,
 '.': True,
 'It': False,
 'ran': False,
 'time': True,
 'programs': False,
 'school': False,
 'life': False,
 ',': True,
 '``': False,
 'Teachers': False,
 "''": False,
 'My': False,
 '35': False,
 'years': False,
 'teaching': False,
 'profession': False,
 'lead': False,
 'believe': False,
 "'s": True,
 'satire': False,
 'much': False,
 'closer': False,
 'reality': False,
 'The': True,
 'scramble': False,
 'survive': False,
 'financially': False,
 'insightful': False,
 'students': False,
 'see': False,
 'right': False,
 'pathetic': False,
 'teachers': False,
 "'": False,
 'pomp': False,
 'pettiness': False,
 'whole': False,
 'situation': False,
 'remind': False,
 'schools': False,
 'I': False,
 'knew': False,
 'When': False,
 'saw': False,
 'episode': False,
 'student': False,
 'repeatedly': False,
 'tried': False,
 'burn': False,
 'immediately': False,
 'recalled': False,
 '...': False,
 'A': True,
 'classic': Fa

In [22]:
# 2000개의 긍/부정 리뷰 데이터에 대해 feature set 생성
feature_sets = []

# train/pos
files = os.listdir('../../aclImdb/train/pos')[:1000]
for file in files:
    with open('../../aclImdb/train/pos/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        feature_sets.append((find_features(review), 'pos'))

# train/neg
files = os.listdir('../../aclImdb/train/neg')[:1000]
for file in files:
    with open('../../aclImdb/train/neg/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        feature_sets.append((find_features(review), 'neg'))

# test/pos
files = os.listdir('../../aclImdb/test/pos')[:1000]
for file in files:
    with open('../../aclImdb/test/pos/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        feature_sets.append((find_features(review), 'pos'))

# test/neg
files = os.listdir('../../aclImdb/test/neg')[:1000]
for file in files:
    with open('../../aclImdb/test/neg/{}'.format(file), 'r', encoding='utf-8') as f:
        review = nltk.word_tokenize(f.read())
        feature_sets.append((find_features(review), 'neg'))

In [23]:
train_set = feature_sets[:2000]
test_set = feature_sets[2000:]

In [25]:
clf = nltk.NaiveBayesClassifier.train(train_set)

In [26]:
result = nltk.classify.accuracy(clf, test_set) * 100

In [27]:
print('Accuracy of the Naive Bayes Classification model:', round(result, 2), '%')

Accuracy of the Naive Bayes Classification model: 82.0 %


##### Doc2Vec
- 벡터공간 모형의 일종으로 간단한 신경망 모형을 통해 만들어지는 모델
- 문맥을 고려한 분석이 가능

In [2]:
# Doc2Vec모델 생성을 위한 gensim 패키지
import gensim
from gensim.models import Doc2Vec
import numpy as np
from random import shuffle
from sklearn.linear_model import LogisticRegression

In [3]:
# Text preprocessing을 위한 nltk 패키지
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import multiprocessing
import os

In [4]:
stop_words = set(stopwords.words('english'))
lemm = WordNetLemmatizer()

In [5]:
LabeledSentence = gensim.models.doc2vec.LabeledSentence
class LabeledLineSentence(object):
    
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list
    
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield LabeledSentence(words=str(doc).split(), tags=[self.labels_list[idx]])

In [6]:
review_list = []
labels_list = []

# train dataset 생성
files = os.listdir('../../aclImdb/train/pos')[:1000]
for file in files:
    review = ''
    with open('../../aclImdb/train/pos/{}'.format(file), 'r', encoding = 'utf-8') as f:
        for word in word_tokenize(f.read()):
            if lemm.lemmatize(word) not in stop_words:
                review += ' ' + word
    review_list.append(review)
    labels_list.append('pos_' + file)
 
files = os.listdir('../../aclImdb/train/neg')[:1000]
for file in files:
    review = ''
    with open('../../aclImdb/train/neg/{}'.format(file), 'r', encoding = 'utf-8') as f:
        for word in word_tokenize(f.read()):
            if lemm.lemmatize(word) not in stop_words:
                review += ' ' + word
    review_list.append(review)
    labels_list.append('neg_' + file)

In [7]:
# test dataset 생성
files = os.listdir('../../aclImdb/test/pos')[:1000]
for file in files:
    review = ''
    with open('../../aclImdb/test/pos/{}'.format(file), 'r', encoding = 'utf-8') as f:
        for word in word_tokenize(f.read()):
            if lemm.lemmatize(word) not in stop_words:
                review += ' ' + word
    review_list.append(review)
    labels_list.append('pos_' + file)
 
files = os.listdir('../../aclImdb/test/neg')[:1000]
for file in files:
    review = ''
    with open('../../aclImdb/test/neg/{}'.format(file), 'r', encoding = 'utf-8') as f:
        for word in word_tokenize(f.read()):
            if lemm.lemmatize(word) not in stop_words:
                review += ' ' + word
        f.close()
    review_list.append(review)
    labels_list.append('neg_' + file)

In [8]:
# 리뷰 리스트와 레이블 리스트를 합쳐서 LabelLineSentence 객체를 생성
it = LabeledLineSentence(doc_list = review_list, labels_list = labels_list)

# Doc2Vec모델을 학습 
'''
size : feature벡터의 차원
window : 예측된 단어와 문맥의 단어들 사이의 최대 거리
dm : 트레이닝 알고리즘 (default=distributed memory)
alpha : 초기 학습률 (learning rate)
min_alpha : alpha값이 학습과정에서 선형으로 줄어들어서 도달하는 최소값
min_count 이하의 total frequency를 가진 단어들은 무시
workers : cpu의 코어 수에 따라 multi-threads를 지원해서 병렬처리하는 옵션
그 밖의 옵션은  https://radimrehurek.com/gensim/models/doc2vec.html 참고
'''

model = Doc2Vec(size=3000, window=10, dm=0, alpha=0.025, \
                min_alpha=0.025, min_count=5, workers=multiprocessing.cpu_count())
model.build_vocab(it)
model.train(it, total_examples=4000, epochs=20)
model.save('partial_Doc2Vec.model')

  # Remove the CWD from sys.path while we load stuff.


In [9]:
model = Doc2Vec.load('partial_Doc2Vec.model')

x_train = np.zeros((2000, 3000))
y_train = np.zeros(2000)
x_test = np.zeros((2000, 3000))
y_test = np.zeros(2000)

In [10]:
files = os.listdir('../../aclImdb/train/pos')[:1000]
for i in range(1000):
    x_train[i] = model.docvecs['pos_' + files[i]]
    y_train[i] = 1
files = os.listdir('../../aclImdb/train/neg')[:1000]
for i in range(1000):
    x_train[i+1000] = model.docvecs['neg_' + files[i]]
    y_train[i+1000] = 0

In [11]:
files = os.listdir('../../aclImdb/test/pos')[:1000]
for i in range(1000):
    x_test[i] = model.docvecs['pos_' + files[i]]
    y_test[i] = 1
files = os.listdir('../../aclImdb/test/neg')[:1000]
for i in range(1000):
    x_test[i+1000] = model.docvecs['neg_' + files[i]]
    y_test[i+1000] = 0

In [None]:
clf = LogisticRegression()
clf.fit(x_train, y_train)

In [None]:
# 커널이 자꾸 죽음.... 약 91.65%의 정확도를 보인다고 함
clf.score(x_test, y_test)

##### keras 의 LSTM (Long Short-Term Memory)을 이용한 감성 분석

In [1]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
np.random.seed(7)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# keras에 imdb 데이터가 이미 전처리되어 저장되어 있음
# 그대로 불러올 때, 상위 5000개의 단어만 활용
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = top_words)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [4]:
# 데이터의 형태를 바꿔줌
max_len = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_len)[:2000, :]
X_test = sequence.pad_sequences(X_test, maxlen=max_len)[:2000, :]
y_train = y_train[:2000]
y_test = y_test[:2000]

In [5]:
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length = max_len))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

In [6]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [7]:
model.fit(X_train, y_train, epochs = 3, batch_size = 64)
scores = model.evaluate(X_test, y_test, verbose = 0)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [8]:
print(scores[1] * 100, '%')

71.6 %
