- 감성분석은 문서 내 텍스트가 나타내는 여러 가지 주관적인 단어와 문맥을 기반으로 감성 수치를 계산하는 방법을 이용
- 감성 지수는 긍정 감성 지수와 부정 감성 지수로 구성되며 이들 지수를 합산해 긍정 또는 부정 감성을 결정
- 지도 학습은 학습 데이터와 타깃 레이블 값을 기반으로 감성 분석 학습을 수행한 뒤 이를 기반으로 다른 데이터의 감성 분석을 예측하는 방법
- 비지도 학습은 'Lexicon'이라는 일종의 감성 어휘 사전을 이용. Lexicon의 감성 분석을 위한 용어와 문맥에 대한 다양한 정보를 이용해 문서의 긍정적 부정적 감성 여부를 판단

In [1]:
# 지도학습 기반 - IMDB 영화평
# https://www.kaggle.com/c/word2vec-nlp-tutorial/data
import pandas as pd
review_df = pd.read_csv('./dataset/labeledTrainData.tsv',header=0,sep="\t",quoting=3)
print(review_df.head(3) )
review_df.shape
review_df.columns.values

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...


array(['id', 'sentiment', 'review'], dtype=object)

In [2]:
print(review_df.review[0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [3]:
# re.sub 사용방법
import re 
re.sub('apple|orange', 'fruit', 'apple box orange tree')    # apple 또는 orange를 fruit로 바꿈

'fruit box fruit tree'

In [None]:
# df/series에서 str 적용 문자열 연산 수행 
import re
# <br> html 태그는 replace 함수로 공백으로 변환
review_df.review = review_df.review.str.replace('<br />',' ')
# 파이썬의 정규 표현식 모듈인 re를 이용하여 영어 문자열이 아닌 문자는 
# 모두 공백으로 변환
review_df.review = review_df.review.apply(lambda x : re.sub('[^a-zA-Z]', ' ',x))
review_df.review[0]

In [6]:
from sklearn.model_selection import train_test_split
class_df = review_df.sentiment
feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, \
                                                    test_size=0.3, \
                                                    random_state=156)
X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [9]:
from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix, f1_score, roc_auc_score
def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
  
    print('오차 행렬')
    print(confusion)
    
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}'.format(accuracy, precision, recall, f1))

In [7]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization수행. 
# LogisticRegression의 C는 10으로 설정.
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))])
# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 
# roc_auc때문에 수행.
pipeline.fit(X_train.review, y_train)
pred = pipeline.predict(X_test.review)
pred_probs = pipeline.predict_proba(X_test.review)[:,1]
print('예측 정확도 : {0:.4f}, ROC-AUC : {1:.4f}'.format(accuracy_score(y_test,pred), \
                                                   roc_auc_score(y_test,pred_probs)))

예측 정확도 : 0.8860, ROC-AUC : 0.9503


In [10]:
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))])
pipeline.fit(X_train.review, y_train)
pred = pipeline.predict(X_test.review)
pred_probs = pipeline.predict_proba(X_test.review)[:,1]
print(get_clf_eval(y_test,pred))
print()
print('ROC-AUC : ',roc_auc_score(y_test, pred_probs))

오차 행렬
[[3257  423]
 [ 375 3445]]
정확도: 0.8936, 정밀도: 0.8906, 재현율: 0.9018,    F1: 0.8962
None

ROC-AUC :  0.959799823582973


#### Q. DT, RF 모델  및 GridSearchCV 적용하여 IMDB 영화평 감성분석 수행 

### 비지도학습 기반 감성 분석
#### NLTK WordNet
- NLTK는 파이썬에서 제공하는 자연 언어 처리(NLP/Natural Language Processing) Toolkit. Wordnet과 같은 말뭉치 및 여러 리소스를 제공하고, classification, tokenization, stemming, tagging, parsing 등 다양한 NLP 기능을 제공
- NLTK는 많은 서브모듈을 가지고 있으며 그중에 감성 어휘 사전도 포함돼 있음
- WordNet 모듈은 시맨틱 분석을 제공하는 어휘사전. 시맨틱은 간단히 표현하면 문맥상 의미
- Synset은 단순한 하나의 단어가 아니라 그 단어가 가지는 문맥, 시맨틱 정보를 제공하는 WordNet의 핵심 개념
- SentiWordNet은 WordNet 기반의 synset을 이용
- WordNet 이용을 위해서는 WordNet 서브 패키지와 데이터 세트를 내려 받아야 함
- SentiWordNet은 WordNet의 Synset과 유사한 Senti_Synset 클래스를 가지고 있으며 senti_synsets()는 WordNet 모듈이어서 synsets()와 비슷하게 Senti_Synset 클래스를 리스트 형태로 반환
- NLTK의 감성 사전은 감성에 대한 사전 역할을 제공하는 장점에도 불구하고 예측 성능은 떨어져 다른 감성 사전을 일반적으로 사용

#### VADER
* 주로 소셜 미디어의 텍스트에 대한 감성 분석을 제공하기 위한 패키지
* 뛰어난 감성 분석 결과를 제공하며 비교적 빠른 수행 시간을 보장해 대용량 텍스트 데이터에 잘 사용되는 패키지
* VADER은 NLTK 서브모듈로 제공될 수도 있고 단독 패키지로 제공될 수도 있음

In [1]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_

[nltk_data]    |   Unzipping corpora\pros_cons.zip.
[nltk_data]    | Downloading package qc to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package rte to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |   

False

In [2]:
# 'present'라는 단어로 wordnet의 synsets 생성.
# synsets 호출 시 Synset 객체를 가지는 list 를 반환
# POS(Part of speech) 태그는 의미, 품사, 인덱스로 구성
from nltk.corpus import wordnet as wn
term = 'present'
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 갯수:', len(synsets))
print('synsets() 반환 값 :', synsets)

synsets() 반환 type : <class 'list'>
synsets() 반환 값 갯수: 18
synsets() 반환 값 : [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


In [3]:
# synset 객체가 가지는 속성 
# Synset은 POS, 정의, 부명제 등으로 시맨틱적인 요소를 표현
for synset in synsets :
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :',synset.lexname())
    print('Definition:',synset.definition())
    print('Lemmas:',synset.lemma_names())

##### Synset name :  present.n.01 #####
POS : noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS : noun.possession
Definition: something presented as a gift
Lemmas: ['present']
##### Synset name :  present.n.03 #####
POS : noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS : verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS : verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS : verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represen

In [5]:
import pandas as pd
# WordNet은 어떤 어휘와 다른 어휘 간의 관계를 유사도로 나타낼 수 있음
# 유사도를 나타내기 위하여 path_similarity() 메서드를 제공
# synset 객체를 단어별로 생성합니다. 
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree , lion , tiger , cat , dog]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

# 단어별 synset 들을 iteration 하면서 다른 단어들의 synset과 유사도를 측정 
for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  \
                  for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame형태로 저장합니다.  
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df
# lion은 tree와의 유사도가 가장 적고 tiger와는 유사도가 가장 큼

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


In [None]:
# tree, dog, cat과 유사도가 높은 단어를 구하세요.(과제)

In [18]:
from nltk.corpus import wordnet as wn
term = 'kitty'
synsets = wn.synsets(term)

print('synsets() 반환 값 :', synsets)

synsets() 반환 값 : [Synset('pool.n.07'), Synset('pot.n.06'), Synset('kitten.n.01'), Synset('kitty.n.04')]


In [19]:
import pandas as pd
tree = wn.synset('tree.n.01')
shrub = wn.synset('shrub.n.01')
cat = wn.synset('cat.n.01')
kitty = wn.synset('kitty.n.04')
dog = wn.synset('dog.n.01')
wolf = wn.synset('wolf.n.01')
entities = [tree , shrub, cat, kitty , dog, wolf]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  \
                  for compared_entity in entities ]
    similarities.append(similarity)
    
 
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df


Unnamed: 0,tree,shrub,cat,kitty,dog,wolf
tree,1.0,0.33,0.08,0.11,0.12,0.08
shrub,0.33,1.0,0.08,0.11,0.12,0.08
cat,0.08,0.08,1.0,0.33,0.2,0.2
kitty,0.11,0.11,0.33,1.0,0.25,0.14
dog,0.12,0.12,0.2,0.25,1.0,0.33
wolf,0.08,0.08,0.2,0.14,0.33,1.0


#### SentiWordNet을 이용한 영화 감상평 감성 분석
* 문서를 문장 단위로 분해
* 다시 문장을 단어 단위로 토큰화하고 품사 태깅
* 품사 태깅된 단어 기반으로 synset 객체와 senti_synset 객체를 생성
* Senti_synset에서 긍정 감성/부정 감성 지수를구하고 이를 모두 합산해 특정 임곗치 값 이상일 때 긍정 감성으로 그렇지 않을 경우 부정 감성으로 결정

In [20]:
# SentiWordNet은 WordNet의 Synset과 유사한 Senti_Synset 클래스를 가지고 있음
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 갯수:', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)

senti_synsets() 반환 type : <class 'list'>
senti_synsets() 반환 값 갯수: 11
senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


In [21]:
# SentiSynset 객체는 단어의 감성을 나타내는 감성 지수와 객관성을 나타내는
# 객관성 지수를 가지고 있으며 감성지수는 긍정 감성지수와 부정 감성지수로 나뉨
# 어떤 단어가 전혀 감성적이지 않으면 객관성 지수는 1, 감성 지수는 모두 0이 됨

import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수: ', father.pos_score())
print('father 부정감성 지수: ', father.neg_score())
print('father 객관성 지수: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수: ',fabulous.pos_score())
print('fabulous 부정감성 지수: ',fabulous.neg_score())
print('fabulous 객관성 지수: ', fabulous.obj_score())

father 긍정감성 지수:  0.0
father 부정감성 지수:  0.0
father 객관성 지수:  1.0


fabulous 긍정감성 지수:  0.875
fabulous 부정감성 지수:  0.125
fabulous 객관성 지수:  0.0


In [23]:
# 품사 태깅을 수행하는 함수 생성
from nltk.corpus import wordnet as wn

# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return 

In [24]:
# Polarity Score를 합산하는 함수 생성
# 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        # NTLK PenTreeBank기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV):
                continue                   
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성.
            # synsets 호출 시 Synset 객체를 가지는 list 를 반환
            synsets = wn.synsets(lemma , pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
        return 1
    
    return 0

In [25]:
# swn_polarity(text) 함수를 IMDF 감상평의 개별 문서에 적용해 긍정 및 부정 감성을 예측
# review_df의 새로운 칼럼으로 'pred' 추가해 swn_polarity(text)로 반환된 감성 평가 반영
# 10분 소요
review_df['preds'] = review_df['review'].apply( lambda x : swn_polarity(x) )
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

In [36]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

def get_clf_eval(y_test=None, pred=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [27]:
# 실제 감성 평가인 sentiment 칼럼과 pred의 정확도, 정밀도, 재현율 값을 모두 측정
print('#### SentiWordNet 예측 성능 평가 ####')
get_clf_eval(y_target, preds)

#### SentiWordNet 예측 성능 평가 ####
오차 행렬
[[7668 4832]
 [3636 8864]]
정확도: 0.6613, 정밀도: 0.6472, 재현율: 0.7091,    F1: 0.6767, AUC:0.6613


#### VADER lexicon을 이용한 Sentiment Analysis
- SentimentIntensityAnalyzer 클래스를 이용해 쉽게 감성 분석 제공

In [40]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [41]:
import pandas as pd
review_df = pd.read_csv('./dataset/labeledTrainData.tsv',header=0,sep="\t",quoting=3)
print(review_df.head(3) )
review_df.shape
review_df.columns.values

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...


array(['id', 'sentiment', 'review'], dtype=object)

In [42]:
import re
# <br> html 태그는 replace 함수로 공백으로 변환
review_df.review = review_df.review.str.replace('<br />',' ')
# 파이썬의 정규 표현식 모듈인 re를 이용하여 영어 문자열이 아닌 문자는 
# 모두 공백으로 변환
review_df.review = review_df.review.apply(lambda x : re.sub('[^a-zA-Z]', ' ',x))
# review_df.review[0]

In [45]:
# NLTK 서브모듈로 SentimentIntensityAnalyzer 임포트. IMDB 감상평 감성 분석
# neg는 부정, neu는 중립, pos는 긍정, compound는 조합한 감성지수
# compound score는 -1 ~ 1사이의 감성지수를 표현하며 0.1이상이면 긍정 감성
# 그 이하이면 부정 감성으로 판단하나 상황에 따라 임곗값을 조정해 예측 성능 조절
from nltk.sentiment.vader import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df.review[1])
print(senti_scores)

{'neg': 0.082, 'neu': 0.691, 'pos': 0.227, 'compound': 0.9783}


In [46]:
review_df.review[1]

'   The Classic War of the Worlds   by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H  G  Wells  classic book  Mr  Hines succeeds in doing so  I  and those who watched his film with me  appreciated the fact that it was not the standard  predictable Hollywood fare that comes out every year  e g  the Spielberg version with Tom Cruise that had only the slightest resemblance to the book  Obviously  everyone looks for different things in a movie  Those who envision themselves as amateur   critics   look only to criticize everything they can  Others rate a movie on more important bases like being entertained  which is why most people never agree with the   critics    We enjoyed the effort Mr  Hines put into being faithful to H G  Wells  classic novel  and we found it to be very entertaining  This made it easy to overlook what the   critics   perceive to be its shortcomings  '

In [48]:
rv = review_df.review[1]

from nltk.sentiment.vader import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(rv)
print(senti_scores)

{'neg': 0.082, 'neu': 0.691, 'pos': 0.227, 'compound': 0.9783}


Q. 'https://www.imdb.com/chart/top/'에서 ranking top3와 ranking 248 ~ 250 영화에 대한 user review 각 1개에 대하여 비지도학습으로 compound를 구하고 '긍정', '부정' 감성여부를 답하세요.

In [31]:
# https://www.imdb.com/chart/top/에서 영화평 가져와서 실행
review1 = '''
I've lost count of the number of times I have seen this movie, but it is more than 20. It has to be one of the best movies ever made. It made me take notice Morgan Freeman and Tim Robbins like I had never noticed any actors before.
I have from a very young age been a huge fan of anything Stephen King writes and had already read the short story that this movie is based on years prior to seeing this movie.
Not everything Stephen King has written that gets turned into a movie comes out well, but this is as close to perfection as it gets and has everything you could ever want in a movie.
Something that is outstanding is the fact that it has no real action, no special effects and no gimmicks. 99% of the movie is just men in a prison uniforms talking. Yet it absolutely hooks you almost from the beginning and has you glued to the screen to the end.
For me what really makes this film one of the best is the message of eternal hope it conveys throughout. The never ever give up hope attitude of the main character so well conveyed by Tim Robbins. The ending is just spine tingling every time I see it, no matter how many times I have seen it.
Brilliant, brilliant movie and a must see for everyone.
'''
from nltk.sentiment.vader import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review1)
print(senti_scores)

{'neg': 0.061, 'neu': 0.706, 'pos': 0.233, 'compound': 0.9943}


In [38]:
# 평가 사용자 함수
from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix, f1_score, roc_auc_score
def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
  
    print('오차 행렬')
    print(confusion)
    
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}'.format(accuracy, precision, recall, f1))

In [39]:
# VADER를 이용한 IMDB 감성 분석 수행
def vader_polarity(review, threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # compound 값에 기반해 threshold 입력값보다 크면 1, 아니면 0을 반환
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment
review_df.vader_preds = review_df.review.apply(lambda x:vader_polarity(x,0.1))
y_target = review_df.sentiment.values
vader_preds = review_df.vader_preds.values
print('VADER 예측 성능 평가 : ')
get_clf_eval(y_target, vader_preds)

VADER 예측 성능 평가 : 
오차 행렬
[[ 6736  5764]
 [ 1867 10633]]
정확도: 0.6948, 정밀도: 0.6485, 재현율: 0.8506,    F1: 0.7359
