- 감성분석은 문서 내 텍스트가 나타내는 여러 가지 주관적인 단어와 문맥을 기반으로 감성 수치를 계산하는 방법을 이용
- 감성 지수는 긍정 감성 지수와 부정 감성 지수로 구성되며 이들 지수를 합산해 긍정 또는 부정 감성을 결정
- 지도 학습은 학습 데이터와 타깃 레이블 값을 기반으로 감성 분석 학습을 수행한 뒤 이를 기반으로 다른 데이터의 감성 분석을 예측하는 방법
- 비지도 학습은 'Lexicon'이라는 일종의 감성 어휘 사전을 이용. Lexicon의 감성 분석을 위한 용어와 문맥에 대한 다양한 정보를 이용해 문서의 긍정적 부정적 감성 여부를 판단

In [1]:
# 지도학습 기반 - IMDB 영화평
# https://www.kaggle.com/c/word2vec-nlp-tutorial/data

import pandas as pd
review_df = pd.read_csv('./dataset/labeledTrainData.tsv', header = 0, sep='\t', quoting=3)
print(review_df.head(3))
review_df.shape
review_df.columns.values

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...


array(['id', 'sentiment', 'review'], dtype=object)

In [2]:
print(review_df.review[3])

"It must be assumed that those who praised this film (\"the greatest filmed opera ever,\" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything except their desire to appear Cultured. Either as a representation of Wagner's swan-song, or as a movie, this strikes me as an unmitigated disaster, with a leaden reading of the score matched to a tricksy, lugubrious realisation of the text.<br /><br />It's questionable that people with ideas as to what an opera (or, for that matter, a play, especially one by Shakespeare) is \"about\" should be allowed anywhere near a theatre or film studio; Syberberg, very fashionably, but without the smallest justification from Wagner's text, decided that Parsifal is \"about\" bisexual integration, so that the title character, in the latter stages, transmutes into a kind of beatnik babe, though one who continues to sing high tenor -- few if any of the actors in the film are the singers, and we get a double 

In [3]:
# re.sub 사용법
import re
# 사과 혹은 오렌지를 과일로 대체
re.sub('apple|orange', 'fruit', 'apple box orange tree')

'fruit box fruit tree'

In [4]:
# df/series 에서 str 적용 문자열 연산 수행
import re
# <br> html 태그는 replace 함수로 공백으로 변환
review_df.review = review_df.review.str.replace('<br />', ' ')
# 파이썬의 정규 표현식 모듈 re를 이용하여 영어 문자열이 아닌 문자는 모두 공백으로 변환
# 알파벳이 아닌 것 모두 공백으로 변환
review_df.review = review_df.review.apply(lambda x : re.sub('[^a-zA-Z]', ' ', x))
review_df.review[3]

' It must be assumed that those who praised this film    the greatest filmed opera ever    didn t I read somewhere   either don t care for opera  don t care for Wagner  or don t care about anything except their desire to appear Cultured  Either as a representation of Wagner s swan song  or as a movie  this strikes me as an unmitigated disaster  with a leaden reading of the score matched to a tricksy  lugubrious realisation of the text   It s questionable that people with ideas as to what an opera  or  for that matter  a play  especially one by Shakespeare  is   about   should be allowed anywhere near a theatre or film studio  Syberberg  very fashionably  but without the smallest justification from Wagner s text  decided that Parsifal is   about   bisexual integration  so that the title character  in the latter stages  transmutes into a kind of beatnik babe  though one who continues to sing high tenor    few if any of the actors in the film are the singers  and we get a double dose of A

In [5]:
from sklearn.model_selection import train_test_split
class_df = review_df.sentiment
feature_df = review_df.drop(['id', 'sentiment'], axis = 1, inplace = False)
x_train, x_test, y_train, y_test = train_test_split(
    feature_df
    , class_df
    , test_size=0.3
    , random_state=156
)

x_train.shape, x_test.shape

((17500, 1), (7500, 1))

In [6]:
from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix, f1_score, roc_auc_score

def get_clf_eval(y_test , pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
  
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}'.format(accuracy, precision, recall, f1))

In [7]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization수행. 
# LogisticRegression의 C는 10으로 설정.
# 파이프라인을 이용하여 한번에 수행(명령 한꺼번에)
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))])
# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 
# roc_auc때문에 수행.

# 학습
pipeline.fit(x_train.review, y_train)

# 예측
pred = pipeline.predict(x_test.review)

# ROC-AUC 때문에 predict_proba 필요
pred_probs = pipeline.predict_proba(x_test.review)[:,1]

print('예측 정확도 : {0:.4f}, ROC-AUC : {1:.4f}'.format(accuracy_score(y_test,pred), roc_auc_score(y_test,pred_probs)))

예측 정확도 : 0.8865, ROC-AUC : 0.9508


In [10]:
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))])
# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 
# roc_auc때문에 수행.

# 학습
pipeline.fit(x_train.review, y_train)

# 예측
y_pred = pipeline.predict(x_test.review)

# ROC-AUC 때문에 predict_proba 필요
pred_probs = pipeline.predict_proba(x_test.review)[:,1]

get_clf_eval(y_test, y_pred)
print()
print('ROC-AUC: ', roc_auc_score(y_test, pred_probs))

오차 행렬
[[3257  423]
 [ 376 3444]]
정확도: 0.8935, 정밀도: 0.8906, 재현율: 0.9016,    F1: 0.8961

ROC-AUC:  0.9597786962212611
