## Sentiment Analysis
문서의 주관적인 감성/의견/감정/기분 등을 파악하기 위한 방법
문서의 주관적인 단어 문맥 분석

지도학습: 데이터와 레이블로 학습을 수행하고 다른 데이터의 감성 분석을 예측
비지도학습: "Lexicon"이라는 감성 사전 활용. 용어와 문맥에 대한 다양한 정보로 문서의 긍정적, 부정적 감성 여부 판단

### IMDB 영화평 실습

In [1]:
import pandas as pd

review_df = pd.read_csv("../labeledTrainData.tsv", header=0, sep="\t", quoting=3)
review_df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [2]:
print(review_df["review"][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [4]:
import re

review_df["review"] = review_df["review"].str.replace("<br />", " ")

review_df["review"] = review_df["review"].apply(lambda x : re.sub("[^a-zA-Z]", " ", x))

In [5]:
from sklearn.model_selection import train_test_split

class_df = review_df["sentiment"]
feature_df = review_df.drop(["id", "sentiment"], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size=0.3,
                                                    random_state=156)
X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

pipeline = Pipeline([
    ("cnt_vect", CountVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("lr_clf", LogisticRegression(solver="liblinear", C=10))
])

# Learning
pipeline.fit(X_train["review"], y_train)
prd = pipeline.predict(X_test["review"])
prd_probs = pipeline.predict_proba(X_test["review"])[:, 1]

print("accuracy: {0:.4f}, ROC-AUC score: {1:.4f}".format(accuracy_score(y_test, prd),
                                                         roc_auc_score(y_test, prd_probs)))

accuracy: 0.8859, ROC-AUC score: 0.9503


In [7]:
pipeline = Pipeline([
    ("tfidf_vect", TfidfVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("lr_clf", LogisticRegression(solver="liblinear", C=10))
])

# Learning
pipeline.fit(X_train["review"], y_train)
prd = pipeline.predict(X_test["review"])
prd_probs = pipeline.predict_proba(X_test["review"])[:, 1]

print("accuracy: {0:.4f}, ROC-AUC score: {1:.4f}".format(accuracy_score(y_test, prd),
                                                         roc_auc_score(y_test, prd_probs)))

accuracy: 0.8936, ROC-AUC score: 0.9598


### Lexicon based unsupervised learning

NLTK에 포함되어 있는 Lexicon 모듈을 사용 (감성 정도 수치를 가지고 있음)

Wordnet: 방대한 영어 어휘 사전 - semantic 분석 제공
semantic? 말은 상황, 문맥에 따라 변화
Wordnet 에서는 어휘의 품사로 구성된 개별 단어를 Synset(Sets of cognitive synonyms) 개념을 이용해 표현
예측 성능이 좋지 못하다는 단점. 일반적으로는 다른 감성 사전을 사용

- SentiWordNet: 긍정 지수, 부정 지수, 객관성 지수 부여
- VADER: 소셜 미디어 텍스트의 감성 분석을 제공하기 위한 패키지
- Pattern: 예측 성능 측면에서 주목받는 패키지

### SentiwordNet 이용 감성 분석

In [8]:
import nltk
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/jinjae/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/jinjae/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/jinjae/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/jinjae/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/jinjae/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     /Users/jinjae/nltk_data...
[nltk_data]    | Downloadi

True

In [9]:
from nltk.corpus import wordnet as wn

term = "present"

synsets = wn.synsets(term)
print("return type:", type(synsets))
print("counts:", len(synsets))
print("return value:", synsets)

return type: <class 'list'>
counts: 18
return value: [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


In [11]:
for synset in synsets:
    print("##### Synset name:", synset.name())
    print("POS:", synset.lexname())
    print("Definition:", synset.definition())
    print("Lemmas:", synset.lemma_names())

##### Synset name: present.n.01
POS: noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']
##### Synset name: present.n.02
POS: noun.possession
Definition: something presented as a gift
Lemmas: ['present']
##### Synset name: present.n.03
POS: noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']
##### Synset name: show.v.01
POS: verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name: present.v.02
POS: verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']
##### Synset name: stage.v.01
POS: verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represent']
##### Synset name: present.v.04
POS: verb.possessi

In [12]:
tree = wn.synset("tree.n.01")
lion = wn.synset("lion.n.01")
tiger = wn.synset("tiger.n.02")
cat = wn.synset("cat.n.01")
dog = wn.synset("dog.n.01")

entities = [tree, lion, tiger, cat, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]

# measure similarities
for entity in entities:
    similarity = [round(entity.path_similarity(compared_entity), 2)
                  for compared_entity in entities]
    similarities.append(similarity)

similarity_df = pd.DataFrame(similarities, columns=entity_names, index=entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


In [14]:
import nltk
from nltk.corpus import  sentiwordnet as swn

senti_synsets = list(swn.senti_synsets("slow"))
print("return type:", type(senti_synsets))
print("return counts:", len(senti_synsets))
print("return value:", senti_synsets)

return type: <class 'list'>
return counts: 11
return value: [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


In [17]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset("father.n.01")
print(father.pos_score())
print(father.neg_score())
print(father.obj_score())
print("\n")
fabulous = swn.senti_synset("fabulous.a.01")
print(fabulous.pos_score())
print(fabulous.neg_score())
print(fabulous.obj_score())

0.0
0.0
1.0


0.875
0.125
0.0


SentiWordNet Lexicon 기반으로 감상평 감성 분석
1. 문장 단위 분해
2. 단어 단위 분해 및 품사 태깅
3. synset 객체 senti_synset 객체 생성
4. 긍정/부정 지수를 구하고 결정

In [18]:
from nltk.corpus import wordnet as wn

# PennTreebank Tag base
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB

In [19]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag
def swn_polarity(text):
    sentiment = 0.0
    tokens_count = 0

    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # for each sentences, generate SentiSynset -> add all
    for raw_sentence in raw_sentences:
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
                continue
            # lemmatize
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            # make Synset object based on word and part of speech
            synsets = wn.synsets(lemma, pos=wn_tag)
            if not synsets:
                continue
            # calculate with positive: +, negative: -
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            tokens_count += 1

    if not tokens_count:
        return 0

    if sentiment >= 0 :
        return 1
    return 0

In [20]:
review_df["preds"] = review_df["review"].apply(lambda x: swn_polarity(x))
y_target = review_df["sentiment"].values
preds = review_df["preds"].values

In [22]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print("accuracy:", np.round(accuracy_score(y_target, preds), 4))
print("precision:", np.round(precision_score(y_target, preds), 4))
print("recall:", np.round(recall_score(y_target, preds), 4))

[[7668 4832]
 [3636 8864]]
accuracy: 0.6613
precision: 0.6472
recall: 0.7091


## VADER를 이용한 감성 분석
SentimentIntensityAnalyzer를 이용 => 쉽게 감성 분석 제공

In [23]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df["review"][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [24]:
def vader_polarity(review, threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)

    agg_score = scores["compound"]
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

review_df["vader_preds"] = review_df["review"].apply(lambda x: vader_polarity(x, 0.1))
y_target = review_df["sentiment"].values
vader_preds = review_df["vader_preds"].values

print(confusion_matrix(y_target, vader_preds))
print("accuracy:", np.round(accuracy_score(y_target, vader_preds), 4))
print("precision:", np.round(precision_score(y_target, vader_preds), 4))
print("recall:", np.round(recall_score(y_target, vader_preds), 4))

[[ 6747  5753]
 [ 1858 10642]]
accuracy: 0.6956
precision: 0.6491
recall: 0.8514
