# 텍스트 분석

***

**감성분석(Sentiment Analysis)**은 문서의 주관적인 감성/의견/감정/기분 등을 파악하는 방법으로 SNS, 여론조사, 온라인 리뷰, 피드백 등 다양한 분야에서 활용된다. 주관적인 생각으로는 text classification과 동일한 개념이라고 생각한다. 하지만 감성분석은 크게 **지도학습** 방법과 **비지도 학습** 방법이 있다.

### 지도학습 기반 감성분석


In [1]:
import pandas as pd

In [15]:
train = pd.read_csv("labeledTrainData.tsv", header = 0, sep = '\t', quoting = 3)

In [16]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [34]:
train['sentiment'].value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [86]:
test = pd.read_csv('TestData.tsv', header = 0, sep = '\t', quoting = 3)

In [87]:
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [19]:
import re

In [88]:
train['review'] = train['review'].str.replace('<br />', ' ')
test['review'] = test['review'].str.replace('<br />', ' ')

In [89]:
train['review'] = train['review'].apply(lambda x : re.sub(r'[^a-zA-Z]', ' ', x))
test['review'] = test['review'].apply(lambda x : re.sub(r'[^a-zA-Z]', ' ', x))

In [70]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

In [106]:
y = train['sentiment']

In [56]:
skf = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = True)

In [92]:
gbc = GradientBoostingClassifier(random_state = 42)
lgbm = LGBMClassifier(random_state = 42)
xgb = XGBClassifier(random_state = 42)

In [107]:
def get_model_proba(model, train, test) :
    print(f'{model.__class__.__name__} Train & Predict Start!\n')
    model_pred = np.zeros((len(cnt_test)))
    for i, idx in enumerate(zip(skf.split(train, y))) :
        tr_x, tr_y = train[idx[0][0]], y.iloc[idx[0][0]]
        val_x, val_y = train[idx[0][1]], y.iloc[idx[0][1]]
    
        model.fit(tr_x, tr_y)
        
        val_pred = model.predict_proba(val_x)[:, 1]
        val_cls = [1 if p > 0.5 else 0 for p in val_pred]
        
        acc = accuracy_score(val_y, val_cls)
        roc_auc = roc_auc_score(val_y, val_pred)
        
        print(f'{i + 1} Fold 정확도 = {acc} / roc_auc = {roc_auc}\n')
        
        model_pred += (model.predict_proba(test)[:, 1] / 10)
        
    return model_pred

#### CountVectorizer 적용

In [27]:
cnt_vect = CountVectorizer(max_features = 3000, ngram_range = (1, 2), stop_words = 'english')

In [46]:
cnt_train = cnt_vect.fit_transform(train['review']).todense()

In [90]:
cnt_test = cnt_vect.transform(test['review']).todense()

In [94]:
xgb_pred = get_model_proba(xgb, cnt_train, cnt_test)
lgbm_pred = get_model_proba(lgbm, cnt_train, cnt_test)

XGBClassifier Train & Predict Start!





1 Fold 정확도 = 0.8484 / roc_auc = 0.92753216





2 Fold 정확도 = 0.842 / roc_auc = 0.9209440000000002





3 Fold 정확도 = 0.8424 / roc_auc = 0.9206528





4 Fold 정확도 = 0.8508 / roc_auc = 0.9275577600000001





5 Fold 정확도 = 0.8504 / roc_auc = 0.92776768





6 Fold 정확도 = 0.8684 / roc_auc = 0.94001856





7 Fold 정확도 = 0.85 / roc_auc = 0.9291811199999999





8 Fold 정확도 = 0.8492 / roc_auc = 0.92730048





9 Fold 정확도 = 0.8492 / roc_auc = 0.93323328





10 Fold 정확도 = 0.8484 / roc_auc = 0.92427008

LGBMClassifier Train & Predict Start!

1 Fold 정확도 = 0.8532 / roc_auc = 0.9316646399999999

2 Fold 정확도 = 0.8464 / roc_auc = 0.9250291199999999

3 Fold 정확도 = 0.8528 / roc_auc = 0.9302169600000001

4 Fold 정확도 = 0.854 / roc_auc = 0.93231552

5 Fold 정확도 = 0.8516 / roc_auc = 0.92992832

6 Fold 정확도 = 0.87 / roc_auc = 0.9439526399999999

7 Fold 정확도 = 0.8532 / roc_auc = 0.9338579199999999

8 Fold 정확도 = 0.852 / roc_auc = 0.93272512

9 Fold 정확도 = 0.86 / roc_auc = 0.93783232

10 Fold 정확도 = 0.8496 / roc_auc = 0.93039616



In [97]:
cnt_pred = xgb_pred * .5 + lgbm_pred * .5

In [102]:
submission = pd.read_csv("sampleSubmission.csv")

In [103]:
submission['sentiment'] = cnt_pred

In [105]:
submission.to_csv('countvect.csv', index = False)

Competition Late Leaderboard 상에서 0.93810으로 201위 랭크
***
#### Tfidf 적용

In [108]:
tfidf = TfidfVectorizer(max_features = 5000, stop_words = 'english')

In [109]:
tfidf_train = tfidf.fit_transform(train['review']).todense()

In [110]:
tfidf_test = tfidf.transform(test['review']).todense()

In [111]:
xgb_pred = get_model_proba(xgb, tfidf_train, tfidf_test)
lgbm_pred = get_model_proba(lgbm, tfidf_train, tfidf_test)

XGBClassifier Train & Predict Start!





1 Fold 정확도 = 0.8476 / roc_auc = 0.9283974399999999





2 Fold 정확도 = 0.8388 / roc_auc = 0.91958464





3 Fold 정확도 = 0.8464 / roc_auc = 0.92166272





4 Fold 정확도 = 0.8552 / roc_auc = 0.9279200000000001





5 Fold 정확도 = 0.8428 / roc_auc = 0.92631328





6 Fold 정확도 = 0.8628 / roc_auc = 0.93643552





7 Fold 정확도 = 0.8616 / roc_auc = 0.9307664





8 Fold 정확도 = 0.8404 / roc_auc = 0.9262195200000001





9 Fold 정확도 = 0.8536 / roc_auc = 0.9330793599999999





10 Fold 정확도 = 0.8448 / roc_auc = 0.9254508800000001

LGBMClassifier Train & Predict Start!

1 Fold 정확도 = 0.8588 / roc_auc = 0.93398592

2 Fold 정확도 = 0.8472 / roc_auc = 0.9284147199999999

3 Fold 정확도 = 0.858 / roc_auc = 0.9305855999999999

4 Fold 정확도 = 0.8588 / roc_auc = 0.93452608

5 Fold 정확도 = 0.8444 / roc_auc = 0.92894976

6 Fold 정확도 = 0.874 / roc_auc = 0.94466688

7 Fold 정확도 = 0.8592 / roc_auc = 0.934288

8 Fold 정확도 = 0.8496 / roc_auc = 0.93284608

9 Fold 정확도 = 0.8604 / roc_auc = 0.93846336

10 Fold 정확도 = 0.8528 / roc_auc = 0.93034752



In [112]:
tfidf_pred = xgb_pred * .5 + lgbm_pred * .5

In [103]:
submission['sentiment'] = tfidf_pred

In [113]:
submission.to_csv('tfidf.csv', index = False)