# 다음 tutorial을 참고해서 IMDB 무비 리뷰 

* https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

## 데이터 읽기

In [1]:
review_trains = []
review_tests = []
for line in open('./movie_data/full_train.txt', 'r'):
    review_trains.append(line.strip())
for line in open('./movie_data/full_test.txt', 'r'):
    review_tests.append(line.strip())
review_trains[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

## 데이터 전처리

#### clutter token 제거 

In [2]:
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    return reviews

review_trains_clean = preprocess_reviews(review_trains)    
review_tests_clean = preprocess_reviews(review_tests)
review_trains_clean[0]    

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

## Naive Vectorization : One-hot encoding 기반
* 각 voca가 해당 review에 등장하면 1, 아니면 0 
* sklearn의 [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 사용 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True, max_features=1000)  # count가 아닌 occurance or not이 feature
cv.fit(review_trains)
train_features = cv.transform(review_trains_clean)
test_features = cv.transform(review_tests_clean)

In [4]:
print(f'voca size={len(cv.get_feature_names())}')
print(f'some voca={cv.get_feature_names()[:10]}')
print(train_features.toarray())

voca size=1000
some voca=['10', '20', '30', 'able', 'about', 'above', 'absolutely', 'across', 'act', 'acted']
[[0 0 0 ... 0 1 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]]


## 분류 모델 : Basline - LogisticRegression

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# label 생성 - 초반 절반은 1, 나머지는 0
labels = [1 if i < 12500 else 0 for i in range(25000)]  

# train/validation set 분리
X_train, X_val, y_train, y_val = train_test_split(train_features, labels, test_size=0.25, shuffle=True)

# Logistic regression으로 학습 (다양한 regularizer, small value mean strong regularization)
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.8504
Accuracy for C=0.05: 0.85248
Accuracy for C=0.25: 0.85424
Accuracy for C=0.5: 0.85424
Accuracy for C=1: 0.85488


In [6]:
# 최종 모델은 위에서 C=0.05 일 때를 사용
final_model = LogisticRegression(C=0.05)
final_model.fit(X_train, y_train)
print ("Accuracy = %s" % (accuracy_score(y_val, final_model.predict(X_val))))
final_model.coef_[0].shape

Accuracy = 0.85248


(1000,)

## XAI

* most influential words(features) for determining pos/neg

In [7]:
feature_to_weight_map = {
    word: weight for word, weight in zip(cv.get_feature_names(), final_model.coef_[0])
}

print('most influential words for positive sentiments')
print(sorted(feature_to_weight_map.items(), key=lambda x:x[1], reverse=True)[:5])
print('most influential words for negative sentiments')
print(sorted(feature_to_weight_map.items(), key=lambda x:x[1], reverse=False)[:5])

most influential words for positive sentiments
[('excellent', 0.9543561684231702), ('perfect', 0.8278048997368097), ('amazing', 0.7256422481993934), ('great', 0.7255364414231789), ('superb', 0.7085847209845163)]
most influential words for negative sentiments
[('worst', -1.440431511724494), ('waste', -1.220476693310591), ('awful', -1.2203809497503864), ('poorly', -0.9629888521971506), ('dull', -0.9112251330265803)]
