# 텍스트 마이닝 실습
## 영화 리뷰 감성 분석

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

* 목표 : 영화 리뷰 문장을 인식해서 긍정반응인지 부정반응인지를 판단하자. 

In [1]:
%matplotlib inline
from sklearn.datasets import load_files
import matplotlib.pyplot as plt

In [5]:
# 훈련데이터세트 확보 
reviews_train = load_files('../aclImdb/train/')
X_train,Y_train = reviews_train.data, reviews_train.target
X_train, Y_train = X_train[0:1000], Y_train[0:1000]

In [6]:
print X_train[0]

Full of (then) unknown actors TSF is a great big cuddly romp of a film.<br /><br />The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.<br /><br />The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.<br /><br />And for anyone who lived in Glasgow it's a great "Oh I know where that is" film.


In [7]:
X_train = [x.replace(b"<br />", b" ") for x in X_train]

In [8]:
print X_train[0]

Full of (then) unknown actors TSF is a great big cuddly romp of a film.  The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.  The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.  And for anyone who lived in Glasgow it's a great "Oh I know where that is" film.


In [9]:
# 테스트 데이터세트 확보 
reviews_test = load_files('../aclImdb/test/')
X_test,Y_test = reviews_test.data, reviews_test.target
X_test = [x.replace(b"<br />", b" ") for x in X_test]

In [10]:
trial_texts = X_train[0:3]

In [11]:
trial_texts

['Full of (then) unknown actors TSF is a great big cuddly romp of a film.  The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.  The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.  And for anyone who lived in Glasgow it\'s a great "Oh I know where that is" film.',
 "Amount of disappointment I am getting these days seeing movies like Partner, Jhoom Barabar and now, Heyy Babyy is gonna end my habit of seeing first day shows.  The movie is an utter disappointment because it had the potential to become a laugh riot only if the d\xc3\xa9butant director, Sajid Khan hadn't tried too many things. Only saving grace in the movie were the last thirty minutes, which were seriously funny elsewhere the movie fails miserably. First half was desperately been tried to look funny but wasn't. Next 45 minutes were em

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(trial_texts)
print "number of features :",len(cv.vocabulary_) # 총 226개의 단어를 특성으로 간주. 

bow = cv.transform(trial_texts) # bag of words 를 가져오자 
print bow.toarray().shape # 총 266개의 단어 특성의 count를 각 문장에 대해 할당 

number of features : 339
(3, 339)


In [13]:
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(),X_train_bow,Y_train,cv=5)

scores

array([ 0.6119403 ,  0.65      ,  0.615     ,  0.59      ,  0.59798995])

In [14]:
cv = CountVectorizer(min_df=5)
X_train_bow = cv.fit_transform(X_train)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(),X_train_bow,Y_train,cv=5)

scores

array([ 0.58208955,  0.61      ,  0.585     ,  0.565     ,  0.57286432])

In [15]:
# 불용어를 제거해보자. 
# 불용어 : 문장의 뜻을 추론하는데 별 쓸모없는 단어들 
# 예 : is, often, eight, all, amount 등등 

In [16]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [18]:
cv = CountVectorizer(min_df=5,stop_words='english') # 불용어의 목록을 직접 디자인 가능하다. 
X_train_bow = cv.fit_transform(X_train)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(),X_train_bow,Y_train,cv=5)

scores

array([ 0.61691542,  0.62      ,  0.615     ,  0.605     ,  0.57286432])

In [69]:
cv = CountVectorizer(min_df=5,stop_words='english') # 불용어의 목록을 직접 디자인 가능하다. 
X_train_bow = cv.fit_transform(X_train)

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train_bow,Y_train)

print grid.best_score_

print grid.best_params_

print len(cv.get_feature_names())

0.691
{'C': 0.001}
3428


In [25]:
# 표제어 추출 알고리즘 (동사원형을 추출해줌.) 을 사용해서 BOW을 만들어보자. 
# pip install spacy 
# python -m spacy download en

import spacy
sp_en = spacy.load('en')
doc_sp = sp_en(unicode(X_train[0]))

In [50]:
# 표제어 추출을 한 결과
' '.join([x.lemma_ for x in doc_sp])

u'full of ( then ) unknown actor tsf be a great big cuddly romp of a film .   the idea of a bunch of bore teenager rip off the local sink factory be odd enough , but add in the black humour that forsyth & co be so good at and -PRON- in for a real treat .   the comatose van driver by -PRON- worth see , and the canal side chase be just too real to be anything but funny .   and for anyone who live in glasgow -PRON- have a great " oh -PRON- know where that be " film .'

In [47]:
# 인풋 문장
X_train[0]

'Full of (then) unknown actors TSF is a great big cuddly romp of a film.  The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.  The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.  And for anyone who lived in Glasgow it\'s a great "Oh I know where that is" film.'

In [63]:
# 표제어 추출을 하고, CountVectorize 해보자
X_train_sp = []
for sentence in X_train:
    X_train_sp += [' '.join([x.lemma_ for x in sp_en(unicode(sentence.decode("utf8")))])]

In [65]:
X_train_sp[1]

u'amount of disappointment -PRON- be get this day see movie like partner , jhoom barabar and now , heyy babyy be go to end -PRON- habit of see \ufeff1 day show .   the movie be a utter disappointment because -PRON- have the potential to become a laugh riot only if the d\xe9butant director , sajid khan have not try too many thing . only save grace in the movie be the last thirty minute , which be seriously funny elsewhere the movie fail miserably . first half be desperately be try to look funny but be not . next 45 minute be emotional and look totally artificial and illogical .   ok , when -PRON- be out for a movie like this -PRON- do not expect much logic but all the flaw tend to appear when -PRON- do not enjoy the movie and that s the case with heyy babyy . acting be good but that s not enough to keep one interest .   for the positive , -PRON- can take hot actress , last 30 minute , some comic scene , good act by the lead cast and the baby . only problem be that this thing do not come

In [70]:
cv = CountVectorizer(min_df=5,stop_words='english') # 불용어의 목록을 직접 디자인 가능하다. 
X_train_bow = cv.fit_transform(X_train_sp)

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train_bow,Y_train)

print grid.best_score_

print grid.best_params_

print len(cv.get_feature_names())

0.691
{'C': 0.001}
2930
