### IMDM 영화평 감성 분석(이진분류)
- CountVectorizer + logisticRession

#### 1. 데이터 탐색

In [139]:
import numpy as np
import pandas as pd

In [140]:
df =pd.read_csv('data/labeledTrainData.tsv',sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [141]:
#quoting=3 모든 문장에 큰따옴표 나오게함 ,인용구를 무시하게 해준다
df =pd.read_csv('data/labeledTrainData.tsv',sep='\t',quoting=3)
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [143]:
print(df.review[0][:1000])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [144]:
df.isna().sum().sum()

0

### 2. 텍스트 전처리

In [145]:
# <br>태그는 공백으로
df.review =df.review.str.replace('<br />','')

In [146]:
# 구둣점,숫자 제거 -- 영어 이외의 문자는 공백으로  대체
df.review =df.review.str.replace('[^A-Za-z]',' ',regex=True)

In [147]:
df.review[0][:200]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want '

### 3. 데티어 셋 분리

In [148]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values,
    test_size=0.2, random_state=2023
)
np.unique(y_train, return_counts=True)

(array([0, 1], dtype=int64), array([10000, 10000], dtype=int64))

In [149]:
from sklearn.feature_extraction.text import CountVectorizer
cvect =CountVectorizer(stop_words='english')

In [150]:
# 아래와 같은 방법으로 하면안됨

cvect.fit_transform(X_train).shape,cvect.fit_transform(X_test).shape

((20000, 67420), (5000, 37893))

In [151]:
#이와 같은 방법을 사용해야 함
cvect.fit(X_train)
X_train_cv =cvect.transform(X_train) # transform  단어의 등장 횟수를 계산
X_test_cv =cvect.transform(X_test)
X_train_cv.shape,X_test_cv.shape
#(문서의 수, 단어의 수) 형태의 튜플로 반환되며,

((20000, 67420), (5000, 67420))

#### 5. 학습 및 평가

In [152]:
from sklearn.linear_model import LogisticRegression
lrc =LogisticRegression(random_state=2023,max_iter=500)

In [153]:
# 시간이 오래 걸리는 작업 - %time magic 명령어 사용
%time lrc.fit(X_train_cv,y_train)

CPU times: total: 4.47 s
Wall time: 4.54 s


In [154]:
lrc.score(X_test_cv,y_test)

0.8774

### 6. 2-Bigram

In [155]:
cvect2 =CountVectorizer(stop_words='english',ngram_range=(1,2))
cvect2.fit(X_train)
X_train_cv2 =cvect2.transform(X_train) # transform  단어의 등장 횟수를 계산
X_test_cv2 =cvect2.transform(X_test)
X_train_cv2.shape,X_test_cv2.shape

((20000, 1457330), (5000, 1457330))

In [156]:
lrc2 =LogisticRegression(random_state=2023,max_iter=500)
%time lrc2.fit(X_train_cv2,y_train)

CPU times: total: 52.3 s
Wall time: 47.3 s


In [157]:
lrc2.score(X_test_cv2,y_test)

0.8896

##### 7. 모델 save/load

In [158]:
import joblib

In [159]:
# 모델 저장
joblib.dump(cvect2,'model/imdb_cvect_2.pkl')
joblib.dump(lrc2,'model/imdb_lrc2.pkl')

['model/imdb_lrc2.pkl']

In [160]:
# 모델 로드
new_cvect =joblib.load('model/imdb_cvect_2.pkl')
new_lrc =joblib.load('model/imdb_lrc2.pkl')

8. 실제 데이터로 검증

In [161]:
review =['''
It is just what you want for the best movie. Great story great acting, thrilling twist. Just watched Joker in 2019, I just has to come back and give dark knight a 10. And thanks to Heath Ledger for the exceptional performs.
''',
'''
No.1 Movie of all time? I think that's what ruined it for me, I sat down expecting to watch the most spectacular cinematic experience of my life - but instead I ended up sitting rather bored for the good part of 3hours.
Everything about it just lacked in moving me, the vengeance was untouched, parts of the story seemed to obvious and other parts just left untouched. Thoroughly this was the most disappointing film I've ever seen, not the worst, just the most disappointing. I'm not trying to change anyone's thoughts on the film or make out the film is bad - just in case anyone gets the impression.
'''
]

In [162]:
# 텍스트 전처리
import re
review=map(lambda x:re.sub('[^A-Za-z]',' ',x),review)
# review = re.sub('[^A-Za-z]','',review)

In [163]:
# feature 변환
review_cv =new_cvect.transform(review)
review_cv.shape

(2, 1457330)

In [164]:
# 에측
# '긍정' if new_lrc.predict(review_cv)[0] == 1 else '부정'
new_lrc.predict(review_cv)

array([1, 0], dtype=int64)

In [165]:
rev='''
No.1 Movie of all time? I think that's what ruined it for me, I sat down expecting to watch the most spectacular cinematic experience of my life - but instead I ended up sitting rather bored for the good part of 3hours.
Everything about it just lacked in moving me, the vengeance was untouched, parts of the story seemed to obvious and other parts just left untouched. Thoroughly this was the most disappointing film I've ever seen, not the worst, just the most disappointing. I'm not trying to change anyone's thoughts on the film or make out the film is bad - just in case anyone gets the impression.
'''
re =re.sub('[^A-Za-z]',' ',rev)
rev_cv =new_cvect.transform([rev])
'긍정' if new_lrc.predict(review_cv)[0] == 1 else '부정'

'긍정'