### IMDB 영화평 감성분석
- pipeline을 이용한 GridSearchCV
- Tfidfvectorizer + NaiveBayes,LogisticRegression

In [16]:
import numpy as np
import pandas as pd
df =pd.read_csv('data/labeledTrainData.tsv',sep='\t',quoting=3)

In [17]:
df.review =df.review.str.replace('<br />','')
df.review =df.review.str.replace('[^A-Za-z]',' ',regex=True)

In [18]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values,
    test_size=0.2, random_state=2023
)

#### Piplining

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [20]:
tvect =TfidfVectorizer(ngram_range=(1,2),stop_words='english')
nb =MultinomialNB()
pipeline =Pipeline([('TVECT',tvect),('NB',nb)])

In [21]:
# 학습
%time pipeline.fit(X_train,y_train)

CPU times: total: 12.2 s
Wall time: 12.8 s


In [22]:
pipeline.score(X_test,y_test)

0.8804

In [27]:
from sklearn.linear_model import LogisticRegression 
lrc = LogisticRegression(random_state=2023)
pipeline = Pipeline([('TVECT', tvect), ('LRC', lrc)]) 
%time pipeline.fit(X_train, y_train)

CPU times: total: 31.6 s
Wall time: 30.3 s


In [28]:
pipeline.score(X_test,y_test)

0.8824

#### 최적 파라메터 찾기

In [29]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df': [100,500],
    'LRC__C': [1,10]
}

In [30]:
grid_pipe = GridSearchCV(
    pipeline, params, scoring='accuracy', cv=3
)
%time grid_pipe.fit(X_train, y_train)

CPU times: total: 4min 34s
Wall time: 4min 31s


In [32]:
grid_pipe.best_params_

{'LRC__C': 10, 'TVECT__max_df': 500}

In [33]:
best_pipe =grid_pipe.best_estimator_
best_pipe.score(X_test,y_test)

0.8896

- 실 데이터에 적용

In [34]:
review =['''
It is just what you want for the best movie. Great story great acting, thrilling twist. Just watched Joker in 2019, I just has to come back and give dark knight a 10. And thanks to Heath Ledger for the exceptional performs.
''',
'''
No.1 Movie of all time? I think that's what ruined it for me, I sat down expecting to watch the most spectacular cinematic experience of my life - but instead I ended up sitting rather bored for the good part of 3hours.
Everything about it just lacked in moving me, the vengeance was untouched, parts of the story seemed to obvious and other parts just left untouched. Thoroughly this was the most disappointing film I've ever seen, not the worst, just the most disappointing. I'm not trying to change anyone's thoughts on the film or make out the film is bad - just in case anyone gets the impression.
'''
]

In [35]:
import re
review=map(lambda x:re.sub('[^A-Za-z]',' ',x),review)

In [36]:
best_pipe.predict(review)


array([1, 0], dtype=int64)