## 한글 언어 처리
한글 언어 처리의 어려움: 띄어쓰기, 조사 등

KoNLPy: 한글 형태소 패키지
형태소 - '단어로서 의미를 가지는 최소 단위'

In [4]:
import pandas as pd

train_df = pd.read_csv("../ratings_train.txt", sep="\t")
train_df.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [6]:
train_df["label"].value_counts()

0    75173
1    74827
Name: label, dtype: int64

In [9]:
import re

train_df = train_df.fillna(' ')
train_df["document"] = train_df["document"].apply(lambda x: re.sub(r"\d+", " ", x))

test_df = pd.read_csv("../ratings_test.txt", sep="\t")
test_df = test_df.fillna(' ')
test_df["document"] = test_df["document"].apply(lambda x: re.sub(r"\d+", " ", x))

In [10]:
train_df.drop("id", axis=1, inplace=True)
test_df.drop("id", axis=1, inplace=True)

TF-IDF 방식으로 단어 벡터화
형태소 단어로 토큰화 (Twitter class 이용)

In [13]:
from konlpy.tag import Okt

okt = Okt()
def okt_tokenizer(text):
    tokens_ko = okt.morphs(text)
    return tokens_ko

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf_vect = TfidfVectorizer(tokenizer=okt_tokenizer, ngram_range=(1, 2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df["document"])
tfidf_matrix_train = tfidf_vect.transform(train_df["document"])



In [19]:
from sklearn.linear_model import LogisticRegression

lg_clf = LogisticRegression(random_state=0, solver="liblinear")

params = {
    'C': [1, 3.5, 4.5, 5.5, 10]
}
grid_cv = GridSearchCV(lg_clf, param_grid=params, cv=3, scoring="accuracy", verbose=1)
grid_cv.fit(tfidf_matrix_train, train_df["label"])
print(grid_cv.best_params_, round(grid_cv.best_score_, 4))

Fitting 3 folds for each of 5 candidates, totalling 15 fits
{'C': 3.5} 0.8593


using test set

In [22]:
from sklearn.metrics import accuracy_score

tfidf_matrix_test = tfidf_vect.transform(test_df["document"])

best_estimator = grid_cv.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print("Logistic Regression accuracy:", accuracy_score(test_df["label"], preds))

Logistic Regression accuracy: 0.86172
