Classifiers 를 비교하기 위하여 10 개 영화의 평점 데이터를 이용합니다. Day 1 때 만들어둔 데이터입니다. 1 ~ 3 점은 negative, 9 ~ 10 은 positive class 이며, 4 ~ 8 점의 데이터는 무시하였습니다.

In [1]:
import sys
import warnings
warnings.filterwarnings('ignore')

from config import dataset_dir
sys.path.append('{}/lovit_textmining_dataset/'.format(dataset_dir))

Dataset version
[navermovie_comments.data] is latest (0.0.1)
[navermovie_comments.models] is latest (0.0.1)
[navernews_10days.data] is latest (0.0.1)
[navernews_10days.models] is latest (0.0.1)


In [2]:
from navermovie_comments import load_sentiment_dataset

texts, x, y, idx_to_vocab = load_sentiment_dataset(data_name='10k', tokenize='komoran')
print(x.shape)

(10000, 4808)


numpy.unique 를 이용하면 실제 값들을 확인할 수 있습니다. negative 는 -1, positive 는 1 입니다.

In [3]:
import numpy as np
np.unique(y)

array([-1,  1])

Decision Tree 의 depth 별로 classification 의 성능과 decision tree 가 이용하는 features 의 개수를 확인합니다.

Depth 외의 다른 features 는 기본값으로 고정합니다.

cross validation 을 이용하여 일반화 성능을 추정합니다. 10-fold cross validation 을 이용하였고, 이용하는 features 의 개수를 확인하기 위하여 데이터 모두를 이용하여 학습시켰습니다.

scikit-learn = 0.20.0 부터 cross_val_score 는 sklearn.cross_validation.cross_val_score 에서 sklearn.mdoel_selection.cross_val_score 로 이동하였습니다.

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

n_cv = 10

# for each depth
for depth in [10, 20, 30, 50, 100]:
    
    # build decision tree
    decision_tree = DecisionTreeClassifier(
        max_features=None,
        max_depth=depth
    )
    
    # train cross-validation
    scores = cross_val_score(
        decision_tree, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv
    
    # re-train using all data
    decision_tree.fit(x,y)
    
    # number of used features
    useful_features = list(
        filter(lambda x:x[1]>0,
               enumerate(decision_tree.feature_importances_)
              )
    )
    n_useful_features = len(useful_features)

    print('depth = {}, cross-validation average = {:.4}, n useful featuers = {}'.format(depth, average_score, n_useful_features))

depth = 10, cross-validation average = 0.7656, n useful featuers = 100
depth = 20, cross-validation average = 0.7774, n useful featuers = 246
depth = 30, cross-validation average = 0.7802, n useful featuers = 359
depth = 50, cross-validation average = 0.7843, n useful featuers = 500
depth = 100, cross-validation average = 0.7821, n useful featuers = 729


L1 regularization 을 적용하는 lasso regression 을 학습합니다. Lasso 도 decision tree 처럼 유용한 features 를 선택하는 특징이 있습니다. 이 둘의 성능을 비교합니다.

Lasso regression 이 이용하는 features 의 개수는 regularization cost 를 통하여 조절할 수 있습니다.

In [5]:
from sklearn.linear_model import LogisticRegression

# for each cost
for cost in [100, 10, 1, 0.1, 0.01]:

    # build lasso regression
    logistic_regression = LogisticRegression(
        penalty='l1', C=cost)

    # train cross validation
    scores = cross_val_score(
        logistic_regression, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv

    # re-train using all data
    logistic_regression.fit(x,y)

    # number of used features
    useful_features = list(
        filter(lambda x:abs(x[1])>0,
               enumerate(logistic_regression.coef_[0])
              )
    )

    n_useful_features = len(useful_features)

    print('L1 lambda = {}, cross-validation = {}, n useful features = {}'.format(1/cost, average_score, n_useful_features))

L1 lambda = 0.01, cross-validation = 0.8206, n useful features = 2496
L1 lambda = 0.1, cross-validation = 0.8477, n useful features = 2172
L1 lambda = 1.0, cross-validation = 0.858, n useful features = 1108
L1 lambda = 10.0, cross-validation = 0.8193000000000001, n useful features = 181
L1 lambda = 100.0, cross-validation = 0.7269999999999999, n useful features = 20


Lasso 대신 L2 regularization 을 적용한 logistic regression 도 함께 돌려봅니다.

In [6]:
from sklearn.linear_model import LogisticRegression

# for each cost
for cost in [100, 10, 1, 0.1, 0.01]:

    # build lasso regression
    logistic_regression = LogisticRegression(
        penalty='l2', C=cost)

    # train cross validation
    scores = cross_val_score(
        logistic_regression, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv

    # re-train using all data
    logistic_regression.fit(x,y)

    # number of used features
    useful_features = list(
        filter(lambda x:abs(x[1])>0.01,
               enumerate(logistic_regression.coef_[0])
              )
    )

    n_useful_features = len(useful_features)

    print('L2 lambda = {}, cross-validation = {}, n features abs > 0.01 = {}'.format(1/cost, average_score, n_useful_features))

L2 lambda = 0.01, cross-validation = 0.8423, n features abs > 0.01 = 4419
L2 lambda = 0.1, cross-validation = 0.8549, n features abs > 0.01 = 4556
L2 lambda = 1.0, cross-validation = 0.8638999999999999, n features abs > 0.01 = 4585
L2 lambda = 10.0, cross-validation = 0.8517000000000001, n features abs > 0.01 = 4261
L2 lambda = 100.0, cross-validation = 0.8221999999999999, n features abs > 0.01 = 2059
