Classifiers 를 비교하기 위하여 10 개 영화의 평점 데이터를 이용합니다. 1 ~ 3 점은 negative, 9 ~ 10 은 positive class 이며, 4 ~ 8 점의 데이터는 무시하였습니다.

In [1]:
import config
import sklearn
import warnings
from navermovie_comments import load_sentiment_dataset

warnings.filterwarnings('ignore')
print(sklearn.__version__)

texts, x, y, idx_to_vocab = load_sentiment_dataset(data_name='10k', tokenize='komoran')
print(x.shape)

0.21.3
(10000, 4808)


numpy.unique 를 이용하면 실제 값들을 확인할 수 있습니다. negative 는 -1, positive 는 1 입니다.

In [2]:
import numpy as np
np.unique(y)

array([-1,  1])

Decision Tree 의 depth 별로 classification 의 성능과 decision tree 가 이용하는 features 의 개수를 확인합니다.

Depth 외의 다른 features 는 기본값으로 고정합니다.

cross validation 을 이용하여 일반화 성능을 추정합니다. 10-fold cross validation 을 이용하였고, 이용하는 features 의 개수를 확인하기 위하여 데이터 모두를 이용하여 학습시켰습니다.

scikit-learn = 0.20.0 부터 cross_val_score 는 sklearn.cross_validation.cross_val_score 에서 sklearn.mdoel_selection.cross_val_score 로 이동하였습니다.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

n_cv = 10

# for each depth
for depth in [10, 20, 30, 50, 100]:
    
    # build decision tree
    decision_tree = DecisionTreeClassifier(
        max_features=None,
        max_depth=depth
    )
    
    # train cross-validation
    scores = cross_val_score(
        decision_tree, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv
    
    # re-train using all data
    decision_tree.fit(x,y)
    
    # number of used features
    useful_features = list(
        filter(lambda x:x[1]>0,
               enumerate(decision_tree.feature_importances_)
              )
    )
    n_useful_features = len(useful_features)

    print('depth = {}, cross-validation average = {:.4}, n useful featuers = {}'.format(depth, average_score, n_useful_features))

depth = 10, cross-validation average = 0.7664, n useful featuers = 97
depth = 20, cross-validation average = 0.7777, n useful featuers = 240
depth = 30, cross-validation average = 0.7783, n useful featuers = 344
depth = 50, cross-validation average = 0.7838, n useful featuers = 518
depth = 100, cross-validation average = 0.7874, n useful featuers = 714


L1 regularization 을 적용하는 lasso regression 을 학습합니다. Lasso 도 decision tree 처럼 유용한 features 를 선택하는 특징이 있습니다. 이 둘의 성능을 비교합니다.

Lasso regression 이 이용하는 features 의 개수는 regularization cost 를 통하여 조절할 수 있습니다.

In [4]:
from sklearn.linear_model import LogisticRegression

# for each cost
for cost in [100, 10, 1, 0.1, 0.01]:

    # build lasso regression
    logistic_regression = LogisticRegression(
        penalty='l1', C=cost)

    # train cross validation
    scores = cross_val_score(
        logistic_regression, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv

    # re-train using all data
    logistic_regression.fit(x,y)

    # number of used features
    useful_features = list(
        filter(lambda x:abs(x[1])>0,
               enumerate(logistic_regression.coef_[0])
              )
    )

    n_useful_features = len(useful_features)

    print('L1 lambda = {}, cross-validation = {}, n useful features = {}'.format(1/cost, average_score, n_useful_features))

L1 lambda = 0.01, cross-validation = 0.8205, n useful features = 2502
L1 lambda = 0.1, cross-validation = 0.8473, n useful features = 2170
L1 lambda = 1.0, cross-validation = 0.858, n useful features = 1108
L1 lambda = 10.0, cross-validation = 0.8194000000000001, n useful features = 181
L1 lambda = 100.0, cross-validation = 0.7269999999999999, n useful features = 20


Lasso 대신 L2 regularization 을 적용한 logistic regression 도 함께 돌려봅니다.

In [5]:
from sklearn.linear_model import LogisticRegression

# for each cost
for cost in [100, 10, 1, 0.1, 0.01]:

    # build lasso regression
    logistic_regression = LogisticRegression(
        penalty='l2', C=cost)

    # train cross validation
    scores = cross_val_score(
        logistic_regression, x, y, cv=n_cv)
    average_score = sum(scores) / n_cv

    # re-train using all data
    logistic_regression.fit(x,y)

    # number of used features
    useful_features = list(
        filter(lambda x:abs(x[1])>0.01,
               enumerate(logistic_regression.coef_[0])
              )
    )

    n_useful_features = len(useful_features)

    print('L2 lambda = {}, cross-validation = {}, n features abs > 0.01 = {}'.format(1/cost, average_score, n_useful_features))

L2 lambda = 0.01, cross-validation = 0.8422000000000001, n features abs > 0.01 = 4420
L2 lambda = 0.1, cross-validation = 0.8550000000000001, n features abs > 0.01 = 4556
L2 lambda = 1.0, cross-validation = 0.8638999999999999, n features abs > 0.01 = 4585
L2 lambda = 10.0, cross-validation = 0.8517000000000001, n features abs > 0.01 = 4261
L2 lambda = 100.0, cross-validation = 0.8221999999999999, n features abs > 0.01 = 2059


Random Forest 도 max depth 와 estimators 의 개수를 다르게 설정하여 확인합니다.

In [6]:
from sklearn.ensemble import RandomForestClassifier

# for each max-depth
for depth in [2, 4, 6, 8]:

    # for each n estimators
    for n_estimators in [100, 300, 500, 1000]:
        random_forest = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=depth,
        )

        # train cross validation
        scores = cross_val_score(
            random_forest, x, y, cv=n_cv)
        average_score = sum(scores) / n_cv

        print('Random forest max-depth = {}, n estimators = {}, cross-validation = {}'.format(
            depth, n_estimators, average_score))

Random forest max-depth = 2, n estimators = 100, cross-validation = 0.7665
Random forest max-depth = 2, n estimators = 300, cross-validation = 0.8013999999999999
Random forest max-depth = 2, n estimators = 500, cross-validation = 0.7855
Random forest max-depth = 2, n estimators = 1000, cross-validation = 0.7975
Random forest max-depth = 4, n estimators = 100, cross-validation = 0.7914
Random forest max-depth = 4, n estimators = 300, cross-validation = 0.7988999999999999
Random forest max-depth = 4, n estimators = 500, cross-validation = 0.8036000000000001
Random forest max-depth = 4, n estimators = 1000, cross-validation = 0.8039999999999999
Random forest max-depth = 6, n estimators = 100, cross-validation = 0.8029
Random forest max-depth = 6, n estimators = 300, cross-validation = 0.8074999999999999
Random forest max-depth = 6, n estimators = 500, cross-validation = 0.8129
Random forest max-depth = 6, n estimators = 1000, cross-validation = 0.8120999999999998
Random forest max-depth =

Gradient boosting 도 max depth 와 n estimators 를 다르게 하여 cross validation 을 수행합니다.

In [7]:
from sklearn.ensemble import GradientBoostingClassifier

# for each max-depth
for depth in [2, 4, 6, 8]:

    # for each n estimators
    for n_estimators in [100, 300, 500, 1000]:
        gbtree = GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=depth,
        )

        # train cross validation
        scores = cross_val_score(
            random_forest, x, y, cv=n_cv)
        average_score = sum(scores) / n_cv

        print('Gradient boosting tree max-depth = {}, n estimators = {}, cross-validation = {}'.format(
            depth, n_estimators, average_score))

Gradient boosting tree max-depth = 2, n estimators = 100, cross-validation = 0.8184999999999999
Gradient boosting tree max-depth = 2, n estimators = 300, cross-validation = 0.8192
Gradient boosting tree max-depth = 2, n estimators = 500, cross-validation = 0.8219
Gradient boosting tree max-depth = 2, n estimators = 1000, cross-validation = 0.8198000000000001
Gradient boosting tree max-depth = 4, n estimators = 100, cross-validation = 0.8201
Gradient boosting tree max-depth = 4, n estimators = 300, cross-validation = 0.8244999999999999
Gradient boosting tree max-depth = 4, n estimators = 500, cross-validation = 0.8262
Gradient boosting tree max-depth = 4, n estimators = 1000, cross-validation = 0.8259000000000001
Gradient boosting tree max-depth = 6, n estimators = 100, cross-validation = 0.8289
Gradient boosting tree max-depth = 6, n estimators = 300, cross-validation = 0.8164000000000001
Gradient boosting tree max-depth = 6, n estimators = 500, cross-validation = 0.8199000000000002
Gr

XGBoost 는 scikit-learn 의 패키지가 아니기 때문에 직접 cross validation 과정을 구현하였습니다.

In [8]:
import numpy as np
import xgboost as xgb

y_xgb = np.zeros(y.shape[0])
y_xgb[np.where(y == 1)[0]] = 1

# for each max-depth
for depth in [2, 4, 6, 8]:

    # for each n estimators
    for n_estimators in [100, 300, 500, 1000]:

        # cross - validation (manually)
        validation_scores = []

        for b in range(10):
            test_idx = [i for i in range(b*1000, (b+1)*1000)]
            train_idx = [i for i in range(10000) if not i in test_idx]
            train_idx = np.asarray(train_idx)
            test_idx = np.asarray(test_idx)

            dtrain = xgb.DMatrix(data=x[train_idx], label=y_xgb[train_idx])
            dtest = xgb.DMatrix(data=x[test_idx])

            param = {'max_depth': depth, 'eta': 0.3, 'silent': 0, 'objective': 'binary:logistic'}
            param['nthread'] = 4
            param['eval_metric'] = 'auc'
            bst = xgb.train(param, dtrain, n_estimators)

            pred_score = bst.predict(dtest)
            y_pred = np.zeros(pred_score.shape[0])
            y_pred[np.where(pred_score > 0.5)[0]] = 1
            n_correct = np.where(y_pred == y_xgb[test_idx])[0].shape[0]
            accuracy = n_correct / y_pred.shape[0]
            validation_scores.append(accuracy)

        average_score = sum(validation_scores) / len(validation_scores)

        print('XGBoost tree max-depth = {}, n estimators = {}, cross-validation = {}'.format(
            depth, n_estimators, average_score))

XGBoost tree max-depth = 2, n estimators = 100, cross-validation = 0.7551
XGBoost tree max-depth = 2, n estimators = 300, cross-validation = 0.7848
XGBoost tree max-depth = 2, n estimators = 500, cross-validation = 0.7958
XGBoost tree max-depth = 2, n estimators = 1000, cross-validation = 0.807
XGBoost tree max-depth = 4, n estimators = 100, cross-validation = 0.7779
XGBoost tree max-depth = 4, n estimators = 300, cross-validation = 0.8027999999999998
XGBoost tree max-depth = 4, n estimators = 500, cross-validation = 0.8130000000000001
XGBoost tree max-depth = 4, n estimators = 1000, cross-validation = 0.8166
XGBoost tree max-depth = 6, n estimators = 100, cross-validation = 0.7888
XGBoost tree max-depth = 6, n estimators = 300, cross-validation = 0.8087
XGBoost tree max-depth = 6, n estimators = 500, cross-validation = 0.8150000000000001
XGBoost tree max-depth = 6, n estimators = 1000, cross-validation = 0.8101999999999998
XGBoost tree max-depth = 8, n estimators = 100, cross-validati