### **IMDB 데이터셋 감정 분석을 위한 그라디언트 부스팅 기법 성능 탐구**

- 그라디언트 부스팅 기법을 활용하여 IMDB 데이터셋의 감정 분석 성능을 개선할 수 있는지 탐구하였다.
- 동일한 조건에서 성능을 비교하기 위해 TF-IDF 기반 Vectorizer를 사용해 문장들을 벡터화하였고, 그 결과 각 문장이 73,683차원의 실수 벡터로 변환되었다. (각 차원은 $[0,1]$의 값을 갖는다.)

**로지스틱 모형(Baseline)**
- 베이스라인 모형으로 로지스틱 모형을 사용하였고, 그 결과는 아래와 같다. *(Accuracy 기준)*
    
    - 기본 모형 (초모수 튜닝 X) : **(Train, Test) = (93.0%, 89.2%)**
    - Grid Search 적용 : **(Train, Test) = (96.8%, 89.5%)**

- Grid Search는 L1, L2 penalty의 크기를 탐색하기 위해 사용하였고, 규제화 크기만으로도 성능이 일정 수준 개선됨을 알 수 있다.

**XGBoost Classifier**
- 첫 번째 그라디언트 부스팅 방법으로 XGBoost [(original paper)](https://dl.acm.org/citation.cfm?id=2939785)를 선정하였고, 그 적용 결과는 아래와 같다. *(Accuracy 기준)*

    - 기본 모형 (초모수 튜닝 X) : **(Train, Test) = (83.3%, 81.2%)**
    - Grid Search 적용 : **(Train, Test) = (-, -)**

- XGBoost 방법을 사용한 결과 로지스틱 모형에 비해 성능이 다소 낮게 나타났으며, 학습시간 또한 길어 Grid Search를 수행할 수 없었다.

**LightGBM Classifier**
- 두 번째 그라디언트 부스팅 방법으로 LightGBM [(original paper)](http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradi)을 선정하였고, 그 적용 결과는 아래와 같다. *(Accuracy 기준)*

    - 기본 모형 (초모수 튜닝 X) : **(Train, Test) = (91.9%, 86.0%)**
    - Grid Search 적용 : **(Train, Test) = (100.0%, 88.2%)**

- LightGBM은 XGBoost에 비해 성능과 학습시간이 모두 우월하게 나타났으며, 초모수를 튜닝하지 않고도 비교적 좋은 성능을 보였다.
- 다만 성능을 개선하기 위해 Grid Search를 통해 초모수를 조정한 결과 과적합이 발생하였고 test accuracy 개선폭이 크지 않았다. 또한 부스팅 방법의 특성상 많은 계산이 필요하여 초모수를 미세 조정하는 것이 거의 불가능하였다.
    
    *※ 초모수 조정을 위해 Google Cloud Platform의 클라우드 컴퓨팅을 적용하여 16개의 CPU로 병렬연산을 수행하였음에도 약 1시간 30분이 소요되어 미세 조정에 실패하였다.*

- 세부적으로는, estimator의 숫자가 1,000을 넘어서면 과적합이 발생하였고, L1, L2 페널티의 크기인 $\alpha, \lambda$는 과적합 방지에 미미한 영향을 미쳤다.

    과적합을 가장 효과적으로 방지한 파라미터는 `num_leaves`와 `max_depth`, 즉 각 의사결정 나무의 복잡도를 직접적으로 줄이는 지표들이었다. 그러나 해당 지표를 통해 과적합을 방지하여도 test accuracy 개선에는 영향력이 없었다. 이는 각 estimator의 복잡도가 줄어들어 모형 전체의 설명력이 줄어드는 것이라 해석할 수 있을 것이다.

**(참조) Random Forest Classifier**
- 또다른 앙상블 학습 기법인 랜덤 포레스트를 적절히 튜닝하여 적용한 결과는 다음과 같다.

    - **(Train, Test) = (87.6%, 85.1%)**

- 초모수를 조정하지 않으면 과적합이 발생하나(training accuracy : 100.0%), 가지치기 파라미터 $\alpha$와 분기를 위한 최소 표본수 `min_samples_split`을 적절히 조정한 결과 과적합이 줄어드는 동시에 test accuracy가 소폭 개선되었다.

- 다만 랜덤 포레스트 단일 모형으로는 부스팅 기법에 비해 설명력이 다소 부족한 것으로 나타났다.

### **향후 개선점 및 탐구 과제**

***1. 그라디언트 부스팅 방법의 과적합 방지***
    
- estimator 수를 늘려 설명력을 늘리는 동시에, 모형의 복잡도를 크게 훼손하지 않는 범위에서 과적합을 방지할 수 있는 방안에 대해 탐구할 예정이다.
- 또한 Vectorize된 $X$ 변수들이 희소 데이터(sparse data)임을 감안하여, 희소 데이터에 부스팅 기법을 적용할 때 주의할 점에 대해 알아볼 것이다.

***2. 다른 특성 변수 변환을 고려***

- TF-IDF 기반 Vectorizer를 사용한 경우 고차원-희소 특성변수들이 생성되므로 계산량이 증가함과 동시에 최적화가 이루어지지 않을 가능성이 존재한다.
- 따라서 Tokenizer가 아닌 다른 특성 변수 변환 방법을 고려하고자 한다.
    - Word2Vec : 현재 자연어처리(NLP) 분야에서 보편적으로 사용되는 특성 변수 생성 기법이다.
    - TF-IDF Vector에 차원 축소 기법 적용
    - Neural Network를 활용한 특성 변수 생성 기법 적용
- 특히 Neural Network를 활용한 방법의 경우, IMDB 데이터를 포함한 많은 자연어처리 Task에서 SOTA(state-of-the-art)의 성능을 보이는 것으로 알려진 ***BERT(Bidirectional Encoder Representations from Transformers)***를 활용한 변수 변환을 적용 예정이다.

***3. 특성 변수에 대한 EDA***

- 딥러닝이 아닌 통계적 기계학습 기법을 활용한 분류 문제의 경우 EDA를 통해 특성변수의 특징을 분석하는 것이 성능 개선에 영향을 미치는 것으로 알려져있다.
- 그러므로 위에서 제시된 특성 변수들에 대한 EDA 및 그에 따른 적절한 분류 모형 선정을 진행할 예정이다.

### **Prerequisites**

In [4]:
# 코드 출력시 불필요한 warning 메시지가 출력되지 않도록 함
# 출력하지 않을 warining 메시지의 종류를 지정할 수 있음 ('DeprecationWarning' 등)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [5]:
import tarfile
tar = tarfile.open('C:/Users/ysp/Desktop/Deep Learning/aclImdb_v1.tar.gz') # gz파일이 저장된 경로로 변경
tar.extractall()


In [6]:
# 자료를 행렬 형태로 바꿈.
import pyprind
import pandas as pd
import os

basepath = 'C://Users//ysp//Desktop//Deep Learning//aclImdb_v1//aclImdb'
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']    

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:30


In [7]:
# 자료의 순서를 임의로 뒤섞어 csv 파일로 저장. 
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('C://Users//ysp//Desktop//Deep Learning//movie_data.csv', index = False, encoding='utf-8')

In [8]:
# 저장된 파일 불러와 확인함.
df = pd.read_csv('C://Users//ysp//Desktop//Deep Learning//movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


### **Data preprocessing**

In [9]:
# 정보를 가지지 않은 것으로 판단 되는 것을 사전에 정리
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # text에서 <[^>]*>과 일치하는 데이터를 공백으로 바꾸는 명령어
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
    text)
    text = (re.sub('[\W]+', ' ', text.lower()) # 단어가 아닌 모든 기호는 공백으로 대체
    +' '.join(emoticons).replace('-', '')) # emoticons을 빈공간 뒤에 배치
    return text

df['review'] = df['review'].apply(preprocessor)
df['review']

0        in 1974 the teenager martha moxley maggie grac...
1        ok so i really like kris kristofferson and his...
2         spoiler do not read this if you think about w...
3        hi for all the people who have seen this wonde...
4        i recently bought the dvd forgetting just how ...
                               ...                        
49995    ok lets start with the best the building altho...
49996    the british heritage film industry is out of c...
49997    i don t even know where to begin on this one i...
49998    richard tyler is a little boy who is scared of...
49999    i waited long to watch this movie also because...
Name: review, Length: 50000, dtype: object

Tokenizer settings

In [10]:
# TfidfVectorizer 클래스의 'tokenizer' 옵션을 위한 함수 정의

# 1) tokenizer
def tokenizer(text):
    return text.split()

# 2) tokenizer_porter (PorterStemmer 클래스 사용)
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [11]:
# TfidfVectorizer 클래스의 'stop_words' 옵션을 위한 함수 정의
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ysp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


data tokenizing (text to array)

In [12]:
# extract values from data frame
X = df['review'].values
y = df['sentiment'].values

In [13]:
# import tokenizer to data transform
from sklearn.feature_extraction.text import TfidfVectorizer

# tokenizer with porter stemmer
tfidf_porter = TfidfVectorizer(ngram_range = (1,1),
                        strip_accents = None,
                        lowercase = False,
                        preprocessor = None,
                        stop_words = 'english',
                        tokenizer = tokenizer_porter)

In [14]:
# tokenize reviews
X_tok = tfidf_porter.fit_transform(X).toarray()

In [0]:
tmp = tfidf_porter.fit_transform(X[:25000]).toarray()

In [17]:
print(X_tok.shape)

(50000, 73590)


In [15]:
# split train-test dataset (plain tokenizer)
X_train = X_tok[:25000]
X_test = X_tok[25000:]

# split train-test label
y_train = y[:25000]
y_test = y[25000:]

In [16]:
print(X_train.shape)
print(X_test.shape)

(25000, 73590)
(25000, 73590)


### **Logistic Regression Model**

plain model without CV

In [20]:
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression(solver = 'liblinear', random_state=0)
%time clf_lr.fit(X_train, y_train)

Wall time: 9.47 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
print('Logistic Model Training Accuracy w/o CV: %.3f' % clf_lr.score(X_train, y_train))
print('Logistic Model Test Accuracy w/o CV: %.3f' % clf_lr.score(X_test, y_test))

Logistic Model Training Accuracy w/o CV: 0.930
Logistic Model Test Accuracy w/o CV: 0.890


Hyperparameter Tuning via CV

In [22]:
param_lr = [{'clf__penalty':['l2', 'l1'],
             'clf__C': [0.1, 0.5, 1.0, 5.0, 10.0]}]

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [28]:
lr = Pipeline([('clf', LogisticRegression(solver = 'liblinear', random_state=0))])
gs_lr = GridSearchCV(lr, param_grid = param_lr, scoring = 'accuracy', cv = 5, verbose = 2)
gs_lr.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] clf__C=0.1, clf__penalty=l2 .....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ...................... clf__C=0.1, clf__penalty=l2, total=  27.4s
[CV] clf__C=0.1, clf__penalty=l2 .....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.4s remaining:    0.0s


[CV] ...................... clf__C=0.1, clf__penalty=l2, total=  12.3s
[CV] clf__C=0.1, clf__penalty=l2 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l2, total=  13.3s
[CV] clf__C=0.1, clf__penalty=l2 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l2, total=  11.8s
[CV] clf__C=0.1, clf__penalty=l2 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l2, total=  11.9s
[CV] clf__C=0.1, clf__penalty=l1 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l1, total=  13.8s
[CV] clf__C=0.1, clf__penalty=l1 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l1, total=  14.3s
[CV] clf__C=0.1, clf__penalty=l1 .....................................
[CV] ...................... clf__C=0.1, clf__penalty=l1, total=  14.5s
[CV] clf__C=0.1, clf__penalty=l1 .....................................
[CV] .

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 11.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                                                           n_jobs=None,
                                                           penalty='l2',
                                                           random_state=0,
                                                    

In [29]:
print('Logistic Model Training Accuracy with CV: %.3f' % gs_lr.score(X_train, y_train))
print('Logistic Model Test Accuracy with CV: %.3f' % gs_lr.score(X_test, y_test))

Logistic Model Training Accuracy with CV: 0.969
Logistic Model Test Accuracy with CV: 0.892


### **XGBoost Classifier Model**

plain model without CV

In [32]:
#!pip install xgboost
%%time
from xgboost import XGBClassifier
clf_xgb = XGBClassifier(random_state=0)
clf_xgb.fit(X_train, y_train)

Wall time: 0 ns


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [33]:
print('XGBoost Model Training Accuracy w/o CV: %.3f' % clf_xgb.score(X_train, y_train))
print('XGBoost Model Test Accuracy w/o CV: %.3f' % clf_xgb.score(X_test, y_test))

XGBoost Model Training Accuracy w/o CV: 0.947
XGBoost Model Test Accuracy w/o CV: 0.853


Hyperparameter Tuning via CV

In [34]:
param_xgb = [{'reg_alpha': [0.5, 1.0],
              'reg_lambda': [0.5, 1.0],
              'n_estimators': [1000, 1500]}]

In [None]:
gs_xgb = GridSearchCV(XGBClassifier(random_state = 0), param_xgb, cv=5, verbose = 2)
gs_xgb.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] n_estimators=1000, reg_alpha=0.5, reg_lambda=0.5 ................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] . n_estimators=1000, reg_alpha=0.5, reg_lambda=0.5, total=69.0min
[CV] n_estimators=1000, reg_alpha=0.5, reg_lambda=0.5 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 69.0min remaining:    0.0s


### **LightGBM Classifier Model**

plain model without CV

In [17]:
#!pip install lightgbm

from lightgbm import LGBMClassifier
clf_lgb = LGBMClassifier(random_state=0, n_jobs=-1)
clf_lgb.fit(X_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=0, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [18]:
print('LightGBM Model Training Accuracy w/o CV: %.3f' % clf_lgb.score(X_train, y_train))
print('LightGBM Model Test Accuracy w/o CV: %.3f' % clf_lgb.score(X_test, y_test))

LightGBM Model Training Accuracy w/o CV: 0.917
LightGBM Model Test Accuracy w/o CV: 0.860


Hyperparameter Tuning via CV

In [0]:
param_lgb = [{'reg_alpha': [0.0, 0.25, 0.5, 0.75, 1.0],
              'reg_lambda': [0.0, 0.25, 0.5, 0.75, 1.0],
              'n_estimators': [100, 300, 500, 1000]}]

In [123]:
gs_lgb = GridSearchCV(LGBMClassifier(random_state = 0), param_lgb, cv=5, verbose = 2, n_jobs = 16)
gs_lgb.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   9 tasks      | elapsed:  1.6min
[Parallel(n_jobs=16)]: Done 130 tasks      | elapsed: 11.3min
[Parallel(n_jobs=16)]: Done 333 tasks      | elapsed: 47.6min
[Parallel(n_jobs=16)]: Done 500 out of 500 | elapsed: 94.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                      colsample_bytree=1.0,
                                      importance_type='split',
                                      learning_rate=0.1, max_depth=-1,
                                      min_child_samples=20,
                                      min_child_weight=0.001,
                                      min_split_gain=0.0, n_estimators=100,
                                      n_jobs=-1, num_leaves=31, objective=None,
                                      random_state=0, reg_alpha=0.0,
                                      reg_lambda=0.0, silent=True,
                                      subsample=1.0, subsample_for_bin=200000,
                                      subsample_freq=0),
             iid='deprecated', n_jobs=16,
             param_grid=[{'n_estimators': [100, 300, 500, 1000],
                          'reg_alp

In [125]:
print(gs_lgb.best_params_)

{'reg_alpha': 0.0, 'reg_lambda': 0.5, 'n_estimators': 1000}


In [124]:
print('LightGBM Model Training Accuracy with CV: %.3f' % gs_lgb.score(X_train, y_train))
print('LightGBM Model Test Accuracy with CV: %.3f' % gs_lgb.score(X_test, y_test))

LightGBM Model Training Accuracy with CV: 1.000
LightGBM Model Test Accuracy with CV: 0.882


### **RandomForest Classifier Model**

plain model without CV

In [251]:
%%time
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(random_state=0, n_jobs = -1,
                                min_samples_split = 5,
                                ccp_alpha = 0.0005,
                                n_estimators = 500)
clf_rf.fit(X_train, y_train)

CPU times: user 2h 12min 41s, sys: 1.26 s, total: 2h 12min 43s
Wall time: 2min 14s


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0005, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [252]:
print('Random Forest Model Training Accuracy w/o CV: %.3f' % clf_rf.score(X_train, y_train))
print('Random Forest Model Test Accuracy w/o CV: %.3f' % clf_rf.score(X_test, y_test))

Random Forest Model Training Accuracy w/o CV: 0.876
Random Forest Model Test Accuracy w/o CV: 0.851


In [19]:
!pip install catboost

Collecting catboost
  Downloading catboost-0.24.2-cp37-none-win_amd64.whl (65.3 MB)
Collecting plotly
  Downloading plotly-4.12.0-py2.py3-none-any.whl (13.1 MB)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py): started
  Building wheel for retrying (setup.py): finished with status 'done'
  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11435 sha256=cd7ff7a5655bb6ebb56c3f96931782bc36cea917a6ed7ef970b9c36c982fb37e
  Stored in directory: c:\users\ysp\appdata\local\pip\cache\wheels\f9\8d\8d\f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, plotly, catboost
Successfully installed catboost-0.24.2 plotly-4.12.0 retrying-1.3.3


In [24]:
from catboost import CatBoostClassifier,Pool
train_dataset=Pool(data=X_train, label=y_train)
eval_dataset=Pool(data=X_test, label=y_test)
cat_cl=CatBoostClassifier()
cat_cl.fit(train_dataset,use_best_model=True, eval_set=eval_dataset)

Learning rate set to 0.070178
0:	learn: 0.6718801	test: 0.6719765	best: 0.6719765 (0)	total: 1.87s	remaining: 31m 13s
1:	learn: 0.6545883	test: 0.6544205	best: 0.6544205 (1)	total: 2.2s	remaining: 18m 16s
2:	learn: 0.6390135	test: 0.6388336	best: 0.6388336 (2)	total: 2.48s	remaining: 13m 45s
3:	learn: 0.6258050	test: 0.6259080	best: 0.6259080 (3)	total: 2.76s	remaining: 11m 27s
4:	learn: 0.6137331	test: 0.6138247	best: 0.6138247 (4)	total: 3.04s	remaining: 10m 4s
5:	learn: 0.6038948	test: 0.6043384	best: 0.6043384 (5)	total: 3.32s	remaining: 9m 9s
6:	learn: 0.5949607	test: 0.5953702	best: 0.5953702 (6)	total: 3.6s	remaining: 8m 30s
7:	learn: 0.5869366	test: 0.5873902	best: 0.5873902 (7)	total: 3.87s	remaining: 7m 59s
8:	learn: 0.5799687	test: 0.5805147	best: 0.5805147 (8)	total: 4.15s	remaining: 7m 36s
9:	learn: 0.5743894	test: 0.5751129	best: 0.5751129 (9)	total: 4.42s	remaining: 7m 18s
10:	learn: 0.5686241	test: 0.5694298	best: 0.5694298 (10)	total: 4.7s	remaining: 7m 2s
11:	learn: 0

93:	learn: 0.4095653	test: 0.4215495	best: 0.4215495 (93)	total: 27.4s	remaining: 4m 23s
94:	learn: 0.4088448	test: 0.4208086	best: 0.4208086 (94)	total: 27.6s	remaining: 4m 23s
95:	learn: 0.4080315	test: 0.4200745	best: 0.4200745 (95)	total: 27.9s	remaining: 4m 22s
96:	learn: 0.4071796	test: 0.4194291	best: 0.4194291 (96)	total: 28.2s	remaining: 4m 22s
97:	learn: 0.4064440	test: 0.4189138	best: 0.4189138 (97)	total: 28.5s	remaining: 4m 21s
98:	learn: 0.4056144	test: 0.4181019	best: 0.4181019 (98)	total: 28.7s	remaining: 4m 21s
99:	learn: 0.4048329	test: 0.4175547	best: 0.4175547 (99)	total: 29s	remaining: 4m 21s
100:	learn: 0.4040056	test: 0.4169212	best: 0.4169212 (100)	total: 29.3s	remaining: 4m 20s
101:	learn: 0.4033179	test: 0.4163268	best: 0.4163268 (101)	total: 29.6s	remaining: 4m 20s
102:	learn: 0.4025516	test: 0.4156199	best: 0.4156199 (102)	total: 29.8s	remaining: 4m 19s
103:	learn: 0.4017691	test: 0.4149699	best: 0.4149699 (103)	total: 30.1s	remaining: 4m 19s
104:	learn: 0.4

184:	learn: 0.3520300	test: 0.3754534	best: 0.3754534 (184)	total: 52.5s	remaining: 3m 51s
185:	learn: 0.3512284	test: 0.3747232	best: 0.3747232 (185)	total: 52.8s	remaining: 3m 51s
186:	learn: 0.3507121	test: 0.3743153	best: 0.3743153 (186)	total: 53.1s	remaining: 3m 50s
187:	learn: 0.3501479	test: 0.3738905	best: 0.3738905 (187)	total: 53.3s	remaining: 3m 50s
188:	learn: 0.3496423	test: 0.3735016	best: 0.3735016 (188)	total: 53.6s	remaining: 3m 49s
189:	learn: 0.3491650	test: 0.3732324	best: 0.3732324 (189)	total: 53.9s	remaining: 3m 49s
190:	learn: 0.3485741	test: 0.3729456	best: 0.3729456 (190)	total: 54.1s	remaining: 3m 49s
191:	learn: 0.3480630	test: 0.3726984	best: 0.3726984 (191)	total: 54.4s	remaining: 3m 48s
192:	learn: 0.3475337	test: 0.3723546	best: 0.3723546 (192)	total: 54.7s	remaining: 3m 48s
193:	learn: 0.3470372	test: 0.3720996	best: 0.3720996 (193)	total: 54.9s	remaining: 3m 48s
194:	learn: 0.3464348	test: 0.3715696	best: 0.3715696 (194)	total: 55.2s	remaining: 3m 47s

274:	learn: 0.3120868	test: 0.3504871	best: 0.3504871 (274)	total: 1m 16s	remaining: 3m 22s
275:	learn: 0.3116944	test: 0.3502949	best: 0.3502949 (275)	total: 1m 17s	remaining: 3m 22s
276:	learn: 0.3113340	test: 0.3501089	best: 0.3501089 (276)	total: 1m 17s	remaining: 3m 22s
277:	learn: 0.3109819	test: 0.3499565	best: 0.3499565 (277)	total: 1m 17s	remaining: 3m 22s
278:	learn: 0.3106688	test: 0.3497213	best: 0.3497213 (278)	total: 1m 18s	remaining: 3m 21s
279:	learn: 0.3103177	test: 0.3494752	best: 0.3494752 (279)	total: 1m 18s	remaining: 3m 21s
280:	learn: 0.3099794	test: 0.3493378	best: 0.3493378 (280)	total: 1m 18s	remaining: 3m 21s
281:	learn: 0.3096341	test: 0.3491815	best: 0.3491815 (281)	total: 1m 18s	remaining: 3m 20s
282:	learn: 0.3093083	test: 0.3489946	best: 0.3489946 (282)	total: 1m 19s	remaining: 3m 20s
283:	learn: 0.3089520	test: 0.3488782	best: 0.3488782 (283)	total: 1m 19s	remaining: 3m 20s
284:	learn: 0.3085958	test: 0.3486973	best: 0.3486973 (284)	total: 1m 19s	remain

364:	learn: 0.2845532	test: 0.3364761	best: 0.3364761 (364)	total: 1m 41s	remaining: 2m 56s
365:	learn: 0.2842918	test: 0.3363605	best: 0.3363605 (365)	total: 1m 41s	remaining: 2m 56s
366:	learn: 0.2839707	test: 0.3362050	best: 0.3362050 (366)	total: 1m 41s	remaining: 2m 55s
367:	learn: 0.2836866	test: 0.3360665	best: 0.3360665 (367)	total: 1m 42s	remaining: 2m 55s
368:	learn: 0.2834547	test: 0.3359638	best: 0.3359638 (368)	total: 1m 42s	remaining: 2m 55s
369:	learn: 0.2832011	test: 0.3359141	best: 0.3359141 (369)	total: 1m 42s	remaining: 2m 54s
370:	learn: 0.2829219	test: 0.3358396	best: 0.3358396 (370)	total: 1m 42s	remaining: 2m 54s
371:	learn: 0.2826692	test: 0.3357335	best: 0.3357335 (371)	total: 1m 43s	remaining: 2m 54s
372:	learn: 0.2824397	test: 0.3356055	best: 0.3356055 (372)	total: 1m 43s	remaining: 2m 53s
373:	learn: 0.2821228	test: 0.3354235	best: 0.3354235 (373)	total: 1m 43s	remaining: 2m 53s
374:	learn: 0.2818453	test: 0.3352963	best: 0.3352963 (374)	total: 1m 44s	remain

454:	learn: 0.2631164	test: 0.3278050	best: 0.3278050 (454)	total: 2m 5s	remaining: 2m 29s
455:	learn: 0.2629098	test: 0.3277407	best: 0.3277407 (455)	total: 2m 5s	remaining: 2m 29s
456:	learn: 0.2627592	test: 0.3276864	best: 0.3276864 (456)	total: 2m 5s	remaining: 2m 29s
457:	learn: 0.2625205	test: 0.3276088	best: 0.3276088 (457)	total: 2m 6s	remaining: 2m 29s
458:	learn: 0.2622937	test: 0.3275120	best: 0.3275120 (458)	total: 2m 6s	remaining: 2m 28s
459:	learn: 0.2621165	test: 0.3274717	best: 0.3274717 (459)	total: 2m 6s	remaining: 2m 28s
460:	learn: 0.2619884	test: 0.3274486	best: 0.3274486 (460)	total: 2m 6s	remaining: 2m 28s
461:	learn: 0.2617786	test: 0.3273583	best: 0.3273583 (461)	total: 2m 7s	remaining: 2m 27s
462:	learn: 0.2615369	test: 0.3272292	best: 0.3272292 (462)	total: 2m 7s	remaining: 2m 27s
463:	learn: 0.2612752	test: 0.3271274	best: 0.3271274 (463)	total: 2m 7s	remaining: 2m 27s
464:	learn: 0.2610687	test: 0.3270467	best: 0.3270467 (464)	total: 2m 7s	remaining: 2m 27s

544:	learn: 0.2457409	test: 0.3216719	best: 0.3216719 (544)	total: 2m 29s	remaining: 2m 4s
545:	learn: 0.2455824	test: 0.3215783	best: 0.3215783 (545)	total: 2m 29s	remaining: 2m 4s
546:	learn: 0.2453401	test: 0.3214780	best: 0.3214780 (546)	total: 2m 29s	remaining: 2m 3s
547:	learn: 0.2451391	test: 0.3214198	best: 0.3214198 (547)	total: 2m 29s	remaining: 2m 3s
548:	learn: 0.2449474	test: 0.3213542	best: 0.3213542 (548)	total: 2m 30s	remaining: 2m 3s
549:	learn: 0.2448667	test: 0.3213273	best: 0.3213273 (549)	total: 2m 30s	remaining: 2m 3s
550:	learn: 0.2446603	test: 0.3212858	best: 0.3212858 (550)	total: 2m 30s	remaining: 2m 2s
551:	learn: 0.2444474	test: 0.3212421	best: 0.3212421 (551)	total: 2m 30s	remaining: 2m 2s
552:	learn: 0.2442205	test: 0.3211573	best: 0.3211573 (552)	total: 2m 31s	remaining: 2m 2s
553:	learn: 0.2440313	test: 0.3211017	best: 0.3211017 (553)	total: 2m 31s	remaining: 2m 1s
554:	learn: 0.2438232	test: 0.3210416	best: 0.3210416 (554)	total: 2m 31s	remaining: 2m 1s

634:	learn: 0.2308915	test: 0.3169496	best: 0.3169386 (633)	total: 2m 54s	remaining: 1m 40s
635:	learn: 0.2307451	test: 0.3168512	best: 0.3168512 (635)	total: 2m 54s	remaining: 1m 39s
636:	learn: 0.2305835	test: 0.3168324	best: 0.3168324 (636)	total: 2m 54s	remaining: 1m 39s
637:	learn: 0.2304302	test: 0.3167655	best: 0.3167655 (637)	total: 2m 54s	remaining: 1m 39s
638:	learn: 0.2303639	test: 0.3167670	best: 0.3167655 (637)	total: 2m 55s	remaining: 1m 38s
639:	learn: 0.2301913	test: 0.3167794	best: 0.3167655 (637)	total: 2m 55s	remaining: 1m 38s
640:	learn: 0.2300154	test: 0.3167259	best: 0.3167259 (640)	total: 2m 55s	remaining: 1m 38s
641:	learn: 0.2298350	test: 0.3167378	best: 0.3167259 (640)	total: 2m 55s	remaining: 1m 38s
642:	learn: 0.2297486	test: 0.3167376	best: 0.3167259 (640)	total: 2m 56s	remaining: 1m 37s
643:	learn: 0.2295689	test: 0.3166471	best: 0.3166471 (643)	total: 2m 56s	remaining: 1m 37s
644:	learn: 0.2293267	test: 0.3165894	best: 0.3165894 (644)	total: 2m 56s	remain

724:	learn: 0.2179747	test: 0.3128449	best: 0.3128449 (724)	total: 3m 17s	remaining: 1m 15s
725:	learn: 0.2177884	test: 0.3127308	best: 0.3127308 (725)	total: 3m 18s	remaining: 1m 14s
726:	learn: 0.2177330	test: 0.3127181	best: 0.3127181 (726)	total: 3m 18s	remaining: 1m 14s
727:	learn: 0.2176916	test: 0.3127015	best: 0.3127015 (727)	total: 3m 18s	remaining: 1m 14s
728:	learn: 0.2175677	test: 0.3126936	best: 0.3126936 (728)	total: 3m 18s	remaining: 1m 13s
729:	learn: 0.2174040	test: 0.3126594	best: 0.3126594 (729)	total: 3m 19s	remaining: 1m 13s
730:	learn: 0.2172677	test: 0.3126625	best: 0.3126594 (729)	total: 3m 19s	remaining: 1m 13s
731:	learn: 0.2172339	test: 0.3126664	best: 0.3126594 (729)	total: 3m 19s	remaining: 1m 13s
732:	learn: 0.2170729	test: 0.3125992	best: 0.3125992 (732)	total: 3m 19s	remaining: 1m 12s
733:	learn: 0.2169357	test: 0.3125434	best: 0.3125434 (733)	total: 3m 20s	remaining: 1m 12s
734:	learn: 0.2168717	test: 0.3125213	best: 0.3125213 (734)	total: 3m 20s	remain

815:	learn: 0.2071742	test: 0.3098974	best: 0.3098974 (815)	total: 3m 41s	remaining: 50s
816:	learn: 0.2071265	test: 0.3098865	best: 0.3098865 (816)	total: 3m 42s	remaining: 49.7s
817:	learn: 0.2069640	test: 0.3098262	best: 0.3098262 (817)	total: 3m 42s	remaining: 49.5s
818:	learn: 0.2068290	test: 0.3098191	best: 0.3098191 (818)	total: 3m 42s	remaining: 49.2s
819:	learn: 0.2066878	test: 0.3097794	best: 0.3097794 (819)	total: 3m 42s	remaining: 48.9s
820:	learn: 0.2065516	test: 0.3097480	best: 0.3097480 (820)	total: 3m 43s	remaining: 48.7s
821:	learn: 0.2064619	test: 0.3097043	best: 0.3097043 (821)	total: 3m 43s	remaining: 48.4s
822:	learn: 0.2063067	test: 0.3096728	best: 0.3096728 (822)	total: 3m 43s	remaining: 48.1s
823:	learn: 0.2062186	test: 0.3096525	best: 0.3096525 (823)	total: 3m 43s	remaining: 47.8s
824:	learn: 0.2060929	test: 0.3096162	best: 0.3096162 (824)	total: 3m 44s	remaining: 47.6s
825:	learn: 0.2059597	test: 0.3096036	best: 0.3096036 (825)	total: 3m 44s	remaining: 47.3s
8

906:	learn: 0.1968860	test: 0.3068117	best: 0.3068074 (905)	total: 4m 5s	remaining: 25.2s
907:	learn: 0.1967394	test: 0.3067398	best: 0.3067398 (907)	total: 4m 6s	remaining: 24.9s
908:	learn: 0.1966167	test: 0.3067259	best: 0.3067259 (908)	total: 4m 6s	remaining: 24.7s
909:	learn: 0.1964408	test: 0.3067039	best: 0.3067039 (909)	total: 4m 6s	remaining: 24.4s
910:	learn: 0.1963346	test: 0.3066684	best: 0.3066684 (910)	total: 4m 6s	remaining: 24.1s
911:	learn: 0.1962996	test: 0.3066656	best: 0.3066656 (911)	total: 4m 7s	remaining: 23.8s
912:	learn: 0.1961738	test: 0.3066013	best: 0.3066013 (912)	total: 4m 7s	remaining: 23.6s
913:	learn: 0.1961121	test: 0.3065745	best: 0.3065745 (913)	total: 4m 7s	remaining: 23.3s
914:	learn: 0.1959262	test: 0.3065340	best: 0.3065340 (914)	total: 4m 7s	remaining: 23s
915:	learn: 0.1958351	test: 0.3065353	best: 0.3065340 (914)	total: 4m 8s	remaining: 22.8s
916:	learn: 0.1957321	test: 0.3064731	best: 0.3064731 (916)	total: 4m 8s	remaining: 22.5s
917:	learn: 

997:	learn: 0.1875474	test: 0.3044329	best: 0.3044329 (997)	total: 4m 29s	remaining: 540ms
998:	learn: 0.1874564	test: 0.3044083	best: 0.3044083 (998)	total: 4m 29s	remaining: 270ms
999:	learn: 0.1873317	test: 0.3043707	best: 0.3043707 (999)	total: 4m 30s	remaining: 0us

bestTest = 0.3043707278
bestIteration = 999



<catboost.core.CatBoostClassifier at 0x29092904208>

In [25]:
print('CatBoost Training Accuracy w/o CV: %.3f' % cat_cl.score(X_train, y_train))
print('Random Forest Model Test Accuracy w/o CV: %.3f' % cat_cl.score(X_test, y_test))

CatBoost Training Accuracy w/o CV: 0.961
Random Forest Model Test Accuracy w/o CV: 0.874
