# 텍스트 유사도 분석 모델
## 1. XGBoost (EXtream Gradient Boosting)
- 앙상블 기법: 1) bagging + 2) boosting
    - 앙상블: 여러 개의 학습 알고리즘을 사용해 더 좋은 성능을 얻는 방법
    - 1) bagging: 각 모델의 결과를 예측하고 모든 결과를 동등하게(parallel) 보고 취합해서 결과를 얻는 방식 - ex. 랜덤포레스트(여러 의사결정트리 결과값의 평균 이용)
    - 2) boosting: 각 모델의 결과를 순차적(sequential)으로 취합, 이전 모델이 잘못 예측한 부분에 가중치를 줘서 다시 모델로 가서 학습 
    - \*앙상블 <-> 싱글: 단순히 하나의 모델만으로 결과를 내는 방법 (ex. CNN, RNN)
    
    
- XG 부스트: "트리 부스팅" 기법 활용 
    - 트리 부스팅: 랜덤포레스트모델의 배깅 원리에 부스팅 방식을 적용한 것. --> 여러 의사결정트리의 결과를 평균 낸 것이 아니라 오답에 대해 가중치를 부여한 것. 그리고 가중치가 적용된 오답에 대해서는 관심을 가지고 정답이 될 수 있도록 결과를 만들고 해당 결과에 대한 다른 오답을 찾아 다시 똑같은 작업을 반복적으로 진행하는 것. 
    
    
- XG 부스트: "트리 부스팅" + 경사 하강법으로 최적화 + 의사결정 트리 구성시 병렬처리(연산 줄임)

In [1]:
import pandas as pd
import numpy as np
import os
import json
from sklearn.model_selection import train_test_split

DATA_IN_PATH = './data_in/'
DATA_OUT_PATH = './data_out/'

TRAIN_Q1_DATA_FILE = 'train_q1.npy'
TRAIN_Q2_DATA_FILE = 'train_q2.npy'
TRAIN_LABEL_DATA_FILE = 'train_label.npy'

train_q1_data = np.load(open(DATA_IN_PATH + TRAIN_Q1_DATA_FILE, 'rb'))
train_q2_data = np.load(open(DATA_IN_PATH + TRAIN_Q2_DATA_FILE, 'rb'))
train_labels = np.load(open(DATA_IN_PATH + TRAIN_LABEL_DATA_FILE, 'rb'))

train_input = np.stack((train_q1_data, train_q2_data), axis=1)
print(train_input.shape)

(298526, 2, 31)


In [2]:
train_input

array([[[  58,    9,   15, ...,    0,    0,    0],
        [   4,    9,   15, ...,    0,    0,    0]],

       [[  38, 3912, 3674, ...,    0,    0,    0],
        [   2,   38,    3, ...,    0,    0,    0]],

       [[1729,  711,   10, ...,    0,    0,    0],
        [   2,    3,    1, ...,    0,    0,    0]],

       ...,

       [[   4,   21,    7, ...,    0,    0,    0],
        [   4,   11,  133, ...,    0,    0,    0]],

       [[   2,   21, 8595, ...,    0,    0,    0],
        [   2,   21, 8595, ...,    0,    0,    0]],

       [[   9,   15,  304, ...,    0,    0,    0],
        [   3,   19,  242, ...,    0,    0,    0]]])

In [4]:
train_q1_data.shape

(298526, 31)

In [5]:
train_q2_data.shape

(298526, 31)

In [7]:
np.stack([train_q1_data, train_q2_data], axis=0)

array([[[  58,    9,   15, ...,    0,    0,    0],
        [  38, 3912, 3674, ...,    0,    0,    0],
        [1729,  711,   10, ...,    0,    0,    0],
        ...,
        [   4,   21,    7, ...,    0,    0,    0],
        [   2,   21, 8595, ...,    0,    0,    0],
        [   9,   15,  304, ...,    0,    0,    0]],

       [[   4,    9,   15, ...,    0,    0,    0],
        [   2,   38,    3, ...,    0,    0,    0],
        [   2,    3,    1, ...,    0,    0,    0],
        ...,
        [   4,   11,  133, ...,    0,    0,    0],
        [   2,   21, 8595, ...,    0,    0,    0],
        [   3,   19,  242, ...,    0,    0,    0]]])

In [8]:
np.stack([train_q1_data, train_q2_data], axis=0).shape

(2, 298526, 31)

In [9]:
train_labels

array([0, 0, 0, ..., 1, 1, 1])

In [10]:
train_input, eval_input, train_label, eval_label = \
      train_test_split(train_input, train_labels, 
                       test_size=0.2, random_state=4242)

### 모델 구성 & 학습

In [15]:
import xgboost as xgb

train_data = xgb.DMatrix(train_input.sum(axis=1), label=train_label)
# 학습 데이터 읽어오기
eval_data = xgb.DMatrix(eval_input.sum(axis=1), label=eval_label)
# 평가 데이터 읽어오기

data_list = [(train_data, 'train'), (eval_data, 'valid')]

params = {} # 인자를 통해 XGB모델에 넣어주기
params['objective'] = 'binary:logistic' # 로지스틱 예측으로
params['eval_metric'] = 'rmse' # root mean square error사용

bst = xgb.train(params, train_data, num_boost_round=1000,
               evals=data_list, early_stopping_rounds=10)

[0]	train-rmse:0.48380	valid-rmse:0.48427
[1]	train-rmse:0.47362	valid-rmse:0.47424
[2]	train-rmse:0.46694	valid-rmse:0.46797
[3]	train-rmse:0.46221	valid-rmse:0.46358
[4]	train-rmse:0.45808	valid-rmse:0.45970
[5]	train-rmse:0.45526	valid-rmse:0.45720
[6]	train-rmse:0.45300	valid-rmse:0.45512
[7]	train-rmse:0.45147	valid-rmse:0.45384
[8]	train-rmse:0.44963	valid-rmse:0.45204
[9]	train-rmse:0.44861	valid-rmse:0.45125
[10]	train-rmse:0.44717	valid-rmse:0.44997
[11]	train-rmse:0.44623	valid-rmse:0.44909
[12]	train-rmse:0.44552	valid-rmse:0.44856
[13]	train-rmse:0.44401	valid-rmse:0.44716
[14]	train-rmse:0.44254	valid-rmse:0.44581
[15]	train-rmse:0.44138	valid-rmse:0.44488
[16]	train-rmse:0.44095	valid-rmse:0.44454
[17]	train-rmse:0.44062	valid-rmse:0.44428
[18]	train-rmse:0.44017	valid-rmse:0.44396
[19]	train-rmse:0.43918	valid-rmse:0.44313
[20]	train-rmse:0.43886	valid-rmse:0.44287
[21]	train-rmse:0.43793	valid-rmse:0.44219
[22]	train-rmse:0.43761	valid-rmse:0.44192
[23]	train-rmse:0.437

[189]	train-rmse:0.40207	valid-rmse:0.42491
[190]	train-rmse:0.40198	valid-rmse:0.42490
[191]	train-rmse:0.40177	valid-rmse:0.42489
[192]	train-rmse:0.40159	valid-rmse:0.42483
[193]	train-rmse:0.40133	valid-rmse:0.42475
[194]	train-rmse:0.40116	valid-rmse:0.42474
[195]	train-rmse:0.40107	valid-rmse:0.42470
[196]	train-rmse:0.40096	valid-rmse:0.42468
[197]	train-rmse:0.40080	valid-rmse:0.42464
[198]	train-rmse:0.40065	valid-rmse:0.42458
[199]	train-rmse:0.40057	valid-rmse:0.42457
[200]	train-rmse:0.40036	valid-rmse:0.42452
[201]	train-rmse:0.40020	valid-rmse:0.42452
[202]	train-rmse:0.39992	valid-rmse:0.42445
[203]	train-rmse:0.39979	valid-rmse:0.42440
[204]	train-rmse:0.39953	valid-rmse:0.42430
[205]	train-rmse:0.39938	valid-rmse:0.42429
[206]	train-rmse:0.39924	valid-rmse:0.42425
[207]	train-rmse:0.39916	valid-rmse:0.42425
[208]	train-rmse:0.39900	valid-rmse:0.42427
[209]	train-rmse:0.39889	valid-rmse:0.42420
[210]	train-rmse:0.39876	valid-rmse:0.42416
[211]	train-rmse:0.39873	valid-r

[376]	train-rmse:0.37774	valid-rmse:0.41929
[377]	train-rmse:0.37766	valid-rmse:0.41927
[378]	train-rmse:0.37747	valid-rmse:0.41921
[379]	train-rmse:0.37733	valid-rmse:0.41921
[380]	train-rmse:0.37715	valid-rmse:0.41919
[381]	train-rmse:0.37696	valid-rmse:0.41919
[382]	train-rmse:0.37682	valid-rmse:0.41914
[383]	train-rmse:0.37672	valid-rmse:0.41912
[384]	train-rmse:0.37659	valid-rmse:0.41912
[385]	train-rmse:0.37649	valid-rmse:0.41911
[386]	train-rmse:0.37640	valid-rmse:0.41909
[387]	train-rmse:0.37632	valid-rmse:0.41907
[388]	train-rmse:0.37607	valid-rmse:0.41900
[389]	train-rmse:0.37595	valid-rmse:0.41900
[390]	train-rmse:0.37575	valid-rmse:0.41900
[391]	train-rmse:0.37556	valid-rmse:0.41897
[392]	train-rmse:0.37536	valid-rmse:0.41884
[393]	train-rmse:0.37525	valid-rmse:0.41881
[394]	train-rmse:0.37512	valid-rmse:0.41875
[395]	train-rmse:0.37495	valid-rmse:0.41873
[396]	train-rmse:0.37469	valid-rmse:0.41867
[397]	train-rmse:0.37467	valid-rmse:0.41865
[398]	train-rmse:0.37461	valid-r

[563]	train-rmse:0.35950	valid-rmse:0.41678
[564]	train-rmse:0.35939	valid-rmse:0.41676
[565]	train-rmse:0.35933	valid-rmse:0.41677
[566]	train-rmse:0.35929	valid-rmse:0.41677
[567]	train-rmse:0.35924	valid-rmse:0.41676
[568]	train-rmse:0.35907	valid-rmse:0.41675
[569]	train-rmse:0.35897	valid-rmse:0.41675
[570]	train-rmse:0.35890	valid-rmse:0.41674
[571]	train-rmse:0.35876	valid-rmse:0.41671
[572]	train-rmse:0.35868	valid-rmse:0.41670
[573]	train-rmse:0.35857	valid-rmse:0.41669
[574]	train-rmse:0.35854	valid-rmse:0.41669
[575]	train-rmse:0.35840	valid-rmse:0.41668
[576]	train-rmse:0.35838	valid-rmse:0.41667
[577]	train-rmse:0.35836	valid-rmse:0.41666
[578]	train-rmse:0.35831	valid-rmse:0.41666
[579]	train-rmse:0.35818	valid-rmse:0.41664
[580]	train-rmse:0.35812	valid-rmse:0.41663
[581]	train-rmse:0.35797	valid-rmse:0.41661
[582]	train-rmse:0.35788	valid-rmse:0.41662
[583]	train-rmse:0.35779	valid-rmse:0.41660
[584]	train-rmse:0.35774	valid-rmse:0.41659
[585]	train-rmse:0.35762	valid-r

[750]	train-rmse:0.34165	valid-rmse:0.41476
Collecting xgboost
  Downloading xgboost-1.6.2-py3-none-win_amd64.whl (125.4 MB)
     -------------------------------------- 125.4/125.4 MB 2.3 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-1.6.2


In [16]:
train_input.sum(axis=1)

array([[   14,    82,   805, ...,     0,     0,     0],
       [    8,    22,    10, ...,     0,     0,     0],
       [    8,   154,    63, ...,     0,     0,     0],
       ...,
       [    4,     6, 84087, ...,     0,     0,     0],
       [    4,    22,   170, ...,     0,     0,     0],
       [    8,    18,    10, ...,     0,     0,     0]])

In [21]:
train_input.sum(axis=1).shape

(238820, 31)

In [17]:
train_input

array([[[    3,     1,   804, ...,     0,     0,     0],
        [   11,    81,     1, ...,     0,     0,     0]],

       [[    4,    13,     5, ...,     0,     0,     0],
        [    4,     9,     5, ...,     0,     0,     0]],

       [[    4,    77,    60, ...,     0,     0,     0],
        [    4,    77,     3, ...,     0,     0,     0]],

       ...,

       [[    2,     3, 17995, ...,     0,     0,     0],
        [    2,     3, 66092, ...,     0,     0,     0]],

       [[    2,    11,     1, ...,     0,     0,     0],
        [    2,    11,   169, ...,     0,     0,     0]],

       [[    4,     9,     5, ...,     0,     0,     0],
        [    4,     9,     5, ...,     0,     0,     0]]])

In [18]:
train_input.shape

(238820, 2, 31)

In [19]:
eval_input.shape

(59706, 2, 31)

In [20]:
train_label.shape

(238820,)

### 예측하기

In [24]:
TEST_Q1_DATA_FILE = 'test_q1.npy'
TEST_Q2_DATA_FILE = 'test_q2.npy'
TEST_ID_DATA_FILE = 'test_id.npy'

test_q1_data = np.load(open(DATA_IN_PATH + TEST_Q1_DATA_FILE, 'rb'), allow_pickle=True)
test_q2_data = np.load(open(DATA_IN_PATH + TEST_Q2_DATA_FILE, 'rb'), allow_pickle=True)
test_id_data = np.load(open(DATA_IN_PATH + TEST_ID_DATA_FILE, 'rb'), allow_pickle=True)

test_input = np.stack((test_q1_data, test_q2_data), axis=1)
test_data = xgb.DMatrix(test_input.sum(axis=1))
test_predict = bst.predict(test_data)

if not os.path.exists(DATA_OUT_PATH):
    os.makedirs(DATA_OUT_PATH)
    
output = pd.DataFrame({'test_id':test_id_data, 'is_duplicate':test_predict})
output.to_csv(DATA_OUT_PATH + 'simple_xgb.csv', index=False)

In [25]:
test_predict

array([0.42514443, 0.4822895 , 0.8435693 , ..., 0.25260812, 0.2535647 ,
       0.48707962], dtype=float32)

캐글 정확도: 0.57038