validationの設定に関して2つのアイデアがある
- StratifiedKFold
- TimeSeriesSplit

In [1]:
import feather
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.model_selection import GroupKFold, TimeSeriesSplit, StratifiedKFold

In [2]:
train = feather.read_dataframe('../data/interim/train.ftr')
test = feather.read_dataframe('../data/interim/test.ftr')
train.shape, test.shape

train['totals.transactionRevenue'] = np.log1p(train['totals.transactionRevenue'].fillna(0).astype('float').values)
train['date'] = pd.to_datetime(train['date'], format='%Y%m%d')
test['date'] = pd.to_datetime(test['date'], format='%Y%m%d')

### StratifiedKFold

- trainには、購入した客とそうでない客で分かれている。
- 収益を生み出す顧客は全登録者数の1.3%程度である。
- 1(真の顧客)と0(そうでない客)で層別を行いたい。

- もし、bouncesによるtestデータを学習データとして使用する場合
- 該当するデータセット内のfullvisitoridをgroupkfoldして、stratifiedkfoldの結果に合体させる。
- かなりややこしくなるので、前もってsplitしておき、その結果（index）だけを渡す構成にしておいた方がよい。

In [15]:
grp_result = train.groupby("fullVisitorId")['totals.transactionRevenue'].sum().reset_index()
len(grp_result)

714167

In [16]:
print('Number of customer in train set:', len(grp_result[grp_result['totals.transactionRevenue']!=0]), \
      'out of rows:', len(grp_result), \
      'and ratio is:', len(grp_result[grp_result['totals.transactionRevenue']!=0])/len(grp_result))

Number of customer in train set: 9996 out of rows: 714167 and ratio is: 0.013996726255903731


In [17]:
# 真の顧客リスト
customer_list_in_train = grp_result[grp_result['totals.transactionRevenue']!=0]['fullVisitorId'].tolist()

# 真の顧客か否か、グループ分けを行う（真の顧客なら1）
group = pd.DataFrame()
group['fullVisitorId'] = train['fullVisitorId'].unique()
group['customer_flg'] = 0
customer_index_in_group = group.query('fullVisitorId in @customer_list_in_train').index
group.loc[customer_index_in_group, 'customer_flg'] = 1

(group.customer_flg==1).sum()

9996

In [6]:
group_skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=71)

for train_index, valid_index in group_skf.split(group.fullVisitorId, group.customer_flg):
    print('---')
    print('train_index:', len(train_index), 'valid_index:', len(valid_index))
    print('number of customer in train-set:', len(group.iloc[train_index].query('fullVisitorId in @customer_list_in_train')))
    print('number of customer in valid-set:', len(group.iloc[valid_index].query('fullVisitorId in @customer_list_in_train')))
print('---')

---
train_index: 571332 valid_index: 142835
number of customer in train-set: 7996
number of customer in valid-set: 2000
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---


In [177]:
for train_index, valid_index in folds_index_list:
    print('---')
    print('train_index:', len(train_index), 'valid_index:', len(valid_index))
    print('number of customer in train-set:', len(group.iloc[train_index].query('fullVisitorId in @customer_list_in_train')))
    print('number of customer in valid-set:', len(group.iloc[valid_index].query('fullVisitorId in @customer_list_in_train')))
print('---')

---
train_index: 571332 valid_index: 142835
number of customer in train-set: 7996
number of customer in valid-set: 2000
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---
train_index: 571334 valid_index: 142833
number of customer in train-set: 7997
number of customer in valid-set: 1999
---


- bouncesでtestデータを使うことになった場合    

In [171]:
train_data_v2 = test[test['totals.bounces'].notnull()].copy()
train_data_v2.shape, len(train_data_v2['fullVisitorId'].unique())

((420948, 53), 357841)

In [161]:
grp_kfold = GroupKFold(n_splits=5)

for train_index, valid_index in grp_kfold.split(X=train_data_v2, y=train_data_v2['date'], groups=train_data_v2['fullVisitorId']):
    print('---')
    print('train_index:', len(train_index), 'valid_index:', len(valid_index))
    print('number of id in train-set:', len(train_data_v2.iloc[train_index]['fullVisitorId'].unique()))
    print('number of id in valid-set:', len(train_data_v2.iloc[valid_index]['fullVisitorId'].unique()))
    print('number of id in total:', len(train_data_v2.iloc[train_index]['fullVisitorId'].unique())+len(train_data_v2.iloc[valid_index]['fullVisitorId'].unique()))
print('---')

---
train_index: 336758 valid_index: 84190
number of id in train-set: 286274
number of id in valid-set: 71567
number of id in total: 357841
---
train_index: 336758 valid_index: 84190
number of id in train-set: 286272
number of id in valid-set: 71569
number of id in total: 357841
---
train_index: 336758 valid_index: 84190
number of id in train-set: 286272
number of id in valid-set: 71569
number of id in total: 357841
---
train_index: 336759 valid_index: 84189
number of id in train-set: 286273
number of id in valid-set: 71568
number of id in total: 357841
---
train_index: 336759 valid_index: 84189
number of id in train-set: 286273
number of id in valid-set: 71568
number of id in total: 357841
---


### TimeSeriesSplit

- 時系列データであるので、時系列でsplitする

In [5]:
train.sort_values('date', ascending=True, inplace=True)
train.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserSize,...,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.campaignCode,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source
538448,Direct,2016-08-01,1492602573213666603,1492602573213666603_1470044332,Not Socially Engaged,1470044332,1,1470044332,Chrome,not available in demo dataset,...,,,,(not set),,True,,(none),,(direct)
538277,Direct,2016-08-01,7394165545362887055,7394165545362887055_1470044425,Not Socially Engaged,1470044425,3,1470044425,Chrome,not available in demo dataset,...,,,,(not set),,True,,(none),,(direct)
538278,Referral,2016-08-01,6107229716178617930,6107229716178617930_1470094529,Not Socially Engaged,1470094529,1,1470094529,Chrome,not available in demo dataset,...,,,,(not set),,,,referral,/,mall.googleplex.com
538279,Direct,2016-08-01,9459384188253198762,9459384188253198762_1470079413,Not Socially Engaged,1470079413,1,1470079413,Chrome,not available in demo dataset,...,,,,(not set),,True,,(none),,(direct)
538280,Direct,2016-08-01,4052177266351383392,4052177266351383392_1470111093,Not Socially Engaged,1470111093,1,1470111093,Safari,not available in demo dataset,...,,,,(not set),,True,,(none),,(direct)


In [6]:
tscv = TimeSeriesSplit(n_splits=5)

for train_index, valid_index in tscv.split(train):
    print('---')
    print('train_index:', len(train_index), 'valid_index:', len(valid_index))
    print('min date in train-set:', min(train.iloc[train_index]['date']))
    print('max date in train-set:', max(train.iloc[train_index]['date']))
    print('min date in valid-set:', min(train.iloc[valid_index]['date']))
    print('max date in valid-set:', max(train.iloc[valid_index]['date']))
print('---')

---
train_index: 150613 valid_index: 150608
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2016-10-03 00:00:00
min date in valid-set: 2016-10-03 00:00:00
max date in valid-set: 2016-11-16 00:00:00
---
train_index: 301221 valid_index: 150608
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2016-11-16 00:00:00
min date in valid-set: 2016-11-16 00:00:00
max date in valid-set: 2017-01-09 00:00:00
---
train_index: 451829 valid_index: 150608
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-01-09 00:00:00
min date in valid-set: 2017-01-09 00:00:00
max date in valid-set: 2017-03-18 00:00:00
---
train_index: 602437 valid_index: 150608
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-03-18 00:00:00
min date in valid-set: 2017-03-18 00:00:00
max date in valid-set: 2017-05-25 00:00:00
---
train_index: 753045 valid_index: 150608
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-05-25 00:00:00
min da

- boncesのtestデータを投入する場合

In [175]:
total = pd.concat([train[['fullVisitorId', 'date']], train_data_v2[['fullVisitorId', 'date']]], axis=0)
total.sort_values('date', ascending=True, inplace=True)

tscv = TimeSeriesSplit(n_splits=5)
for train_index, valid_index in tscv.split(total):
    print('---')
    print('train_index:', len(train_index), 'valid_index:', len(valid_index))
    print('min date in train-set:', min(total.iloc[train_index]['date']))
    print('max date in train-set:', max(total.iloc[train_index]['date']))
    print('min date in valid-set:', min(total.iloc[valid_index]['date']))
    print('max date in valid-set:', max(total.iloc[valid_index]['date']))
print('---')

---
train_index: 220771 valid_index: 220766
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2016-10-25 00:00:00
min date in valid-set: 2016-10-25 00:00:00
max date in valid-set: 2017-01-03 00:00:00
---
train_index: 441537 valid_index: 220766
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-01-03 00:00:00
min date in valid-set: 2017-01-03 00:00:00
max date in valid-set: 2017-04-13 00:00:00
---
train_index: 662303 valid_index: 220766
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-04-13 00:00:00
min date in valid-set: 2017-04-13 00:00:00
max date in valid-set: 2017-07-24 00:00:00
---
train_index: 883069 valid_index: 220766
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-07-24 00:00:00
min date in valid-set: 2017-07-24 00:00:00
max date in valid-set: 2017-12-02 00:00:00
---
train_index: 1103835 valid_index: 220766
min date in train-set: 2016-08-01 00:00:00
max date in train-set: 2017-12-02 00:00:00
min d

In [35]:
train_date = pd.to_datetime(train['date'], format='%Y%m%d')

In [42]:
train_date

0        2016-09-02
1        2016-09-02
2        2016-09-02
3        2016-09-02
4        2016-09-02
5        2016-09-02
6        2016-09-02
7        2016-09-02
8        2016-09-02
9        2016-09-02
10       2016-09-02
11       2016-09-02
12       2016-09-02
13       2016-09-02
14       2016-09-02
15       2016-09-02
16       2016-09-02
17       2016-09-02
18       2016-09-02
19       2016-09-02
20       2016-09-02
21       2016-09-02
22       2016-09-02
23       2016-09-02
24       2016-09-02
25       2016-09-02
26       2016-09-02
27       2016-09-02
28       2016-09-02
29       2016-09-02
            ...    
903623   2017-01-04
903624   2017-01-04
903625   2017-01-04
903626   2017-01-04
903627   2017-01-04
903628   2017-01-04
903629   2017-01-04
903630   2017-01-04
903631   2017-01-04
903632   2017-01-04
903633   2017-01-04
903634   2017-01-04
903635   2017-01-04
903636   2017-01-04
903637   2017-01-04
903638   2017-01-04
903639   2017-01-04
903640   2017-01-04
903641   2017-01-04
