# Instacart - Shortest Path to Submission
## 1. 問題定義
- userごとにorderの記録が時系列順に並んでいる
- あるuserが過去注文したアイテムを再注文した場合、"reorder"フラグが立てられている
- 各userの最新orderについて、reorderしたアイテムを複数予測する
    - 実際にreorderしたアイテムを多く含み、かつreorderしたアイテムをなるべく含まないようにする
    - ひとつもreorderが無いと予想した場合は"None"を出力する

(参考)リレーション

https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/33205

## 2. データを眺める
略。公式のDataページかKernelを見てザックリ雰囲気をつかむ
- https://www.kaggle.com/c/instacart-market-basket-analysis/data

## 3. First Submission
とりあえず、ユーザーごとに「過去買ったものは全部reorderする」という予測でsubmissionを作ってみる

In [1]:
import pandas as pd
import numpy as np
import os
import time
from contextlib import contextmanager

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

# まずfeatherに変換しておく
def load(path):
    if not os.path.exists(path+'.f'):
        pd.read_csv(path).to_feather(path+'.f')
    return pd.read_feather(path+'.f')

with timer('load data'):
    aisles      = load('../input/aisles.csv')
    departments = load('../input/departments.csv')
    prior       = load('../input/order_products__prior.csv')
    train       = load('../input/order_products__train.csv')
    orders      = load('../input/orders.csv')
    products    = load('../input/products.csv')

load data - done in 1s


In [2]:
# order-id/user-id/product-idを一つにまとめる
with timer('merge & drop duplicates'):
    prior_orders = pd.merge(prior, orders[['order_id','user_id']], on='order_id', how='left')
    print(prior_orders.shape)
    prior_orders.drop_duplicates(subset=['user_id','product_id'], inplace=True)
    print(prior_orders.shape)   
    
prior_orders.head()

(32434489, 5)
(13307953, 5)
merge & drop duplicates - done in 11s


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id
0,2,33120,1,1,202279
1,2,28985,2,1,202279
2,2,9327,3,0,202279
3,2,45918,4,1,202279
4,2,30035,5,0,202279


In [3]:
# userごとに、過去買ったアイテムをまとめる
with timer('aggregate prior products'):
    prior_orders['product_id_str'] = prior_orders['product_id'].astype(str)
    prior_products = prior_orders.groupby('user_id')['product_id_str'].apply(lambda x: ' '.join(x)).reset_index()
    
prior_products.head()

aggregate prior products - done in 16s


Unnamed: 0,user_id,product_id_str
0,1,196 12427 10258 25133 10326 17122 41787 13176 ...
1,2,49451 32792 32139 34688 36735 37646 22829 2485...
2,3,38596 21903 248 40604 8021 17668 21137 23650 3...
3,4,22199 25146 1200 17769 43704 37646 11865 35469...
4,5,27344 24535 43693 40706 16168 21413 13988 3376...


In [4]:
with timer('make 1st submission'):
    orders_in_test = orders[orders['eval_set'] == 'test']

    submission = pd.merge(orders_in_test[['user_id','order_id']], prior_products, on='user_id', how='left')
    submission.drop('user_id', axis=1, inplace=True)
    submission.columns = ['order_id', 'products']
    submission.to_csv('../output/submission_baseline.csv', index=False)

make 1st submission - done in 1s


- Pandasは内部で列指向にデータを持っており、なるべく列をまたがない＆一度にたくさんの行を処理させた方が高速
    - product_idの型変換はgroupbyの中でやらず、先にastype(str)で追加の列を作っておく

## 4. モデル学習

- user_id x item_idを1つの行として学習させる
    - priorで買った＆trainで買わなかった…0
    - priorで買った＆trainで買った…1
    
|eval_set|reordered|意味|
|--|--|--|
|prior|0,1|過去に買って、train/testで買わなかった(label==0)|
|train/test|1|過去に買って、train/testで再注文(label==1)|
|train/test|0|過去に買わず、train/testで初めて買った(学習対象外)|    

In [5]:
with timer('extract last order for each user-x-item'):
    # 全データをまとめる
    all = pd.merge(orders, pd.concat([train,prior]), on='order_id', how='left')
    all.head()

    # user x itemで最後のデータだけを残す
    last_order_by_user_x_item = all.drop_duplicates(subset=['user_id','product_id'], keep='last')
    last_order_by_user_x_item.head()

extract last order for each user-x-item - done in 14s


In [None]:
import lightgbm as lgb

with timer('make data'):
    # train/testでユーザーが振り分けられている
    train_users = orders[['eval_set','user_id']][orders['eval_set'] =='train']
    train_users['is_train'] = 1
    train_users.drop('eval_set', axis=1, inplace=True)

    # 最後のデータが"train"で、かつreordered==1のものはラベルが1
    X = last_order_by_user_x_item.drop('order_id', axis=1)
    X = pd.merge(X, train_users, on='user_id', how='left')

    X = X[(X['eval_set'] == 'prior') | (X['reordered'] == 1.0)]
    X['target'] = X['eval_set'] == 'train'
    X['target'] = X['target'].astype(np.int8)
    
    # TODO: ここで特徴量エンジニアリングを色々する

    print(X.head())
    droplist = ['reordered', 'eval_set', 'user_id', 'product_id', 'is_train']
    
    X_train = X[X['is_train'] == 1.0].drop(droplist, axis=1)
    X_test  = X[X['is_train'] == 0.0].drop(droplist, axis=1)

    dtrain = lgb.Dataset(X_train.drop('target', axis=1), X_train['target'])
    X_test.drop('target', axis=1, inplace=True)

open C:\Users\noumi\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\lightgbm\lib_lightgbm.dll
   order_number  order_dow  order_hour_of_day  days_since_prior_order  \
0             1          2                  8                     NaN   
1             3          3                 12                    21.0   
2             5          4                 15                    28.0   
3             5          4                 15                    28.0   
4             5          4                 15                    28.0   

   add_to_cart_order  is_train  target  
0                2.0       1.0       0  
1                5.0       1.0       0  
2                5.0       1.0       0  
3                6.0       1.0       0  
4                7.0       1.0       0  
make data - done in 7s


In [None]:
with timer('lgbm training'):
    lgb_param = {
        'objective' : 'binary',
        'metric' : 'auc',
        'num_leaves' : 15,
        'seed' : 0,
        'learning_rate' : 0.1
    }

    result = lgb.cv(lgb_param, dtrain, num_boost_round=10000, early_stopping_rounds=100, verbose_eval=50)
    best_round = len(result['auc_mean'])
    
    print('best auc: {}, at round {}'.format(result['auc-mean'][-1], best_round))
    
    booster = lgb.train(lgb_param, dtrain, num_boost_round=int(best_round*1.1))

[50]	cv_agg's auc: 0.741813 + 0.000356137
[100]	cv_agg's auc: 0.743675 + 0.000320226
[150]	cv_agg's auc: 0.744749 + 0.000346864
[200]	cv_agg's auc: 0.745684 + 0.000328303
[250]	cv_agg's auc: 0.746543 + 0.000274331
[300]	cv_agg's auc: 0.747296 + 0.000295992
[350]	cv_agg's auc: 0.748084 + 0.000320604
[400]	cv_agg's auc: 0.74881 + 0.000342262
[450]	cv_agg's auc: 0.749511 + 0.000322739
[500]	cv_agg's auc: 0.750225 + 0.000321
[550]	cv_agg's auc: 0.750822 + 0.000330886
[600]	cv_agg's auc: 0.751453 + 0.000328097
[650]	cv_agg's auc: 0.752118 + 0.00038093
[700]	cv_agg's auc: 0.752746 + 0.000402061
[750]	cv_agg's auc: 0.753343 + 0.000416048
[800]	cv_agg's auc: 0.753904 + 0.000433217
[850]	cv_agg's auc: 0.754446 + 0.000404878
[900]	cv_agg's auc: 0.754989 + 0.00037285


In [None]:
y_test = booster.predict(X_test)

predicted = X_test[['order_id',]]

## TODO:

- LGBMClassifierの予測値をそのまま使っているが、コンペ指標に最適化されていない(Boschと同じ)。predict_proba -> 閾値最適化する
    - metric/objectiveも適当
- 特徴量エンジニアリング
- prior/trainを1つずつずらしていくとデータの水増しができるはず