## Chapter 2

### 2.1 Split the dataset

Note that for the exercise in this repo we will not be using the test dataset provided by Kaggle. This dataset has no information on whether the recommendations are good or bad (of course. This is tested at kaggle's side) and we are not taking part of any competition. Therefore, we will carry on as if the training dataset is all we have, and we will split the training dataset into train, validation and testing. 

As I mentioned in `Chapter 0` there is a temporal component to this problem. As written in the Kaggle's site: *"You are provided with a year of transactional data for 22,873 users on the site ponpare.jp. The training set spans the dates 2011-07-01 to 2012-06-23. The test set spans the week after the end of the training set, 2012-06-24 to 2012-06-30. The goal of the competition is to recommend a ranked list of coupons for each user in the dataset (found in user_list.csv). Your predictions are scored against the actual coupon purchases, made during the test set week, of the 310 possible test set coupons."*

Therefore, we will split the dataset according to this set up, this is: we will use the last week of data as test dataset, the previous week as validation, and the rest of the data will be our training data. However, the code below is written so that one has flexibility to choose the testing period (referred as `tp`).

Before we move into the code, let's just clarify a few additional aspects and notation:

1. **Interactions**: I refer to coupon visits (views) and purchases as "interactions".


2. **Coupons**: based on the testing set provided by kaggle, testing coupons are selected based on the dispfrom column. 

    In addition, kaggle provides coupon features for the testing coupons. Therefore I will assume that next week's coupons are received with enough time in advance so that we can use *ALL* coupons (train, validation and test coupons) to engineer the features and split the dataset afterwards. Programmatically, this scenario is easier since no new features (as the result of, for example, one-hot enconding) will appear during testing. Given the amount of coupons in this example, this represents no limitation at all. 
    
    A second possible scenario is one where we might need to recommend coupons as they come. In this case, we would only have time to do some fast processing (LabelEnconding perhaps) and, since a new coupon has never been seen before we would have to recommend based on some similarity metric between the new and old coupons (which would also need to be computed fast). 


3. **Users**: for the excercise here we will only concentrate in users that have been seen during training. In the real world you need to recommend to everyone, existing and new users (i.e. you have to face the so called cold start problem for users in this case). 

    For users that have not interacted at all in the past we know nothing, unless they fill a form or register in your site in which case we might have some demographic information. On top of that, in the example used here, remember that no one has interacted with the upcoming coupons. Altogether, it is straightforward to understand that is hard to recommend well to these users. 
    
    An easy solution would be recommending new coupons that resemble the most popular coupons in the past. If, in addition, we have some demographic information (age, gender, location) we might want to constrain the definition of *"similar coupons"* to groups of similar demographics. In other words, recommend new coupons that are similar to the most popular coupons amongst users of similar age, gender, etc...

Let's start defining some useful variables

In [2]:
import numpy as np
import pandas as pd
import os

inp_dir = "../datasets/Ponpare/data_translated"
out_dir = "../datasets/Ponpare/data_processed"
# Testing period
tp = 7

In [3]:
# the master list of users in the dataset
df_users = pd.read_csv(os.path.join(inp_dir,"user_list.csv"))
df_users['reg_date'] =  pd.to_datetime(df_users.reg_date, infer_datetime_format=True)

df_users.head()

Unnamed: 0,reg_date,sex_id,age,withdraw_date,pref_name,user_id_hash
0,2012-03-28 14:14:18,f,25,,,d9dca3cb44bab12ba313eaa681f663eb
1,2011-05-18 00:41:48,f,34,,tokyo,560574a339f1b25e57b0221e486907ed
2,2011-06-13 16:36:58,m,41,,aichi,e66ae91b978b3229f8fd858c80615b73
3,2012-02-08 12:56:15,m,25,,,43fc18f32eafb05713ec02935e2c2825
4,2011-05-22 23:43:56,m,62,,kanagawa,dc6df8aa860f8db0d710ce9d4839840f


In [4]:
# master list of coupons which are considered part of the training set
df_coupons = pd.read_csv(os.path.join(inp_dir,"coupon_list_train.csv"))
df_coupons['dispfrom'] = pd.to_datetime(df_coupons.dispfrom, infer_datetime_format=True)
df_coupons['dispend'] = pd.to_datetime(df_coupons.dispend, infer_datetime_format=True)
df_coupons['validfrom'] = pd.to_datetime(df_coupons.validfrom, infer_datetime_format=True)
df_coupons['validend'] = pd.to_datetime(df_coupons.validend, infer_datetime_format=True)

df_coupons.head()

Unnamed: 0,capsule_text,genre_name,price_rate,catalog_price,discount_price,dispfrom,dispend,dispperiod,validfrom,validend,validperiod,usable_date_mon,usable_date_tue,usable_date_wed,usable_date_thu,usable_date_fri,usable_date_sat,usable_date_sun,usable_date_holiday,usable_date_before_holiday,large_area_name,ken_name,small_area_name,coupon_id_hash
0,Food,Food,50,3000,1500,2011-07-08 12:00:00,2011-07-09 12:00:00,1,2011-07-10,2011-12-08,151.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,kanto,saitama,saitama,6b263844241eea98c5a97f1335ea82af
1,Food,Food,51,2080,1000,2011-07-01 12:00:00,2011-07-02 12:00:00,1,2011-07-03,2011-12-04,154.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,kanto,chiba,chiba,cc031f250e8bad1e24060263b9fc0ddd
2,Food,Food,50,7000,3500,2011-07-12 12:00:00,2011-07-15 12:00:00,3,2011-07-16,2012-01-11,179.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,kanto,chiba,chiba,ba5e9b7453ca52ff711635a5d2e8102d
3,Food,Food,50,3000,1500,2011-07-09 12:00:00,2011-07-11 12:00:00,2,2011-07-12,2011-12-01,142.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,kanto,chiba,chiba,3e1ffbedca3569f9e8032d401e8cb4e6
4,Food,Food,50,2000,1000,2011-07-05 12:00:00,2011-07-06 12:00:00,1,2011-07-07,2011-12-30,176.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,kanto,chiba,chiba,782934b6c815b4030ea204eef7d4a734


In [5]:
# the viewing log of users browsing coupons during training
df_visits = pd.read_csv(os.path.join(inp_dir,"coupon_visit_train.csv"))
df_visits['i_date'] = pd.to_datetime(df_visits.i_date, infer_datetime_format=True)

df_visits.head()

Unnamed: 0,purchase_flg,i_date,page_serial,referrer_hash,view_coupon_id_hash,user_id_hash,session_id_hash,purchaseid_hash
0,0,2012-03-28 14:15:00,7,7d3892e54acb559ae36c459978489330,34c48f84026e08355dc3bd19b427f09a,d9dca3cb44bab12ba313eaa681f663eb,673af822615593249e7c6a9a1a6bbb1a,
1,0,2012-03-28 14:17:28,9,7d3892e54acb559ae36c459978489330,34c48f84026e08355dc3bd19b427f09a,d9dca3cb44bab12ba313eaa681f663eb,673af822615593249e7c6a9a1a6bbb1a,
2,0,2012-03-28 14:20:05,16,7d3892e54acb559ae36c459978489330,17c450c3b470c045d35ec22b02daa690,d9dca3cb44bab12ba313eaa681f663eb,673af822615593249e7c6a9a1a6bbb1a,
3,0,2012-03-28 14:23:16,18,7d3892e54acb559ae36c459978489330,91a15e6a95d09e5e01b50747833b317d,d9dca3cb44bab12ba313eaa681f663eb,673af822615593249e7c6a9a1a6bbb1a,
4,0,2012-03-28 14:26:25,20,7d3892e54acb559ae36c459978489330,96fcbc8f6e45d5a2de1661eb140c6e82,d9dca3cb44bab12ba313eaa681f663eb,673af822615593249e7c6a9a1a6bbb1a,


In [6]:
# the purchase log of users buying coupons during training
df_purchases = pd.read_csv(os.path.join(inp_dir,"coupon_detail_train.csv"))
df_purchases['i_date'] = pd.to_datetime(df_purchases.i_date, infer_datetime_format=True)

df_purchases.head()

Unnamed: 0,item_count,i_date,small_area_name,purchaseid_hash,user_id_hash,coupon_id_hash
0,1,2012-03-28 15:06:06,hyogo,c820a8882374a4e472f0984a8825893f,d9dca3cb44bab12ba313eaa681f663eb,34c48f84026e08355dc3bd19b427f09a
1,1,2011-07-04 23:52:54,tokyo_ginza,1b4eb2435421ede98c8931c42e8220ec,560574a339f1b25e57b0221e486907ed,767673b7a777854a92b73b0934ddfae7
2,1,2011-07-16 00:52:49,tokyo_ebisu,36b5f9ba46c44b65587d0b16f2e4c77f,560574a339f1b25e57b0221e486907ed,4f3b5b91d9831192557c056022fdc1f2
3,1,2011-07-16 00:54:53,tokyo_ebisu,2f30f46937cc9004774e576914b2aa1a,560574a339f1b25e57b0221e486907ed,4f3b5b91d9831192557c056022fdc1f2
4,1,2011-07-16 00:55:52,tokyo_ebisu,4d000c64a55ac573d0ae1a8f03677f50,560574a339f1b25e57b0221e486907ed,4f3b5b91d9831192557c056022fdc1f2


Let's find the most recent date during training using interactions.

In [7]:
df_interactions_l = [df_visits, df_purchases]
most_recent = []
for df in df_interactions_l:
    for col in df.columns:
        if col == 'i_date':
            most_recent.append(df[col].max())
present = np.max(most_recent)

print(present)

2012-06-23 23:59:48


Let's compute a "days to present" column that will be used to split the datasets

In [8]:
tmp_df_visits = pd.DataFrame({'present': [present]*df_visits.shape[0]})
df_visits['days_to_present'] = (tmp_df_visits['present'] - df_visits['i_date'])
df_visits['days_to_present'] = df_visits.days_to_present.dt.days

tmp_df_detail = pd.DataFrame({'present': [present]*df_purchases.shape[0]})
df_purchases['days_to_present'] = (tmp_df_detail['present'] - df_purchases['i_date'])
df_purchases['days_to_present'] = df_purchases.days_to_present.dt.days

As mentioned before, for the excercise here we will only use customers that were seen during training.

In [9]:
tmp_df_users = pd.DataFrame({'present': [present]*df_users.shape[0]})
df_users['days_to_present'] = (tmp_df_users['present'] - df_users['reg_date'])
df_users['days_to_present'] = df_users.days_to_present.dt.days

tmp_df_coupons = pd.DataFrame({'present': [present]*df_coupons.shape[0]})
df_coupons['days_to_present'] = (tmp_df_detail['present'] - df_coupons['dispfrom'])
df_coupons['days_to_present'] = df_coupons.days_to_present.dt.days

# clean a bit
del(tmp_df_visits,tmp_df_detail,tmp_df_users,tmp_df_coupons)

We will explore a series of scenarios that I will discuss when we get there. For now and for convenience, let's split all datasets.

In [10]:
# Adding flags whether train/test/valid
df_visits['days_to_present_flag'] = df_visits.days_to_present.apply(
    lambda x: 0 if x<=tp-1 else 1 if ((x>tp-1) and (x<=(tp*2)-1)) else 2)
df_purchases['days_to_present_flag'] = df_purchases.days_to_present.apply(
    lambda x: 0 if x<=tp-1 else 1 if ((x>tp-1) and (x<=(tp*2)-1)) else 2)
df_users['days_to_present_flag'] = df_users.days_to_present.apply(
    lambda x: 0 if x<=tp-1 else 1 if ((x>tp-1) and (x<=(tp*2)-1)) else 2)
df_coupons['days_to_present_flag'] = df_coupons.days_to_present.apply(
    lambda x: 0 if x<=tp-1 else 1 if ((x>tp-1) and (x<=(tp*2)-1)) else 2)

Split and save

In [11]:
df_l = ['df_visits', 'df_purchases', 'df_users', 'df_coupons']
for df in df_l:
    print('INFO: splitting {}'.format(df.split('_')[1]))
    tmp_train = eval(df)[eval(df)['days_to_present_flag'] == 2]
    tmp_valid = eval(df)[eval(df)['days_to_present_flag'] == 1]
    tmp_test  = eval(df)[eval(df)['days_to_present_flag'] == 0]

    tmp_train.drop('days_to_present_flag', axis=1, inplace=True)
    tmp_valid.drop('days_to_present_flag', axis=1, inplace=True)
    tmp_test.drop('days_to_present_flag', axis=1, inplace=True)

    tmp_train.reset_index(drop=True, inplace=True)
    tmp_valid.reset_index(drop=True, inplace=True)
    tmp_test.reset_index(drop=True, inplace=True)

    tmp_train.to_pickle(open(os.path.join(out_dir,'train',df+'_train.p'), 'wb'))
    tmp_valid.to_pickle(open(os.path.join(out_dir,'valid',df+'_valid.p'), 'wb'))
    tmp_test.to_pickle(open(os.path.join(out_dir,'test',df+'_test.p'), 'wb'))

INFO: splitting visits


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


INFO: splitting purchases
INFO: splitting users
INFO: splitting coupons


Now the following files

```
df_coupons_train.p       
df_purchases_train.p 
df_visits_train.p   
df_users_train.p         
```

Should be in `../datasets/Ponpare/data_processed/train/`. The same applies to the `test` and `valid` directories.