#### Imports

In [1]:
import numpy as np
import pandas as pd
import gc
DATA_PATH = '../data/'

In [None]:
print('loading prior')
priors = pd.read_csv(DATA_PATH + 'order_products__prior.csv', dtype={
            'order_id': np.int32,
            'product_id': np.uint16,
            'add_to_cart_order': np.int16,
            'reordered': np.int8})

print('loading train')
train = pd.read_csv(DATA_PATH + 'order_products__train.csv', dtype={
            'order_id': np.int32,
            'product_id': np.uint16,
            'add_to_cart_order': np.int16,
            'reordered': np.int8})

print('loading orders')
orders = pd.read_csv(DATA_PATH + 'orders.csv', dtype={
        'order_id': np.int32,
        'user_id': np.int32,
        'eval_set': 'category',
        'order_number': np.int16,
        'order_dow': np.int8,
        'order_hour_of_day': np.int8,
        'days_since_prior_order': np.float32})

loading prior
loading train


In [2]:
priors = priors.groupby('order_id').apply(lambda x: pd.Series(x.product_id.values)).unstack().to_sparse(fill_value=0)

In [31]:
train = train.groupby('order_id').apply(lambda x: pd.Series(x.product_id.values)).unstack().to_sparse(fill_value=0)

#### Processing starts here

The idea is to prepare the data for input into an LSTM network.  Usually, one would have one long time series, and would use windowing to come up with input/output pairs, but we have a bunch of shorter time series from different users.  We'll ignore this, and use each user's order sequence as it's own input/output pair.  The first thing we'll need is a features array with dimensions `(sequence_length, n_users, n_features)`. We'll use the features provided for each order and a vector of product IDs for that order.  We'll assume that the maximum prior order size is the maximum number of products we can predict.  This means `n_features` will equal 148 (145 potential products plus 3 order features).  We'll use the last order's products vector as the label, so the labels array should be of dimension `(n_users, max_products_per_order)`.

For those familiar with neural network sequence prediction, I know how unorthodox this is.  It's all very experimental!

Concatenate prior orders with training orders.  Note some users exist in `priors` that do not exist in the `train`.

In [40]:
df = pd.concat([priors,train], axis=0)

In [53]:
df.to_pickle('full_unpackedprodid')

In [54]:
del priors, train
gc.collect;

In [71]:
print('loading orders')
orders = pd.read_csv(DATA_PATH + 'orders.csv', dtype={
        'order_id': np.int32,
        'user_id': np.int32,
        'eval_set': 'category',
        'order_number': np.int16,
        'order_dow': np.int8,
        'order_hour_of_day': np.int8,
        'days_since_prior_order': np.float32})

loading orders


Separate training and test orders.

In [72]:
train_orders = orders[orders.eval_set!='test']
test_orders = orders[orders.eval_set=='test']
df = df.set_index("order_id")

In [73]:
del orders
gc.collect();

Join training orders with the ordered products dataframe.

In [74]:
train_final = train_orders.join(df, on="order_id", how="left")

In [2]:
#test_orders.to_pickle(DATA_PATH+"test.pickle")
#train_final.to_pickle(DATA_PATH+"train.pickle")
train_final = pd.read_pickle(DATA_PATH+'train.pickle')

In [3]:
train_final.reset_index(inplace=True)

Since this is a lot of data and we'll want to pad our sequences, we need to maintain reasonable data sizes. How many `user_id`s have `i` orders?

In [25]:
a = train_final.groupby(['user_id'])['order_number'].max()
for i in range(1,100+1):
    print(i,len(a[a==i])) # Note to self: make this into a plot.

1 0
2 0
3 8686
4 22451
5 18267
6 15334
7 13196
8 11075
9 9762
10 8660
11 7399
12 6787
13 5923
14 5464
15 4965
16 4487
17 3983
18 3692
19 3290
20 3108
21 2844
22 2683
23 2513
24 2208
25 2126
26 2049
27 1836
28 1695
29 1592
30 1495
31 1447
32 1338
33 1243
34 1174
35 1097
36 1030
37 1001
38 944
39 906
40 847
41 891
42 781
43 726
44 726
45 674
46 736
47 638
48 592
49 603
50 545
51 566
52 519
53 520
54 463
55 432
56 382
57 386
58 313
59 335
60 317
61 312
62 272
63 257
64 248
65 227
66 239
67 176
68 200
69 186
70 176
71 143
72 158
73 152
74 170
75 147
76 126
77 123
78 129
79 120
80 112
81 114
82 116
83 83
84 98
85 104
86 82
87 74
88 80
89 83
90 72
91 69
92 73
93 51
94 64
95 62
96 68
97 49
98 47
99 538
100 867


Looks like a long-tailed distribution.  We need to drop our max orders per `user_id` if we're going to pad these sequences, but we don't want to throw away too many orders.  That way, we can still operate on the data in memory while minimizing information loss.

In [4]:
# Number of orders lost by subsetting
train_final.shape[0]-train_final[train_final.order_number<=68].shape[0]

83082

Padding the sequences.

In [5]:
total_per_user = 68

In [6]:
train_final = train_final[train_final.order_number<=total_per_user]
train_final = train_final.join(train_final.groupby(['user_id'])['order_number'].max(),on='user_id', rsuffix='_max')

In [7]:
train_final['order_number_margin'] = total_per_user-train_final['order_number_max']

In [8]:
train_final['new_order_number'] = train_final['order_number'] + train_final['order_number_margin']

In [9]:
train_final.set_index(['user_id','new_order_number'], inplace=True)

In [10]:
train_final.drop(['index','order_number_margin','order_number_max'],axis=1,inplace=True)

In [11]:
user_id_indices = list(set([index[0] for index in train_final.index]))

In [12]:
new_multiindex_array = [(x,y) for x in user_id_indices for y in range(1,68+1)]

In [13]:
train_final = train_final.reindex(index=new_multiindex_array)

In [14]:
del user_id_indices, new_multiindex_array
gc.collect();

In [15]:
train_final.to_pickle(DATA_PATH+'padded_train.pickle')

To be continued...