## Application to transactional data

Let's apply this to transactional data.

### Dataset Description

We're going to be using a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The original data can be found here: http://archive.ics.uci.edu/ml/datasets/online+retail

Let's examine the dataset.

In [23]:
import pandas as pd


# Load data
df = pd.read_csv('Online Retail.csv')
print(df.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

    InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/10 8:26       2.55     17850.0  United Kingdom  
1  12/1/10 8:26       3.39     17850.0  United Kingdom  
2  12/1/10 8:26       2.75     17850.0  United Kingdom  
3  12/1/10 8:26       3.39     17850.0  United Kingdom  
4  12/1/10 8:26       3.39     17850.0  United Kingdom  


### Defining Dataset provider

Now let's format the data for the Autoencoder to use.

In [4]:
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder


# Filter excessive positive and negative quantities
mask_1 = df['Quantity'] > 0
mask_2 = df['Quantity'] < 30
df = df[np.logical_and(mask_1, mask_2)]

# Encode labels
le = LabelEncoder()
df['ItemID'] = le.fit_transform(df['StockCode'])

# Process datetime
df['Date'] = pd.to_datetime(df['InvoiceDate'])

# Split into train and test
mask = df['Date'] < pd.to_datetime('2011-11-01')
print(mask.sum() / len(mask))
df_train = df[mask]
df_test = df[np.logical_not(mask)]

# Aggregate to session and item level
ids = ['CustomerID', 'ItemID']
df_train_agg = df_train[ids + ['Quantity']].groupby(ids).sum()
df_train_agg = df_train_agg.reset_index().sort_values(ids)
df_test_agg = df_test[ids + ['Quantity']].groupby(ids).sum()
df_test_agg = df_test_agg.reset_index().sort_values(ids)

# Process columns from float to int
df_train_agg['CustomerID'] = df_train_agg['CustomerID'].astype(np.int)
df_test_agg['CustomerID'] = df_test_agg['CustomerID'].astype(np.int)

# Process quantity
df_train_agg['Quantity'] = 1 * (df_train_agg['Quantity'] >= 1)
df_test_agg['Quantity'] = 1 * (df_test_agg['Quantity'] >= 1)

# Remove users and items that aren't in training set
mask_3 = np.in1d(df_test_agg['CustomerID'], df_train_agg['CustomerID'])
mask_4 = np.in1d(df_test_agg['ItemID'], df_train_agg['ItemID'])
df_test_agg = df_test_agg[np.logical_and(mask_3, mask_4)]

# Write data
folders = ['retail/TRAIN', 'retail/TEST']
for folder in folders:
    os.makedirs(folder, exist_ok=True)

df_train_agg.to_csv('retail/TRAIN/train.txt', sep='\t',
                    header=False, index=False)
df_test_agg.to_csv('retail/TEST/test.txt', sep='\t',
                   header=False, index=False)
print('Done!')

   Session ID    Item ID  Quantity
0          11  214821376         2
1          12  214717872         4
2          21  214548736         1
3          21  214838496         1
4          33  214706448         2


In [18]:
!head retail/TRAIN/train.txt

12347	29	1
12347	130	1
12347	167	1
12347	207	1
12347	209	1
12347	281	1
12347	323	1
12347	327	1
12347	340	1
12347	392	1


In [19]:
!head retail/TEST/test.txt

12347	167	1
12347	340	1
12347	473	1
12347	770	1
12347	1939	1
12347	2113	1
12347	2323	1
12347	2332	1
12347	2334	1
12347	3052	1


### Experiment and results

To be edited.

In [13]:
!python run.py --gpu_ids 0 \
--path_to_train_data retail/TRAIN \
--path_to_eval_data retail/TEST \
--hidden_layers 512,512,1024 \
--non_linearity_type selu \
--batch_size 128 \
--logdir model_save \
--drop_prob 0.8 \
--optimizer momentum \
--lr 0.005 \
--weight_decay 0 \
--aug_step 1 \
--noise_prob 0 \
--num_epochs 50 \
--summary_frequency 500

Namespace(aug_step=1, batch_size=128, constrained=False, drop_prob=0.8, gpu_ids='0', hidden_layers='512,512,1024', logdir='model_save', lr=0.005, noise_prob=0.0, non_linearity_type='selu', num_epochs=50, optimizer='momentum', path_to_eval_data='retail/TEST', path_to_train_data='retail/TRAIN', skip_last_layer_nl=False, summary_frequency=500, weight_decay=0.0)
Loading training data
Data loaded
Total items found: 3879
Vector dim: 3596
Loading eval data
******************************
******************************
[3596, 512, 512, 1024]
Dropout drop probability: 0.8
Encoder pass:
torch.Size([512, 3596])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])
Decoder pass:
torch.Size([512, 1024])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([3596, 512])
torch.Size([3596])
******************************
******************************
######################################################
##################################

Total epoch 39 finished in 0.270003080368042 seconds with TRAINING RMSE loss: 0.18442396016310184
Epoch 39 EVALUATION LOSS: 0.09746190868051324
Saving model to model_save/model.epoch_39
Doing epoch 40 of 50
[40,     0] RMSE: 0.1841024
Total epoch 40 finished in 0.2738964557647705 seconds with TRAINING RMSE loss: 0.18297309379921387
Doing epoch 41 of 50
[41,     0] RMSE: 0.1830710
Total epoch 41 finished in 0.27388429641723633 seconds with TRAINING RMSE loss: 0.18248006582122078
Doing epoch 42 of 50
[42,     0] RMSE: 0.1824773
Total epoch 42 finished in 0.2709999084472656 seconds with TRAINING RMSE loss: 0.18187371649240078
Epoch 42 EVALUATION LOSS: 0.09670318491883391
Saving model to model_save/model.epoch_42
Doing epoch 43 of 50
[43,     0] RMSE: 0.1822296
Total epoch 43 finished in 0.274259090423584 seconds with TRAINING RMSE loss: 0.18156921818708208
Doing epoch 44 of 50
[44,     0] RMSE: 0.1816645
Total epoch 44 finished in 0.2700827121734619 seconds with TRAINING RMSE loss: 0.1828

In [49]:
!tensorboard --logdir=model_save

TensorBoard 1.5.0 at http://432008e1172c:6006 (Press CTRL+C to quit)
^C


In [12]:
# !rm -rf model_save

In [26]:
!python infer.py \
--path_to_train_data retail/TRAIN \
--path_to_eval_data retail/TEST \
--hidden_layers 512,512,1024 \
--non_linearity_type selu \
--save_path model_save/model.epoch_49 \
--drop_prob 0.8 \
--predictions_path preds.txt

Namespace(constrained=False, drop_prob=0.8, hidden_layers='512,512,1024', non_linearity_type='selu', path_to_eval_data='retail/TRAIN', path_to_train_data='retail/TRAIN', predictions_path='preds_train.txt', save_path='model_save/model.epoch_49', skip_last_layer_nl=False)
Loading training data
Data loaded
Total items found: 3879
Vector dim: 3596
Loading eval data
******************************
******************************
[3596, 512, 512, 1024]
Dropout drop probability: 0.8
Encoder pass:
torch.Size([512, 3596])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])
Decoder pass:
torch.Size([512, 1024])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([3596, 512])
torch.Size([3596])
******************************
******************************
Loading model from: model_save/model.epoch_49
######################################################
######################################################
############# AutoEncod

In [28]:
!head preds_train.txt -n 50

12347	29	1.0022114515304565	1.0
12347	130	1.0056113004684448	1.0
12347	167	1.0526694059371948	1.0
12347	207	1.2262483835220337	1.0
12347	209	1.0388710498809814	1.0
12347	281	1.0841541290283203	1.0
12347	323	1.0782625675201416	1.0
12347	327	1.1885783672332764	1.0
12347	340	1.000394344329834	1.0
12347	392	1.061516284942627	1.0
12347	407	1.1897742748260498	1.0
12347	473	1.1478164196014404	1.0
12347	661	1.0947186946868896	1.0
12347	697	1.000368356704712	1.0
12347	770	1.0166997909545898	1.0
12347	806	1.039190649986267	1.0
12347	837	0.9917394518852234	1.0
12347	926	1.0237293243408203	1.0
12347	927	1.0889842510223389	1.0
12347	1047	1.0912761688232422	1.0
12347	1050	1.0489686727523804	1.0
12347	1105	0.9916511178016663	1.0
12347	1106	1.1202387809753418	1.0
12347	1121	0.9663246273994446	1.0
12347	1152	1.131880760192871	1.0
12347	1262	1.097755789756775	1.0
12347	1263	0.9538242816925049	1.0
12347	1264	0.9588323831558228	1.0
12347	1265	1.0248124599456787	1.0
12347	1266	

In [20]:
!head retail/TEST/test.txt -n 20

12347	167	1
12347	340	1
12347	473	1
12347	770	1
12347	1939	1
12347	2113	1
12347	2323	1
12347	2332	1
12347	2334	1
12347	3052	1
12352	720	1
12352	1088	1
12352	1497	1
12352	1500	1
12352	1508	1
12352	1538	1
12352	1839	1
12352	1843	1
12352	1943	1
12352	1944	1


In [21]:
!head preds.txt -n 20

12347	167	1.0526694059371948	1.0
12347	340	1.000394344329834	1.0
12347	473	1.1478164196014404	1.0
12347	770	1.0166997909545898	1.0
12347	1939	0.9950699806213379	1.0
12347	2332	1.1410807371139526	1.0
12347	2334	1.1010527610778809	1.0
12347	3052	1.010302186012268	1.0
12347	2113	1.0811024904251099	1.0
12347	2323	0.9664358496665955	1.0
12352	3918	0.987004280090332	1.0
12352	1497	0.987648606300354	1.0
12352	1508	1.1000040769577026	1.0
12352	1538	1.0199328660964966	1.0
12352	1500	1.0147424936294556	1.0
12352	1843	0.9649232029914856	1.0
12352	720	1.0109050273895264	1.0
12352	2385	1.102394938468933	1.0
12352	1839	1.0148576498031616	1.0
12352	1088	0.9791996479034424	1.0


In [16]:
!python compute_RMSE.py --path_to_predictions=preds.txt

Namespace(path_to_predictions='preds.txt', round=False)
####################
RMSE: 0.09381776951434737
####################


In [29]:
!python compute_RMSE.py --path_to_predictions=preds_train.txt

Namespace(path_to_predictions='preds_train.txt', round=False)
####################
RMSE: 0.0901337912503639
####################


In [19]:
import copy
from reco_encoder.data import input_layer
import torch

In [20]:
params = dict()
params['batch_size'] = 128
params['data_dir'] = 'yoochoose-data/ALL'
params['major'] = 'users'
params['itemIdInd'] = 1
params['userIdInd'] = 0
print("Loading all data")
all_data_layer = input_layer.UserItemRecDataProvider(params=params)

Loading all data


In [21]:
print("Loading training data")
train_params = copy.deepcopy(params)
params['data_dir'] = 'yoochoose-data/TRAIN'
data_layer = input_layer.UserItemRecDataProvider(params=train_params, 
                                                 user_id_map=all_data_layer.userIdMap, 
                                                 item_id_map=all_data_layer.itemIdMap)


Loading training data


In [23]:
eval_params = copy.deepcopy(train_params)
eval_params['data_dir'] = 'yoochoose-data/TEST'
data_layer_eval = input_layer.UserItemRecDataProvider(params=eval_params, 
                                                      user_id_map=all_data_layer.userIdMap, 
                                                      item_id_map=all_data_layer.itemIdMap)
data_layer_eval.src_data = data_layer.data

In [24]:
next(data_layer.iterate_one_epoch())

FloatTensor of size 128x7025 with indices:


Columns 0 to 12 
    0     1     2     3     3     4     5     5     5     5     6     6     6
  799  1036   669    50  1693   143    11   738  1062    59   646   782   146

Columns 13 to 25 
    6     7     8     8     9     9     9    10    10    11    11    12    13
   89  3229  1622   717   156   827    79   416   135  1610   848   510   329

Columns 26 to 38 
   14    15    16    16    16    17    18    18    19    20    20    20    20
  629     6  2389  1584   288   188    27    80  1551    95  3735  4962   286

Columns 39 to 51 
   20    20    20    21    21    22    22    23    24    25    25    26    27
    4   195  1075  2364  1940    23    68   224   877   770  1020   470   255

Columns 52 to 64 
   27    28    29    29    30    31    31    32    33    34    34    35    36
  302   364  1399  1103    29  1964   720    56  2733  2448   194  2898  3174

Columns 65 to 77 
   37    37    38    38    39    40    40    40    40    40    

In [28]:
next(data_layer_eval.iterate_one_epoch_eval())

(FloatTensor of size 1x7025 with indices:
 
    0    0
  159   19
 [torch.LongTensor of size 2x2]
 and values:
 
  0
  0
 [torch.FloatTensor of size 2], FloatTensor of size 1x7025 with indices:
 
    0    0    0
  117  159   19
 [torch.LongTensor of size 2x3]
 and values:
 
  0
  0
  0
 [torch.FloatTensor of size 3])

In [None]:
params = dict()
params['batch_size'] = 128
params['data_dir'] = 'yoochoose-data/ALL'
params['major'] = 'users'
params['itemIdInd'] = 1
params['userIdInd'] = 0
print("Loading all data")
all_data_layer = input_layer.UserItemRecDataProvider(params=params)
print("Loading training data")
train_params = copy.deepcopy(params)
params['data_dir'] = 'yoochoose-data/TRAIN'
data_layer = input_layer.UserItemRecDataProvider(params=train_params, 
                                                 user_id_map=all_data_layer.userIdMap, 
                                                 item_id_map=all_data_layer.itemIdMap)
eval_params = copy.deepcopy(train_params)
eval_params['data_dir'] = 'yoochoose-data/TEST'
data_layer_eval = input_layer.UserItemRecDataProvider(params=eval_params, 
                                                      user_id_map=all_data_layer.userIdMap, 
                                                      item_id_map=all_data_layer.itemIdMap)
