### Next purchase pridiction

This is a competition notebook for [SberMarket Inclass challenge](https://www.kaggle.com/c/sbermarket-internship-competition).

The dataset is the purchase history of twenty thousand customers for a couple of years. The data contains statistics for each purchase (without specifying the quantity of purchased goods), there are only the list of purchased categories (a total of 881 categories).

The challenge is to predict the next order, **regardless of the time of the next order** and regardless of the number of purchased items of each category.<br>

Evaluation metric is F1

In this kernel I will demonstrate a fairly simple way of solving using the linear regression assisted by [LightAutoML](https://github.com/sberbank-ai-lab/LightAutoML) and feature engineering.

#### Pipeline

1. Training dataset creation and feature engineering
1. Model training
1. Test dataset creation
1. Preliminary predictions
1. Optimal probability threshold
1. Final predictions and submit

In [7]:
import os
import numpy as np
import pandas as pd
import time
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

In [8]:
path_train = os.path.join('..','data','train.csv')
path_submit = os.path.join('..','data','sample_submission.csv')

In [9]:
raw = pd.read_csv(path_train)
sub = pd.read_csv(path_submit)

#### 1. Training dataset creation and feature engineering

```order_number``` - order counter for each user<br>
```user_id``` - user ID<br>
```category``` - category<br>
```ordered```  - purchase counter for each category for each user<br>
```orders_total``` - total purchase counter for each user<br>
```rating``` - average amount of each category in  customer's purchase<br>
```total_ordered``` - purchase counter by all users<br>
```id``` - user_id / category as in submission file<br>
```target``` - target variable (the last known purchase)<br>

In [10]:
%%time
# sparse matrix for temporary use
train_raw = pd.get_dummies(raw, columns = ['cart'], prefix='', prefix_sep='', dtype='bool')
train_raw = train_raw.groupby(['user_id', 'order_completed_at']).any().reset_index()

# order counter for each use
train_raw['order_number'] = train_raw.groupby(['user_id']).cumcount()
train_raw = train_raw.drop('order_completed_at', axis=1)

# separate datasets by the last purchase
last_order = train_raw.groupby(['user_id'])['order_number'].transform(max) == train_raw['order_number']
train = train_raw[~last_order].groupby('user_id').sum().reset_index()
valid = train_raw[last_order].reset_index(drop=True)

#purchase counter for each category for each user
train_melt = pd.melt(train, id_vars=['user_id'], var_name='category', value_name='ordered')
valid_melt = pd.melt(valid, id_vars=['user_id'], var_name='category', value_name='target')

Train = train_melt.copy()

# total purchase counter for each user
order_number = valid[['user_id', 'order_number']].set_index('user_id').squeeze()
Train['orders_total']= Train['user_id'].map(order_number)

#average amount of each category in  customer's purchase
Train['rating'] = Train['ordered'] / Train['orders_total']

# user_id / category as in submission file
Train['id'] = Train['user_id'].astype(str) + ';' + Train['category']

# target variable (the last known purchase)
Train['target'] = valid_melt['target'].astype(int)

#remove those users/categories who are not represented in the submission file
Train = Train[Train.id.isin(sub.id.unique())].reset_index(drop=True)
#Check
print((Train.sort_values('id')['id'].values == Train.sort_values('id')['id'].values).all())

#purchase counter by all user (for represetned users)
total_ordered = Train.groupby('category')['ordered'].sum()
Train['total_ordered'] = Train['category'].map(total_ordered)

Train.head(20)

True
CPU times: user 1min 51s, sys: 15.7 s, total: 2min 7s
Wall time: 1min 42s


Unnamed: 0,user_id,category,ordered,orders_total,rating,id,target,total_ordered
0,7,0,0,10,0.0,7;0,1,12922
1,8,0,1,7,0.142857,8;0,0,12922
2,9,0,1,45,0.022222,9;0,0,12922
3,12,0,1,20,0.05,12;0,1,12922
4,13,0,3,16,0.1875,13;0,0,12922
5,14,0,17,80,0.2125,14;0,0,12922
6,16,0,9,93,0.096774,16;0,0,12922
7,18,0,2,12,0.166667,18;0,0,12922
8,21,0,22,82,0.268293,21;0,0,12922
9,25,0,1,13,0.076923,25;0,0,12922


In [11]:
Train_set, Valid_set = train_test_split(Train, test_size = 0.2,
                                        stratify = None, random_state = 17)

#### 2. Model training

In [12]:
# brew install libomp
# !pip install -U lightautoml
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

In [13]:
%%time 
def f1 (real, pred, **kwargs):
    return f1_score(real, (pred > 0.5).astype(int), **kwargs)

roles = {'target': 'target', 'drop': ['user_id', 'category', 'id']}
task = Task('binary', metric = f1)

automl = TabularAutoML(task = task, 
                       timeout = 300,
                       cpu_limit = 4,
                       reader_params = {'n_jobs': 4, 'cv': 5, 'random_state': 17},
                       general_params = {'use_algos': [['linear_l2']]},
                      )
train_pred = automl.fit_predict(Train_set, roles = roles)
print('Score', "%.5f" % f1(Train_set.target, train_pred.data))

valid_pred = automl.predict(Valid_set)
print('Score on out of folds validation', "%.5f" % f1(Valid_set.target, valid_pred.data))

Score 0.57657
Score on out of folds validation 0.57507
CPU times: user 23.3 s, sys: 2.45 s, total: 25.8 s
Wall time: 20.7 s


Check possible probability threshold level

In [14]:
best_score = 0
for i in np.arange(0.01, 1.0, 0.01):
    score = f1 = f1_score(Valid_set.target, (valid_pred.data > i).astype(int))
    if score > best_score:
        best_score = score
        proba_split = i

print('At i =', "%.2f" % proba_split,'score is : ' "%.5f" % best_score)

At i = 0.29 score is : 0.61853


#### 3. Test dataset creation

In [15]:
Test = Train.copy()

#increment counter
Test['orders_total'] += 1 

#add last purchase
Test['ordered'] = Test['ordered'] + Test['target']

#recalculate including last order
test_total_ordered = Test.groupby('category')['ordered'].sum()
Test['total_ordered'] = Test['category'].map(test_total_ordered)

#recalculate including last order
Test['rating'] = Test['ordered'] / Test['orders_total']

Test = Test.drop('target', axis=1)
Test.head(3)

Unnamed: 0,user_id,category,ordered,orders_total,rating,id,total_ordered
0,7,0,1,11,0.090909,7;0,14190
1,8,0,1,8,0.125,8;0,14190
2,9,0,1,46,0.021739,9;0,14190


#### 4. Preliminary predictions


In [16]:
predictions = automl.predict(Test)
print('Train target mean:', "%.5f" % Train.target.mean())
print('Test target mean:', "%.5f" % (predictions.data > 0.5).astype(int).mean())

Train target mean: 0.23596
Test target mean: 0.07531


#### 5. Optimal probability threshold
Using a threshold probability level of 0.5, the average value of the predictions is less than the average value in the training set.<br>
Based on the hypothesis that the total number of all purchases of all customers in one order is approximately equal, the optimal probability threshold value is selecting.

In [17]:
th = 0.5
train_mean = Train.target.mean()
test_mean = (predictions.data > th).astype(int).mean()

while test_mean < train_mean:
    th -= 0.005
    test_mean = (predictions.data > th).astype(int).mean()
    
print('Threshold:', "%.4f" % th)
print('Train mean:', "%.5f" % train_mean)
print('New Test mean:', "%.5f" % test_mean)

Threshold: 0.2450
Train mean: 0.23596
New Test mean: 0.23612


#### 6. Final predictions and submit

In [18]:
Test['target'] = (predictions.data > th).astype(int)
submit = pd.merge(sub['id'], Test[['id', 'target']], on='id')
#submit.to_csv('submission.csv', index = False)