  <h1 align="center">E-commerce behaviour predictions </h1> 



#Dataset description

The training data contains full e-commerce session information. The aim is to predict the `aid` values for each session type thats occur after the last timestamp `ts` in the test session for each session in the test data. In other words, the test data contains sessions truncated by timestamp, and model should predict what occurs after the point of truncation.

> train.csv - the training data, which contains full session data: 

`session` - the unique session id 

`aid` - the article id (product code) of the associated event 

`ts` - the Unix timestamp of the event 

`type` - the event type, i.e., whether a product was clicked, added to the user's cart, or ordered during the session: 
0.  'clicks', 
1.  'carts', 
2. 'orders' 

> test.csv - the test data, which contains truncated session data
your task is to predict the next aid clicked after the session truncation, as well as the the remaining aids that are added to carts and orders; you may predict up to 20 values for each session type


> Acknowledgements:
> > Copyright (c) 2022 Otto (GmbH & Co KG), https://www.otto.de/jobs/technology/ueberblick/

#Loading and exploring dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
import seaborn as sns

from datetime import datetime


import warnings
warnings.filterwarnings('ignore')

import gc

from scipy.sparse import csr_matrix

from sklearn.neighbors import NearestNeighbors

In [2]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab Notebooks/Na GITa/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/Na GITa


In [3]:
train = pd.read_csv('data/onlineshop/train_colab.csv', usecols=[1, 2, 3, 4])
test = pd.read_csv('data/onlineshop/test_colab.csv', usecols=[1, 2, 3, 4])

In [4]:
train.head()

Unnamed: 0,session,aid,ts,type
0,0,1349536,1661634295,0
1,0,165096,1661634321,0
2,0,315914,1661634351,0
3,0,315914,1661634431,1
4,0,1680276,1661634664,0


In [5]:
train.tail()

Unnamed: 0,session,aid,ts,type
12941604,12899776,1737908,1661723987,0
12941605,12899777,384045,1661723976,0
12941606,12899777,384045,1661723986,0
12941607,12899778,561560,1661723983,0
12941608,12899778,32070,1661723994,0


In [6]:
test.head()

Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000,0
1,12899780,1142000,1661724000,0
2,12899780,582732,1661724058,0
3,12899780,973453,1661724109,0
4,12899780,736515,1661724136,0


In [7]:
test.tail()

Unnamed: 0,session,aid,ts,type
6540533,14571577,1141710,1662328774,0
6540534,14571578,519105,1662328775,0
6540535,14571579,739876,1662328775,0
6540536,14571580,202353,1662328781,0
6540537,14571581,1100210,1662328791,0


Replacing `ts` with info about hour and day

In [8]:
#datetime.fromtimestamp(train.ts[1]).strftime('%a')

In [9]:
#datetime.fromtimestamp(train.ts[1]).strftime('%H%M')

In [10]:
train['ts'] = pd.to_datetime(train['ts'], unit='s')
test['ts'] = pd.to_datetime(test['ts'], unit='s')

In [11]:
train['day'] = train['ts'].dt.day_name()
test['day'] = test['ts'].dt.day_name()

In [12]:
train['hour'] = train['ts'].dt.hour
test['hour'] = test['ts'].dt.hour

In [13]:
train_time = train.drop(columns=['ts'])
test_time = test.drop(columns=['ts'])

In [14]:
del train
del test

In [15]:
gc.collect()

36

#KNN

Concatenate train+test, use KNN to predict 20 more labels without previuosly used in session, use SVD to factorization and use different features than type

In [16]:
data = pd.concat([train_time, test_time])

In [17]:
data['type'] = data['type'] + 1 #to make sparse matrix with pivot (NaN replaced by 0)

In [18]:
data.session.nunique()

3366233

In [19]:
data.aid.nunique()

1027688

In [22]:
#df.groupby(['userId','movieId'])['rating'].max().unstack()

In [23]:
first_chunk = data[data['aid'].isin(data.aid.unique()[:1000])]

In [40]:
first_chunk.head() 

Unnamed: 0,session,aid,type,day,hour
0,0,1349536,1,Saturday,21
1,0,165096,1,Saturday,21
2,0,315914,1,Saturday,21
3,0,315914,2,Saturday,21
4,0,1680276,1,Saturday,21


In [None]:
# chunk_size = 10000
# chunks = [x for x in range(0, df.shape[0], chunk_size)]
# type_2_df = pd.concat([df.iloc[chunks[i]:chunks[i + 1] - 1].pivot_table(index = 'session', columns = 'aid', values = 'type', aggfunc='mean').fillna(0) for i in range(0, len(chunks) - 1)])

In [35]:
first_chunk_df = first_chunk.pivot_table(index = 'session', columns = 'aid', values = 'type').fillna(0)

In [36]:
first_chunk_df.head()

aid,2027,4322,4525,5606,6362,6851,7651,8017,9827,9891,...,1830578,1836610,1837737,1837818,1845526,1847491,1847685,1849394,1854762,1854872
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
first_chunk_matrix = csr_matrix(first_chunk_df.values)

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(first_chunk_matrix)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=20)

In [41]:
query_index = np.random.choice(first_chunk_df.shape[0])
print(query_index)
distances_1, indices_1 = model_knn.kneighbors(first_chunk_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 1000)

161037


In [43]:
for i in range(0, len(distances_1.flatten())):
  if i == 0:
    print('Recommendations for {0}:\n'.format(first_chunk_df.index[query_index]))
  else:
    print('{0}: {1}, with distance of {2}:'.format(i, first_chunk_df.index[indices_1.flatten()[i]], distances_1.flatten()[i]))

Recommendations for 13522519:

1: 13721221, with distance of 0.0:
2: 13452452, with distance of 0.0:
3: 6292056, with distance of 0.0:
4: 14466378, with distance of 0.0:
5: 13186570, with distance of 0.0:
6: 13855582, with distance of 0.0:
7: 12773292, with distance of 0.0:
8: 14095183, with distance of 0.0:
9: 13855637, with distance of 0.0:
10: 14414302, with distance of 0.0:
11: 13806762, with distance of 0.0:
12: 1153025, with distance of 0.0:
13: 2191959, with distance of 0.0:
14: 13931143, with distance of 0.0:
15: 10875497, with distance of 0.0:
16: 11350591, with distance of 0.0:
17: 13931513, with distance of 0.0:
18: 13523282, with distance of 0.0:
19: 686489, with distance of 0.0:
20: 14237504, with distance of 0.0:
21: 13721864, with distance of 0.0:
22: 1658439, with distance of 0.0:
23: 2658254, with distance of 0.0:
24: 14025029, with distance of 0.0:
25: 12177480, with distance of 0.0:
26: 965494, with distance of 0.0:
27: 9300848, with distance of 0.0:
28: 13720004, wi

In [44]:
second_chunk = data[data['aid'].isin(data.aid.unique()[1000:2000])]
second_chunk_df = second_chunk.pivot_table(index = 'session', columns = 'aid', values = 'type').fillna(0)
second_chunk_matrix = csr_matrix(second_chunk_df.values)

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(second_chunk_matrix)

distances_2, indices_2 = model_knn.kneighbors(second_chunk_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 1000)

In [45]:
for i in range(0, len(distances_2.flatten())):
  if i == 0:
    print('Recommendations for {0}:\n'.format(second_chunk_df.index[query_index]))
  else:
    print('{0}: {1}, with distance of {2}:'.format(i, second_chunk_df.index[indices_2.flatten()[i]], distances_2.flatten()[i]))

Recommendations for 13899485:

1: 14288639, with distance of 0.0:
2: 13607394, with distance of 0.0:
3: 13981204, with distance of 0.0:
4: 13334292, with distance of 0.0:
5: 14170066, with distance of 0.0:
6: 12852846, with distance of 0.0:
7: 6168228, with distance of 0.0:
8: 5470245, with distance of 0.0:
9: 12852914, with distance of 0.0:
10: 13425276, with distance of 0.0:
11: 6164416, with distance of 0.0:
12: 14446647, with distance of 0.0:
13: 1502601, with distance of 0.0:
14: 13077416, with distance of 0.0:
15: 13077431, with distance of 0.0:
16: 13822468, with distance of 0.0:
17: 13334895, with distance of 0.0:
18: 12655859, with distance of 0.0:
19: 11628101, with distance of 0.0:
20: 13422516, with distance of 0.0:
21: 13424577, with distance of 0.0:
22: 14134796, with distance of 0.0:
23: 12780225, with distance of 0.0:
24: 4323598, with distance of 0.0:
25: 97472, with distance of 0.0:
26: 1283896, with distance of 0.0:
27: 14441620, with distance of 0.0:
28: 2503586, wi