  <h1 align="center">E-commerce behaviour predictions </h1> 



#Dataset description

The training data contains full e-commerce session information. The aim is to predict the `aid` values for each session type thats occur after the last timestamp `ts` in the test session for each session in the test data. In other words, the test data contains sessions truncated by timestamp, and model should predict what occurs after the point of truncation.

> train.csv - the training data, which contains full session data: 

`session` - the unique session id 

`aid` - the article id (product code) of the associated event 

`ts` - the Unix timestamp of the event 

`type` - the event type, i.e., whether a product was clicked, added to the user's cart, or ordered during the session: 
0.  'clicks', 
1.  'carts', 
2. 'orders' 

> test.csv - the test data, which contains truncated session data
your task is to predict the next aid clicked after the session truncation, as well as the the remaining aids that are added to carts and orders; you may predict up to 20 values for each session type


> Acknowledgements:
> > Copyright (c) 2022 Otto (GmbH & Co KG), https://www.otto.de/jobs/technology/ueberblick/

#Loading and exploring dataset

In [None]:
!pip install pynvml
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)

if (device_name != b'Tesla T4') and (device_name != b'Tesla P4') and (device_name != b'Tesla P100-PCIE-16GB'):
  raise Exception("""
                     Unfortunately this instance does not have a T4, P4 or P100 GPU.\n
                     Please make sure you've configured Colab to request a GPU instance type.\n
                     Sometimes Colab allocates a Tesla K80 instead of a T4, P4 or P100. Resetting the instance.If you get a K80 GPU, try Runtime -> Reset all runtimes...""")
else:
  print('Woo! You got the right kind of GPU!')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Woo! You got the right kind of GPU!


In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
We will now install RAPIDS via pip!  Please stand by, should be quick...
***********************************************************************



In [None]:
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,573 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Hit:13 http://ppa

In [None]:
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:27
🔁 Restarting kernel...


In [None]:
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'
!pip uninstall cupy -y

Found existing installation: cffi 1.15.1
Uninstalling cffi-1.15.1:
  Successfully uninstalled cffi-1.15.1
Found existing installation: cryptography 38.0.4
Uninstalling cryptography-38.0.4:
  Successfully uninstalled cryptography-38.0.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 446.7/446.7 kB 16.1 MB/s eta 0:00:00
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 22.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
import seaborn as sns

from datetime import datetime


import warnings
warnings.filterwarnings('ignore')

import gc

from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

#from sklearn.neighbors import NearestNeighbors, KDTree
from sklearn import preprocessing
from sklearn.decomposition import PCA

import tqdm.notebook as tq

import joblib

#!pip install cuml
import cuml, cudf #; cuml.__version__
from cuml.neighbors import NearestNeighbors


In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab Notebooks/Na GITa/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/Na GITa


In [None]:
# train = pd.read_csv('data/onlineshop/train_colab.csv', usecols=[1, 2, 3, 4])
# test = pd.read_csv('data/onlineshop/test_colab.csv', usecols=[1, 2, 3, 4])

In [None]:

train = cudf.read_csv('data/onlineshop/train_colab.csv', usecols=[1, 2, 3, 4])
test = cudf.read_csv('data/onlineshop/test_colab.csv', usecols=[1, 2, 3, 4])

In [None]:
train.head()

Unnamed: 0,session,aid,ts,type
0,0,1349536,1661634295,0
1,0,165096,1661634321,0
2,0,315914,1661634351,0
3,0,315914,1661634431,1
4,0,1680276,1661634664,0


In [None]:
train.tail()

Unnamed: 0,session,aid,ts,type
12941604,12899776,1737908,1661723987,0
12941605,12899777,384045,1661723976,0
12941606,12899777,384045,1661723986,0
12941607,12899778,561560,1661723983,0
12941608,12899778,32070,1661723994,0


In [None]:
test.head()

Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000,0
1,12899780,1142000,1661724000,0
2,12899780,582732,1661724058,0
3,12899780,973453,1661724109,0
4,12899780,736515,1661724136,0


In [None]:
test.tail()

Unnamed: 0,session,aid,ts,type
6540533,14571577,1141710,1662328774,0
6540534,14571578,519105,1662328775,0
6540535,14571579,739876,1662328775,0
6540536,14571580,202353,1662328781,0
6540537,14571581,1100210,1662328791,0


Replacing `ts` with info about hour and day

In [None]:
#datetime.fromtimestamp(train.ts[1]).strftime('%a')

In [None]:
#datetime.fromtimestamp(train.ts[1]).strftime('%H%M')

In [None]:
train['ts'] = pd.to_datetime(train['ts'], unit='s')
test['ts'] = pd.to_datetime(test['ts'], unit='s')

In [None]:
train['day'] = train['ts'].dt.day_name()
test['day'] = test['ts'].dt.day_name()

In [None]:
train['hour'] = train['ts'].dt.hour
test['hour'] = test['ts'].dt.hour

In [None]:
train_time = train.drop(columns=['ts'])
test_time = test.drop(columns=['ts'])

In [None]:
del train
del test

In [None]:
gc.collect()

36

#KNN

In [None]:
#data = pd.concat([train_time, test_time])
data = cudf.concat([train, test])

In [None]:
data['type'] = data['type'] + 1 #to make sparse matrix with pivot (NaN replaced by 0)

In [None]:
data.session.nunique()

3366233

In [None]:
test_time.session.nunique()

1617733

In [None]:
data.aid.nunique()

1027688

In [None]:
#df.groupby(['userId','movieId'])['rating'].max().unstack()

In [None]:
first_chunk = data[data['aid'].isin(data.aid.unique()[:1000])]

In [None]:
first_chunk.head() 

Unnamed: 0,session,aid,type,day,hour
0,0,1349536,1,Saturday,21
1,0,165096,1,Saturday,21
2,0,315914,1,Saturday,21
3,0,315914,2,Saturday,21
4,0,1680276,1,Saturday,21


In [None]:
# chunk_size = 10000
# chunks = [x for x in range(0, df.shape[0], chunk_size)]
# type_2_df = pd.concat([df.iloc[chunks[i]:chunks[i + 1] - 1].pivot_table(index = 'session', columns = 'aid', values = 'type', aggfunc='mean').fillna(0) for i in range(0, len(chunks) - 1)])

In [None]:
first_chunk_df = first_chunk.pivot_table(index = 'session', columns = 'aid', values = 'type').fillna(0)

In [None]:
first_chunk_df.head()

aid,2027,4322,4525,5606,6362,6851,7651,8017,9827,9891,...,1830578,1836610,1837737,1837818,1845526,1847491,1847685,1849394,1854762,1854872
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
first_chunk_matrix = csr_matrix(first_chunk_df.values)

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(first_chunk_matrix)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=20)

In [None]:
query_index = 105000 #random index
print(query_index)
distances_1, indices_1 = model_knn.kneighbors(first_chunk_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 20)

105000


In [None]:
for i in range(0, len(distances_1.flatten())):
  if i == 0:
    print('Recommendations for {0}:\n'.format(first_chunk_df.index[query_index]))
  else:
    print('{0}: {1}, with distance of {2}:'.format(i, first_chunk_df.index[indices_1.flatten()[i]], distances_1.flatten()[i]))

Recommendations for 12432611:

1: 6516776, with distance of 0.0:
2: 13698819, with distance of 0.0:
3: 1906444, with distance of 0.0:
4: 14339840, with distance of 0.0:
5: 13282071, with distance of 0.0:
6: 9827242, with distance of 0.0:
7: 13801181, with distance of 0.0:
8: 14276910, with distance of 0.0:
9: 12804110, with distance of 0.0:
10: 14406942, with distance of 0.0:
11: 2949144, with distance of 0.0:
12: 12759161, with distance of 0.0:
13: 12607492, with distance of 0.0:
14: 12641138, with distance of 0.0:
15: 5315225, with distance of 0.0:
16: 9828604, with distance of 0.0:
17: 7420120, with distance of 0.0:
18: 13411039, with distance of 0.0:
19: 13745515, with distance of 0.0:


In [None]:
del first_chunk_matrix
del first_chunk_df
gc.collect()

24

In [None]:
second_chunk = data[data['aid'].isin(data.aid.unique()[1000:2000])]
second_chunk_df = second_chunk.pivot_table(index = 'session', columns = 'aid', values = 'type').fillna(0)
second_chunk_matrix = csr_matrix(second_chunk_df.values)

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(second_chunk_matrix)

distances_2, indices_2 = model_knn.kneighbors(second_chunk_df[second_chunk_df.index == 12432611].values.reshape(1, -1), n_neighbors = 20)

for i in range(0, 20):
  if i == 0:
    print('Recommendations for {0}:\n'.format(second_chunk_df[second_chunk_df.index == 12432611].index[0]))
  else:
    print('{0}: {1}, with distance of {2}:'.format(i, second_chunk_df.index[indices_2.flatten()[i]], distances_2.flatten()[i]))

Recommendations for 12432611:

1: 13215372, with distance of 0.0:
2: 2888676, with distance of 0.0:
3: 1835561, with distance of 0.0:
4: 12694377, with distance of 0.0:
5: 12770847, with distance of 0.0:
6: 12770890, with distance of 0.0:
7: 10606496, with distance of 0.0:
8: 13553598, with distance of 0.0:
9: 13400137, with distance of 0.0:
10: 2886424, with distance of 0.0:
11: 12624609, with distance of 0.0:
12: 14281750, with distance of 0.0:
13: 12538073, with distance of 0.0:
14: 11740851, with distance of 0.0:
15: 1836195, with distance of 0.0:
16: 12872379, with distance of 0.0:
17: 12770758, with distance of 0.0:
18: 13553036, with distance of 0.0:
19: 13215418, with distance of 0.0:


In [None]:
A = model_knn.kneighbors_graph(second_chunk_df[second_chunk_df.index.isin([12432611, 14571363])].values)
B = A.toarray()

In [None]:
B != 0

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [None]:
res = second_chunk_df[B[0] != 0]
res

aid,2306,3923,6643,12782,14161,21885,24496,24614,24649,25530,...,1846140,1846519,1846802,1848540,1848943,1849385,1852263,1852609,1853288,1854775
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1835561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1836195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2886424,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2888676,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10606496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11740851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12538073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12624609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12694377,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12770758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
res = second_chunk_df[B[1] != 0]
res

aid,2306,3923,6643,12782,14161,21885,24496,24614,24649,25530,...,1846140,1846519,1846802,1848540,1848943,1849385,1852263,1852609,1853288,1854775
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3073076,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3836139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5949764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10611623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12553176,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13007629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13295022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13564154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13564195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
m2 = (res != 0).any()
products = m2.index[m2].tolist()
products

[868327]

In [None]:
recommend = np.zeros([max(data.aid)])
recommend[products] = recommend[products] + 1

In [None]:
np.where(recommend>0)

(array([868327]),)

In [None]:
m1 = (second_chunk_df[second_chunk_df.index == 12432611] != 0).any()
used_products = m1.index[m1].tolist()
used_products 

[496180]

In [None]:
second_chunk_df[B[1] != 0].index.to_list()

[102358,
 3073076,
 3836139,
 5949764,
 10611623,
 12553176,
 13007629,
 13295022,
 13564154,
 13564195,
 13747035,
 13901119,
 14170161,
 14260909,
 14302716,
 14303334,
 14337956,
 14368637,
 14416274,
 14539326]

In [None]:
c = second_chunk_df.index.values*B

In [None]:
c[0][c[0]>0]

array([ 1835561.,  1836195.,  2886424.,  2888676., 10606496., 11740851.,
       12538073., 12624609., 12694377., 12770758., 12770847., 12770890.,
       12872379., 13215372., 13215418., 13400137., 13553036., 13553598.,
       14281750., 14282589.])

In [None]:
c[1][c[1]>0]

array([  102358.,  3073076.,  3836139.,  5949764., 10611623., 12553176.,
       13007629., 13295022., 13564154., 13564195., 13747035., 13901119.,
       14170161., 14260909., 14302716., 14303334., 14337956., 14368637.,
       14416274., 14539326.])

In [None]:
del second_chunk_matrix
del second_chunk_df
gc.collect()

321

## Functions' definitions

In [None]:
def KNN_chunk(chunk, targets, n_neighbors=20, metric='cosine'):
    """ KNN model for chunks
    Arguments:
        chunk: part of data
        targets: sessions from test dataset in chunk
        
    Returns:
        csr_matrix: one row for every target, with numbers of n_neighbors found in chunk
    """

    chunk_df = chunk.to_pandas().pivot_table(index='session', columns='aid', values='type').fillna(0)
    
    chunk_matrix = csr_matrix(chunk_df.values)
    # pca = PCA(n_components=10)
    # chunk_matrix = pca.fit_transform(chunk_df.values)
    

    model_knn = NearestNeighbors(metric=metric, algorithm='brute', n_neighbors=n_neighbors, n_jobs=-1)
    model_knn.fit(chunk_matrix)

    nn = model_knn.kneighbors_graph(chunk_df[chunk_df.index.isin(targets)].values_host)
    result = chunk_df.index.values*nn.toarray()

    del chunk_matrix
    del chunk_df
    gc.collect()
    return csr_matrix(result)


def recommend_orders(data_orders, target, sessions):
  
  for i in range(len(target)):
    sess = sessions.getrow(i).data
    products = data_orders[data_orders.session.isin(sess)].aid.values
    if len(products) > 0:
      recommend[target[i]].append(products.tolist())

def neigh_products(data, target, sessions, recommend):
  """         
    Returns:
      recommend: array with all products aids of targets' neighbours 
  """
  for i in range(len(target)):
    sess = sessions.getrow(i).data
    products = data[data.session.isin(sess)].aid.values
    if len(products) > 0:
      recommend[target[i]].append(products.tolist())

def recommend_products(data, recommend, suffix, recommendations):
  """         
    Returns:
      recommendations: dict with 20 most repetitive products for each target  
  """
  for k, v in recommend.items():
    products = sum(v, []) #flatten results
    if len(products) > 1:
      omit = data[data.session == k].aid.values #data_orders, data_clicks etc.
      rec = products[products not in omit] #without products used in target before
      if isinstance(rec, list):
        #first 20 most repeated products
        rec = sorted(rec, key=rec.count, reverse=True)
        if len(set(rec)) > 20:
          rec = list(dict.fromkeys(rec))[:20]
        else:
          rec = list(dict.fromkeys(rec))
        recommendations[str(k) + '_' + suffix] = " ".join(str(i) for i in rec)
      else:
        recommendations[str(k) + '_' + suffix] = rec 
    else:
      recommendations[str(k) + '_' + suffix] = products
#  return recommendations

In [None]:
targets_all = test_time.session.unique() #all sessions in test dataset

In [None]:
test_time.session.nunique()

In [None]:
chunks_targets = list(range(0, test_time.session.nunique(), 10000)) + [test_time.session.nunique()]

###KNN for orders

In [None]:
data.aid.nunique()

1027688

In [None]:
list(range(0, data.aid.nunique(), 1000))[-1]

1027000

In [None]:
chunks_products = list(range(0, data.aid.nunique(), 1000)) + [data.aid.nunique()]

In [None]:
recomm_orders = {}

In [None]:
data_orders = data[data.type == 3]
for t1, t2 in tq.tqdm(zip(chunks_targets, chunks_targets[1:])):

  recommend = {key: [] for key in targets_all[t1:t2]} 

  for i1, i2 in zip(chunks_products, chunks_products[1:]):

    chunk = data[data['aid'].isin(data.aid.unique()[i1:i2])]
    targets = chunk[chunk.session.isin(targets_all[t1:t2])].session.unique() #check which test sessions are in chunk (to predict)
    if len(targets) > 0:
      res = KNN_chunk(chunk, targets, n_neighbors=5)
      recommend_orders(data_orders, targets, res)

      del chunk
      gc.collect()
    else:
      del chunk
      gc.collect()

  recommend_products(data_orders, recommend, 'orders', recomm_orders)
  joblib.dump(recomm_orders,'orders_part1.joblib');

  0%|          | 0/163 [00:00<?, ?it/s]

## KNN model

In [None]:
recomm_orders = {}
recomm_clicks = {}
recomm_carts = {}

targets_all = test.session.unique().to_pandas() #all sessions in test dataset

chunks_products = list(range(0, data.aid.nunique(), 1000)) + [data.aid.nunique()]
chunks_targets = list(range(0, test.session.nunique(), 10000)) + [test.session.nunique()]

In [None]:
data_orders = data[data.type == 3]
data_clicks = data[data.type == 1]
data_carts = data[data.type == 2]


for t1, t2 in tq.tqdm(list(zip(chunks_targets, chunks_targets[1:]))):
    recommend_orders = {key: [] for key in targets_all[t1:t2]} 
    recommend_clicks = {key: [] for key in targets_all[t1:t2]} 
    recommend_carts = {key: [] for key in targets_all[t1:t2]} 

    for i1, i2 in tq.tqdm(list(zip(chunks_products, chunks_products[1:]))):
      chunk = data[data['aid'].isin(data.aid.unique()[i1:i2])]
      targets = chunk[chunk.session.isin(targets_all[t1:t2])].session.unique() #check which test sessions are in chunk (to predict)
      if len(targets) > 0:
        res = KNN_chunk(chunk, targets, n_neighbors=5)
        neigh_products(data_orders, targets, res, recommend_orders)
        neigh_products(data_clicks, targets, res, recommend_clicks)
        neigh_products(data_carts, targets, res, recommend_carts)

        del chunk
        gc.collect()
      else:
        del chunk
        gc.collect()
    
    recommend_products(data_clicks, recommend_clicks, 'orders', recomm_clicks)
    joblib.dump(recomm_clicks,'KNNclicks_part1.joblib');
    del recommend_clicks
    gc.collect()
    recommend_products(data_carts, recommend_carts, 'orders', recomm_carts)
    joblib.dump(recomm_carts,'KNNcarts_part1.joblib');
    del recommend_carts
    gc.collect()
    recommend_products(data_orders, recommend_orders, 'orders', recomm_orders)
    joblib.dump(recomm_orders,'KNNorders_part1.joblib');
    del recommend_orders
    gc.collect()

#KDTree

In [None]:
def KDTree_chunk(chunk, targets, n_neighbors=10, leaf_size=500, metric='cityblock'):
    """ KDTree model for chunks
    Arguments:
        chunk: part of data
        targets: sessions from test dataset in chunk
        
    Returns:
        csr_matrix: one row for every target, with numbers of n_neighbors found in chunk
    """

    chunk_df = chunk.pivot_table(index='session', columns='aid', values='type').fillna(0)
    #chunk_matrix = csr_matrix(chunk_df.values)
    
    kdt = KDTree(chunk_df, leaf_size=leaf_size, metric=metric)
    ind = kdt.query(chunk_df[chunk_df.index.isin(targets)].values, k=n_neighbors, return_distance=False)
    
    result = [[chunk_df.index.values[i].tolist() for i in ind[j]] for j in range(len(ind))]

    #del chunk_matrix
    del chunk_df
    gc.collect()
    return result

def neigh_products_KD(data, target, sessions, recommend):
  """         
    Returns:
      recommend: array with all products aids of targets' neighbours 
  """
  for i in range(len(target)):
    #sess = sessions.getrow(i).data
    products = data[data.session.isin(sessions[i])].aid.values
    if len(products) > 0:
      recommend[target[i]].append(products.tolist())

def recommend_products(data, recommend, suffix, recommendations):
  """         
    Returns:
      recommendations: dict with 20 most repetitive products for each target  
  """
  for k, v in recommend.items():
    products = sum(v, []) #flatten results
    if len(products) > 1:
      omit = data[data.session == k].aid.values #data_orders, data_clicks etc.
      rec = products[products not in omit] #without products used in target before
      if isinstance(rec, list):
        #first 20 most repeated products
        rec = sorted(rec, key=rec.count, reverse=True)
        if len(set(rec)) > 20:
          rec = list(dict.fromkeys(rec))[:20]
        else:
          rec = list(dict.fromkeys(rec))
        recommendations[str(k) + '_' + suffix] = " ".join(str(i) for i in rec)
      else:
        recommendations[str(k) + '_' + suffix] = rec 
    else:
      recommendations[str(k) + '_' + suffix] = products
#  return recommendations    

In [None]:
recomm_orders = {}
recomm_clicks = {}
recomm_carts = {}

targets_all = test_time.session.unique() #all sessions in test dataset

chunks_products = list(range(0, data.aid.nunique(), 1000)) + [data.aid.nunique()]
chunks_targets = list(range(0, test_time.session.nunique(), 10000)) + [test_time.session.nunique()]

## Cityblock metric

In [None]:
KDTree.valid_metrics

['euclidean',
 'l2',
 'minkowski',
 'p',
 'manhattan',
 'cityblock',
 'l1',
 'chebyshev',
 'infinity']

In [None]:
data_orders = data[data.type == 3]
data_clicks = data[data.type == 1]
data_carts = data[data.type == 2]


for t1, t2 in tq.tqdm(list(zip(chunks_targets, chunks_targets[1:]))):
    recommend_orders = {key: [] for key in targets_all[t1:t2]} 
    recommend_clicks = {key: [] for key in targets_all[t1:t2]} 
    recommend_carts = {key: [] for key in targets_all[t1:t2]} 

    for i1, i2 in tq.tqdm(list(zip(chunks_products, chunks_products[1:]))):
        chunk = data[data['aid'].isin(data.aid.unique()[i1:i2])]
        targets = chunk[chunk.session.isin(targets_all[t1:t2])].session.unique() #check which test sessions are in chunk (to predict)
        if len(targets) > 0:
          res = KDTree_chunk(chunk, targets, n_neighbors=5, leaf_size=50000)
          neigh_products_KD(data_orders, targets, res, recommend_orders)
          neigh_products_KD(data_clicks, targets, res, recommend_clicks)
          neigh_products_KD(data_carts, targets, res, recommend_carts)

          del chunk
          gc.collect()
        else:
          del chunk
          gc.collect()
      
    recommend_products(data_clicks, recommend_clicks, 'orders', recomm_clicks)
    joblib.dump(recomm_clicks,'KDTclicks_part1.joblib');
    del recommend_clicks
    gc.collect()
    recommend_products(data_carts, recommend_carts, 'orders', recomm_carts)
    joblib.dump(recomm_carts,'KDTcarts_part1.joblib');
    del recommend_carts
    gc.collect()
    recommend_products(data_orders, recommend_orders, 'orders', recomm_orders)
    joblib.dump(recomm_orders,'KDTorders_part1.joblib');
    del recommend_orders
    gc.collect()

  0%|          | 0/162 [00:00<?, ?it/s]

  0%|          | 0/1028 [00:00<?, ?it/s]

 `leaf_size=50` - it takes over 10 h per iteration. 

`leaf_size=500` - it takes over 10 h per iteration. 

`leaf_size=5000` - it takes over 3 h per iteration.  

`leaf_size=50000` - it takes over 10 h per iteration. 

## Cosine metric

KDTree class does not support Cosine Distance as metric, so one need to transform data to obtain estimate of Cosine Metric: 

$C(u, v) = {E(u_{norm}, v_{norm})^2}/2$, where $C$ - cosine distance (1- cosine similarity), $E$ - euclidean distance. 

$C(u, v) \approx {E(u_{norm}, v_{norm})^2}$

$C(u, v) \approx {E(arctanh(u_{norm}), arctanh(v_{norm}))}$

In [None]:
def KDTree_chunk_trans(chunk, targets, n_neighbors=10, leaf_size=500, metric='euclidean'):
    """ KDTree model for chunks
    Arguments:
        chunk: part of data
        targets: sessions from test dataset in chunk
        
    Returns:
        csr_matrix: one row for every target, with numbers of n_neighbors found in chunk
    """

    chunk_df = chunk.pivot_table(index='session', columns='aid', values='type').fillna(0)
    chunk_matrix = chunk_df.values
    chunk_matrix = np.arctanh(preprocessing.normalize(chunk_matrix, norm='l2'))
    chunk_matrix[chunk_matrix >= 1] = 0.99
    chunk_matrix[chunk_matrix <= -1] = -0.99

    kdt = KDTree(chunk_matrix, leaf_size=leaf_size, metric=metric)

    neigh = chunk_df[chunk_df.index.isin(targets)].values
    neigh = np.arctanh(preprocessing.normalize(neigh, norm='l2'))

    ind = kdt.query(neigh, k=n_neighbors, return_distance=False)
    
    result = [[chunk_df.index.values[i].tolist() for i in ind[j]] for j in range(len(ind))]

    #del chunk_matrix
    del chunk_df
    gc.collect()
    return result


In [None]:
data_orders = data[data.type == 3]
data_clicks = data[data.type == 1]
data_carts = data[data.type == 2]


for t1, t2 in tq.tqdm(list(zip(chunks_targets, chunks_targets[1:]))):
    recommend_orders = {key: [] for key in targets_all[t1:t2]} 
    recommend_clicks = {key: [] for key in targets_all[t1:t2]} 
    recommend_carts = {key: [] for key in targets_all[t1:t2]} 

    for i1, i2 in tq.tqdm(list(zip(chunks_products, chunks_products[1:]))):
        chunk = data[data['aid'].isin(data.aid.unique()[i1:i2])]
        targets = chunk[chunk.session.isin(targets_all[t1:t2])].session.unique() #check which test sessions are in chunk (to predict)
        if len(targets) > 0:
          res = KDTree_chunk_trans(chunk, targets, n_neighbors=5, leaf_size=50000)
          neigh_products_KD(data_orders, targets, res, recommend_orders)
          neigh_products_KD(data_clicks, targets, res, recommend_clicks)
          neigh_products_KD(data_carts, targets, res, recommend_carts)

          del chunk
          gc.collect()
        else:
          del chunk
          gc.collect()
      
    recommend_products(data_clicks, recommend_clicks, 'orders', recomm_clicks)
    joblib.dump(recomm_clicks,'clicks_part1.joblib');
    del recommend_clicks
    gc.collect()
    recommend_products(data_carts, recommend_carts, 'orders', recomm_carts)
    joblib.dump(recomm_carts,'carts_part1.joblib');
    del recommend_carts
    gc.collect()
    recommend_products(data_orders, recommend_orders, 'orders', recomm_orders)
    joblib.dump(recomm_orders,'orders_part1.joblib');
    del recommend_orders
    gc.collect()

  0%|          | 0/162 [00:00<?, ?it/s]

  0%|          | 0/1028 [00:00<?, ?it/s]

It takes over 5 h per iteration. 

In [None]:
#Hamming Distance 

#Matrix factorization with SVD

In [None]:
data = pd.concat([train, test])

In [None]:
data = data.drop(columns=['ts'])
data['type'] = data['type'] + 1 #to make sparse matrix with pivot (NaN replaced by 0)

In [None]:
def SVD(chunk, targets, recommend, n_products=20, k=10):
    """ KNN model for chunks
    Arguments:
        chunk: part of data
        targets: sessions from test dataset in chunk
        n_products: number of products recommendations
        k: number of singular values and vectors to compute
        
    Returns:
        csr_matrix: one row for every target, with numbers of n_neighbors found in chunk
    """

    chunk_df = chunk.pivot_table(index='session', columns='aid', values='type').fillna(0)
    chunk_matrix = csr_matrix(chunk_df.values)

    u, s, v = svds(chunk_matrix.asfptype(), k=k)

    del chunk_matrix
    gc.collect()

    pred = np.dot(np.dot(u, np.diag(s)), v) 
    #pred = normalize(pred)
    #replace used products values with 0
    pred[chunk_df.values > 0] = 0
    #swap values in pivot df
    chunk_df[:] = pred
  
    for t in targets:
      sorted_df = chunk_df[chunk_df.index == t].sort_values(by = t, axis = 1, ascending = False)
      products = zip(sorted_df.columns.values[:n_products], sorted_df.values[0][:n_products])
      recommend[t].append(list(map(list, list(products))))
      recommend[t] = sum(recommend[t], []) #flatten array
      recommend[t] = [sorted(recommend[t], key=lambda x: x[1], reverse=True)[:20]]

    del chunk_df
    del sorted_df
    gc.collect()

def recommend_products(recommend, suffix, recommendations):
  """         
    Returns:
      recommendations: dict with 20 most repetitive products for each target  
  """
    
  for k, v in recommend.items():
    products = sum(v, []) #flatten results
    #first 20 most repeated products
    #products = sorted(products, key = products.count, reverse = True)
    products = [i[0] for i in products[:20]]
    recommendations[str(k) + '_' + suffix] = " ".join(str(i) for i in products)

In [None]:
data_orders = data[data.type == 3]
data_clicks = data[data.type == 1]
data_carts = data[data.type == 2]

del data
gc.collect()

data_type = [data_orders, data_clicks, data_carts]

recommendations = {}

targets_all = test.session.unique() #all sessions in test dataset

del train
del test
gc.collect()

0

In [None]:
# with np.load('clicks_values_part1.npz') as data:
#     values = data['arr_0'].tolist()
# with np.load('clicks_keys_part1.npz') as data:
#     keys = data['arr_0'].astype(int)
# recommend = {k: v for k, v in zip(keys, values)}

In [None]:
# del values, keys
# gc.collect()

11

In [None]:
#load recommend and change products range for diff data

for d in range(2, len(data_type)): #for carts
  data = data_type[d]
  chunks_products = list(range(0, data.aid.nunique(), 1000)) + [data.aid.nunique()]
  chunks_list = list(zip(chunks_products, chunks_products[1:]))
  recommend = {key: [[[0.0, 0.0]]*20] for key in targets_all} 
  #recommend = np.load('clicks_part1.npy');


  for i1, i2 in tq.tqdm(chunks_list):
          chunk = data[data['aid'].isin(sorted(data.aid.unique())[i1:i2])] #sorted: assumption that similar products are numbered similarly
          targets = chunk[chunk.session.isin(targets_all)].session.unique() #check which test sessions are in chunk, 
                                                                            #if they aren't, SVD shows constant in every column anyway
          if len(targets) > 0:
            SVD(chunk, targets, recommend, k=10)
            del chunk
            gc.collect()
          else:
            del chunk
            gc.collect()

          if i1%150000 == 0: #save every 150 iteration
            if d == 0:          
              joblib.dump(recommend, 'orders_part1.joblib');
            elif d == 1: 
              recommend_keys = np.array(list(recommend.keys()), dtype=int)
              np.savez_compressed('clicks_keys_part1.npz', recommend_keys, allow_pickle=False);
              recommend_values = np.array(list(recommend.values()), dtype=float)
              np.savez_compressed('clicks_values_part1.npz', recommend_values, allow_pickle=False);

              del recommend_keys, recommend_values
              gc.collect()
            else:
              recommend_keys = np.array(list(recommend.keys()), dtype=int)
              np.savez_compressed('carts_keys_part1.npz', recommend_keys, allow_pickle=False);
              recommend_values = np.array(list(recommend.values()), dtype=float)
              np.savez_compressed('carts_values_part1.npz', recommend_values, allow_pickle=False);

              del recommend_keys, recommend_values
              gc.collect()


  if d == 0:
          recommend_products(recommend, 'orders', recommendations)
          joblib.dump(recommendations, 'orders_recomm.joblib');
  elif d == 1:
          recommend_products(recommend, 'clicks', recommendations)
          joblib.dump(recommendations, 'clicks_recomm.joblib');
  else:
          recommend_products(recommend, 'carts', recommendations)
          joblib.dump(recommendations, 'carts_recomm.joblib');
      


  0%|          | 0/381 [00:00<?, ?it/s]

**Fixing bugs**

In [None]:
with np.load('clicks_values_part1.npz') as data:
    values = data['arr_0'].astype(int).tolist()
with np.load('clicks_keys_part1.npz') as data:
    keys = data['arr_0'].astype(int)
#recommend = {k: v for k, v in zip(keys, values)}

In [None]:
targets_all = list(range(12899779, 14571582))

In [None]:
recommend = {key: [] for key in targets_all} 

In [None]:
for k, v in zip(keys, values):
  recommend[k] = v

In [None]:
recommendations = {}
recommend_products(recommend, 'clicks', recommendations)

In [None]:
len(recommendations)*3

5015409

In [None]:
joblib.dump(recommendations, 'clicks_recomm.joblib');

In [None]:
with np.load('carts_values_part1.npz') as data:
    values = data['arr_0'].astype(int).tolist()
with np.load('carts_keys_part1.npz') as data:
    keys = data['arr_0'].astype(int)

targets_all = list(range(12899779, 14571582))
recommend = {key: [] for key in targets_all} 

for k, v in zip(keys, values):
  recommend[k] = v

recommendations = {}
recommend_products(recommend, 'carts', recommendations)

In [None]:
joblib.dump(recommendations, 'carts_recomm.joblib');

In [None]:
ord = joblib.load('orders_part1.joblib')

targets_all = list(range(12899779, 14571582))
recommend = {key: [] for key in targets_all} 

for k, v in ord.items():
  recommend[k] = v

recommendations = {}
recommend_products(recommend, 'orders', recommendations)

In [None]:
joblib.dump(recommendations, 'orders_recomm.joblib');

##Recommendation of most popular products

For empty targets recommend most popular products. 

In [None]:
def most_popular(data, k):
  u, count = np.unique(data.aid.values, return_counts=True)
  count_sort_ind = np.argsort(-count)
  return u[count_sort_ind][:k]


def most_popular_fill(recommendations, popular):
   indices = [i for i, x in enumerate(list(recommendations.values())) if (x == '') | (x == '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0')]
   empty_recomm = np.array(list(recommendations.keys()))[indices]
   for er in empty_recomm:
     recommendations[er] = " ".join(str(i) for i in popular)

In [None]:
orders_rec = joblib.load('orders_recomm.joblib');
most_ordered = most_popular(data_orders, k=20)
most_popular_fill(orders_rec, most_ordered)

In [None]:
clicks_rec = joblib.load('clicks_recomm.joblib');
most_clicked = most_popular(data_clicks, k=20)
most_popular_fill(clicks_rec, most_clicked)

In [None]:
carts_rec = joblib.load('carts_recomm.joblib');
most_carted = most_popular(data_carts, k=20)
most_popular_fill(carts_rec, most_carted)

#Submission file

In [None]:
len(carts_rec)

1671803

In [None]:
len(orders_rec)

1671803

In [None]:
len(clicks_rec)

1671803

Submission layout:

```
session_type,labels
12906577_clicks,135193 129431 119318 ...
12906577_carts,135193 129431 119318 ...
12906577_orders,135193 129431 119318 ...
12906578_clicks, 135193 129431 119318 ...
etc.
```



In [None]:
submission = pd.DataFrame(columns=['session_type', 'labels'], data=[[0, 0]]*3*1671803)

In [None]:
submission.iloc[::3, 0] = list(clicks_rec.keys())
submission.iloc[1::3, 0] = list(carts_rec.keys())
submission.iloc[2::3, 0] = list(orders_rec.keys())

In [None]:
submission.iloc[::3, 1] = list(clicks_rec.values())
submission.iloc[1::3, 1] = list(carts_rec.values())
submission.iloc[2::3, 1] = list(orders_rec.values())

In [None]:
submission.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59594 58211 58965 58619 58830 58317 58386 5856...
1,12899779_carts,122983 1460571 1116095 554660 166037 1006198 1...
2,12899779_orders,122983 1445562 1531805 1460571 1534690 332654 ...
3,12899780_clicks,736915 736915 974030 974968 1141175 736999 736...
4,12899780_carts,122983 1460571 1116095 554660 166037 1006198 1...


In [None]:
submission.tail()

Unnamed: 0,session_type,labels
5015404,14571580_carts,122983 1460571 1116095 554660 166037 1006198 1...
5015405,14571580_orders,122983 1445562 1531805 1460571 1534690 332654 ...
5015406,14571581_clicks,1099010 1100142 1099464 1098720 1098934 109941...
5015407,14571581_carts,122983 1460571 1116095 554660 166037 1006198 1...
5015408,14571581_orders,122983 1445562 1531805 1460571 1534690 332654 ...


In [None]:
submission.to_csv("submission_v1.csv", index=False) #14 571 582, 5015409