The MIT License (MIT)

Copyright (c) 2021 NVIDIA CORPORATION

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

In [1]:
!python --version

Python 3.7.10


In [2]:
# Check CUDA/cuDNN Version
!nvcc -V && which nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
/usr/local/cuda/bin/nvcc


In [3]:
# ubunta version
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic


In [4]:
# Check GPU
!nvidia-smi

Mon Apr 12 09:42:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
print(device_name)
if device_name != b'Tesla T4' and device_name != b'Tesla P100-PCIE-16GB':
  raise Exception("""
    Unfortunately this instance does not have a T4 GPU.
    
    Please make sure you've configured Colab to request a GPU instance type.
    
    Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.

    If you get a K80 GPU, try Runtime -> Reset all runtimes...
  """)
else:
  print('Woo! You got the right kind of GPU!')

b'Tesla T4'
Woo! You got the right kind of GPU!


In [6]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.18

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 213 (delta 22), reused 3 (delta 0), pack-reused 171[K
Receiving objects: 100% (213/213), 64.29 KiB | 12.86 MiB/s, done.
Resolving deltas: 100% (84/84), done.
PLEASE READ
********************************************************************************************************
Changes:
1. IMPORTANT SCRIPT CHANGES: Colab has updated to Python 3.7, and now runs our STABLE and NIGHTLY versions (0.18 and 0.19)!  PLEASE update your older install script code as follows:
	!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.18

	import sys, os

	dist_package_index = sys.path.index('/usr/local/lib/python3.7/dist-packages')
	sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.7/site-packages'] + sys.path[dist_package_index:]
	sys.path
	exec(open('rapidsai-csp-utils/colab/update_modules.py').rea

In [12]:
# delete package cache
!conda clean -a -y

Cache location: 
There are no tarballs to remove
Cache location: /usr/local/pkgs
Will remove the following packages:
/usr/local/pkgs
---------------

python-3.8.5-h7579374_1                    111.3 MB
conda-4.9.2-py38h06a4308_0                  11.7 MB
libgfortran-ng-7.5.0-h14aa051_18              92 KB
pycosat-0.6.3-py38h7b6447c_1                 424 KB
libgcc-ng-9.1.0-hdf63c60_0                  25.6 MB
ca-certificates-2020.10.14-0                 227 KB
pyct-0.4.6-py_0                                7 KB
six-1.15.0-py38h06a4308_0                     88 KB
liblapack-3.9.0-8_openblas                    41 KB
_libgcc_mutex-0.1-conda_forge                  6 KB
arrow-cpp-proc-3.0.0-cuda                    134 KB
rapids-0.18.0-cuda11.0_py37_g334c31e_223      14 KB
dask-2021.4.0-pyhd8ed1ab_0                    11 KB
pip-20.2.4-py38h06a4308_0                    7.9 MB
openssl-1.1.1h-h7b6447c_0                   13.4 MB
cryptography-3.2.1-py38h3c74f83_1            3.5 MB
setuptools-50.3.1-

In [14]:
import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.7/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.7/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

***********************************************************************
Let us check on those pyarrow and cffi versions...
***********************************************************************

You're don't have pyarrow.
unloaded cffi 1.14.5
loaded cffi 1.14.5


In [None]:
# Critical imports
# import nvstrings, nvcategory
import cudf
import cuml
import os
import numpy as np
import pandas as pd

In [None]:
gdf = cudf.Series([1, 2, 3, 4, 5, 6])
print(gdf)
print(type(gdf))

In [16]:
import pandas as pd
import cudf
from sklearn.model_selection import GroupKFold

pd.__version__, cudf.__version__

('1.1.5', '0.18.1')

In [17]:
from numba import cuda

def get_order_in_group(utrip_id_,order):
    for i in range(cuda.threadIdx.x, len(utrip_id_), cuda.blockDim.x):
        order[i] = i

def add_cumcount(df, sort_col, outputname):
    df = df.sort_values(sort_col, ascending=True)
    tmp = df[['utrip_id_', 'checkin']].groupby(['utrip_id_']).apply_grouped(
        get_order_in_group,incols=['utrip_id_'],
        outcols={'order': 'int32'},
        tpb=32)
    tmp.columns = ['utrip_id_', 'checkin', outputname]
    df = df.merge(tmp, how='left', on=['utrip_id_', 'checkin'])
    df = df.sort_values(sort_col, ascending=True)
    return(df)

In [65]:
# cudf.read_csv cannot read all records in test_set via raw.github
# so download the files
!wget https://raw.githubusercontent.com/louislung/deeplearning/main/WSDM2021/00_Data/test_set.csv
!wget https://raw.githubusercontent.com/louislung/deeplearning/main/WSDM2021/00_Data/train_set.csv

--2021-04-12 10:39:47--  https://raw.githubusercontent.com/louislung/deeplearning/main/WSDM2021/00_Data/test_set.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29150292 (28M) [text/plain]
Saving to: ‘test_set.csv.1’


2021-04-12 10:39:48 (240 MB/s) - ‘test_set.csv.1’ saved [29150292/29150292]

--2021-04-12 10:39:48--  https://raw.githubusercontent.com/louislung/deeplearning/main/WSDM2021/00_Data/train_set.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92651064 (88M) [text/plain]
Saving to: ‘train_set.csv’


2021-04-12 10

In [66]:
try:
    train = cudf.read_csv('../00_Data/train_set.csv').sort_values(by=['user_id','checkin'])
    test = cudf.read_csv('../00_Data/test_set.csv').sort_values(by=['user_id','checkin'])
except:
    train = cudf.read_csv('train_set.csv').sort_values(by=['user_id','checkin'])
    test = cudf.read_csv('test_set.csv').sort_values(by=['user_id','checkin'])

# del train['Unnamed: 0']
# del test['row_num'], test['total_rows']

print(train.shape, test.shape)
train.head()

(1166835, 9) (378667, 9)


Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
413669,29,2016-07-09,2016-07-11,47054,desktop,1601,Elbonia,Elbonia,29_1
413670,29,2016-07-11,2016-07-13,34444,desktop,1601,Elbonia,Elbonia,29_1
413671,29,2016-07-13,2016-07-16,12291,desktop,1601,Elbonia,Elbonia,29_1
413672,29,2016-07-16,2016-07-18,16386,desktop,8132,Elbonia,Elbonia,29_1
1128910,81,2016-05-15,2016-05-16,33665,desktop,9924,Elbonia,Elbonia,81_1


In [67]:
print('train set got %d unique users ' % train.user_id.unique().shape[0])
print('test  set got %d unique users ' % test.user_id.unique().shape[0])
print()
print('train set got %d unique trips ' % train.utrip_id.unique().shape[0])
print('test  set got %d unique trips ' % test.utrip_id.unique().shape[0])
print()
print('%d users appear in both sets' % train.user_id.unique().isin(test.user_id).sum())
print('%d trips appear in both sets' % train.utrip_id.unique().isin(test.utrip_id).sum())

train set got 200153 unique users 
test  set got 68502 unique users 

train set got 217686 unique trips 
test  set got 70662 unique trips 

9496 users appear in both sets
0 trips appear in both sets


In [83]:
# add column "istest", concat both train and test set
train['istest'] = 0
test['istest'] = 1
raw = cudf.concat([train,test], sort=False )
raw = raw.sort_values( ['user_id','checkin'], ascending=True )
raw.head()

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,istest
413669,29,2016-07-09,2016-07-11,47054,desktop,1601,Elbonia,Elbonia,29_1,0
413670,29,2016-07-11,2016-07-13,34444,desktop,1601,Elbonia,Elbonia,29_1,0
413671,29,2016-07-13,2016-07-16,12291,desktop,1601,Elbonia,Elbonia,29_1,0
413672,29,2016-07-16,2016-07-18,16386,desktop,8132,Elbonia,Elbonia,29_1,0
355509,65,2016-09-26,2016-09-29,36403,desktop,3577,The Devilfire Empire,Cobra Island,65_1,1


In [84]:
# split data into folds and add a column
raw['fold'] = 0
group_kfold = GroupKFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(group_kfold.split(X=raw, y=raw, groups=raw['utrip_id'].to_pandas())):
    raw.iloc[test_index,10] = fold

raw['fold'].value_counts()

1    309101
0    309101
3    309100
2    309100
4    309100
Name: fold, dtype: int32

In [85]:
# This flag tell which row must be part of the submission file.

raw['submission'] = 0
raw.loc[ (raw.city_id==0)&(raw.istest) ,'submission'] = 1

raw.loc[ raw.submission==1 ]

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,istest,fold,submission
355513,65,2016-10-03,2016-10-04,0,mobile,4132,The Devilfire Empire,,65_1,1,2,1
356899,67,2016-08-11,2016-08-14,0,desktop,9924,Tcherkistan,,67_1,1,1,1
10963,115,2016-04-06,2016-04-07,0,desktop,9924,Elbonia,,115_1,1,0,1
120565,279,2016-03-27,2016-04-01,0,desktop,2803,Tcherkistan,,279_1,1,3,1
139366,307,2016-06-02,2016-06-03,0,desktop,8132,Elbonia,,307_1,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
353170,6257974,2016-07-05,2016-07-07,0,tablet,7974,The Devilfire Empire,,6257974_1,1,2,1
353174,6258010,2016-06-10,2016-06-12,0,mobile,359,The Devilfire Empire,,6258010_1,1,0,1
353181,6258104,2016-08-25,2016-08-27,0,mobile,359,Gondal,,6258104_4,1,1,1
353186,6258120,2016-07-24,2016-07-25,0,desktop,9924,Gondal,,6258120_1,1,4,1


In [86]:
#number of places visited in each trip

aggs = raw.groupby('utrip_id', as_index=False)['user_id'].count().reset_index()
aggs.columns = ['utrip_id', 'N']
raw = raw.merge(aggs, on=['utrip_id'], how='inner')
raw.head(2)

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,istest,fold,submission,N
0,21257,2016-09-19,2016-09-20,15430,desktop,4568,Gondal,Santa Prisca,21257_4,1,2,0,14
1,21257,2016-09-20,2016-09-21,65063,desktop,4568,Gondal,Santa Prisca,21257_4,1,2,0,14


In [87]:
raw['utrip_id_'], mp = raw['utrip_id'].factorize()

In [88]:
# dcount = how many destinations in the pass
# the first checkin in a trip has dcount = 0
raw = add_cumcount(raw, ['utrip_id_','checkin'], 'dcount')

In [90]:
# icount = how many destinations left
# the last checkin in a trip has icount=0
raw['icount'] = raw['N']-raw['dcount']-1

In [91]:
raw.head(50)

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,istest,fold,submission,N,utrip_id_,dcount,icount
7648,1000027,2016-08-13,2016-08-14,8183,desktop,7168,Elbonia,Gondal,1000027_1,0,0,0,4,0,0,3
7649,1000027,2016-08-14,2016-08-16,15626,desktop,7168,Elbonia,Gondal,1000027_1,0,0,0,4,0,1,2
7650,1000027,2016-08-16,2016-08-18,60902,desktop,7168,Elbonia,Gondal,1000027_1,0,0,0,4,0,2,1
7651,1000027,2016-08-18,2016-08-21,30628,desktop,253,Elbonia,Gondal,1000027_1,0,0,0,4,0,3,0
7652,1000033,2016-04-09,2016-04-11,38677,mobile,359,Gondal,Cobra Island,1000033_1,0,0,0,5,1,0,4
7653,1000033,2016-04-11,2016-04-12,52089,desktop,384,Gondal,Cobra Island,1000033_1,0,0,0,5,1,1,3
7654,1000033,2016-04-12,2016-04-14,21328,desktop,384,Gondal,Cobra Island,1000033_1,0,0,0,5,1,2,2
7655,1000033,2016-04-14,2016-04-16,27485,desktop,384,Gondal,Cobra Island,1000033_1,0,0,0,5,1,3,1
7656,1000033,2016-04-16,2016-04-19,38677,desktop,384,Gondal,Cobra Island,1000033_1,0,0,0,5,1,4,0
7657,1000045,2016-06-18,2016-06-20,64876,desktop,2790,The Devilfire Empire,Fook Island,1000045_1,0,0,0,7,2,0,6


In [119]:
raw.to_csv('train_and_test_2.csv', index=False)