### Leave-Last-Out Splitting
A data-splitting strategy to pick up the lastest two item interactions for evaluation. This strategy is widely used in many recommendation papers.

Specially, given a chronological user interaction sequence of length N:

Training part: the first N-2 items;

Validation part: the (N-1)-th item;

Testing part: the N-th item. In this case N = 5. 


Using the presupplied Train, Validation and Test sets encourages consistent RecSys benchmarks. 

Notes: Must combine train, eval, and test sets in this order for the SasRecDataSet() initialization to work. The initialization with a filename expects the train, eval and test data to be in one tsv file. It will take the last item as the test item for a given user and the second to last as the validation item for that user. 
<br>
recommender requires that you have specific versions of certain packages, so you have to use a virtual environment to have the right package versions. For ex, tensorflow needs to be version 2.12.0. Python needs to be a version less than 3.11.9. 

In [9]:
import pandas as pd

book_test_df = pd.read_csv('data/Books.test.csv.gz', compression='gzip', sep=',', header=0)
book_val_df = pd.read_csv('data/Books.valid.csv.gz', compression='gzip', sep=',', header=0)
book_train_df = pd.read_csv('data/Books.train.csv.gz', compression='gzip', sep=',', header=0)

book_train_df.head()

Unnamed: 0,user_id,parent_asin,rating,timestamp,history
0,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1446304000,5.0,1441260345000,
1,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1564770672,5.0,1441260365000,1446304000
2,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1442450703,5.0,1523093714024,1446304000 1564770672
3,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1780671067,1.0,1611623223325,1446304000 1564770672 1442450703
4,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1645671127,3.0,1612044209266,1446304000 1564770672 1442450703 1780671067


In [10]:
print("Training set: ", book_train_df.shape)
print("Val Set: ", book_val_df.shape)
print("Test Set: ", book_test_df.shape)
print(book_test_df.head())
print()
print(book_val_df.head())
print("Unique Users: ", book_train_df['user_id'].nunique())


Training set:  (7935557, 5)
Val Set:  (776370, 5)
Test Set:  (776370, 5)
                        user_id parent_asin  rating      timestamp  \
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  0593235657     5.0  1640629604904   
1  AGKASBHYZPGTEPO6LWZPVJWB2BVA  0803736800     4.0  1454676557000   
2  AGXFEGMNVCSTSYYA5UWXDV7AFSXA  1542046599     5.0  1605649719611   
3  AFWHJ6O3PV4JC7PVOJH6CPULO2KQ  0679450815     5.0  1638987703546   
4  AHXBL3QDWZGJYH7A5CMPFNUPMF7Q  1250866448     5.0  1669414969335   

                                             history  
0  1446304000 1564770672 1442450703 1780671067 16...  
1  0811849783 0803729952 0735336296 1508558884 08...  
2        1578052009 1477493395 1594747350 1594749310  
3  B00INIQVJA 1496407903 1974633225 B07KD27RHM 16...  
4  0920668372 1589255208 2764322836 2764330898 00...  

                        user_id parent_asin  rating      timestamp  \
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  1782490671     5.0  1640383495102   
1  AGKASBHYZPGTEPO6LWZPVJWB2BVA  08

In [11]:
import sys
import os
import pandas as pd 
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.datasets.amazon_reviews import get_review_data
from recommenders.datasets.split_utils import filter_k_core
from recommenders.models.sasrec.model import SASREC
from recommenders.models.sasrec.ssept import SSEPT
from recommenders.models.sasrec.sampler import WarpSampler
from recommenders.models.sasrec.util import SASRecDataSet
from recommenders.utils.notebook_utils import store_metadata
from recommenders.utils.timer import Timer


print(f"Python version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")
print("tensorflow version should be 2.12.0")

Python version: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Tensorflow version: 2.12.0
tensorflow version should be 2.12.0


In [13]:
# Combine all interactions
merged_df = pd.concat([book_train_df, book_val_df, book_test_df], ignore_index=True)

# Make sure user/item IDs are integers starting from 1
user_set = set(merged_df['user_id'])
item_set = set(merged_df['parent_asin'])

user_map = dict()
item_map = dict()

for u, user in enumerate(user_set):
    user_map[user] = u+1

for i, item in enumerate(item_set):
    item_map[item] = i+1

merged_df['user_id'] = merged_df['user_id'].map(user_map)
merged_df['parent_asin'] = merged_df['parent_asin'].map(item_map)

# Sort by user and timestamp
merged_df = merged_df.sort_values(['user_id', 'timestamp'])

# Keep only the columns SASRecDataset expects
merged_df = merged_df[['user_id', 'parent_asin']]

# Save to a single TSV
merged_df.to_csv("data/book_all.tsv", sep="\t", header=False, index=False)

In [None]:
'''
#combine prebuilt benchmarking test, train and eval sets to change user_id strings to actual integers 
merged_df = pd.concat([book_train_df, book_val_df, book_test_df], ignore_index=True)

user_set, item_set = set(merged_df['user_id'].unique()), set(merged_df['parent_asin'].unique())

user_map = dict()
item_map = dict()

for u, user in enumerate(user_set):
    user_map[user] = u+1
for i, item in enumerate(item_set):
    item_map[item] = i+1

#changing each user_id column in each dataset to its integer mapping
book_train_df['user_id'] = book_train_df['user_id'].apply(lambda x: user_map[x])
book_train_df['parent_asin'] = book_train_df['parent_asin'].apply(lambda x: item_map[x])
book_val_df['user_id'] = book_val_df['user_id'].apply(lambda x: user_map[x])
book_val_df['parent_asin'] = book_val_df['parent_asin'].apply(lambda x: item_map[x])
book_test_df['user_id'] = book_test_df['user_id'].apply(lambda x: user_map[x])
book_test_df['parent_asin'] = book_test_df['parent_asin'].apply(lambda x: item_map[x])

book_train_df = book_train_df.sort_values(by=["user_id", "timestamp"])
book_val_df = book_val_df.sort_values(by=["user_id", "timestamp"])
book_test_df = book_test_df.sort_values(by=["user_id", "timestamp"])

book_train_df.drop(columns=["timestamp", "history", "rating"], inplace=True)
book_val_df.drop(columns=["timestamp", "history", "rating"], inplace=True)
book_test_df.drop(columns=["timestamp", "history", "rating"], inplace=True)

book_train_df.to_csv("data/book_train.tsv", sep = "\t", header=False, index=False)
book_val_df.to_csv("data/book_val.tsv", sep = "\t", header=False, index=False)
book_test_df.to_csv("data/book_test.tsv", sep = "\t", header=False, index=False)
'''



In [14]:
from recommenders.models.sasrec.util import SASRecDataSet
from recommenders.models.sasrec.model import SASREC

num_epochs = 10
batch_size = 128
seed = 100  # Set None for non-deterministic result

lr = 0.001             # learning rate
maxlen = 50            # maximum sequence length for each user
num_blocks = 2         # number of transformer blocks
hidden_units = 100     # number of units in the attention calculation
num_heads = 1          # number of attention heads
dropout_rate = 0.1     # dropout rate
l2_emb = 0.0           # L2 regularization coefficient
num_neg_test = 100     # number of negative examples per positive example

dataset = SASRecDataSet(filename="data/book_all.tsv", col_sep="\t")
dataset.split()

model = SASREC(item_num=dataset.itemnum,
                   seq_max_len=maxlen,
                   num_blocks=num_blocks,
                   embedding_dim=hidden_units,
                   attention_dim=hidden_units,
                   attention_num_heads=num_heads,
                   dropout_rate=dropout_rate,
                   conv_dims = [100, 100],
                   l2_reg=l2_emb,
                   num_neg_test=num_neg_test
)

print("Number of users:", dataset.usernum)
print("Number of items:", dataset.itemnum)
print("Number of valid users for evaluation:", len(dataset.user_valid))

Number of users: 776370
Number of items: 495063
Number of valid users for evaluation: 776370


In [15]:
sampler = WarpSampler(dataset.user_train, dataset.usernum, dataset.itemnum, batch_size=batch_size, maxlen=maxlen, n_workers=3)

In [16]:
with Timer() as train_time:
    t_test = model.train(dataset, sampler, num_epochs=num_epochs, batch_size=batch_size, lr=lr, val_epoch=6)

print('Time cost for training is {0:.2f} mins'.format(train_time.interval/60.0))

                                                                      

KeyboardInterrupt: 

In [17]:
res_syn = {"ndcg@10": t_test[0], "Hit@10": t_test[1]}
print(res_syn)

NameError: name 't_test' is not defined