Just... only keep users/items with sufficient number of transations

https://github.com/rogerioxavier/X-Wines/blob/main/Tests/XWines_recommender01.ipynb 

Exploring Data Splitting Strategies for the Evaluation of Recommendation Models 

https://arxiv.org/pdf/2007.13237 

-> Temporal global 80/20 (stratification by user)

\+ looks simple 

\+ no leaking

\- a lot of data discarded (only keep users/items with sufficient number of transations and occurring in both sets)

-> User split where users are in train or test

\+ for cold start
\- but can we do cold start ?

-> Leave one out



# Imports

In [1]:
import pandas as pd

# Load dataset

In [2]:
# load dataset into dataframe
ratings_df = pd.read_csv("Dataset/last/XWines_Slim_150K_ratings.csv") 
# ensure the Date column is in datetime format
ratings_df['Date'] = pd.to_datetime(ratings_df['Date'])
display(ratings_df.head(10))


  ratings_df = pd.read_csv("Dataset/last/XWines_Slim_150K_ratings.csv")


Unnamed: 0,RatingID,UserID,WineID,Vintage,Rating,Date
0,143,1356810,103471,1950,4.5,2021-11-02 20:52:59
1,199,1173759,111415,1951,5.0,2015-08-20 17:46:26
2,348,1164877,111395,1952,5.0,2020-11-13 05:40:26
3,374,1207665,111433,1953,5.0,2017-05-05 06:44:13
4,834,1075841,111431,1955,5.0,2016-09-14 20:18:38
5,876,1211463,111395,1955,5.0,2021-12-02 23:12:49
6,1005,1076348,111433,1955,4.5,2021-06-19 19:53:56
7,1020,1147051,111429,1955,5.0,2018-07-08 20:09:46
8,1029,1225931,111431,1955,5.0,2017-04-24 01:41:52
9,1399,1197513,111415,1958,5.0,2014-07-04 01:07:16


# Temporal global split

In [3]:
# create folders for data splits

from pathlib import Path

folder_path = Path("Dataset/ratings_splits/temporal_global/filtered")
folder_path.mkdir(parents=True, exist_ok=True)

folder_path = Path("Dataset/ratings_splits/temporal_global/unfiltered")
folder_path.mkdir(parents=True, exist_ok=True)

## Split based on cut-off date

In [4]:
# take the 80th percentile of the Date column as the cutoff date
quantile = 0.80  
cutoff_date = ratings_df['Date'].quantile(quantile)

print(f"Cutoff date at {quantile*100}% quantile: {cutoff_date}")

# create the train and test sets based on the cutoff date
train = ratings_df[ratings_df['Date'] < cutoff_date]
test = ratings_df[ratings_df['Date'] >= cutoff_date]

# display the sizes of the train and test sets
print("Train set size:", train.shape)
print("Test set size:", test.shape)
print("Total number of entries:", train.shape[0]+test.shape[0])
print("Train-test ratio:", train.shape[0] / (train.shape[0]+test.shape[0]), "/", test.shape[0] / (train.shape[0]+test.shape[0]))

Cutoff date at 80.0% quantile: 2020-08-25 02:03:53.800000
Train set size: (120000, 6)
Test set size: (30000, 6)
Total number of entries: 150000
Train-test ratio: 0.8 / 0.2


In [5]:
# save the unfiltered train and test datasets
train.to_csv('Dataset/ratings_splits/temporal_global/unfiltered/train.csv', index=False)
test.to_csv('Dataset/ratings_splits/temporal_global/unfiltered/test.csv', index=False)

## Filtered split (including only user appearing in both Train and Test)

In [6]:
# find users who are present in both train and test sets
train_users = set(train['UserID'].unique())
test_users = set(test['UserID'].unique())
common_users = train_users.intersection(test_users)

# filter the train and test sets to only include the common users
train_filtered = train[train['UserID'].isin(common_users)]
test_filtered = test[test['UserID'].isin(common_users)]

# display the sizes of the train and test sets
print("Filtered Train set size:", train_filtered.shape)
print("Filtered Test set size:", test_filtered.shape)
print("Total number of entries:", train_filtered.shape[0]+test_filtered.shape[0])
print("Train-test ratio:", train_filtered.shape[0] / (train_filtered.shape[0]+test_filtered.shape[0]), "/", test_filtered.shape[0] / (train_filtered.shape[0]+test_filtered.shape[0]))

Filtered Train set size: (82693, 6)
Filtered Test set size: (28218, 6)
Total number of entries: 110911
Train-test ratio: 0.7455797892003498 / 0.25442021079965016


In [7]:
# save the filtered train and test datasets
train_filtered.to_csv('Dataset/ratings_splits/temporal_global/filtered/train.csv', index=False)
test_filtered.to_csv('Dataset/ratings_splits/temporal_global/filtered/test.csv', index=False)