# Data Split

## 0 Global Settings

In [8]:
import sys

import pyspark
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from pyspark.sql.functions import col

from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.download_utils import maybe_download
from recommenders.datasets.spark_splitters import (
    spark_random_split, 
    spark_chrono_split, 
    spark_stratified_split
)
from recommenders.datasets.spark_splitters import spark_random_split

print("System version: {}".format(sys.version))
print("Pyspark version: {}".format(pyspark.__version__))

System version: 3.9.12 (main, May  8 2022, 14:00:45) 
[Clang 10.0.1 (clang-1001.0.46.4)]
Pyspark version: 3.2.1


## 1 Data Preparation

### 1.1 Data Understanding

In [9]:
DATA_PATH='../data/amazon_reviews_us_Electronics_v1_00.tsv'

COL_USER = "customer_id"
COL_ITEM = "product_id"
COL_RATING = "star_rating"
COL_PREDICTION = "star_rating"
COL_TIMESTAMP = "review_date"

In [10]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
data = spark.read.option("delimiter", "\t").option("header", True).csv(DATA_PATH)

In [11]:
display(data.show(2))

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   41409413|R2MTG1GCZLR2DK|B00428R89M|     112201306|yoomall 5M Antenn...|     Electronics|          5|            0|          0|   N|                Y|          Five Stars|       As described.| 2015-08-31|
|         US|   49668221|R2HBOEM8LE9928|B000068O48|     734576678|Hosa GPM-103 3.5m...|     Electronics|          5|    

None

In [101]:
data.describe()

                                                                                

DataFrame[summary: string, marketplace: string, customer_id: string, review_id: string, product_id: string, product_parent: string, product_title: string, product_category: string, star_rating: string, helpful_votes: string, total_votes: string, vine: string, verified_purchase: string, review_headline: string, review_body: string, review_date: string]

In [77]:
print(
    "Total number of ratings are\t{}".format(data.count()),
    "Total number of users are\t{}".format(data.select(col(COL_USER)).distinct().count()),
    "Total number of items are\t{}".format(data.select(col(COL_ITEM)).distinct().count()),
    sep="\n"
)



Total number of ratings are	3093869
Total number of users are	2154357
Total number of items are	185852



                                                                                

### 1.2 Data transformation

Convert original timestamps to ISO format

In [12]:
from pyspark.sql.functions import to_date

data.withColumn("datetype", to_date(col(COL_TIMESTAMP), "yyyy-MM-dd")).show(2)

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|  datetype|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----------+
|         US|   41409413|R2MTG1GCZLR2DK|B00428R89M|     112201306|yoomall 5M Antenn...|     Electronics|          5|            0|          0|   N|                Y|          Five Stars|       As described.| 2015-08-31|2015-08-31|
|         US|   49668221|R2HBOEM8LE9928|B000068O48|     734576678|Hosa GPM-1

## 2 Experimentation Protocol

Experimentation protocol is usually set up to favor a reasonable evaluation for a specific recommendation scenario. For example,

Recommender-A is to recommend movies to people by taking people's collaborative rating similarities. To make sure the evaluation is statisically sound, the same set of users for both model building and testing should be used (to avoid any cold-ness of users), and a stratified splitting strategy should be taken.
Recommender-B is to recommend fashion products to customers. It makes sense that evaluation of the recommender considers time-dependency of customer purchases, as apparently, tastes of the customers in fashion items may be drifting over time. In this case, a chronologically splitting should be used.

## 3 Data split

### 3.1 Random split

Random split simply takes in a data set and outputs the splits of the data, given the split ratios.

In [104]:
type(data)

pyspark.sql.dataframe.DataFrame

In [9]:
data_train, data_test = spark_random_split(data, ratio=0.7)

In [10]:
data_train.count(), data_test.count()

                                                                                

(2166025, 927844)

Multi-split:

In [None]:
data_train, data_validate, data_test = spark_random_split(data, ratio=[0.6, 0.2, 0.2])
# or
data_train, data_validate, data_test = spark_random_split(data, ratio=[3, 1, 1])

In [None]:
data_train.shape[0], data_validate.shape[0], data_test.shape[0]

### 3.2 Chronological Split

#### 3.2.1 "Filter by"

Chrono splitting can be either by "user" or "item". For example, if it is by "user" and the splitting ratio is 0.7, it means that first 70% ratings for each user in the data will be put into one split while the other 30% is in another. It is worth noting that a chronological split is not "random" because splitting is timestamp-dependent.

In [9]:
data_train, data_test = spark_chrono_split(
    data, ratio=0.7, filter_by="user",
    col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)

22/05/15 10:14:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 10:14:10 WARN BlockManager: Block rdd_24_6 replicated to only 0 peer(s) instead of 1 peers
22/05/15 10:14:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 10:14:10 WARN BlockManager: Block rdd_24_5 replicated to only 0 peer(s) instead of 1 peers
22/05/15 10:14:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 10:14:10 WARN BlockManager: Block rdd_24_1 replicated to only 0 peer(s) instead of 1 peers
22/05/15 10:14:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 10:14:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 10:14:10 WARN BlockManager: Block rdd_24_7 replicated to only 0 peer(s) instead of 1 peers
22/05/15 10:14:10 WARN BlockManager: Block rdd_24_3 replicated to only 0 peer(s) instead of 1 peers
22/05/15 10:14:10 WARN RandomB

In [10]:
data_train.count()

                                                                                

806056

In [28]:
data_train.where(col(COL_USER)=='9640864').select(col(COL_TIMESTAMP)).show(10)

+-----------+
|review_date|
+-----------+
| 2014-11-13|
| 2015-02-14|
+-----------+



In [27]:
data_test.where(col(COL_USER)=='9640864').select(col(COL_TIMESTAMP)).show(10)

+-----------+
|review_date|
+-----------+
| 2015-03-04|
+-----------+



### 3.3.2 Min-rating filter
A min-rating filter is applied to data before it is split by using chronological splitter. The reason of doing this is that, for multi-split, there should be sufficient number of ratings for user/item in the data.

For example, the following means splitting only applies to users that have at least 10 ratings.

In [15]:
data_train, data_test = spark_chrono_split(
    data, filter_by="user", min_rating=2, ratio=0.7,
    col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)

22/05/15 11:27:51 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:27:51 WARN BlockManager: Block rdd_90_3 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:27:51 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:27:51 WARN BlockManager: Block rdd_90_5 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:27:51 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:27:51 WARN BlockManager: Block rdd_90_7 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:27:51 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:27:51 WARN BlockManager: Block rdd_90_2 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:27:51 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:27:51 WARN BlockManager: Block rdd_90_0 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:27:51 WARN RandomB

In [16]:
data_train.count() + data_test.count(), data.count()


                                                                                

(1405764, 3093869)

## 3.3 Stratified Split

Chronogically splitting method takes in a dataset and splits it by either user or item. The split is stratified so that the same set of users or items will appear in both training and testing data sets.

Similar to chronological splitter, filter_by and min_rating_filter also apply to the stratified splitter.

The following example shows the split of the sample data with a ratio of 0.7, and for each user there should be at least 10 ratings.

Chronogically splitting method takes in a dataset and splits it by either user or item. The split is stratified so that the same set of users or items will appear in both training and testing data sets.

Similar to chronological splitter, filter_by and min_rating_filter also apply to the stratified splitter.

The following example shows the split of the sample data with a ratio of 0.7, and for each user there should be at least 10 ratings.

In [18]:
data_train, data_test = spark_stratified_split(
    data, filter_by="user", min_rating=5, ratio=0.7,
    col_user=COL_USER, col_item=COL_ITEM
)

22/05/15 11:29:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:29:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:29:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:29:21 WARN BlockManager: Block rdd_134_2 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:29:21 WARN BlockManager: Block rdd_134_6 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:29:21 WARN BlockManager: Block rdd_134_7 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:29:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:29:21 WARN BlockManager: Block rdd_134_1 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:29:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/05/15 11:29:21 WARN BlockManager: Block rdd_134_3 replicated to only 0 peer(s) instead of 1 peers
22/05/15 11:29:22 WARN Ra

In [20]:
data_train.count() + data_test.count(), data.count()


                                                                                

(411774, 3093869)

## 3.4 Data Split at Scale

Spark DataFrame is used for scalable splitting. This allows splitting operation performed on large dataset that is distributed across Spark cluster.

For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same MovieLens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting.

In [23]:
spark = start_or_get_spark()
data_spark = spark.read.csv(DATA_PATH)
data_spark_train, data_spark_test = spark_random_split(data_spark, ratio=0.7)
data_spark_train.count(), data_spark_test.count()
spark.stop()


ERROR:root:KeyboardInterrupt while sending command.                (0 + 8) / 13]
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.9/site-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

                                                                                

References
Dimitris Paraschakis et al, "Comparative Evaluation of Top-N Recommenders in e-Commerce: An Industrial Perspective", IEEE ICMLA, 2015, Miami, FL, USA.
Guy Shani and Asela Gunawardana, "Evaluating Recommendation Systems", Recommender Systems Handbook, Springer, 2015.
Apache Spark, url: https://spark.apache.org/.