# Data transformation (collaborative filtering)

It is usually observed in the real-world datasets that users may have different types of interactions with items. In addition, same types of interactions (e.g., click an item on the website, view a movie, etc.) may also appear more than once in the history. Given that this is a typical problem in practical recommendation system design, the notebook shares data transformation techniques that can be used for different scenarios.

Specifically, the discussion in this notebook is only applicable to collaborative filtering algorithms.

## 0 Global Settings

In [1]:
# set the environment path to find Recommenders
import sys

import pyspark
import pandas as pd
import numpy as np
import datetime
import math

print("System version: {}".format(sys.version))

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe

System version: 3.9.12 (main, May  8 2022, 14:00:45) 
[Clang 10.0.1 (clang-1001.0.46.4)]


In [4]:
DATA_PATH='../data/amazon_reviews_us_Electronics_v1_00.tsv'

COL_USER = "customer_id"
COL_ITEM = "product_id"
COL_RATING = "star_rating"
COL_PREDICTION = "star_rating"
COL_TIMESTAMP = "review_date"

In [5]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
data = spark.read.option("delimiter", "\t").option("header", True).csv(DATA_PATH)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/16 13:16:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 2 Data Transformation
Many collaborative filtering algorithms are built on a user-item sparse matrix. This requires that the input data for building the recommender should contain unique user-item pairs.

For explicit feedback datasets, this can simply be done by deduplicating the repeated user-item-rating tuples.

In [6]:
data = data.drop_duplicates()

In [7]:
data.count()

                                                                                

3093869