## Recommendation System Analysis and Modelling

------------------------------------------------------------------------
> # Introduction
>
> Recommendation systems are essential for delivering personalised user experiences across a variety of platforms, including e-commerce, streaming services, social media and news websites.
>
> This project aims to develop a recommendation system that leveraged historical user data to provide tailored recommendations across different domains, such as product recommendations, content suggestions and service optimisation.
>
> **CRISP DM Framework**
>
> The analysis followed the CRISP-DM methodology, which includes the following stages:
>
> ### 1. Business Understanding:
>
> The objectives were defined below, followed by the formulation of analytic questions to guide the modelling process.
>
> Key objectives of the project include:
>
> 1.Develop Personalized Recommendations: Tailor suggestions based on user behaviour and past interactions.
>
> 2.Address Diverse Use Cases: Implement systems for product, content and service recommendations.
>
> 3.Utilize Historical Data: Leverage past user actions to make accurate predictions.
>
> 4.Enhance User Engagement: Improve user satisfaction and retention through relevant suggestions.
>
> 5.Ensure Scalability & Real-Time Performance: Handle large data volumes and provide recommendations promptly.
>
> 6.Boost Business Metrics: Increase sales and conversion rates through better user personalization.
>
> 7.Balance Accuracy & Diversity: Provide relevant but varied recommendations to avoid monotony.
>
> 
> Analytic Questions:
> 
>
> ### 2. Data Understanding:
>
> The dataset consists of three files: events.csv, item_properties.csv and category_tree.csv, which collectively describe the interactions and properties of items on an e-commerce website. The data, collected over a 4.5-month period, is raw and contains hashed values due to confidentiality concerns. The goal of publishing this dataset is to support research in recommender systems using implicit feedback.
>
> 2.1 Behavior Data (events.csv)
>
> The behavior data includes a total of 2,756,101 events, with 2,664,312 views, 69,332 add-to-cart actions, and 22,457 transactions, recorded from 1,407,580 unique visitors. Each event corresponds to one of three types of interactions: "view", "addtocart", or "transaction". These implicit feedback signals are crucial for recommender systems:
>
> View: Represents a user showing interest in an item.
>
> Add to Cart: Indicates a higher level of intent to purchase.
>
> Transaction: Represents a completed purchase.
>
> 2.2 Item Properties (item_properties.csv)
>
> This file contains 20,275,902 rows, representing various properties of 417,053 unique items. Each property may change over time (e.g., price updates), with each row capturing a snapshot of an item’s property at a specific timestamp. For items with constant properties, only a single snapshot is recorded. The file is split into two due to its size, and it contains detailed item information, which is essential for building item profiles and understanding how item properties influence user behavior.
>
> 2.3 Category Tree (category_tree.csv)
>
> The category_tree.csv file outlines the hierarchical structure of item categories. It provides a category-based organization of the products, which can help in grouping items into broader categories or subcategories. This file is important for building models that recommend items within specific categories or using category-based clustering for recommendations.
> ### 3. Data Preparation :

In [1]:
library(ggplot2)

In [2]:
library(bit64)

Loading required package: bit


Attaching package: ‘bit’


The following object is masked from ‘package:base’:

    xor


Attaching package bit64

package:bit64 (c) 2011-2017 Jens Oehlschlaegel

creators: integer64 runif64 seq :

coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring

logical operator: ! & | xor != == < <= >= >

arithmetic operator: + - * / %/% %% ^

math: sign abs sqrt log log2 log10

math: floor ceiling trunc round

querying: is.integer64 is.vector [is.atomic} [length] format print str

values: is.na is.nan is.finite is.infinite

aggregation: any all min max range sum prod

cumulation: diff cummin cummax cumsum cumprod

access: length<- [ [<- [[ [[<-

combine: c rep cbind rbind as.data.frame



for more help type ?bit64


Attaching package: ‘bit64’


The following object is masked from ‘package:utils’:

    hashtab


The following objects are masked from ‘package:base’:

    %in%, :, colSums, is.double, match, order, rank, rowSums


In [3]:
library(tidyr)

In [4]:
library(data.table)


Attaching package: ‘data.table’


The following object is masked from ‘package:bit’:

    setattr




In [5]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following object is masked from ‘package:bit’:

    symdiff


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [6]:
library(janitor)


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




In [7]:
library(caret)

Loading required package: lattice



In [8]:
# get working directory
getwd()

In [9]:
# set working directory3. Data Preparation:
setwd("/home/kojo/data-science-projects/Recommendation System Analysis and Modelling/")

In [10]:
read_in_chunks <- function(file_path, chunk_size = 10000, ...) {
  # Open the file connection for reading
  con <- file(file_path, open = "r")
  
  # Read the header line (assumes CSV header is in the first line)
  header <- readLines(con, n = 1)
  
  # Prepare an empty list to store chunks
  chunks <- list()
  chunk_index <- 1
  
  repeat {
    # Read the next chunk_size lines
    lines <- readLines(con, n = chunk_size)
    
    # Exit the loop if no more lines are available
    if (length(lines) == 0) break
    
    # Combine header and the chunk's lines to form a valid CSV text
    csv_text <- paste(c(header, lines), collapse = "\n")
    
    # Read the combined text into a data frame
    chunk_df <- read.csv(text = csv_text, header = TRUE, stringsAsFactors = FALSE, ...)
    
    # Store the chunk in the list
    chunks[[chunk_index]] <- chunk_df
    chunk_index <- chunk_index + 1
  }
  
  # Close the connection
  close(con)
  
  return(chunks)
}

In [11]:
# Read category tree.csv
category_tree <- fread("./00_raw_data/category_tree.csv")

View(head(category_tree, 10))

categoryid,parentid
<int>,<int>
1016,213.0
809,169.0
570,9.0
1691,885.0
536,1691.0
231,
542,378.0
1146,542.0
1140,542.0
1479,1537.0


In [12]:
# Read events.csv in chunks
events <- read_in_chunks("./00_raw_data/events.csv")

In [13]:
# Read item_properties_part1.1.csv in chunks
item_properties_1 <- read_in_chunks("./00_raw_data/item_properties_part1.1.csv")

In [14]:
# Read item_properties_part2.csv in chunks
item_properties_2 <- read_in_chunks("./00_raw_data/item_properties_part2.csv")

In [15]:
# Bind events to data frame
events <- rbindlist(events)

View(head(events, 10))

timestamp,visitorid,event,itemid,transactionid
<dbl>,<int>,<chr>,<int>,<int>
1433221000000.0,257597,view,355908,
1433224000000.0,992329,view,248676,
1433222000000.0,111016,view,318965,
1433222000000.0,483717,view,253185,
1433221000000.0,951259,view,367447,
1433224000000.0,972639,view,22556,
1433222000000.0,810725,view,443030,
1433223000000.0,794181,view,439202,
1433221000000.0,824915,view,428805,
1433221000000.0,339335,view,82389,


In [16]:
# Bind item properties 1 & 2 in a data frame
item_properties <- do.call(rbind, c(item_properties_1, item_properties_2))

View(head(item_properties, 10))

Unnamed: 0_level_0,timestamp,itemid,property,value
Unnamed: 0_level_1,<dbl>,<int>,<chr>,<chr>
1,1435460000000.0,460429,categoryid,1338
2,1441508000000.0,206783,888,1116713 960601 n277.200
3,1439089000000.0,395014,400,n552.000 639502 n720.000 424566
4,1431227000000.0,59481,790,n15360.000
5,1431832000000.0,156781,917,828513
6,1436065000000.0,285026,available,0
7,1434251000000.0,89534,213,1121373
8,1431832000000.0,264312,6,319724
9,1433646000000.0,229370,202,1330310
10,1434251000000.0,98113,451,1141052 n48.000


In [17]:
# Data cleaning
# Check for duplicates
sum(duplicated(category_tree))

In [18]:
# Check for duplicates in events
sum(duplicated(events))

# Remove duplicates 
events <- unique(events)

# Verify
sum(duplicated(events))

In [19]:
# Check for NA's 
colSums(is.na(category_tree) | category_tree == "")

In [20]:
# Check for NA's in events
colSums(is.na(events) | events == "")

In [21]:
# Convert 'event' to factor
events[, event := as.factor(event)]

In [22]:
# Check the datatype or class of the variable timestamp
class(events$timestamp)

# Convert to POSIXct
events$timestamp <- as.POSIXct(events$timestamp / 1000, origin = "1970-01-01", tz = "UTC")

In [23]:
# Convert to POSIXct
item_properties$timestamp <- as.POSIXct(item_properties$timestamp / 1000, origin = "1970-01-01", tz = "UTC")

In [24]:
# Clean numeric values
# Use string functions to remove the prefix and then convert them to numeric.
# Some of the 'value' column sometimes starts with "n"

# Check Class
class(item_properties)

# Set to data.table
setDT(item_properties)

# Apply the operation
item_properties[, value_clean := as.numeric(gsub("^n", "", value))]

# Round to 3 decimanls for precision
# item_properties[, value_clean := round(value_clean, 3)]


# Check conversion
summary(item_properties$value_clean)

View(head(item_properties, 10))

“NAs introduced by coercion”


    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-6034344    17920   519769      Inf   769062      Inf  7331908 

timestamp,itemid,property,value,value_clean
<dttm>,<int>,<chr>,<chr>,<dbl>
2015-06-28 03:00:00,460429,categoryid,1338,1338.0
2015-09-06 03:00:00,206783,888,1116713 960601 n277.200,
2015-08-09 03:00:00,395014,400,n552.000 639502 n720.000 424566,
2015-05-10 03:00:00,59481,790,n15360.000,15360.0
2015-05-17 03:00:00,156781,917,828513,828513.0
2015-07-05 03:00:00,285026,available,0,0.0
2015-06-14 03:00:00,89534,213,1121373,1121373.0
2015-05-17 03:00:00,264312,6,319724,319724.0
2015-06-07 03:00:00,229370,202,1330310,1330310.0
2015-06-14 03:00:00,98113,451,1141052 n48.000,


In [25]:
# Merging datasets
# Perform a Rolling Join
# A rolling join allows you to match each event with the most recent (previous) item property snapshot.

# Set event to data.table
setDT(events)

# Order the data by itemid and timestamp
setorder(events, itemid, timestamp)
setorder(item_properties, itemid, timestamp)

# Set keys for a rolling join
setkey(events, itemid, timestamp)
setkey(item_properties, itemid, timestamp)

# Rolling join: for each event, get the most recent snapshot from item properties.
# This matches on 'itemid' and finds the snapshot with a timestamp less than or equal to the event timestamp.
merged_data <- item_properties[events, on = .(itemid, timestamp), roll = TRUE]

# Inspect the merged result
head(merged_data)

# Both events and item_props are keyed by itemid and timestamp. This ensures the join is performed efficiently.

# Rolling Join (roll = TRUE): When you join item_props with events, 
# the roll = TRUE option tells data.table to find, for each event, the row in item_props with the closest timestamp that does not exceed the event's timestamp. This aligns each event with the proper snapshot of the item properties.

timestamp,itemid,property,value,value_clean,visitorid,event,transactionid
<dttm>,<int>,<chr>,<chr>,<dbl>,<int>,<fct>,<int>
2015-08-18 18:30:40,3,459.0,769062,769062.0,370720,view,
2015-08-31 14:39:02,3,283.0,1305767 150169 1182824 327918 261419,,639016,view,
2015-06-30 07:03:11,4,888.0,371058 71429,,1042455,view,
2015-08-31 18:06:00,4,888.0,371058 71429 508476,,905555,view,
2015-09-15 23:22:44,4,591.0,1116693,1116693.0,1010132,view,
2015-05-06 20:33:13,6,,,,330981,view,


In [26]:
# Filter the item properties data to isolate the rows where the property is "categoryid". This gives you the actual category identifier for each item.

category_property <- item_properties[property == "categoryid"]

# Perform a rolling join Since item properties are time-dependent, align each event with the most recent "categoryid" snapshot preceding the event time

# Order and set keys for rolling join
setorder(category_property, itemid, timestamp)
setorder(events, itemid, timestamp)
setkey(category_property, itemid, timestamp)
setkey(events, itemid, timestamp)


# Rolling join: For each event, get the most recent "categoryid" snapshot
events_with_category <- category_property[events, on = .(itemid, timestamp), roll = TRUE]

head(events_with_category)

timestamp,itemid,property,value,value_clean,visitorid,event,transactionid
<dttm>,<int>,<chr>,<chr>,<dbl>,<int>,<fct>,<int>
2015-08-18 18:30:40,3,categoryid,1171.0,1171.0,370720,view,
2015-08-31 14:39:02,3,categoryid,1171.0,1171.0,639016,view,
2015-06-30 07:03:11,4,categoryid,1038.0,1038.0,1042455,view,
2015-08-31 18:06:00,4,categoryid,1038.0,1038.0,905555,view,
2015-09-15 23:22:44,4,categoryid,1038.0,1038.0,1010132,view,
2015-05-06 20:33:13,6,,,,330981,view,


In [27]:
# Rename and Prepare the Category Identifier
# For clarity, rename the column containing the category identifier. 
events_with_category[, categoryid := value_clean]


# Merge with category tree
# setKey
setkey(category_tree, categoryid)
setkey(events_with_category, categoryid)


# Merge the category tree with the events data
final_data <- merge(events_with_category, category_tree, by = "categoryid", all.x = TRUE)

head(final_data, 20)
# In the merge() function in R, the all.x = TRUE argument specifies that the merge should be left join (keeping all rows from the left dataset and only matching rows from the right dataset).

categoryid,timestamp,itemid,property,value,value_clean,visitorid,event,transactionid,parentid
<int>,<dttm>,<int>,<chr>,<chr>,<dbl>,<int>,<fct>,<int>,<int>
,2015-05-06 20:33:13,6,,,,330981,view,,
,2015-05-06 20:35:35,6,,,,330981,view,,
,2015-05-13 22:53:14,9,,,,1205411,view,,
,2015-07-09 01:35:08,9,,,,949658,view,,
,2015-05-03 16:45:07,16,,,,1275279,view,,
,2015-05-12 02:12:50,16,,,,1103054,view,,
,2015-05-12 02:12:53,16,,,,1103054,view,,
,2015-06-10 07:27:33,16,,,,375506,view,,
,2015-06-10 22:45:47,16,,,,534240,view,,
,2015-06-11 00:30:03,16,,,,843371,view,,
