# Data preprocessing and Feature Engineering

In order to obtain a learnable dataset, features must be preprocessed and engineered in order to contain valuable learning data. This notebook will generate a PySpark RDD that contains the preprocessed dataset, ready for feature selection algorithms.

More than 20 features must be engineered, some ideas are:

Session-time related features:
* Weekday of the session ✅
* Month of the session ✅
* Season of the session ✅
* Duration of the session ✅
* Day period of the session ✅
* Is it during the weekend ✅

Session-item related features:
* Number of viewed items per session ✅
* Number of unique viewed items per session ✅
* Item with the most time spent on. ✅
* Longest time spend on item in the session. ✅
* The last item of the session. ✅
* Mean time per item. ✅
* Variance time per item. ✅
* Item revisit?

### Starting the Spark engine and loading the dataset.

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/11 10:56:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

datas = [train_sessions, train_purchases, candidate_items, item_features]

                                                                                

## Group sessions and aggregate their duration.

Using a map reduce operation, we will start by computing the session duration in seconds. For that, we use a function that will convert the timestamp into seconds since the 1st January 1970 (UNIX time).

In [3]:
import datetime

def timestamp_to_unix(timestamp: str) -> int:
    '''Converts the timestamp, on a string format to the UNIX time
    '''
    try:
        date = datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
    except ValueError:
        date = datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
    return int(date.timestamp())

posix_date_mapped_sessions = train_sessions.rdd.map(lambda x: (x.session_id, timestamp_to_unix(x.date))).cache()
sampled_sessions = posix_date_mapped_sessions.sample(False, 0.0001, seed=42)

In [4]:
sampled_sessions.take(1)

22/05/11 10:56:54 WARN BlockManager: Task 23 already completed, not releasing lock for rdd_45_0
                                                                                

[(4163, 1606579200)]

In [5]:
def reduce_min_max(x, y):
    if type(x) == tuple:
        return(max(x[0], y), min(x[1], y))
    return (max(x, y), min(x, y))

reduced_sessions = posix_date_mapped_sessions.reduceByKey(lambda x, y: reduce_min_max(x, y))
reduced_sessions.take(1)

                                                                                

[(24, (1582741472, 1582737768))]

In [6]:
def get_duration_seconds(tupl):
    if type(tupl[1]) is int:
        return (tupl[0], 0)
    else:
        return (tupl[0], tupl[1][0] - tupl[1][1])

time_sessions = reduced_sessions.map(lambda x: get_duration_seconds(x))

In [7]:
time_sessions.take(10)

[(24, 3704),
 (48, 657),
 (184, 0),
 (208, 0),
 (232, 0),
 (248, 1716),
 (352, 190),
 (376, 171),
 (384, 409),
 (464, 0)]

We cannot compute the sessions that only show one item, so we set the session duration to 0

## Group sessions and aggregate the number of unique AND total viewed items

We compute how much unique and total items have been viewed during a session

In [8]:
item_mapped_sessions = train_sessions.rdd.map(lambda x: (x.session_id, x.item_id)).cache()
uniquely_viewed_items = item_mapped_sessions.groupByKey().mapValues(lambda vals: len(set(vals)))

uniquely_viewed_items.take(10)

                                                                                

[(24, 8),
 (48, 2),
 (184, 1),
 (208, 1),
 (232, 1),
 (248, 10),
 (352, 2),
 (376, 4),
 (384, 10),
 (464, 1)]

In [9]:
total_viewed_items = item_mapped_sessions.groupByKey().mapValues(lambda vals: len(list(vals)))
total_viewed_items.take(10)

                                                                                

[(24, 9),
 (48, 2),
 (184, 1),
 (208, 1),
 (232, 1),
 (248, 15),
 (352, 2),
 (376, 5),
 (384, 10),
 (464, 1)]

# Time period sessions evaluation

We will map for each session its day in the week, if it is on weekend, the month, the season and the year.

In [10]:
# Computing the starting date for each session
starting_date_sessions = train_sessions.rdd.map(lambda x: (x.session_id, x.date)).reduceByKey(lambda x, y: min(x, y))

In [11]:
type(starting_date_sessions.take(1)[0])

                                                                                

tuple

In [12]:
def parse_datetime(timestamp):
    try:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
    except ValueError:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
    

def get_season(date_time):
    season = (date_time.month - 1) // 3
    season += (date_time.month == 3)&(date_time.day>=20)
    season += (date_time.month == 6)&(date_time.day>=21)
    season += (date_time.month == 9)&(date_time.day>=23)
    season -= 3*int(((date_time.month == 12)&(date_time.day>=21)))
    return season

def get_day_period(date_time):
    '''Converts the date_time into the day of the week.
    
    0 -> Morning (from 6am to 12am)
    1 -> Afternoon (from 12am to 6pm)
    2 -> Evening (from 6pm to 12pm)
    3 -> Night (from 12pm to 6am)
    '''
    return date_time.hour // 4
    
def map_day_period(timestamp):
    # Converts into datetime
    date_time = parse_datetime(timestamp)
    # Computes the season
    season = get_season(date_time)
    # Computes the 
    return (get_day_period(date_time), date_time.weekday(), int(date_time.weekday() > 4), date_time.month - 1, season, date_time.year)

date_period_sessions = starting_date_sessions.map(lambda x: (x[0], map_day_period(x[1])))
date_period_sessions.take(10)

[(24, (4, 2, 0, 1, 0, 2020)),
 (48, (4, 2, 0, 3, 1, 2020)),
 (184, (0, 6, 1, 3, 1, 2021)),
 (208, (2, 6, 1, 11, 3, 2020)),
 (232, (3, 4, 0, 8, 2, 2020)),
 (248, (3, 0, 0, 11, 3, 2020)),
 (352, (4, 0, 0, 5, 1, 2020)),
 (376, (2, 5, 1, 0, 0, 2020)),
 (384, (2, 4, 0, 11, 3, 2020)),
 (464, (2, 3, 0, 5, 1, 2020))]

# Time per item evaluation

Computing the items where time was the most spent on. The time is considered to be the one between two clicks.

As it is impossible to determine how much time was spent on the last item of the session, the information will be computed from the train_purchase dataset, where the date of the purchase is also included.

In [13]:
def compute_item_duration(item_list):
    '''Return a new list of tuples, where the first element is the item_id and the second element
    is its related duration. The duration is computed from the time difference with the next visited item.
    
    :param item_list: A list containing tuples of (item_id, datetime)
    '''
    new_list = []
    for idx in range(len(item_list)):
        if idx != len(item_list) - 1:
            time_begin = parse_datetime(item_list[idx][1])
            time_end = parse_datetime(item_list[idx+1][1])
            delta_seconds = (time_end - time_begin).seconds
            new_list.append((item_list[idx][0], delta_seconds))
        else:
            new_list.append((item_list[idx][0], item_list[idx][1]))
    return new_list
            
    


time_per_item = (train_sessions.rdd
    .map(lambda x: (x.session_id, (x.item_id, x.date)))  # For every entry maps a tuple with key=session_id and values=(item_id, date)
    .groupByKey()  # Groups by session_id, result (key: session_id, values: Iterable[item_id, date])
    .mapValues(lambda x: sorted(list(x), key=lambda _x: _x[1]))  # For each key, maps the values to a sorted version
    .mapValues(lambda x: compute_item_duration(x)))  # For each value: Iterable[item_id, date] maps to a new Iterable[item_id, duration]



In [14]:
for it in time_per_item.take(1)[0][1]:
    print(it)



(2927, 15)
(2927, 180)
(16064, 23)
(11662, 42)
(434, 3096)
(18539, 183)
(10414, 42)
(28075, 118)
(18476, '2020-02-26 18:24:32.77')


                                                                                

In [15]:
def compute_last_item_time(item_list, purchase_time):
    '''Computes the time spent on the last item.
    :param item_list: A list of visited items in a session, already processed by the compute_item_duration function
    :param purchase_time: The timestamp of the last purchased item
    '''
    new_list = []
    for idx in range(len(item_list)):
        if idx != len(item_list) - 1:
            new_list.append(item_list[idx])
        else:
            begin_time = parse_datetime(item_list[idx][1])
            end_time = parse_datetime(purchase_time)
            delta_seconds = (end_time - begin_time).seconds
            new_list.append((item_list[idx][0], delta_seconds))
    return new_list


time_per_item = (train_purchases.rdd
    .map(lambda x: (x.session_id, x.date))  # For every entry maps a tuple with key=session_id and value=x.date
    .join(time_per_item)  #  Joins the purchase dataset with the time_per_item dataset from the key (session_id)
    .mapValues(lambda x: compute_last_item_time(x[1], x[0]))  # Sets the duration of the last viewed item.
)

In [16]:
time_per_item.take(10)

                                                                                

[(48, [(8398, 657), (26404, 35)]),
 (208, [(26257, 20)]),
 (352, [(8530, 189), (13653, 359)]),
 (384,
  [(9582, 14),
   (1271, 35),
   (13412, 49),
   (7474, 25),
   (12140, 31),
   (5692, 44),
   (14953, 106),
   (784, 29),
   (2991, 71),
   (27183, 27)]),
 (464, [(7607, 90)]),
 (480, [(2915, 240), (7548, 152), (16377, 59)]),
 (496,
  [(23462, 26),
   (5880, 52),
   (10892, 67),
   (6636, 66),
   (21197, 51),
   (17239, 129),
   (26180, 158)]),
 (512,
  [(26283, 26),
   (3382, 41),
   (14360, 71),
   (18765, 25),
   (2273, 36),
   (4061, 48),
   (19866, 367),
   (10017, 100),
   (9699, 28),
   (2366, 68),
   (9106, 45),
   (10531, 36),
   (5243, 67),
   (4612, 56)]),
 (544, [(26237, 2319), (13636, 186)]),
 (592, [(15170, 523), (1780, 322)])]

In [19]:
# First feature and second features: item with the most time spent on and the spent time.
longest_items = time_per_item.mapValues(lambda x: sorted(x, key=lambda _x : _x[1])[-1][0])
longest_items.take(10)

[(48, (8398, 657)),
 (208, (26257, 20)),
 (352, (13653, 359)),
 (384, (14953, 106)),
 (464, (7607, 90)),
 (480, (2915, 240)),
 (496, (26180, 158)),
 (512, (19866, 367)),
 (544, (26237, 2319)),
 (592, (15170, 523))]

In [20]:
time_per_item.take(10)

                                                                                

[(48, [(8398, 657), (26404, 35)]),
 (208, [(26257, 20)]),
 (352, [(8530, 189), (13653, 359)]),
 (384,
  [(9582, 14),
   (1271, 35),
   (13412, 49),
   (7474, 25),
   (12140, 31),
   (5692, 44),
   (14953, 106),
   (784, 29),
   (2991, 71),
   (27183, 27)]),
 (464, [(7607, 90)]),
 (480, [(2915, 240), (7548, 152), (16377, 59)]),
 (496,
  [(23462, 26),
   (5880, 52),
   (10892, 67),
   (6636, 66),
   (21197, 51),
   (17239, 129),
   (26180, 158)]),
 (512,
  [(26283, 26),
   (3382, 41),
   (14360, 71),
   (18765, 25),
   (2273, 36),
   (4061, 48),
   (19866, 367),
   (10017, 100),
   (9699, 28),
   (2366, 68),
   (9106, 45),
   (10531, 36),
   (5243, 67),
   (4612, 56)]),
 (544, [(26237, 2319), (13636, 186)]),
 (592, [(15170, 523), (1780, 322)])]

In [21]:
# Third feature: the last item visited in the session.
last_item_per_session  = time_per_item.mapValues(lambda x: x[-1][0])
last_item_per_session.take(10)

                                                                                

[(48, 26404),
 (208, 26257),
 (352, 13653),
 (384, 27183),
 (464, 7607),
 (480, 16377),
 (496, 26180),
 (512, 4612),
 (544, 13636),
 (592, 1780)]

In [72]:
import numpy as np

# Fourth and fifth features: the mean and standard deviation of the time spend on each item 
mean_time_per_session = time_per_item.mapValues(lambda x: (np.array(x)[:, 1].mean(), np.array(x)[:, 1].std()))
mean_time_per_session.take(10)

                                                                                

[(48, (346.0, 311.0)),
 (208, (20.0, 0.0)),
 (352, (274.0, 85.0)),
 (384, (43.1, 25.719447894540817)),
 (464, (90.0, 0.0)),
 (480, (150.33333333333334, 73.90233795730386)),
 (496, (78.42857142857143, 43.709616930887165)),
 (512, (72.42857142857143, 84.18662456175589)),
 (544, (1252.5, 1066.5)),
 (592, (422.5, 100.5))]

In [105]:
def count_different_items(x):
    values = [x_ for x_ in x] 
    print(values)
    print('----')
    return x

items_per_sessions = (train_sessions.rdd
                      .map(lambda x: (x.session_id, (x.item_id, 1)))
                      .groupByKey()
                      .sample(False, 0.000005, seed=42)
                      .mapValues(count_different_items)
)
items_per_sessions.take(10)

[(243, 1), (13322, 1), (14248, 1)]                                              
----
[(22705, 1)]
----
[(2447, 1), (19150, 1), (18737, 1)]
----
[(22435, 1), (2314, 1)]
----
[(20086, 1), (222, 1), (19215, 1), (4542, 1)]                       (0 + 4) / 4]
----
[(6373, 1)]
----
[(12427, 1)]                                                                    
----
[(7360, 1)]
----[(3254, 1), (24454, 1), (6651, 1), (15373, 1), (24799, 1), (4867, 1), (17495, 1), (20641, 1), (7159, 1), (10445, 1), (10440, 1), (27943, 1), (6392, 1), (27232, 1), (22474, 1)]
----

[(20556, 1), (19736, 1), (12945, 1), (19736, 1)]
----


[(747089, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf640>),
 (910665, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf5e0>),
 (1983049, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf2b0>),
 (4385009, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf730>),
 (2188762, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf790>),
 (3994580, <pyspark.resultiterable.ResultIterable at 0x7fb0cc9bf7f0>),
 (2974789, <pyspark.resultiterable.ResultIterable at 0x7fb0cca28fa0>),
 (3069125, <pyspark.resultiterable.ResultIterable at 0x7fb0cca28850>),
 (1584439, <pyspark.resultiterable.ResultIterable at 0x7fb0cca28040>),
 (2908407, <pyspark.resultiterable.ResultIterable at 0x7fb0cca28dc0>)]