# Feature engineering

From the dataset explored in the 0_data_analysis notebook, we compute several features.

In order to obtain a learnable dataset, features must be preprocessed and engineered in order to contain valuable learning data. This notebook will generate a PySpark RDD that contains the preprocessed dataset, ready for feature selection algorithms.

### Configuring and launching the pyspark environment

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/27 09:07:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Loading the datasets inside spark

In [46]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

test_sessions = spark.read.load('../Data/test_leaderboard_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

                                                                                

## Time related features.

As fashion purchases are highly dependant on seasonal trends, we will extract date features from the session item visit dates.

For each session, the first item visit date will be taken and used as the reference session date. (The average session duration is an hour)

Engineered features include:
* Month of the session ✅
* Season of the session ✅
* Year of the session ✅
* Duration of the session ✅
* Day period of the session ✅

In [47]:
import datetime

def parse_datetime(timestamp):
    try:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
    except ValueError:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')    

date_parsed_sessions = (train_sessions.rdd
                        .map(lambda x: (x.session_id, parse_datetime(x.date)))  # Maps rows to (key=session_id, values=(parsed_date)) tuples
                        .cache())

# Reduces by key from the MAX monoid
max_date_sessions = (
    date_parsed_sessions
    .reduceByKey(max)
)

# Reduces by key from the MIN monoid
min_date_sessions = (
    date_parsed_sessions
    .reduceByKey(min)
)

# Computes delta time in seconds
time_sessions_in_seconds = (
    max_date_sessions
    .join(min_date_sessions)  # Joins the max dates with the min dates from the session_id (on pair per session)
    .mapValues(lambda x: (x[0] - x[1]).seconds)  # Computes delta time for each session 
)

time_sessions_in_seconds.take(5)

                                                                                

[(24, 3703), (28, 87), (36, 43), (44, 33), (48, 657)]

In [48]:
def get_season(date_time):
    '''Converts the date_time into a season
    
    :returns: an integer
        0 -> Winter
        1 -> Spring
        2 -> Summer
        3 -> Autumn
    '''    
    season = (date_time.month - 1) // 3
    season += (date_time.month == 3)&(date_time.day>=20)
    season += (date_time.month == 6)&(date_time.day>=21)
    season += (date_time.month == 9)&(date_time.day>=23)
    season -= 3*int(((date_time.month == 12)&(date_time.day>=21)))
    return season

def get_day_period(date_time):
    '''Converts the date_time into the day of the week.
    
    0 -> Morning (from 6am to 12am)
    1 -> Afternoon (from 12am to 6pm)
    2 -> Evening (from 6pm to 12pm)
    3 -> Night (from 12pm to 6am)
    '''
    return date_time.hour // 6

# Assigns a season for each session
season_per_session = (
    min_date_sessions
    .mapValues(get_season)
)

# Assigns a day period (morning, afternoon, evening, night...) for each session
day_period_per_session = (
    min_date_sessions
    .mapValues(get_day_period)
)

# Assigns a month for each session
month_per_session = (
    min_date_sessions
    .mapValues(lambda x: x.month - 1)
)

# Assigns a year for each session
year_per_session = (
    min_date_sessions
    .mapValues(lambda x: x.year - 2020)
)

print(min_date_sessions.take(5))
print(season_per_session.take(5))
print(day_period_per_session.take(5))
print(month_per_session.take(5))
print(year_per_session.take(5))

[(24, datetime.datetime(2020, 2, 26, 17, 22, 48, 903000)), (28, datetime.datetime(2020, 5, 18, 12, 50, 24, 248000)), (36, datetime.datetime(2020, 6, 21, 10, 29, 8, 263000)), (44, datetime.datetime(2020, 11, 27, 20, 45, 10, 302000)), (48, datetime.datetime(2020, 4, 15, 17, 17, 42, 594000))]
[(24, 0), (28, 1), (36, 2), (44, 3), (48, 1)]
[(24, 2), (28, 2), (36, 1), (44, 3), (48, 2)]
[(24, 1), (28, 4), (36, 5), (44, 10), (48, 3)]
[(24, 0), (28, 0), (36, 0), (44, 0), (48, 0)]


## Item time related features

Computing the items where time was the most spent on. The time spent on an item is the difference between the visit of that item and the visit of the last item. For the last item, the time is computed from the session purchase date.

We will extract two feautures:
* The item on which the use has spent the most time on ✅
* The time spent on that item ✅

In [52]:
# Computing session purchase_date
session_purchase_date = (
    train_purchases.rdd
    .map(lambda x: (x.session_id, parse_datetime(x.date)))
)




def compute_time_per_item(session_info):
    '''Computes the time spent on each item (in seconds)
    
    :param session_info: The information of the session ([(item_id, date)...], purchase_date)
    :returns: Time per item information [(item_id, visit_time)...]
    '''
    # Unpacks the info
    visited_items, purchase_date = session_info
    item_time_list = []
    
    for idx in range(len(visited_items)):
        item_id, item_visit_date = visited_items[idx]
        
        # Last item, need to check the purchase date
        if idx == len(visited_items) - 1:
            item_time_list.append((item_id, (purchase_date - item_visit_date).seconds))
        # Not the last item, checks the next item 
        else:
            _, next_item_visit_date = visited_items[idx + 1]
            item_time_list.append((item_id, (next_item_visit_date - item_visit_date).seconds))
    
    return item_time_list


# Here, we group the tuples by their keys (session_id) using Map Reduce
items_time_per_session = (
    train_sessions.rdd
    .map(lambda x: (x.session_id, (x.item_id, parse_datetime(x.date))))
    .mapValues(lambda x: [(x[0], x[1])])  # Sets all the values inside a list in order to easily reduce and join the sessions
    .reduceByKey(lambda x, y: x + y)  # Joins all the sessions togheter by reducing and joining the lists
    .join(session_purchase_date)  # Each tuple has now the shape (session_id, ([(item_id, date)...], purchase_date))
    .mapValues(lambda x: (sorted(x[0], key=lambda t:t[1]), x[1]))  # For each session, sorts the (item_id, date) tuples by date 
    .mapValues(compute_time_per_item)
)

items_time_per_session.take(1)[0]


                                                                                

(24,
 [(2927, 15),
  (2927, 180),
  (16064, 23),
  (11662, 42),
  (434, 3096),
  (18539, 183),
  (10414, 42),
  (28075, 118),
  (18476, 191)])

In [53]:
TEST_items_time_per_session = (
    test_sessions.rdd
    .map(lambda x: (x.session_id, (x.item_id, parse_datetime(x.date))))
    .mapValues(lambda x: [(x[0], x[1])])  # Sets all the values inside a list in order to easily reduce and join the sessions
    .reduceByKey(lambda x, y: x + y)  # Joins all the sessions togheter by reducing and joining the lists
    .join(session_purchase_date)  # Each tuple has now the shape (session_id, ([(item_id, date)...], purchase_date))
    .mapValues(lambda x: (sorted(x[0], key=lambda t:t[1]), x[1]))  # For each session, sorts the (item_id, date) tuples by date 
    .mapValues(compute_time_per_item)
)

test_sessions.take(1)[0]

Row(session_id=26, item_id=19185, date='2021-06-16 09:53:54.158')

In [6]:
most_spent_time = (
    items_time_per_session
    .mapValues(lambda x: sorted(x, reverse=True, key=lambda t: t[1]))  # Sorts the item visits by time
    .mapValues(lambda x: x[0])  # Gets the item with the most time spent on
)

most_spent_time.take(1)[0]

                                                                                

(24, (434, 3096))

In [55]:
TEST_most_spent_time = (
    TEST_items_time_per_session
    .mapValues(lambda x: sorted(x, reverse=True, key=lambda t: t[1]))  # Sorts the item visits by time
    .mapValues(lambda x: x[0])  # Gets the item with the most time spent on
)

TEST_most_spent_time.take(1)[0]

                                                                                

IndexError: list index out of range

## Session mean and standard deviation time

We will extract two features:

* For each session, the mean visit time on each item ✅
* For each session, the standard time on each time ✅

In [7]:
import numpy as np

mean_std_time_session = (
    items_time_per_session
    .mapValues(lambda x: np.array(x, dtype=np.float32)[:, 1])  # Gets a numpy array for each time and takes only the visit time (drops the item id)
    .mapValues(lambda x: (x.mean(), x.std()))
)

mean_std_time_session.take(1)

                                                                                

[(24, (432.22223, 944.24854))]

In [None]:
import numpy as np

mean_std_time_session = (
    items_time_per_session
    .mapValues(lambda x: np.array(x, dtype=np.float32)[:, 1])  # Gets a numpy array for each time and takes only the visit time (drops the item id)
    .mapValues(lambda x: (x.mean(), x.std()))
)

mean_std_time_session.take(1)

## Item revisit time

Item that have been revisited are the most likely to catch the interest of the user, and therefore to be purchased.

We will extract three features:
* The item that has been revisited the most times ✅
* The number of times this item has been revisited ✅
* The number of items that have been revisited at least once ✅

In [8]:
item_visit_counts = (
    train_sessions.rdd
    .map(lambda x: (x.session_id, x.item_id))  # Maps the rows to a (key=session_id, values=item_id) tuple
    .mapValues(lambda x: [x])  # Puts the items inside a list, so they can be reduced easily.
    .reduceByKey(lambda x, y: x + y)  # Joins the lists togheter, grouping the tuples by keys
    .mapValues(lambda x: np.unique(x, return_counts=True))  # Transforms values into two arrays: an item_id array and an occurence_array
    .mapValues(lambda x: np.vstack((x[0], x[1])))
)

def get_revisited_items(count_array):
    '''Gets the number of items revisited at least once'''
    revisited_indices = count_array[1, :] > 1
    return np.count_nonzero(revisited_indices)

def get_item_revisits_info(count_array):
    '''Returns the item_id of the item that
    has been revisited the most as well as the 
    number of times it has been revisited
    
    If no item was revisited return -1, -1, 0
    '''
    number_of_revisits = get_revisited_items(count_array)
    if number_of_revisits == 0:
        return (-1, -1, number_of_revisits)
    
    most_revisited_item_idx = np.argmax(count_array[1, :])
    most_revisits = count_array[1, :].max()
    
    return (count_array[0, most_revisited_item_idx], most_revisits, number_of_revisits)
    

item_revisit_info = (
    item_visit_counts
    .mapValues(get_item_revisits_info) 
)


print(item_revisit_info.take(10))

[Stage 31:>                                                         (0 + 1) / 1]

[(24, (2927, 2, 1)), (28, (11529, 2, 1)), (36, (-1, -1, 0)), (44, (-1, -1, 0)), (48, (-1, -1, 0)), (52, (-1, -1, 0)), (108, (12735, 3, 2)), (124, (-1, -1, 0)), (140, (-1, -1, 0)), (156, (-1, -1, 0))]


                                                                                

# Engineering dataset features from item features

Each item in the dataset has a finite number of features. Presented in tuples of (feature_cateogy_id, feature_value_id), each item has multiple of those feauters.

As there are 73 unique features categories and 904 different unique (feature_category, feature_value) pairs. As most of those categories are not documented, it is difficult to interepret their meaning.

In this phase, we will engineer features by performing K-Means clustering using Map-Reduce.

Multiple features will be engineered:
* One feature telling in wich category cluster is the session belonging ✅
* One feature for each cluster on the item feature value clustering (25 clusters) ✅

Counting the number of occurences of each item category in each session

In [9]:
# Counts the number of unique category IDs
unique_categories_nb = len(
    item_features.rdd
    .map(lambda x: set([x.feature_category_id]))  # Maps the rows to a set containing the feature category
    .reduce(lambda x, y: x.union(y))  # Reduces by joining the sets, keeping only the unique categories values
)

def initialize_vector(category_nb, max_categories=unique_categories_nb):
    '''
    Returns a vector of categories, where each element is 0 except the one at the category's index.
    '''
    vector = np.zeros(max_categories, dtype=np.float32)
    vector[category_nb - 1] = 1.0
    return vector

def normalize_vector(category_vector):
    '''
    Noramlizes a category vector, dividing all the elements by the sum of of occurences
    '''
    total_occurences = np.sum(category_vector)
    return category_vector / total_occurences


normalized_categories_vector_per_session = (
    train_sessions.rdd
    .map(lambda x: (x.item_id, x.session_id))  # Maps the rows to (key=item_id,values=sessions_id) tuples
    .join(item_features.rdd.map(lambda x: (x.item_id, x.feature_category_id)))  # Joins the two datasets togheter by item_id
    .map(lambda x: x[1])  # Only keeps some parts of the tuples, now we have (key=session_id,values=feature_category_id)
    .mapValues(initialize_vector)  # Encodes the feature category inside a counter vector
    .reduceByKey(lambda x, y: x + y)   # Reduces by session_id, and adds the counter vectors
    .mapValues(normalize_vector)  # Noramlizes the vector, so that the sum of every element is equal to 1.0
)

normalized_categories_vector_per_session.take(1)

                                                                                

[(474192,
  array([0.        , 0.00381679, 0.01145038, 0.02290076, 0.01145038,
         0.01908397, 0.04961832, 0.        , 0.        , 0.        ,
         0.02290076, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.02290076, 0.02290076, 0.02290076, 0.        ,
         0.        , 0.        , 0.00381679, 0.01145038, 0.        ,
         0.02290076, 0.        , 0.00381679, 0.02671756, 0.04961832,
         0.00381679, 0.02290076, 0.00763359, 0.01145038, 0.        ,
         0.00381679, 0.00381679, 0.00381679, 0.00381679, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.02290076,
         0.01145038, 0.05725191, 0.        , 0.00381679, 0.05725191,
         0.        , 0.        , 0.01908397, 0.        , 0.02671756,
         0.05725191, 0.        , 0.        , 0.02290076, 0.01908397,
         0.05343511, 0.02671756, 0.02290076, 0.00381679, 0.02290076,
         0.        , 0.00381679, 0.05343511, 0.04961832, 0.        ,
         0.00381679, 0.0

## Clustering on item feature categories

Performing clustering from the Normalized Category Vectors

In [10]:
##### CLUSTERING WITH MAP/REDUCE
# inspired from https://uv.ulb.ac.be/pluginfile.php/3410436/mod_resource/content/1/Kmeans.html?fbclid=IwAR1h5xJKxQ1nrCHlgtQsofwjgc8B7oVl69GG6Mm8WxChTH3zBc4SKFo3Noo

from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

def mapstep(x,br_clusters): 
    """
    This function returns the closest cluster of the current row x
    x : current row
    broadcast_clusters : current value of the clusters
    """
    M = br_clusters.value.shape[0]
    d=np.zeros(M) # distance of the current row with each cluster
    for m in range(M):
        d[m]=sum(abs(np.subtract(x,br_clusters.value[m]))) # compute distance between cluster m and row x
    return np.argmin(d)

def add_values(x, y):
    return x[0] + y[0], x[1] + y[1]

def k_means_MR(dataset, M, steps):
    """
    dataset : RDD data structure, all columns are used for distance computation (1xn dimension)
    M : number of clusters
    steps : number of steps
    """
    n = len(dataset.take(1)[0]) # input dimension
    clusters = np.array(dataset.takeSample(True,M)) # starts with random clusters
    broadcast_clusters = sc.broadcast(clusters) ## broadcast cluster position
    for i in range(steps):
        distance_set = dataset.map(lambda x : (mapstep(x, broadcast_clusters), (x, 1))) # set containing (closest_cluster,(row value),counter=1)
        new_clusters_set = distance_set.reduceByKey(add_values) # adds all cluster related entry (+ increments counter)
        new_clusters_set = new_clusters_set.map(lambda x : x[1][0] / x[1][1]) # apply mean operation
        new_clusters = np.array(new_clusters_set.take(M))
        broadcast_clusters = sc.broadcast(new_clusters)
    return broadcast_clusters.value


mapped_vectors = normalized_categories_vector_per_session.map(lambda x: x[1]).cache()  # Drops the session id

feature_category_clusters = k_means_MR(
    mapped_vectors,
    10,  # 10 Clusters are taken
    5  # Iterates over 5 steps
)

22/05/27 09:13:06 WARN BlockManager: Task 64 already completed, not releasing lock for rdd_105_0
                                                                                

In [11]:
broadcast_clusters_142 = sc.broadcast(feature_category_clusters) # broadcasting is a good practice for parallel computing
clustered_categories = normalized_categories_vector_per_session.map(lambda x : (x[0], mapstep(x[1],broadcast_clusters_142))) # compute associated clusters for each item

clustered_categories.take(10) # (example) OUTPUT [SESSION_ID | CLUSTER_ID]

                                                                                

[(474192, 8),
 (1119330, 8),
 (1309362, 0),
 (1468758, 8),
 (1910616, 0),
 (2691762, 1),
 (2933592, 8),
 (4141098, 8),
 (18, 0),
 (93486, 0)]

## Clustering on item feature values

We will perform clustering on the item feature values only. In total, 25 different clusters will be computed and each item will be classified in one of those clusters.

Then, for each session we will look at the visited items. For each of those items we will increment the session's counter for the class of that item. In total, 25 new features will be added to our sessions.

In [12]:
unique_feature_values = (
    item_features.rdd
    .map(lambda x: (set([x.feature_value_id])))  # Maps each row to a set containing the feature value
    .reduce(lambda x, y: x.union(y))  # Joins the sets, discarding duplicate values (only keeping unique values)
)

highest_feature_value = max(unique_feature_values)  # Takes the highest value in the set, used as vector sie


def initialize_value_vector(value_nb, max_values=highest_feature_value):
    '''
    Returns a vector of item feature values, where each element is 0 except the one at the values's index.
    '''
    vector = np.zeros(max_values, dtype=np.float32)
    vector[value_nb - 1] = 1.0
    return vector

encoded_item_values = (
    item_features.rdd
    .map(lambda x: (x.item_id, x.feature_value_id))  # Maps each row to a (key=item_id, values=feature_value_id) tuples
    .mapValues(initialize_value_vector)  # Initializes each value vector (one hot encoding type)
    .reduceByKey(lambda x, y: x + y)  # Reduces on each item by summing the feature value vector
)

                                                                                

In [13]:
# Performs clustering
mapped_item_values = encoded_item_values.map(lambda x: x[1])

feature_values_clusters = k_means_MR(
    mapped_item_values,
    25,  # As analysed in notebook 2b_feature_clustering, 25 is a good cluster compromise
    5
)

                                                                                

In [14]:
broadcast_clusters_47 = sc.broadcast(feature_values_clusters) # broadcasting is a good practice for parallel computing
clustered_item_values = encoded_item_values.map(lambda x : (x[0], mapstep(x[1],broadcast_clusters_47))) # compute associated clusters for each item

clustered_item_values.take(10) # (example) OUTPUT [ITEM_ID | CLUSTER_ID]

[(2, 13),
 (4, 0),
 (8, 14),
 (10, 1),
 (14, 15),
 (16, 0),
 (18, 2),
 (20, 16),
 (24, 1),
 (26, 16)]

In [15]:
# Encodes the clustered item values in one-hot encoding
encoded_item_values = (
    clustered_item_values
    .mapValues(lambda x: initialize_value_vector(x, max_values=25))
)

# Computes the vector for each session
item_features_session = (
    train_sessions.rdd
    .map(lambda x: (x.item_id, x.session_id))
    .join(encoded_item_values)
    .map(lambda x: (x[1][0], x[1][1]))
    .reduceByKey(lambda x, y: x + y)
    .mapValues(normalize_vector)
    .mapValues(tuple)
)

item_features_session.take(1)

                                                                                

[(474192,
  (0.06666667,
   0.0,
   0.0,
   0.06666667,
   0.0,
   0.13333334,
   0.0,
   0.2,
   0.0,
   0.0,
   0.0,
   0.0,
   0.2,
   0.06666667,
   0.0,
   0.0,
   0.2,
   0.0,
   0.0,
   0.06666667,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0))]

# Cluster items depending on their categories

As there is a too high amount of different items, one-hot encoding them would result in an undesirable explosion of features.

Items are therefore assigned to one of 50 clusters, based on their categories. Two similar items have a higher chance to have common categories, and item values are too unique to be useful.

In [16]:
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

def mapstep(x,br_clusters): 
    """
    This function returns the closest cluster of the current row x
    x : current row
    broadcast_clusters_k : current value of the clusters
    """
    M = br_clusters.value.shape[0]
    d=np.zeros(M) # distance of the current row with each cluster
    for m in range(M):
        d[m]=sum(abs(np.subtract(x,br_clusters.value[m]))) # compute distance between cluster m and row x
    return(np.argmin(d),)    

def k_means_MR(dataset,M,steps):
    """
    dataset : RDD data structure, all columns are used for distance computation (1xn dimension)
    M : number of clusters
    steps : number of steps
    """
    n = len(dataset.take(1)[0]) # input dimension
    clusters = np.array(dataset.takeSample(True,M)) # starts with random clusters
    broadcast_clusters_k = sc.broadcast(clusters) ## broadcast cluster position
    for i in range(steps):
        distance_set = dataset.map(lambda x : mapstep(x,broadcast_clusters_k)+(x+(1,),)) # set containing (closest_cluster,(row value),counter=1)
        new_clusters_set = distance_set.reduceByKey(lambda x,y : np.add(x,y)) # adds all cluster related entry (+ increments counter)
        new_clusters_set = new_clusters_set.map(lambda x : x[1][:-1]/x[1][-1]) # apply mean operation
        new_clusters = np.array(new_clusters_set.take(M))
        broadcast_clusters_k = sc.broadcast(new_clusters)
    return broadcast_clusters_k.value

category_count = item_features.rdd.map(lambda x: x[1]).distinct().collect() # 904 different features value
renamed_set = item_features.rdd.map(lambda x: (x[0], x[1]))

dic_values = {elem:i for i, elem in enumerate(category_count)} # maps the features values in a dictionnary

def fill_column(x):
    value = dic_values.get(x[1])
    return (x[0],((0,)*(value)+(1,)+(0,)*(len(category_count)-value-1)))

columns_features = renamed_set.map(fill_column) # creates all empty columns with features values and fills it
reduced_map = columns_features.reduceByKey(lambda x,y : tuple(sum(x) for x in zip(x,y))) # reduce by key (item)

dataset = reduced_map.map(lambda x : x[1]) # remap everything in one single line of 904 elements (easier for clustering)

item_cat_clusters = k_means_MR(dataset,50,5) # compute cluster values

broadcast_clusters_pp = sc.broadcast(item_cat_clusters) # broadcasting is a good practice for parallel computing
item_clustered_by_cat = reduced_map.map(lambda x : (x[0],)+mapstep(x[1],broadcast_clusters_pp)) # compute associated clusters for each item

item_clustered_by_cat.take(20) # (example) OUTPUT [ITEM_ID | CLUSTER_ID]

                                                                                

[(2, 0),
 (4, 1),
 (8, 22),
 (10, 23),
 (14, 2),
 (16, 1),
 (18, 3),
 (20, 24),
 (24, 25),
 (26, 4),
 (28, 5),
 (30, 4),
 (32, 26),
 (36, 4),
 (38, 27),
 (40, 6),
 (42, 25),
 (44, 7),
 (46, 28),
 (50, 1)]

In [17]:
# Saves a dictionnary that assigns each item to its cluster (so no clustering operation must be done again)
item_cluster_dict = { item_id: cluster for item_id, cluster in item_clustered_by_cat.collect() }


                                                                                

In [18]:
import pickle

with open('../Data/item_dict.pd', 'wb') as handle:
    pickle.dump(item_cluster_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [19]:
def map_cluster_revisited(x):
    if x[0] != -1:
        return item_cluster_dict[x[0]], x[1], x[2]
    return x
    
most_spent_time_clustered = most_spent_time.mapValues(lambda x: (item_cluster_dict[x[0]], x[1]))  # Encodes the item id in its cluster
revisited_clustered = item_revisit_info.mapValues(map_cluster_revisited)  # Encodes the item id in its cluster

## First and last visited items

In each session, take the item that has been visited as first and revisited as last

Two features will be engineered:

* The item that has been visited as first (clustered) ✅
* The item that has been visites as last (clustered) ✅

In [20]:
first_last_item = (
    train_sessions.rdd
    .map(lambda x: (x.session_id, (x.item_id, parse_datetime(x.date))))  # Maps the rows into a (key=session_id, value=(item_id, date)) tuple
    .mapValues(lambda x: ([(x)]))  # Puts the (item_id, date) tuple into a list to easily merge the values
    .reduceByKey(lambda x, y: x + y)  # Groups the items per session, merging the session's items and their date
    .mapValues(lambda x: sorted(x, key=lambda x_: x_[1]))  # In each session, sorts the items by date (earliest to latest)
    .mapValues(lambda x: (x[0][0], x[-1][0]))  # Takes the first and last visited items
    .cache()
)
first_last_item.take(10)

22/05/27 09:24:20 WARN BlockManager: Task 219 already completed, not releasing lock for rdd_244_0
                                                                                

[(24, (2927, 18476)),
 (28, (16895, 21902)),
 (36, (25417, 26536)),
 (44, (22747, 17089)),
 (48, (8398, 26404)),
 (52, (24636, 12047)),
 (108, (4816, 15421)),
 (124, (26092, 26092)),
 (140, (18723, 13914)),
 (156, (3462, 3462))]

In [21]:
# Encodes the first and last visited item from their clusters
first_last_item_clustered = first_last_item.mapValues(lambda x: (item_cluster_dict[x[0]], item_cluster_dict[x[1]]))
first_last_item_clustered.take(10)

[(24, (17, 33)),
 (28, (7, 7)),
 (36, (32, 31)),
 (44, (5, 5)),
 (48, (30, 8)),
 (52, (9, 1)),
 (108, (36, 14)),
 (124, (32, 32)),
 (140, (5, 5)),
 (156, (38, 38))]

## Most and Least frequent bought item (clustered), based on the most visited item (and most time spend on item)

### FOR MOST VISITED ITEM

In [22]:
# join the most revisited item (clustered) with the bought items (clustered)
temp_clusters = (train_purchases
                 .rdd.map(lambda x: (x[0],(item_cluster_dict.get(x[1]))))
                 .join(revisited_clustered.map(lambda x: (x[0],(x[1][0]))))
                ).cache()

In [23]:
temp_clusters.take(1)

22/05/27 09:24:20 WARN DAGScheduler: Broadcasting large task binary with size 1045.3 KiB
22/05/27 09:24:44 WARN BlockManager: Task 229 already completed, not releasing lock for rdd_254_0
                                                                                

[(24, (5, 17))]

In [24]:
temp_clusters_m = (
    temp_clusters.map(lambda x : (x[1][1],(x[1][0],)))
    .reduceByKey(lambda x,y : x+y)
)# keep only most_visited cluster and bought cluster

In [25]:
#temp_clusters_m.take(1)

In [26]:
def mode_and_least(row):
    vals, counts = np.unique(np.array(row), return_counts=True)
    mode_value = np.argwhere(counts == np.max(counts))[0][0]
    least_value = np.argwhere(counts == np.min(counts))[0][0]
    return (row[mode_value],row[least_value])

temp_most_visited = temp_clusters_m.map(lambda x : (x[0], mode_and_least(x[1]))) # mapping : most visited clusted id -->TO--> most and least bought (cluster)

In [27]:
most_visited_dict = {k: v for k, v in temp_most_visited.collect()}

                                                                                

In [28]:
most_visited_dict.get(20)

(20, 9)

In [29]:
revisited_clustered = revisited_clustered.map(lambda x: (x[0],x[1]+most_visited_dict.get(x[1][0])))

In [30]:
revisited_clustered.take(1)

                                                                                

[(24, (17, 2, 1, 12, 28))]

### FOR MOST TIME SPENT ON ITEM

In [31]:
# join the most revisited item (clustered) with the bought items (clustered)
temp_clusters = (train_purchases
                 .rdd.map(lambda x: (x[0],(item_cluster_dict.get(x[1]))))
                 .join(most_spent_time_clustered.map(lambda x: (x[0],(x[1][0]))))
                ).cache()
temp_clusters.take(1)

22/05/27 09:26:07 WARN DAGScheduler: Broadcasting large task binary with size 1047.3 KiB
                                                                                

[(24, (5, 5))]

In [32]:
temp_clusters_m = (
    temp_clusters.map(lambda x : (x[1][1],(x[1][0],)))
    .reduceByKey(lambda x,y : x+y)
).cache()# keep only most_visited cluster and bought cluster

In [33]:
#temp_clusters_m.take(1)

In [34]:
temp_most_time = temp_clusters_m.map(lambda x : (x[0], mode_and_least(x[1]))) # mapping

In [35]:
temp_most_time.take(10)

most_time_dict = {k: v for k, v in temp_most_time.collect()}

                                                                                

In [36]:
most_spent_time_clustered = most_spent_time_clustered.map(lambda x: (x[0],x[1]+most_time_dict.get(x[1][0])))

In [37]:
most_spent_time_clustered.take(1)

                                                                                

[(24, (5, 3096, 22, 2))]

# Save the final dataset

The final engineered dataset, containing all the values, is created by joining the other engineered value datasets with the session_id as key.

In [38]:
#item_features_session.take(1)

In [39]:
def unpack_tuples(row):
    out = []
    for elem in row:
        if type(elem) == tuple:
            out.extend(unpack_tuples(elem))
        else:
            out.append(elem)
    return out

final_engineered_dataset = (
    time_sessions_in_seconds
    .join(season_per_session)
    .join(day_period_per_session)
    .join(month_per_session)
    .join(year_per_session)
    .join(most_spent_time_clustered)
    .join(mean_std_time_session)
    .join(revisited_clustered)
    .join(first_last_item_clustered)
    .join(clustered_categories)
    .join(item_features_session)
    .mapValues(unpack_tuples)
)


final_engineered_dataset.take(1)

                                                                                

[(28560,
  [272,
   3,
   2,
   8,
   0,
   26,
   73,
   26,
   7,
   47.5,
   19.94785,
   -1,
   -1,
   0,
   39,
   22,
   31,
   26,
   1,
   0.0,
   0.0,
   0.16666667,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.16666667,
   0.0,
   0.0,
   0.0,
   0.0,
   0.6666667,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0])]

0 - session id (int)<br>
<br>
1 - session total time (float)<br>
<br>
2 - day period (int), 4 categories 0,..,3<br>
3 - season (int), 4 categories 0,..,3<br>
4 - month (int), 12 categories 0,..,11<br>
5 - year (int), 2 categories 0,1<br>
<br>
6 - item with most time spent on (int), 50 categories<br>
7 - most time spent on an item (float) <br>
8 - most frequently bought item for given most time spent on item (int), 50 categories<br>
9 - least ""<br>
<br>
10 - mean time per item (float)<br>
11 - std time per item (float)<br>
<br>
12 - item revisited the most (int), 50 categories<br>
13 - number of time it has been revisited (int)<br>
14 - number of revisited items (int)<br>
15 - most frequently bought item when item, when X is the most revisited item (int), 50 categories<br>
16 - least ""<br>
<br>
17 - first visited item (int), 50 categories<br>
18 - last ""<br>
<br>
19 - normalized feature category vector (int), 10 categories<br>
<br>
20 -> 44 - normalized feature vector (float)<br>

In [40]:
tuple_final_engineered_dataset = (
    final_engineered_dataset.map(lambda x: ((x[0],) + tuple(x[1])))
)

#tuple_final_engineered_dataset.take(1)

In [41]:
columns = ['session_id', 'session_time', 'season', 'day_period', 'month', 'year', 'item_most_time_spent', 'most_time_spent_on_item', 'most_frequently_bought_for_time_spent', 
          'least_frequently_bought_for_time_spent', 'mean_time', 'std_time', 'item_most_visited', 'number_o_visit', 'number_o_revisited_items', 'most_frequently_bought_for_most_revisited',
          'first_item_visited', 'last_item_visited', 'normalized_features_vector']

columns += [str(i+1) for i in range(25)]

print(columns)

['session_id', 'session_time', 'season', 'day_period', 'month', 'year', 'item_most_time_spent', 'most_time_spent_on_item', 'most_frequently_bought_for_time_spent', 'least_frequently_bought_for_time_spent', 'mean_time', 'std_time', 'item_most_visited', 'number_o_visit', 'number_o_revisited_items', 'most_frequently_bought_for_most_revisited', 'first_item_visited', 'last_item_visited', 'normalized_features_vector', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25']


In [42]:
tuple_final_engineered_dataset.map(lambda x : tuple([float(xi) for xi in x])).coalesce(1).toDF(columns).write.option("header",True).csv('/PROJ/Data/session_engineered_features.csv')

22/05/27 09:32:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


AnalysisException: path file:/PROJ/Data/session_engineered_features.csv already exists.

In [None]:
test = spark.read.csv('../Data/session_engineered_features.csv',header=False,
                                          inferSchema=True)
summary = test.describe().toPandas().set_index('summary').transpose()
print(summary)

In [43]:
# Saves all the clusters array inside the data folder
import pickle

with open('../Data/clusters/feature_category_clusters.np', 'wb') as file:
    pickle.dump(feature_category_clusters, file, pickle.HIGHEST_PROTOCOL)
    
with open('../Data/clusters/feature_values_clusters.np', 'wb') as file:
    pickle.dump(feature_values_clusters, file, pickle.HIGHEST_PROTOCOL)

with open('../Data/clusters/item_cat_clusters.np', 'wb') as file:
    pickle.dump(item_cat_clusters, file, pickle.HIGHEST_PROTOCOL)



In [44]:
with open('../Data/most_visited_dict.pd', 'wb') as file:
    pickle.dump(most_visited_dict, file, pickle.HIGHEST_PROTOCOL)
    
with open('../Data/most_time_dict.pd', 'wb') as file:
    pickle.dump(most_time_dict, file, pickle.HIGHEST_PROTOCOL)