# Feature engineering

From the dataset explored in the 0_data_analysis notebook, we compute several features.

In order to obtain a learnable dataset, features must be preprocessed and engineered in order to contain valuable learning data. This notebook will generate a PySpark RDD that contains the preprocessed dataset, ready for feature selection algorithms.

### Configuring and launching the pyspark environment

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/19 08:13:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/19 08:13:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/05/19 08:13:15 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Loading the datasets inside spark

In [2]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

                                                                                

## Time related features.

As fashion purchases are highly dependant on seasonal trends, we will extract date features from the session item visit dates.

For each session, the first item visit date will be taken and used as the reference session date. (The average session duration is an hour)

Engineered features include:
* Month of the session ✅
* Season of the session ✅
* Year of the session ✅
* Duration of the session ✅
* Day period of the session ✅

In [4]:
import datetime

def parse_datetime(timestamp):
    try:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
    except ValueError:
        return datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')    

date_parsed_sessions = (train_sessions.rdd
                        .map(lambda x: (x.session_id, parse_datetime(x.date)))  # Maps rows to (key=session_id, values=(parsed_date)) tuples
                        .cache())

# Reduces by key from the MAX monoid
max_date_sessions = (
    date_parsed_sessions
    .reduceByKey(max)
)

# Reduces by key from the MIN monoid
min_date_sessions = (
    date_parsed_sessions
    .reduceByKey(min)
)

# Computes delta time in seconds
time_sessions_in_seconds = (
    max_date_sessions
    .join(min_date_sessions)  # Joins the max dates with the min dates from the session_id (on pair per session)
    .mapValues(lambda x: (x[0] - x[1]).seconds)  # Computes delta time for each session 
)

time_sessions_in_seconds.take(5)

                                                                                

[(24, 3703), (28, 87), (36, 43), (44, 33), (48, 657)]

In [8]:
def get_season(date_time):
    '''Converts the date_time into a season
    
    :returns: an integer
        0 -> Winter
        1 -> Spring
        2 -> Summer
        3 -> Autumn
    '''    
    season = (date_time.month - 1) // 3
    season += (date_time.month == 3)&(date_time.day>=20)
    season += (date_time.month == 6)&(date_time.day>=21)
    season += (date_time.month == 9)&(date_time.day>=23)
    season -= 3*int(((date_time.month == 12)&(date_time.day>=21)))
    return season

def get_day_period(date_time):
    '''Converts the date_time into the day of the week.
    
    0 -> Morning (from 6am to 12am)
    1 -> Afternoon (from 12am to 6pm)
    2 -> Evening (from 6pm to 12pm)
    3 -> Night (from 12pm to 6am)
    '''
    return date_time.hour // 4

# Assigns a season for each session
season_per_session = (
    min_date_sessions
    .mapValues(get_season)
)

# Assigns a day period (morning, afternoon, evening, night...) for each session
day_period_per_session = (
    min_date_sessions
    .mapValues(get_day_period)
)

# Assigns a month for each session
month_per_session = (
    min_date_sessions
    .mapValues(lambda x: x.month)
)

# Assigns a year for each session
year_per_session = (
    min_date_sessions
    .mapValues(lambda x: x.year)
)

print(min_date_sessions.take(5))
print(season_per_session.take(5))
print(day_period_per_session.take(5))
print(month_per_session.take(5))
print(year_per_session.take(5))

[(24, datetime.datetime(2020, 2, 26, 17, 22, 48, 903000)), (28, datetime.datetime(2020, 5, 18, 12, 50, 24, 248000)), (36, datetime.datetime(2020, 6, 21, 10, 29, 8, 263000)), (44, datetime.datetime(2020, 11, 27, 20, 45, 10, 302000)), (48, datetime.datetime(2020, 4, 15, 17, 17, 42, 594000))]
[(24, 0), (28, 1), (36, 2), (44, 3), (48, 1)]
[(24, 4), (28, 3), (36, 2), (44, 5), (48, 4)]
[(24, 2), (28, 5), (36, 6), (44, 11), (48, 4)]
[(24, 2020), (28, 2020), (36, 2020), (44, 2020), (48, 2020)]


## Item time related features

Computing the items where time was the most spent on. The time spent on an item is the difference between the visit of that item and the visit of the last item. For the last item, the time is computed from the session purchase date.

We will extract two feautures:
* The item on which the use has spent the most time on ✅
* The time spent on that item ✅

In [15]:
# Computing session purchase_date
session_purchase_date = (
    train_purchases.rdd
    .map(lambda x: (x.session_id, parse_datetime(x.date)))
)

def compute_time_per_item(session_info):
    '''Computes the time spent on each item (in seconds)
    
    :param session_info: The information of the session ([(item_id, date)...], purchase_date)
    :returns: Time per item information [(item_id, visit_time)...]
    '''
    # Unpacks the info
    visited_items, purchase_date = session_info
    item_time_list = []
    
    for idx in range(len(visited_items)):
        item_id, item_visit_date = visited_items[idx]
        
        # Last item, need to check the purchase date
        if idx == len(visited_items) - 1:
            item_time_list.append((item_id, (purchase_date - item_visit_date).seconds))
        # Not the last item, checks the next item 
        else:
            _, next_item_visit_date = visited_items[idx + 1]
            item_time_list.append((item_id, (next_item_visit_date - item_visit_date).seconds))
    
    return item_time_list


# Here, we group the tuples by their keys (session_id) using Map Reduce
items_time_per_session = (
    train_sessions.rdd
    .map(lambda x: (x.session_id, (x.item_id, parse_datetime(x.date))))
    .mapValues(lambda x: [(x[0], x[1])])  # Sets all the values inside a list in order to easily reduce and join the sessions
    .reduceByKey(lambda x, y: x + y)  # Joins all the sessions togheter by reducing and joining the lists
    .join(session_purchase_date)  # Each tuple has now the shape (session_id, ([(item_id, date)...], purchase_date))
    .mapValues(lambda x: (sorted(x[0], key=lambda t:t[1]), x[1]))  # For each session, sorts the (item_id, date) tuples by date 
    .mapValues(compute_time_per_item)
)

items_time_per_session.take(1)[0]

                                                                                

(24,
 [(2927, 15),
  (2927, 180),
  (16064, 23),
  (11662, 42),
  (434, 3096),
  (18539, 183),
  (10414, 42),
  (28075, 118),
  (18476, 191)])

In [19]:
most_spent_time = (
    items_time_per_session
    .mapValues(lambda x: sorted(x, reverse=True, key=lambda t: t[1]))  # Sorts the item visits by time
    .mapValues(lambda x: x[0])  # Gets the item with the most time spent on
)

most_spent_time.take(1)[0]

                                                                                

(24, (434, 3096))

## Session mean and standard deviation time

We will extract two features:

* For each session, the mean visit time on each item ✅
* For each session, the standard time on each time ✅

In [20]:
import numpy as np

mean_std_time_session = (
    items_time_per_session
    .mapValues(lambda x: np.array(x, dtype=np.float32)[:, 1])  # Gets a numpy array for each time and takes only the visit time (drops the item id)
    .mapValues(lambda x: (x.mean(), x.std()))
)

mean_std_time_session.take(1)

                                                                                

[(24, (432.22223, 944.24854))]

## Item revisit time

Item that have been revisited are the most likely to catch the interest of the user, and therefore to be purchased.

We will extract three features:
* The item that has been revisited the most times ✅
* The number of times this item has been revisited ✅
* The number of items that have been revisited at least once ✅

In [40]:
item_visit_counts = (
    train_sessions.rdd
    .map(lambda x: (x.session_id, x.item_id))  # Maps the rows to a (key=session_id, values=item_id) tuple
    .mapValues(lambda x: [x])  # Puts the items inside a list, so they can be reduced easily.
    .reduceByKey(lambda x, y: x + y)  # Joins the lists togheter, grouping the tuples by keys
    .mapValues(lambda x: np.unique(x, return_counts=True))  # Transforms values into two arrays: an item_id array and an occurence_array
    .mapValues(lambda x: np.vstack((x[0], x[1])))
)

def get_revisited_items(count_array):
    '''Gets the number of items revisited at least once'''
    revisited_indices = count_array[1, :] > 1
    return np.count_nonzero(revisited_indices)

def get_item_revisits_info(count_array):
    '''Returns the item_id of the item that
    has been revisited the most as well as the 
    number of times it has been revisited
    
    If no item was revisited return -1, -1, 0
    '''
    number_of_revisits = get_revisited_items(count_array)
    if number_of_revisits == 0:
        return (-1, -1, number_of_revisits)
    
    most_revisited_item_idx = np.argmax(count_array[1, :])
    most_revisits = count_array[1, :].max()
    
    return (count_array[0, most_revisited_item_idx], most_revisits, number_of_revisits)
    

item_revisit_info = (
    item_visit_counts
    .mapValues(get_item_revisits_info) 
)


print(item_revisit_info.take(1))



[(24, (2927, 2, 1))]


                                                                                

# Engineering dataset features from item features

Each item in the dataset has a finite number of features. Presented in tuples of (feature_cateogy_id, feature_value_id), each item has multiple of those feauters.

As there are 73 unique features categories and 904 different unique (feature_category, feature_value) pairs. As most of those categories are not documented, it is difficult to interepret their meaning.

In this phase, we will engineer features by performing K-Means clustering using Map-Reduce.

Multiple features will be engineered:
* One feature telling in wich category cluster is the session belonging ✅
* One feature for each cluster on the item feature value clustering (25 clusters) ✅

Counting the number of occurences of each item category in each session

In [52]:
# Counts the number of unique category IDs
unique_categories_nb = len(
    item_features.rdd
    .map(lambda x: set([x.feature_category_id]))  # Maps the rows to a set containing the feature category
    .reduce(lambda x, y: x.union(y))  # Reduces by joining the sets, keeping only the unique categories values
)

def initialize_vector(category_nb, max_categories=unique_categories_nb):
    '''
    Returns a vector of categories, where each element is 0 except the one at the category's index.
    '''
    vector = np.zeros(max_categories, dtype=np.float32)
    vector[category_nb - 1] = 1.0
    return vector

def normalize_vector(category_vector):
    '''
    Noramlizes a category vector, dividing all the elements by the sum of of occurences
    '''
    total_occurences = np.sum(category_vector)
    return category_vector / total_occurences


normalized_categories_vector_per_session = (
    train_sessions.rdd
    .map(lambda x: (x.item_id, x.session_id))  # Maps the rows to (key=item_id,values=sessions_id) tuples
    .join(item_features.rdd.map(lambda x: (x.item_id, x.feature_category_id)))  # Joins the two datasets togheter by item_id
    .map(lambda x: x[1])  # Only keeps some parts of the tuples, now we have (key=session_id,values=feature_category_id)
    .mapValues(initialize_vector)  # Encodes the feature category inside a counter vector
    .reduceByKey(lambda x, y: x + y)   # Reduces by session_id, and adds the counter vectors
    .mapValues(normalize_vector)  # Noramlizes the vector, so that the sum of every element is equal to 1.0
)

normalized_categories_vector_per_session.take(1)

                                                                                

[(474192,
  array([0.        , 0.00381679, 0.01145038, 0.02290076, 0.01145038,
         0.01908397, 0.04961832, 0.        , 0.        , 0.        ,
         0.02290076, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.02290076, 0.02290076, 0.02290076, 0.        ,
         0.        , 0.        , 0.00381679, 0.01145038, 0.        ,
         0.02290076, 0.        , 0.00381679, 0.02671756, 0.04961832,
         0.00381679, 0.02290076, 0.00763359, 0.01145038, 0.        ,
         0.00381679, 0.00381679, 0.00381679, 0.00381679, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.02290076,
         0.01145038, 0.05725191, 0.        , 0.00381679, 0.05725191,
         0.        , 0.        , 0.01908397, 0.        , 0.02671756,
         0.05725191, 0.        , 0.        , 0.02290076, 0.01908397,
         0.05343511, 0.02671756, 0.02290076, 0.00381679, 0.02290076,
         0.        , 0.00381679, 0.05343511, 0.04961832, 0.        ,
         0.00381679, 0.0

## Clustering on item feature categories

Performing clustering from the Normalized Category Vectors

In [None]:
##### CLUSTERING WITH MAP/REDUCE
# inspired from https://uv.ulb.ac.be/pluginfile.php/3410436/mod_resource/content/1/Kmeans.html?fbclid=IwAR1h5xJKxQ1nrCHlgtQsofwjgc8B7oVl69GG6Mm8WxChTH3zBc4SKFo3Noo

from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

def mapstep(x,br_clusters): 
    """
    This function returns the closest cluster of the current row x
    x : current row
    broadcast_clusters : current value of the clusters
    """
    M = br_clusters.value.shape[0]
    d=np.zeros(M) # distance of the current row with each cluster
    for m in range(M):
        '''
        if x.shape != br_clusters.value[m].shape:
            print('Detected shape problem:')
            print(x)
            print(br_clusters.value[m])
        '''
        d[m]=sum(abs(np.subtract(x,br_clusters.value[m]))) # compute distance between cluster m and row x
    return np.argmin(d)    

def add_values(x, y):
    return x[0] + y[0], x[1] + y[1]

def k_means_MR(dataset, M, steps):
    """
    dataset : RDD data structure, all columns are used for distance computation (1xn dimension)
    M : number of clusters
    steps : number of steps
    """
    n = len(dataset.take(1)[0]) # input dimension
    clusters = np.array(dataset.takeSample(True,M)) # starts with random clusters
    broadcast_clusters = sc.broadcast(clusters) ## broadcast cluster position
    for i in range(steps):
        distance_set = dataset.map(lambda x : (mapstep(x, broadcast_clusters), (x, 1))) # set containing (closest_cluster,(row value),counter=1)
        new_clusters_set = distance_set.reduceByKey(add_values) # adds all cluster related entry (+ increments counter)
        new_clusters_set = new_clusters_set.map(lambda x : x[1][0] / x[1][1]) # apply mean operation
        new_clusters = np.array(new_clusters_set.take(M))
        broadcast_clusters = sc.broadcast(new_clusters)
    return broadcast_clusters.value


mapped_vectors = normalized_categories_vector_per_session.map(lambda x: x[1]).cache()  # Drops the session id

feature_category_clusters = k_means_MR(
    mapped_vectors,
    10,  # 10 Clusters are taken
    5  # Iterates over 5 steps
)

In [128]:
broadcast_clusters = sc.broadcast(feature_category_clusters) # broadcasting is a good practice for parallel computing
clustered_categories = normalized_categories_vector_per_session.map(lambda x : (x[0], mapstep(x[1],broadcast_clusters))) # compute associated clusters for each item

clustered_categories.take(10) # (example) OUTPUT [SESSION_ID | CLUSTER_ID]

                                                                                

[(474192, 8),
 (1119330, 8),
 (1309362, 2),
 (1468758, 8),
 (1910616, 2),
 (2691762, 4),
 (2933592, 0),
 (4141098, 8),
 (18, 2),
 (93486, 2)]

## Clustering on item feature values

We will perform clustering on the item feature values only. In total, 25 different clusters will be computed and each item will be classified in one of those clusters.

Then, for each session we will look at the visited items. For each of those items we will increment the session's counter for the class of that item. In total, 25 new features will be added to our sessions.

In [110]:
unique_feature_values = (
    item_features.rdd
    .map(lambda x: (set([x.feature_value_id])))  # Maps each row to a set containing the feature value
    .reduce(lambda x, y: x.union(y))  # Joins the sets, discarding duplicate values (only keeping unique values)
)

highest_feature_value = max(unique_feature_values)  # Takes the highest value in the set, used as vector sie


def initialize_value_vector(value_nb, max_values=highest_feature_value):
    '''
    Returns a vector of item feature values, where each element is 0 except the one at the values's index.
    '''
    vector = np.zeros(max_values, dtype=np.float32)
    vector[value_nb - 1] = 1.0
    return vector

encoded_item_values = (
    item_features.rdd
    .map(lambda x: (x.item_id, x.feature_value_id))  # Maps each row to a (key=item_id, values=feature_value_id) tuples
    .mapValues(initialize_value_vector)  # Initializes each value vector (one hot encoding type)
    .reduceByKey(lambda x, y: x + y)  # Reduces on each item by summing the feature value vector
)

                                                                                

In [113]:
# Performs clustering
mapped_item_values = encoded_item_values.map(lambda x: x[1])

feature_values_clusters = k_means_MR(
    mapped_item_values,
    25,  # As analysed in notebook 2b_feature_clustering, 25 is a good cluster compromise
    5
)

0


                                                                                

1


                                                                                

2


                                                                                

3


                                                                                

4


                                                                                

In [115]:
broadcast_clusters = sc.broadcast(feature_values_clusters) # broadcasting is a good practice for parallel computing
clustered_item_values = encoded_item_values.map(lambda x : (x[0], mapstep(x[1],broadcast_clusters))) # compute associated clusters for each item

clustered_item_values.take(10) # (example) OUTPUT [ITEM_ID | CLUSTER_ID]

[(2, 12),
 (4, 3),
 (8, 0),
 (10, 1),
 (14, 14),
 (16, 13),
 (18, 2),
 (20, 3),
 (24, 1),
 (26, 15)]

In [121]:
# Encodes the clustered item values in one-hot encoding
encoded_item_values = (
    clustered_item_values
    .mapValues(lambda x: initialize_value_vector(x, max_values=25))
)

# Computes the vector for each session
item_features_session = (
    train_sessions.rdd
    .map(lambda x: (x.item_id, x.session_id))
    .join(encoded_item_values)
    .map(lambda x: (x[1][0], x[1][1]))
    .reduceByKey(lambda x, y: x + y)
    .mapValues(normalize_vector)
    .mapValues(tuple)
)

item_features_session.take(1)

                                                                                

[(474192,
  (0.26666668,
   0.0,
   0.0,
   0.0,
   0.0,
   0.13333334,
   0.06666667,
   0.0,
   0.0,
   0.0,
   0.0,
   0.13333334,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.2,
   0.0,
   0.2))]

# Save the final dataset

The final engineered dataset, containing all the values, is created by joining the other engineered value datasets with the session_id as key.

In [129]:
def unpack_tuples(row):
    out = []
    for elem in row:
        if type(elem) == tuple:
            out.extend(unpack_tuples(elem))
        else:
            out.append(elem)
    return out

final_engineered_dataset = (
    time_sessions_in_seconds
    .join(season_per_session)
    .join(day_period_per_session)
    .join(month_per_session)
    .join(year_per_session)
    .join(most_spent_time)
    .join(mean_std_time_session)
    .join(item_revisit_info)
    .join(clustered_categories)
    .join(item_features_session)
    .mapValues(unpack_tuples)
)


final_engineered_dataset.take(1)

                                                                                

[(720,
  [23943,
   1,
   2,
   4,
   2021,
   21890,
   20444,
   5988.25,
   8466.431,
   21890,
   4,
   1,
   6,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0])]

In [131]:
tuple_final_engineered_dataset = (
    final_engineered_dataset.map(lambda x: ((x[0],)+tuple(x[1])))
)

tuple_final_engineered_dataset.saveAsTextFile('/PROJ/Data/session_engineered_features.csv')

                                                                                