# Feature engineering on Item feature categories.


Item features can hold many interesting information, but there is a very high amount of them.

Some item share some feature categories (such as color, size) but some don't.

Different features can be engineered from item featues, such as:

* The most common visited (feature_category, feature_value) pairs. ✅
* The number of visits on the most common pair. ✅

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/14 13:55:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


                                                                                

In [3]:
mapped_item_features = item_features.rdd.map(lambda x: (x.item_id, (x.feature_category_id, x.feature_value_id)))

item_session_and_features = (train_sessions.rdd
                             .map(lambda x: (x.item_id, (x.session_id)))  # Maps sessions to a (key, value pair)
                             .join(mapped_item_features)  # Huge operation, might be too heavy
)

In [4]:
item_session_and_features.take(5)

                                                                                

[(4230, (31, (45, 559))),
 (4230, (31, (56, 153))),
 (4230, (31, (34, 275))),
 (4230, (31, (19, 769))),
 (4230, (31, (68, 864)))]

In [15]:
from collections import Counter
# Groupping by session too heavy, let's try to perform a reduce operation

# At this point, item_features_per_session contains tuples of (item_id, (sessions_id, (feature_cat_id, feature_val_id))
# Let's reorder the tuple to (session_id, (item_id, feature_cat_id, feature_val_id))

def reduce_count_item_pairs(x, y):
    '''Merges the y dictionnary into the x dictionnary.
    Expects to have both x and y as dictionnary having the keys (feature_cat_id, feature_val_id)
    and a counter of occurences as a value.
    '''
    for feature_pair in y:
        if x.get(feature_pair) is None:
            x[feature_pair] = y[feature_pair]
        else:
            x[feature_pair] += y[feature_pair]
    return x
        
item_features_per_session = (item_session_and_features
                             .map(lambda x: (x[1][0], (x[0], x[1][1][0], x[1][1][1])))  # Reorder to (session_id, (item_id, feature_cat_id, feature_val_id))
                             .mapValues(lambda x: {(x[1], x[2]): 1})  # Transforms to (session_id, counter_of(feature_cat_id, feature_val_id))
                             .reduceByKey(reduce_count_item_pairs)  # Reduce by merging the dictionaries and their values
                             .mapValues(lambda x: Counter(x).most_common()[0])  # Takes the most common key tuple
                             .mapValues(lambda x: (x[0][0], x[0][1], x[1]))
)

In [16]:
item_features_per_session.take(5)

                                                                                

[(2019690, (55, 267, 3)),
 (1633440, (32, 286, 3)),
 (1501040, (72, 75, 7)),
 (490710, (56, 365, 10)),
 (2357160, (46, 825, 11))]