# Feature engineering on Item feature categories.


Item features can hold many interesting information, but there is a very high amount of them.

Some item share some feature categories (such as color, size) but some don't.

Different features can be engineered from item featues, such as:

* The most common visited (feature_category, feature_value) pairs. ✅
* The number of visits on the most common pair. ✅

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/16 09:35:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


                                                                                

In [3]:
mapped_item_features = item_features.rdd.map(lambda x: (x.item_id, (x.feature_category_id, x.feature_value_id)))

item_session_and_features = (train_sessions.rdd
                             .map(lambda x: (x.item_id, (x.session_id)))  # Maps sessions to a (key, value pair)
                             .join(mapped_item_features)  # Huge operation, might be too heavy
)

In [4]:
item_session_and_features.take(5)

                                                                                

[(4230, (31, (45, 559))),
 (4230, (31, (56, 153))),
 (4230, (31, (34, 275))),
 (4230, (31, (19, 769))),
 (4230, (31, (68, 864)))]

In [5]:
from collections import Counter
# Groupping by session too heavy, let's try to perform a reduce operation

# At this point, item_features_per_session contains tuples of (item_id, (sessions_id, (feature_cat_id, feature_val_id))
# Let's reorder the tuple to (session_id, (item_id, feature_cat_id, feature_val_id))

def reduce_count_item_pairs(x, y):
    '''Merges the y dictionnary into the x dictionnary.
    Expects to have both x and y as dictionnary having the keys (feature_cat_id, feature_val_id)
    and a counter of occurences as a value.
    '''
    for feature_pair in y:
        if x.get(feature_pair) is None:
            x[feature_pair] = y[feature_pair]
        else:
            x[feature_pair] += y[feature_pair]
    return x
        
item_features_per_session = (item_session_and_features
                             .map(lambda x: (x[1][0], (x[0], x[1][1][0], x[1][1][1])))  # Reorder to (session_id, (item_id, feature_cat_id, feature_val_id))
                             .mapValues(lambda x: {(x[1], x[2]): 1})  # Transforms to (session_id, counter_of(feature_cat_id, feature_val_id))
                             .reduceByKey(reduce_count_item_pairs)  # Reduce by merging the dictionaries and their values
                             .mapValues(lambda x: Counter(x).most_common()[0])  # Takes the most common key tuple
                             .mapValues(lambda x: (x[0][0], x[0][1], x[1]))
)

In [6]:
item_features_per_session.take(5)

                                                                                

[(2021250, (56, 365, 2)),
 (4373370, (56, 365, 3)),
 (1205080, (69, 639, 3)),
 (173500, (56, 365, 4)),
 (2596380, (56, 365, 1))]

## Counting number of categories.

We want to provide features engineered from the item feature categories. There are many feature categories. Let's count them.

In [7]:
def reduce_unique(x, y):
    ''' Merges all the sets togheter '''
    return (x[0], x[1] | y[1])

unique_categories = (item_features.rdd
                    .map(lambda x: (x.item_id, (x.feature_category_id)))  # Maps every item feature entry to (item_id, (feature_category_id))
                    .mapValues(lambda x: set([x]))  # Maps the values to a set, so they will be counted as unique during the reduce opetations.
                    .reduce(reduce_unique)  # Reduces the values by merging the sets, eliminating duplicate feature_category_ids
)

len(unique_categories[1])

                                                                                

73

There are 73 different feature categories. 

## Engineering our first 73 features

Some features could be engineered such as counting the number of occurences for each category met per session.
We will have 73 different features only for that.

In [8]:
def count_categories(x, y):
    ''' Merges all the dictionnary values togheter, adding the categories values alltogheter '''
    for key in y:
        x[key] = x.get(key, 0) + y[key]
    return x

def fill_empties(x, nb_categories=len(unique_categories[1])):
    ''' Fill the dictionnary with categories that are not present
    setting their occurence count to 0'''
    for idx in range(0, nb_categories+1):
        x[idx] = x.get(idx, 0)
    return x

def get_occurences(x):
    ''' Remaps the values so we only have the occurences. 
    The feature category id by itself is defined by the order (from 1 to 73).
    '''
    listed_x = list(x)
    new_list = []
    for elem in listed_x:
        new_list.append(elem[1])
    return tuple(new_list)

feature_categories_per_session = (item_session_and_features
                             .map(lambda x: (x[1][0], (x[1][1][0])))  # Project to (session_id, (feature_cat_id))
                             .mapValues(lambda x: {x: 1})  # Maps value before being reduced, putting them in a dictionnary
                             .reduceByKey(count_categories)  # Counts feature categories occurence in a dictionnary
                             .mapValues(fill_empties)  # Fills unmet categories and sets their occurence to 0
                             .mapValues(lambda x: tuple(sorted(x.items(), key=lambda x: x[0])))  # Sorts by key value
                             .mapValues(get_occurences)
)

feature_categories_per_session.take(1)

                                                                                

[(39520,
  (0,
   0,
   0,
   9,
   11,
   8,
   0,
   20,
   0,
   0,
   0,
   14,
   2,
   0,
   0,
   4,
   0,
   9,
   14,
   16,
   0,
   2,
   0,
   0,
   2,
   0,
   16,
   0,
   2,
   2,
   58,
   0,
   16,
   2,
   5,
   0,
   0,
   0,
   0,
   0,
   0,
   2,
   0,
   0,
   2,
   8,
   10,
   20,
   0,
   0,
   20,
   0,
   0,
   2,
   0,
   12,
   20,
   0,
   0,
   14,
   0,
   20,
   4,
   11,
   0,
   11,
   0,
   0,
   20,
   20,
   0,
   0,
   20,
   4))]

## Feature groupping using frequent itemset

The dimensionality can be reduced. For example, we merge the feature categories inside groups, determined from frequent itemset analysis.

Spark ML package has Frequent Pattern Mining functionalities, implementing FP-Growth for example.  

In [9]:
def occurences_to_present_category(x):
    ''' Maps a list of occurences (where the feature category id is defined by the order)
    to a list containing the feautre categories with a non null occurency.
    '''
    present_categories = set()
    assert(len(x) == 74) # Checks that the data has 73 features
    for idx in range(74):
        if x[idx] != 0:
            present_categories.add(idx)
    return tuple(present_categories)
            

present_categories_per_session = feature_categories_per_session.mapValues(occurences_to_present_category)

In [10]:
present_categories_per_session.take(1)

                                                                                

[(39520,
  (3,
   4,
   5,
   7,
   11,
   12,
   15,
   17,
   18,
   19,
   21,
   24,
   26,
   28,
   29,
   30,
   32,
   33,
   34,
   41,
   44,
   45,
   46,
   47,
   50,
   53,
   55,
   56,
   59,
   61,
   62,
   63,
   65,
   68,
   69,
   72,
   73))]

In [11]:
present_categories_itemsets = present_categories_per_session.map(lambda x: list(x[1])).cache()  # Dropping the sessions data

In [12]:
from pyspark.mllib.fpm import FPGrowth

# Starts training the FPGrowth tree on the pipelined RDD
model = FPGrowth.train(present_categories_per_session.map(lambda x: list(x[1])), minSupport=0.2, numPartitions=10)

# Prints the most frequent itemsets
frequent_itemsets = model.freqItemsets().collect()

for fi in frequent_itemsets:
    print(fi)

22/05/16 09:38:29 WARN FPGrowth: Input data is not cached.
22/05/16 09:40:42 ERROR Executor: Exception in task 2.0 in stage 30.0 (TID 101)]
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
	at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
	at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:242)
	at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:53)
	at org.apache.spark.scheduler.DirectTaskResult$$Lambda$2458/1018210965.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.a

ConnectionRefusedError: [Errno 111] Connection refused