# Feature engineering on Item feature categories.


Item features can hold many interesting information, but there is a very high amount of them.

Some item share some feature categories (such as color, size) but some don't.

Different features can be engineered from item featues, such as:

* The most common visited (feature_category, feature_value) pairs. ✅
* The number of visits on the most common pair. ✅

In [1]:
import os 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# launch this cell if you have issues on windows with py4j (think about updating your PATH)
import sys
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# starts a spark session from notebook

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=4g  pyspark-shell"
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("load_explore") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/16 14:40:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# loads relevant datas in DataFrames
train_sessions = spark.read.load('../Data/train_sessions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

train_purchases = spark.read.load('../Data/train_purchases.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

candidate_items = spark.read.load('../Data/candidate_items.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

item_features = spark.read.load('../Data/item_features.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


                                                                                

In [3]:
mapped_item_features = item_features.rdd.map(lambda x: (x.item_id, (x.feature_category_id, x.feature_value_id)))

item_session_and_features = (train_sessions.rdd
                             .map(lambda x: (x.item_id, (x.session_id)))  # Maps sessions to a (key, value pair)
                             .join(mapped_item_features)  # Huge operation, might be too heavy
)

In [4]:
item_session_and_features.take(5)

                                                                                

[(15654, (13, (56, 365))),
 (15654, (13, (47, 516))),
 (15654, (13, (69, 780))),
 (15654, (13, (68, 351))),
 (15654, (13, (62, 801)))]

In [5]:
from collections import Counter
# Groupping by session too heavy, let's try to perform a reduce operation

# At this point, item_features_per_session contains tuples of (item_id, (sessions_id, (feature_cat_id, feature_val_id))
# Let's reorder the tuple to (session_id, (item_id, feature_cat_id, feature_val_id))

def reduce_count_item_pairs(x, y):
    '''Merges the y dictionnary into the x dictionnary.
    Expects to have both x and y as dictionnary having the keys (feature_cat_id, feature_val_id)
    and a counter of occurences as a value.
    '''
    for feature_pair in y:
        if x.get(feature_pair) is None:
            x[feature_pair] = y[feature_pair]
        else:
            x[feature_pair] += y[feature_pair]
    return x
        
item_features_per_session = (item_session_and_features
                             .map(lambda x: (x[1][0], (x[0], x[1][1][0], x[1][1][1])))  # Reorder to (session_id, (item_id, feature_cat_id, feature_val_id))
                             .mapValues(lambda x: {(x[1], x[2]): 1})  # Transforms to (session_id, counter_of(feature_cat_id, feature_val_id))
                             .reduceByKey(reduce_count_item_pairs)  # Reduce by merging the dictionaries and their values
                             .mapValues(lambda x: Counter(x).most_common()[0])  # Takes the most common key tuple
                             .mapValues(lambda x: (x[0][0], x[0][1], x[1]))
)

In [6]:
item_features_per_session.take(5)

                                                                                

[(290712, (56, 365, 8)),
 (1264194, (56, 365, 5)),
 (3219234, (56, 365, 6)),
 (282840, (56, 365, 10)),
 (4209018, (56, 365, 7))]

## Counting number of categories.

We want to provide features engineered from the item feature categories. There are many feature categories. Let's count them.

In [7]:
def reduce_unique(x, y):
    ''' Merges all the sets togheter '''
    return (x[0], x[1] | y[1])

unique_categories = (item_features.rdd
                    .map(lambda x: (x.item_id, (x.feature_category_id)))  # Maps every item feature entry to (item_id, (feature_category_id))
                    .mapValues(lambda x: set([x]))  # Maps the values to a set, so they will be counted as unique during the reduce opetations.
                    .reduce(reduce_unique)  # Reduces the values by merging the sets, eliminating duplicate feature_category_ids
)

len(unique_categories[1])

                                                                                

73

There are 73 different feature categories. 

## Engineering our first 73 features

Some features could be engineered such as counting the number of occurences for each category met per session.
We will have 73 different features only for that.

In [8]:
def count_categories(x, y):
    ''' Merges all the dictionnary values togheter, adding the categories values alltogheter '''
    for key in y:
        x[key] = x.get(key, 0) + y[key]
    return x

def fill_empties(x, nb_categories=len(unique_categories[1])):
    ''' Fill the dictionnary with categories that are not present
    setting their occurence count to 0 '''
    for idx in range(0, nb_categories+1):
        x[idx] = x.get(idx, 0)
    return x

def get_occurences(x):
    ''' Remaps the values so we only have the occurences. 
    The feature category id by itself is defined by the order (from 1 to 73).
    '''
    listed_x = list(x)
    new_list = []
    for elem in listed_x:
        new_list.append(elem[1])
    return tuple(new_list)

feature_categories_per_session = (item_session_and_features
                             .map(lambda x: (x[1][0], (x[1][1][0])))  # Project to (session_id, (feature_cat_id))
                             .mapValues(lambda x: {x: 1})  # Maps value before being reduced, putting them in a dictionnary
                             .reduceByKey(count_categories)  # Counts feature categories occurence in a dictionnary
                             .mapValues(fill_empties)  # Fills unmet categories and sets their occurence to 0
                             .mapValues(lambda x: tuple(sorted(x.items(), key=lambda x: x[0])))  # Sorts by key value
                             .mapValues(get_occurences)
)

feature_categories_per_session.take(1)

                                                                                

[(2066478,
  (0,
   0,
   0,
   1,
   5,
   1,
   4,
   5,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   5,
   1,
   1,
   0,
   0,
   0,
   0,
   4,
   0,
   1,
   0,
   0,
   4,
   4,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   5,
   0,
   0,
   5,
   0,
   0,
   4,
   0,
   5,
   5,
   0,
   0,
   1,
   4,
   5,
   4,
   5,
   0,
   5,
   0,
   0,
   5,
   5,
   0,
   0,
   5,
   5))]

## Feature groupping using frequent itemset

The dimensionality can be reduced. For example, we merge the feature categories inside groups, determined from frequent itemset analysis.

Spark ML package has Frequent Pattern Mining functionalities, implementing FP-Growth for example.  

In [9]:
def occurences_to_present_category(x):
    ''' Maps a list of occurences (where the feature category id is defined by the order)
    to a list containing the feautre categories with a non null occurency.
    '''
    present_categories = set()
    assert(len(x) == 74) # Checks that the data has 73 features
    for idx in range(74):
        if x[idx] != 0:
            present_categories.add(idx)
    return tuple(present_categories)
            

present_categories_per_session = feature_categories_per_session.mapValues(occurences_to_present_category)

In [10]:
present_categories_per_session.take(1)

                                                                                

[(2066478,
  (3,
   4,
   5,
   6,
   7,
   11,
   17,
   18,
   19,
   24,
   26,
   29,
   30,
   32,
   34,
   45,
   46,
   47,
   50,
   53,
   55,
   56,
   59,
   60,
   61,
   62,
   63,
   65,
   68,
   69,
   72,
   73))]

In [11]:
present_categories_itemsets = present_categories_per_session.map(lambda x: list(x[1])).cache()  # Dropping the sessions data

In [12]:
from pyspark.mllib.fpm import FPGrowth

# Starts training the FPGrowth tree on the pipelined RDD
model = FPGrowth.train(present_categories_per_session.map(lambda x: list(x[1])).cache(), minSupport=0.4, numPartitions=10)

22/05/16 14:44:40 WARN FPGrowth: Input data is not cached.
                                                                                

In [13]:
# model.freqItemsets().count()

# Selecting the biggest itemsets

Now that we have itemsets with a support of at least 20%, let's select a number of large groups and use them as features.

If in a session, the categories visited at least once are equal or a superset of one of those groups, sets the value to 1; else 0.

In [14]:
# Compute the largest itemsets and sorts them by the length of the groups (largest groups first)
frequent_itemsets = model.freqItemsets().sortBy(lambda x: len(x.items), ascending=False, numPartitions=4).cache()

                                                                                

In [15]:
# top_20_frequent = frequent_itemsets.take(20)

In [16]:
# top_20_frequent

We have a problem: the most frequent itemsets found here have a big part in common. The ideal would be to find frequent itemsets but with fairly different items.

# Item feature category clustering

One technique would be to perform clustering on item category features. For each session, we take the feature occurences per session, and perform normalization.

An arbitrary number of clusters X could be created. From that, we could one-hot encode the classified cluster of each session and create from that X new features in our dataset.

In [69]:
import numpy as np

def normalize_vector(x):
    values = np.array(x, dtype=np.float64)
    norm = np.linalg.norm(values)
    values[:] = values[:] / norm
    return values

normalized_categories = feature_categories_per_session.mapValues(normalize_vector).map(lambda x: x[1])

In [71]:
from pyspark.mllib.clustering import KMeans, KMeansModel

number_of_clusters = 10

# Build the model (cluster the data)
clusters = KMeans.train(normalized_categories, number_of_clusters, maxIterations=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return np.sqrt(sum([x**2 for x in (point - center)]))

WSSSE = normalized_categories.map(lambda point: error(point)).reduce(lambda x, y: x + y)

print("Within Set Sum of Squared Error = " + str(WSSSE))



Within Set Sum of Squared Error = 300793.6195249505


                                                                                

In [85]:
# Set seed to draw always the same vector
np.random.seed(42)

# We try our clusters by creating a fake vector and predicting its cluster.
category_vector = np.random.normal(loc=10.0, scale=5.0, size=len(unique_categories[1])+1)

# We clip the negative values to 0 
negative_values = category_vector < 0.0
category_vector[negative_values] = 0.0

category_vector

array([12.48357077,  9.30867849, 13.23844269, 17.61514928,  8.82923313,
        8.82931522, 17.89606408, 13.83717365,  7.65262807, 12.71280022,
        7.68291154,  7.67135123, 11.20981136,  0.43359878,  1.37541084,
        7.18856235,  4.9358444 , 11.57123666,  5.45987962,  2.93848149,
       17.32824384,  8.8711185 , 10.33764102,  2.87625907,  7.27808638,
       10.55461295,  4.24503211, 11.87849009,  6.99680655,  8.54153125,
        6.99146694, 19.26139092,  9.93251388,  4.71144536, 14.11272456,
        3.89578175, 11.04431798,  0.20164938,  3.35906976, 10.98430618,
       13.6923329 , 10.85684141,  9.42175859,  8.49448152,  2.60739005,
        6.40077896,  7.69680615, 15.28561113, 11.71809145,  1.18479922,
       11.62041985,  8.0745886 ,  6.61539   , 13.05838144, 15.15499761,
       14.6564006 ,  5.80391238,  8.45393812, 11.65631716, 14.87772564,
        7.60412881,  9.07170512,  4.46832513,  4.01896688, 14.06262911,
       16.78120014,  9.63994939, 15.01766449, 11.80818013,  6.77

In [86]:
# We predict our vector values
clusters.predict(category_vector)

4

# Feature engineering on item feature cateogries

Now that we have our clusters, we perform a simple mapValues operation to cluster each item session and to one hot encode the results.

In [90]:
def one_hot_encode_integer(x, max_range=number_of_clusters):
    values = np.zeros(max_range, dtype=np.uint8)
    values[x-1] = 1
    return values
    

clustered_sessions = feature_categories_per_session.mapValues(normalize_vector).mapValues(lambda x: clusters.predict(x)).mapValues(one_hot_encode_integer)

In [93]:
clustered_sessions.take(10)

                                                                                

[(2066478, array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0], dtype=uint8)),
 (3582234, array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], dtype=uint8)),
 (1985388, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8)),
 (2130960, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8)),
 (2269938, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8)),
 (2700660, array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], dtype=uint8)),
 (2990472, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8)),
 (3730554, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8)),
 (4161570, array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], dtype=uint8)),
 (2393550, array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8))]