# Feature primitives

https://featuretools.alteryx.com/en/stable/getting_started/primitives.html

Defining input and output data types that can be stacked together. Principle: break down feature engineering calculations into basic components which can be stacked together into increasingly complicated fields.

For example, average time between events can be decomposed into the primitives `time_since_previous` and `mean`.

In [1]:
import featuretools as ft
import pandas

# Display options
pandas.set_option('display.max_rows', 10)

es = ft.demo.load_mock_customer(return_entityset=True)

In [2]:
feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean"],
    trans_primitives=["time_since_previous"],
    features_only=True,
)

feature_defs

[<Feature: zip_code>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: TIME_SINCE_PREVIOUS(join_date)>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>]

The argument `features_only` lists the derived features without calculating the full matrix; useful for debugging.

DFS computes many potentially interesting features in very compact code.

In [3]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "max", "min", "std", "skew"],
    trans_primitives=["time_since_previous"],
)

feature_matrix

Unnamed: 0_level_0,zip_code,MAX(transactions.amount),MEAN(transactions.amount),MIN(transactions.amount),SKEW(transactions.amount),STD(transactions.amount),TIME_SINCE_PREVIOUS(join_date),MAX(sessions.MEAN(transactions.amount)),MAX(sessions.MIN(transactions.amount)),MAX(sessions.SKEW(transactions.amount)),...,SKEW(sessions.MAX(transactions.amount)),SKEW(sessions.MEAN(transactions.amount)),SKEW(sessions.MIN(transactions.amount)),SKEW(sessions.STD(transactions.amount)),SKEW(sessions.TIME_SINCE_PREVIOUS(session_start)),STD(sessions.MAX(transactions.amount)),STD(sessions.MEAN(transactions.amount)),STD(sessions.MIN(transactions.amount)),STD(sessions.SKEW(transactions.amount)),STD(sessions.TIME_SINCE_PREVIOUS(session_start))
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,60091,149.02,80.375443,7.55,-0.025941,44.09563,,94.481667,20.65,0.602209,...,-0.333796,0.335175,-0.47041,0.204548,-1.507217,7.928001,11.007471,4.961414,0.415426,157.884451
4,60091,149.95,80.070459,5.73,-0.036348,45.068765,22948824.0,110.45,54.83,0.382868,...,0.027256,1.980948,2.10351,-1.065663,1.065177,3.514421,13.027258,16.960575,0.387884,308.688904
1,60091,139.43,71.631905,5.81,0.019698,40.442059,744019.0,88.755625,26.36,0.640252,...,-0.780493,-0.424949,2.440005,-0.312355,-0.254557,7.322191,13.759314,6.954507,0.589386,171.754341
3,13244,149.15,67.06043,5.89,0.41823,43.683296,10212841.0,82.109444,20.06,0.854976,...,-0.941078,0.678544,1.000771,-0.245703,0.434581,10.724241,11.174282,5.424407,0.429374,177.613813
2,13244,146.81,77.422366,8.73,0.098259,37.705178,21282510.0,96.581,56.46,0.755711,...,-1.539467,0.235296,2.154929,0.013087,0.162631,17.221593,11.477071,15.874374,0.509798,194.638554


## Types of primitives

*Aggregation primitives* take multiple related inputs and produce a single output. Works cross parent-child relationships defined in an EntitySet.

*Transform primitives* take one or more columns from a single dataframe and transform them into a new output column.

It's easy to see all the built-in primitives:

In [35]:
ft.list_primitives()

Unnamed: 0,name,type,dask_compatible,spark_compatible,description,valid_inputs,return_type
0,skew,aggregation,False,False,Computes the extent to which a distribution di...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Semantic Tags = ['numeric'])>
1,n_most_common_frequency,aggregation,False,False,Determines the frequency of the n most common ...,<ColumnSchema (Semantic Tags = ['category'])>,<ColumnSchema (Logical Type = Categorical) (Se...
2,min,aggregation,True,True,"Calculates the smallest value, ignoring `NaN` ...",<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Semantic Tags = ['numeric'])>
3,max_consecutive_false,aggregation,False,False,Determines the maximum number of consecutive F...,<ColumnSchema (Logical Type = Boolean)>,<ColumnSchema (Logical Type = Integer) (Semant...
4,first,aggregation,False,False,Determines the first value in a list.,<ColumnSchema>,
...,...,...,...,...,...,...,...
198,percent_change,transform,False,False,Determines the percent difference between valu...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Logical Type = Double) (Semanti...
199,median_word_length,transform,False,False,Determines the median word length.,<ColumnSchema (Logical Type = NaturalLanguage)>,<ColumnSchema (Logical Type = Double) (Semanti...
200,quarter,transform,True,True,Determines the quarter a datetime column falls...,<ColumnSchema (Logical Type = Datetime)>,"<ColumnSchema (Logical Type = Ordinal: [1, 2, ..."
201,multiply_numeric_boolean,transform,True,False,Performs element-wise multiplication of a nume...,"<ColumnSchema (Logical Type = Boolean)>, <Colu...",<ColumnSchema (Semantic Tags = ['numeric'])>


In [5]:
ft.summarize_primitives()

Unnamed: 0,Metric,Count
0,total_primitives,203
1,aggregation_primitives,65
2,transform_primitives,138
3,unique_input_types,23
4,unique_output_types,22
...,...,...
38,uses_time_index_tag_input,29
39,uses_date_of_birth_tag_input,1
40,uses_ignore_tag_input,0
41,uses_passthrough_tag_input,0


## Custom primitives

Users can define their own primitives:

* Specify the type
* Define the input and output data types
* Create a function to perform the calculation
* Add attributes to determine when it can be applied

In [6]:
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage
import pandas as pd

Examples: creating new primitives computing absolute value and maximum (without using the built-in max)

In [None]:
class Absolute(TransformPrimitive):
    name = "absolute"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def absolute(column):
            return abs(column)

        return absolute
    
class Maximum(AggregationPrimitive):
    name = "maximum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def maximum(column):
            return max(column)

        return maximum

## Word count example

A transform primitive takes the words in each row of an input and returns the count. It's also possible to have multiple element input types when defining primitives; see the documentation for examples.

In [36]:
class WordCount(TransformPrimitive):
    """
    Counts the number of words in each row of the column. Returns a list
    of the counts for each row.
    """

    name = "word_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def word_count(column):
            word_counts = []
            for value in column:
                words = value.split(None)
                word_counts.append(len(words))
            return word_counts

        return word_count

In [38]:
es = make_ecommerce_entityset()
es

Entityset: ecommerce
  DataFrames:
    régions [Rows: 2, Columns: 2]
    stores [Rows: 6, Columns: 3]
    products [Rows: 6, Columns: 4]
    customers [Rows: 3, Columns: 15]
    sessions [Rows: 6, Columns: 6]
    log [Rows: 17, Columns: 17]
    cohorts [Rows: 2, Columns: 3]
  Relationships:
    customers.cohort -> cohorts.cohort
    customers.région_id -> régions.id
    stores.région_id -> régions.id
    sessions.customer_id -> customers.id
    log.session_id -> sessions.id
    log.product_id -> products.id

In [52]:
pd.set_option('display.max_colwidth', None)
es["customers"][["id", "favorite_quote"]]

Unnamed: 0,id,favorite_quote
2,2,All members of the working classes must seize the means of production.
0,0,The proletariat have nothing to lose but their chains
1,1,Capitalism deprives us all of self-determination


In [37]:
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["sum", "mean", "std"],
    trans_primitives=[WordCount],
)

feature_matrix[
    [
        "customers.WORD_COUNT(favorite_quote)",
        "STD(log.WORD_COUNT(comments))",
        "SUM(log.WORD_COUNT(comments))",
        "MEAN(log.WORD_COUNT(comments))",
    ]
]

Unnamed: 0_level_0,customers.WORD_COUNT(favorite_quote),STD(log.WORD_COUNT(comments)),SUM(log.WORD_COUNT(comments)),MEAN(log.WORD_COUNT(comments))
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,9.0,540.43686,2500.0,500.0
1,9.0,583.70255,1732.0,433.0
2,9.0,,246.0,246.0
3,6.0,883.883476,1256.0,628.0
4,6.0,0.0,9.0,3.0
5,12.0,19.79899,68.0,34.0
