# Explanation of Feature Generation

## Interface
The goal here is to generate a bunch of features using a class called FeatureSynth. The interface for this class should be as follows:
* User initializes the class with their training data.
     * User specifies which columns of the training data are categorical, DateTime, numerical, etc.
     * /*User specifies how they want the cluster analysis to be performed. This may be by directly passing in a clustering function to fit, or it could be by passing an indicator variable that will indicate which clustering function to use.*/
     * User specifies picture size, with not only number of pixels, but also number of channels. This only works for generating square images, so that must be checked by the program. ```total_number_of_features = number_of_pixels * number_of_channels```, so ```num_synth_feats = total_number_of_features - number_of_input_features```.
* User calls function ```synthesize_features()```
    * This function call will first look at how many synthetic features it must create, ```num_synth_feats```. From here, it will execute the following algorithm to generate deep features:
```
    function synthesize_features():
        numerical_summary_stats = ('mean', 'std', 'skew', 'median')
        mono_feature_operations = ('log', 'exp', 'sin', 'relu', 'sigmoid')
        dual_feature_operations = ('add', 'multiply', 'subtract')
        boolean_operations = ('AND', 'OR', 'XOR', 'NAND')
        level = 0
        while total number of features < desired number of features:
            for each categorical variable/feature in this level:
                make groups of examples that have the same value for that feature
                for each numerical variable in this level:
                    for each summary stat in numerical_summary_stats:
                        if current number features == desired number features:
                            break
                        else:
                            create a feature of the summary stat for the numerical variable
                            log the feature name in this level + 1, as a numerical variable
                for each other categorical variable in this level:
                    if current number features == desired number features:
                        break
                    else:
                        create a feature for the mode of the other categorical variable
                        log the feature name in this level + 1, as a categorical variable

            for each numerical variable/feature in this level:
                for each single feature operation in mono_feature_operations:
                    if current number features == desired number features:
                        break
                    else:
                        create a feature from the single feature operation
                        log the feature name in this level + 1, as a numerical variable
                for each other numerical variable/feature in this level:
                    for each two feature operation in dual_feature_operations:
                        if current number features == desired number features:
                            break
                        else if operation has not already been performed:    
                            make feature from two feature operation on numerical variable
                            log it as level + 1 numerical feature

            for each boolean variable/feature:
                for each other boolean variable/feature:
                    for boolean operation in boolean_operations:
                        if current number features == desired number features or operation has been performed already:
                            break
                        else:
                            make feature from boolean operation on two boolean features
                            log it as a level + 1 boolean feature

            levels += 1
```
     
     

In [3]:
class FeatureTypes:
    def __init__(self, categorical=[], numerical=[], date=[], boolean=[]):
        ''' Sets specified instance variables. Each variable should be a list of strings.'''
        self.Categorical = categorical
        self.Numerical = numerical
        self.Date = date
        self.Boolean = boolean
    
    def get_num_features(self):
        ''' Finds how many features total are present. '''
        feature_type_lens = [len(features_of_this_type) for 
                             features_of_this_type in (self.Categorical, self.Numerical, 
                                                       self.Date, self.Boolean)]
        return sum(feature_type_lens)
            

In [4]:
class SingleFeatureOperations:
    Relu = 'relu'
    Sigmoid = 'sigmoid'
    
class TwoFeatureOperations:
    Add = 'add'
    Subtract = 'subtract'
    Multiply = 'multiply'

class BooleanOperations:
    And = 'AND',
    Or = 'OR',
    Xor = 'XOR',
    Nand = 'NAND'

In [5]:
import pandas as pd
import numpy as np
import random as rd
import math

In [14]:
class SynthesizeFeatures:
    def __init__(self, feature_names, total_num_features):
        ''' 
        Set instance variables. Defaults to a 16x16 image with three pixel
        channels, such as with a 16x16 RGB image.
        '''
        # TODO: build in type checking; features_names must be of type FeatureTypes
        self._level_to_features = dict()
        self._level_to_features[0] = feature_names # level 0 features
        original_num_features = feature_names.get_num_features()
        self._total_num_features = total_num_features
        print(f'Aiming to have a total of {self._total_num_features} features.')
        
    
    def synthesize_features(self, df_inp):
        df = df_inp.copy()

        #def completed(current_num_features):
         #   return current_num_features >= self._total_num_features
        def completed(num_feats_to_create):
            return num_feats_to_create <= 0
        
        def _num_features_helper(old_df, new_features_inp):
            new_features_dfs = [new_features_inp[key] for key in new_features_inp.keys()]
            new_feature_lens = [len(nf.columns) for nf in new_features_dfs]
            return len(old_df.columns) + sum(new_feature_lens)

        def concat_new_features(old_df, new_features_inp):
            new_features_dfs = [new_features_inp[key] for key in new_features_inp.keys()]
            new_feature_lens = [len(nf.columns) for nf in new_features_dfs]
            print(f'Going to have {len(old_df.columns) + sum(new_feature_lens)} features after concat.')
            new_features_dfs.append(old_df)
            return pd.concat( new_features_dfs, axis=1)
        
        numerical_summary_stats = ('mean', 'std', 'skew', 'median')
        single_feature_operations = ('log', 'exp', 'sin', 'relu', 'sigmoid')
        two_numerical_feature_operations = ('add', 'multiply', 'subtract')
        boolean_operations = ('AND', 'OR', 'XOR', 'NAND')

        num_feats_to_create = self._total_num_features - len(df_inp.columns)
        print(f'Going to create {num_feats_to_create} synthetic features')
        level = 0
        while True:
            feature_names = self._level_to_features[level]
            new_features = {
                'num' : pd.DataFrame(index=df.index),
                'cat' : pd.DataFrame(index=df.index),
                'bool' : pd.DataFrame(index=df.index),
                'date' : pd.DataFrame(index=df.index)
            }
            
            # Populate with synthetic features related to groups and categorical variables
            for cat_feat_name in  feature_names.Categorical:
                category_groups = df.groupby(cat_feat_name)
                for num_feat_name in feature_names.Numerical:
                    for summary_stat in numerical_summary_stats:
                        if completed(num_feats_to_create): 
                            return concat_new_features(df, new_features)
                        new_feature_name = f'group|{cat_feat_name}|num|{num_feat_name}|op|{summary_stat}|'
                        new_feature = category_groups[num_feat_name].transform(summary_stat)
                        new_features['num'][new_feature_name] =  new_feature
                        num_feats_to_create -= 1

                # create mode of other categorical variables, while in this category's group
                # this is currently untested, and needs to be tested before uncommenting
                for other_cat_feat_name in feature_names.Categorical:
                    if cat_feat_name == other_cat_feat_name: continue
                    #if completed(num_feats_to_create): 
                     #   return concat_new_features(df, new_features)
                    #new_feature_name = f'group|{cat_feat_name}|cat|{other_cat_feat_name}|op|mode|'
                    #new_feature = category_groups[other_cat_feat_name].agg( lambda x: pd.Series.mode(x)[0] )
                    #new_features['cat'][new_feature_name] = new_feature
                    #num_feats_to_create -= 1

            # Populate with synthetic features related to numerical transformations
            for i, num_feat_name in enumerate(feature_names.Numerical):
                for single_feat_op in single_feature_operations:
                    if completed(num_feats_to_create): 
                        return concat_new_features(df, new_features) 
                    new_feature_name = f'num|{num_feat_name}|op|{single_feat_op}|'
                    new_feature = SynthesizeFeatures._apply_single_num_feat_operation(df[num_feat_name], single_feat_op)
                    new_features['num'][new_feature_name] = new_feature
                    num_feats_to_create -= 1
                    
                other_num_feature_names = feature_names.Numerical[i+1:] # avoids pairs of features from getting called twice; also avoids pairing with self
                for other_num_feat_name in other_num_feature_names:
                    for two_feature_operation in two_numerical_feature_operations:
                        if completed(num_feats_to_create): 
                            return concat_new_features(df, new_features)
                        new_feature_name = f'num1|{num_feat_name}|num2|{other_num_feat_name}|op|{two_feature_operation}|'
                        new_feature = SynthesizeFeatures._apply_two_num_feat_operation(df[num_feat_name], df[other_num_feat_name], two_feature_operation)
                        new_features['num'][new_feature_name] = new_feature
                        num_feats_to_create -= 1
            
            # Populate with synthetic features related to boolean transformations
            for i, bool_feat_name in enumerate(feature_names.Boolean):
                other_bool_feat_names = feature_names.Boolean[i+1:]
                for other_bool_feat_name in other_bool_feat_names:
                    for bool_op in boolean_operations:
                        if completed(num_feats_to_create):
                            return concat_new_features(df, new_features)
                        new_feature_name = f'bool_feat1|{bool_feat_name}|bool_feat2|{other_bool_feat_name}|op|{bool_op}|'
                        new_feature = SynthesizeFeatures._apply_two_bool_feat_operation(df[bool_feat_name], df[other_bool_feat_name], bool_op)
                        new_features['bool'][new_feature_name] = new_feature
                        num_feats_to_create -= 1

            # Update df and _level_to_features before looping
            print(f'DF num features: {len(df.columns)}')
            df = concat_new_features(df, new_features)
            print(f'DF num features: {len(df.columns)}')
            self._level_to_features[level+1] = FeatureTypes( 
                numerical = new_features['num'].columns,
                categorical = new_features['cat'].columns,
                boolean = new_features['bool'].columns,
                date = new_features['date'].columns)
            level += 1    
    
    @staticmethod
    def _apply_single_num_feat_operation(feature, operation_str):
        _builtin_operation_strs = ('exp', 'log', 'sin', 'cos', 'tan', 
                                   'sinh', 'cosh', 'tanh')
        if operation_str in _builtin_operation_strs:
            return feature.apply(operation_str) # not safe, because may cause problems if out of range
        else:
            if operation_str == SingleFeatureOperations.Relu:
                func = lambda x : 0 if x <= 0 else x
            elif operation_str == SingleFeatureOperations.Sigmoid:
                func = lambda x : 1 / (1 +  math.exp(-x))
            return feature.apply(func)
    
    @staticmethod
    def _apply_two_num_feat_operation(feat1, feat2, operation_str):
        if operation_str == TwoFeatureOperations.Add:
            return feat1 + feat2
        elif operation_str == TwoFeatureOperations.Subtract:
            return feat1-feat2
        elif operation_str == TwoFeatureOperations.Multiply:
            return feat1 * feat2

    @staticmethod
    def _apply_two_bool_feat_operation(feature1, feature2, operation_str):
        combined = pd.DataFrame()
        def xor_func(*x):
            a, b = x[0], x[1]
            return True if ((a and not b) or (b and not a)) else False
        def nand_func(*x):
            a, b = x[0], x[1]
            return not (a and b)
        def or_func(*x):
            a, b = x[0], x[1]
            return a or b
        def and_func(*x):
            a, b = x[0], x[1]
            return a and b

        if operation_str == BooleanOperations.Xor:
            func = xor_func
        elif operation_str == BooleanOperations.And:
            func = and_func
        elif operation_str == BooleanOperations.Or:
            func = or_func
        elif operation_str == BooleanOperations.Nand:
            func = nand_func

        return combined.apply(lambda x : func(x), axis=1)    

## An example
The following is an example of generating a bunch of features, given an input feature dataset. Notice, there is no mention of labels. This is because LABELS SHOULD NOT BE PRESENT AT ALL IN THE FEATURE SYNTHESIZATION PROCESS. This is a clear demonstration of not allowing a data leak.

In [15]:
n_rows = 100

categories_1 = ('a', 'b', 'c', 'd', 'e')
categories_2 = ('InstaMed', 'is', 'a', 'cool', 'company', 'check', 'it', 'out', 'sometime')
rand_cats_1 = [rd.choice(categories_1) for i in range(n_rows)]
rand_cats_2 = [rd.choice(categories_2) for i in range(n_rows)]

example_df = pd.DataFrame({
    'example_numerical_col_1': np.random.rand(n_rows)*50,
    'example_numerical_col_2': np.random.rand(n_rows)*20,
    'example_categorical_col_1': rand_cats_1,
    'example_categorical_col_2': rand_cats_2,
    'example_boolean_col_1': np.random.randint(low=0, high=2, size=n_rows),
    'example_boolean_col_2': np.random.randint(low=0, high=2, size=n_rows)
})
example_feature_names = FeatureTypes(
    categorical=['example_categorical_col_1', 'example_categorical_col_2'],
    numerical= ['example_numerical_col_1'],
    boolean = []
)

In [16]:
tnf = 16**2 * 3
sf = SynthesizeFeatures(example_feature_names, tnf)
new_feats = sf.synthesize_features(example_df)

Aiming to have a total of 768 features.
Going to create 762 synthetic features
DF num features: 6
Going to have 19 features after concat.
DF num features: 19
DF num features: 19
Going to have 318 features after concat.
DF num features: 318
Going to have 768 features after concat.


In [17]:
new_feats.head()

Unnamed: 0,num|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||op|log|,num|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||op|exp|,num|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||op|sin|,num|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||op|relu|,num|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||op|sigmoid|,num1|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||num2|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|exp||op|add|,num1|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||num2|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|exp||op|multiply|,num1|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||num2|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|exp||op|subtract|,num1|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||num2|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|sin||op|add|,num1|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|log||num2|num|group|example_categorical_col_1|num|example_numerical_col_1|op|mean||op|sin||op|multiply|,...,num|example_numerical_col_1|op|exp|,num|example_numerical_col_1|op|sin|,num|example_numerical_col_1|op|relu|,num|example_numerical_col_1|op|sigmoid|,example_numerical_col_1,example_numerical_col_2,example_categorical_col_1,example_categorical_col_2,example_boolean_col_1,example_boolean_col_2
0,1.178425,25.771079,-0.107452,3.249253,0.962646,155681600000.0,505848900000.0,-155681600000.0,3.845114,1.936104,...,3166.051,0.978804,8.06024,0.999684,8.06024,17.829134,c,is,0,1
1,1.121334,21.519191,0.072584,3.068945,0.955593,2216490000.0,6802287000.0,-2216490000.0,3.523576,1.395237,...,431764000.0,0.859266,19.88339,1.0,19.88339,19.253873,b,is,1,0
2,1.212335,28.82734,-0.217968,3.361324,0.966474,3307917000000.0,11118980000000.0,-3307917000000.0,2.836077,-1.765527,...,23141820000000.0,-0.59981,30.772662,1.0,30.772662,0.145958,e,out,0,0
3,1.121334,21.519191,0.072584,3.068945,0.955593,2216490000.0,6802287000.0,-2216490000.0,3.523576,1.395237,...,27.49026,-0.171389,3.313832,0.9649,3.313832,2.508841,b,cool,1,0
4,1.212335,28.82734,-0.217968,3.361324,0.966474,3307917000000.0,11118980000000.0,-3307917000000.0,2.836077,-1.765527,...,612476.6,0.68812,13.325266,0.999998,13.325266,11.063081,e,sometime,1,0


In [85]:
# def concat_new_features(old_df, new_features_inp):
#     new_features_dfs = [new_features_inp[key] for key in new_features_inp.keys()]

#     new_feature_lens = [len(nf.columns) for nf in new_features_dfs]
#     print(f'Did have {len(old_df.columns)} columns')
#     print(f'Going to have {len(old_df.columns) + sum(new_feature_lens)} features after concat.')
#     return pd.concat( [old_df] + new_features_dfs, axis=1)
a = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
b = pd.DataFrame({'c':[7, 8, 9], 'd':[10, 11, 12]})
new_feat
c = pd.DataFrame({'e':[13, 14, 15], 'f':[16, 17, 18]})
pd.concat([a, b, c], axis=1)

Unnamed: 0,a,b,c,d,e,f
0,1,4,7,10,13,16
1,2,5,8,11,14,17
2,3,6,9,12,15,18


In [327]:
# import featuretools as ft
# class FeatureSynth:  
#     # 'NumUnique', 'percenttrue',
#     non_temporal_agg_prims = ['mean', 'count', 'sum', 'min', 'max',
#                               'std', 'median', 'mode', 
#                                'all', 'any', 'skew',
#                              ]
    
#     def __init__(self):
#         pass
    
#     def synthesize_features(self, features_df, index_col_name, feature_limit=3072, max_depth=2):        
#         num_features = len(features_df.columns)
        
#         if num_features > feature_limit:
#             num_cols_to_drop = num_features - feature_limit
#             cols_to_drop = all_features.columns[-num_cols_to_drop:] # drop the last columns, since they're typically deepest
#             return features_df.drop(columns=cols_to_drop)
#         elif num_features == feature_limit:
#             return features_df
#         else:
            
#             es = ft.EntitySet(id = "features")
#             es = es.entity_from_dataframe(entity_id="base_features",
#                                           dataframe=features_df,
#                                           index=index_col_name)
            
#             cols_to_include = list(features_df.columns)
#             cols_to_include.remove(index_col_name)
#             print(cols_to_include)
            
#             es = es.normalize_entity(base_entity_id = "base_features",
#                                      new_entity_id = "child0",
#                                      index = index_col_name,
#                                      copy_variables = cols_to_include
#                                     )
#             for i in range(max_depth-1):
#                 es.normalize_entity(base_entity_id = f"child{i}",
#                                      new_entity_id = f"child{i+1}",
#                                      index = index_col_name,
#                                      copy_variables = cols_to_include
#                                     )
#             print(es)

#             features, info = ft.dfs(entityset = es,
#                                     agg_primitives= FeatureSynth.non_temporal_agg_prims,
#                                     target_entity = 'base_features',
#                                     max_depth=max_depth,
#                                     max_features = feature_limit
#                                    )
#             return pd.DataFrame(features)

In [328]:
# feature_synth = FeatureSynth()

In [330]:
# image16x16 = feature_synth.synthesize_features(X, 'the_indices', feature_limit=221, max_depth=4)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Entityset: features
  Entities:
    base_features [Rows: 178, Columns: 14]
    child0 [Rows: 178, Columns: 14]
    child1 [Rows: 178, Columns: 14]
    child2 [Rows: 178, Columns: 14]
    child3 [Rows: 178, Columns: 14]
  Relationships:
    base_features.the_indices -> child0.the_indices
    child0.the_indices -> child1.the_indices
    child1.the_indices -> child2.the_indices
    child2.the_indices -> child3.the_indices


In [333]:
# image16x16.head(1)

Unnamed: 0_level_0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,...,child0.child1.SKEW(child0.alcalinity_of_ash),child0.child1.SKEW(child0.magnesium),child0.child1.SKEW(child0.total_phenols),child0.child1.SKEW(child0.flavanoids),child0.child1.SKEW(child0.nonflavanoid_phenols),child0.child1.SKEW(child0.proanthocyanins),child0.child1.SKEW(child0.color_intensity),child0.child1.SKEW(child0.hue),child0.child1.SKEW(child0.od280/od315_of_diluted_wines),child0.child1.SKEW(child0.proline)
the_indices,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,...,,,,,,,,,,


In [334]:
# pd.Series(image16x16.columns)

0                                                alcohol
1                                             malic_acid
2                                                    ash
3                                      alcalinity_of_ash
4                                              magnesium
5                                          total_phenols
6                                             flavanoids
7                                   nonflavanoid_phenols
8                                        proanthocyanins
9                                        color_intensity
10                                                   hue
11                          od280/od315_of_diluted_wines
12                                               proline
13                                        child0.alcohol
14                                     child0.malic_acid
15                                            child0.ash
16                              child0.alcalinity_of_ash
17                             

In [288]:
# image16x16.columns

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity',
       ...
       'SKEW(child.alcalinity_of_ash)', 'SKEW(child.magnesium)',
       'SKEW(child.total_phenols)', 'SKEW(child.flavanoids)',
       'SKEW(child.nonflavanoid_phenols)', 'SKEW(child.proanthocyanins)',
       'SKEW(child.color_intensity)', 'SKEW(child.hue)',
       'SKEW(child.od280/od315_of_diluted_wines)', 'SKEW(child.proline)'],
      dtype='object', length=104)

In [332]:
# pd.Series(X.columns)

0                          alcohol
1                       malic_acid
2                              ash
3                alcalinity_of_ash
4                        magnesium
5                    total_phenols
6                       flavanoids
7             nonflavanoid_phenols
8                  proanthocyanins
9                  color_intensity
10                             hue
11    od280/od315_of_diluted_wines
12                         proline
13                     the_indices
dtype: object