# Explanation of Feature Generation

## Interface
The goal here is to generate a bunch of features using a class called FeatureSynth. The interface for this class should be as follows:
* User initializes the class with their training data.
     * User specifies which columns of the training data are categorical, DateTime, numerical, etc.
     * /*User specifies how they want the cluster analysis to be performed. This may be by directly passing in a clustering function to fit, or it could be by passing an indicator variable that will indicate which clustering function to use.*/
     * User specifies picture size, with not only number of pixels, but also number of channels. This only works for generating square images, so that must be checked by the program. ```total_number_of_features = number_of_pixels * number_of_channels```, so ```num_synth_feats = total_number_of_features - number_of_input_features```.
* User calls function ```synthesize_features()```
    * This function call will first look at how many synthetic features it must create, ```num_synth_feats```. From here, it will execute the following algorithm to generate deep features:
```
    function synthesize_features():
        numerical_summary_stats = ('mean', 'std', 'skew', 'median')
        mono_feature_operations = ('log', 'exp', 'sin', 'relu', 'sigmoid')
        dual_feature_operations = ('add', 'multiply', 'subtract')
        boolean_operations = ('AND', 'OR', 'XOR', 'NAND')
        level = 0
        while total number of features < desired number of features:
            for each categorical variable/feature in this level:
                make groups of examples that have the same value for that feature
                for each numerical variable in this level:
                    for each summary stat in numerical_summary_stats:
                        if current number features == desired number features:
                            break
                        else:
                            create a feature of the summary stat for the numerical variable
                            log the feature name in this level + 1, as a numerical variable
                for each other categorical variable in this level:
                    if current number features == desired number features:
                        break
                    else:
                        create a feature for the mode of the other categorical variable
                        log the feature name in this level + 1, as a categorical variable

            for each numerical variable/feature in this level:
                for each single feature operation in mono_feature_operations:
                    if current number features == desired number features:
                        break
                    else:
                        create a feature from the single feature operation
                        log the feature name in this level + 1, as a numerical variable
                for each other numerical variable/feature in this level:
                    for each two feature operation in dual_feature_operations:
                        if current number features == desired number features:
                            break
                        else if operation has not already been performed:    
                            make feature from two feature operation on numerical variable
                            log it as level + 1 numerical feature

            for each boolean variable/feature:
                for each other boolean variable/feature:
                    for boolean operation in boolean_operations:
                        if current number features == desired number features or operation has been performed already:
                            break
                        else:
                            make feature from boolean operation on two boolean features
                            log it as a level + 1 boolean feature

            levels += 1
```
     
     

In [86]:
class FeatureTypes:
    def __init__(self, categorical=[], numerical=[], date=[], boolean=[]):
        ''' Sets specified instance variables. Each variable should be a list of strings.'''
        self.Categorical = categorical
        self.Numerical = numerical
        self.Date = date
        self.Boolean = boolean
    
    def get_num_features(self):
        ''' Finds how many features total are present. '''
        feature_type_lens = [len(features_of_this_type) for 
                             features_of_this_type in (self.Categorical, self.Numerical, 
                                                       self.Date, self.Boolean)]
        return sum(feature_type_lens)     

In [87]:
class SingleFeatureOperations:
    Relu = 'relu'
    Sigmoid = 'sigmoid'
    Square = 'square'
    Cube = 'cube'
    
class TwoFeatureOperations:
    Add = 'add'
    Subtract = 'subtract'
    Multiply = 'multiply'

class BooleanOperations:
    And = 'AND',
    Or = 'OR',
    Xor = 'XOR',
    Nand = 'NAND'

In [88]:
import pandas as pd
import numpy as np
import random as rd
import math

In [89]:
class IncorrectTypeException(Exception):
    def __init__(self, variable_name, expected_type, actual_type):
        message = f'Expected type of {variable_name} to be {expected_type}' \
                      f', but it was actually of type {actual_type}.'
        super(IncorrectTypeException, self).__init__(message)

In [125]:
class FeatureHandlingMethods:
    ''''''
    ImputeMedian = 'imputeMedian'
    ImputeMean = 'imputeMean'
    Zero = 'zero'
    Remove = 'remove'

class FeatureStates:
    ''''''
    Good = 'noNullValues'
    Ok = 'someNullValues'   
    Bad = 'tooManyNullValues'
    
class FeatureOperations:
    ''''''
    def __init__(self, max_proportion_null, handling_method):
        self.max_proportion_null = max_proportion_null
        self.handling_method = handling_method
        return

    def get_feature_state(self, feature):
        '''
        summary
            self explanatory
        parameters
            feature: pandas.Series
        returns
            the state of the feature (one of the class variables)
        '''
        if type(feature) != type(pd.Series()):
            raise IncorrectTypeException('feature', type(pd.Series()), type(feature))
        else:
            num_null = feature.isnull().sum()
            prop_null = num_null / feature.size
            if prop_null > self.max_proportion_null:
                return FeatureStates.Bad
            elif prop_null > 0:
                return FeatureStates.Ok
            else:
                return FeatureStates.Good
    
    def handle_feature(self, feature, verbose=True):
        '''
        summary
            This method checks feature to see if it complies with standards
            for features given by instance variables.
        parameters
            feature: pandas.Series
        returns
            pandas.Series of the new feature, unless the feature is ruled as bad, in which case
            the value None will be returned.
        '''
        if type(feature) != type(pd.Series()):
            raise IncorrectTypeException('feature', type(pd.Series()), type(feature))
        feature_state = self.get_feature_state(feature)
        if feature_state == FeatureStates.Good:
            return feature
        elif feature_state == FeatureStates.Ok:
            if self.handling_method == FeatureHandlingMethods.ImputeMedian:
                return feature.fillna(feature.median())
            elif self.handling_method == FeatureHandlingMethods.ImputeMean:
                return feature.fillna(feature.mean())
            elif self.handling_method == FeatureHandlingMethods.Zero:
                return feature.fillna(0)
            elif self.handling_method == FeatureHandlingMethods.Remove:
                return None
            else:
                raise Exception('Invalid handling method; check initialization of this instance'
                                + ' of FeatureOperations; it may help to use a FeatureHandlingMethods'
                               + 'class variable, to avoid typographic mistakes.')
        elif feature_state == FeatureStates.Bad:
            print('from FeatureOperations.handle_features: A feature was labeled as "tooManyNullValues", or'
                 + 'FeatureStates.Bad. Returning None')
            return None
        else:
            raise Exception("Invalid Feature State")
    
    def apply_single_num_feat_operation(self, feature, operation_str):
        builtin_operation_strs = ('exp', 'log', 'sin', 'cos', 'tan', 
                                   'sinh', 'cosh', 'tanh')
        if operation_str in builtin_operation_strs:
            return feature.apply(operation_str) # not safe, because may cause problems if out of range
        else:
            if operation_str == SingleFeatureOperations.Relu:
                func = lambda x : 0 if x <= 0 else x
            elif operation_str == SingleFeatureOperations.Sigmoid:
                func = lambda x : 1 / (1 +  math.exp(-x))
            elif operation_str == SingleFeatureOperations.Square:
                func = lambda x : x**2
            elif operation_str == SingleFeatureOperations.Cube:
                func = lambda x : x**3
            return feature.apply(func)
    
    def apply_two_num_feat_operation(self, feat1, feat2, operation_str):
        if operation_str == TwoFeatureOperations.Add:
            return feat1 + feat2
        elif operation_str == TwoFeatureOperations.Subtract:
            return feat1-feat2
        elif operation_str == TwoFeatureOperations.Multiply:
            return feat1 * feat2

    def apply_two_bool_feat_operation(self, feature1, feature2, operation_str):
        combined = pd.DataFrame()
        def xor_func(*x):
            a, b = x[0], x[1]
            return True if ((a and not b) or (b and not a)) else False
        def nand_func(*x):
            a, b = x[0], x[1]
            return not (a and b)
        def or_func(*x):
            a, b = x[0], x[1]
            return a or b
        def and_func(*x):
            a, b = x[0], x[1]
            return a and b

        if operation_str == BooleanOperations.Xor:
            func = xor_func
        elif operation_str == BooleanOperations.And:
            func = and_func
        elif operation_str == BooleanOperations.Or:
            func = or_func
        elif operation_str == BooleanOperations.Nand:
            func = nand_func

        return combined.apply(lambda x : func(x), axis=1) 
    

In [120]:
f = FeatureOperations(.1, FeatureHandlingMethods.ImputeMedian)
f.get_feature_state(example_df.example_numerical_col_1)

'noNullValues'

In [121]:
s = pd.Series({1:1, 2:4, 22: 2, 38:np.NaN}); s
s.fillna(s.mean())

1     1.000000
2     4.000000
22    2.000000
38    2.333333
dtype: float64

In [122]:
class SynthesizeFeatures:
    _DEFAULT_FEATURE_HANDLING_METHOD = FeatureHandlingMethods.ImputeMedian
    _DEFAULT_MAX_PROPORTION_NULL_VALUES = 0.1
    
    def __init__(self, feature_names, total_num_features):
        ''' 
        Set instance variables. Defaults to a 16x16 image with three pixel
        channels, such as with a 16x16 RGB image.
        '''
        # TODO: build in type checking; features_names must be of type FeatureTypes
        self._level_to_features = dict()
        self._level_to_features[0] = feature_names # level 0 features
        original_num_features = feature_names.get_num_features()
        self._total_num_features = total_num_features
        print(f'Aiming to have a total of {self._total_num_features} features.')
        
    
    def synthesize_features(self, df_inp, max_prop_null=_DEFAULT_MAX_PROPORTION_NULL_VALUES, 
                            null_handling_method=_DEFAULT_FEATURE_HANDLING_METHOD):
        df = df_inp.copy()
        def completed(num_feats_to_create):
            return num_feats_to_create <= 0
        
        def _num_features_helper(old_df, new_features_inp):
            new_features_dfs = [new_features_inp[key] for key in new_features_inp.keys()]
            new_feature_lens = [len(nf.columns) for nf in new_features_dfs]
            return len(old_df.columns) + sum(new_feature_lens)

        def concat_new_features(old_df, new_features_inp):
            new_features_dfs = [new_features_inp[key] for key in new_features_inp.keys()]
            new_feature_lens = [len(nf.columns) for nf in new_features_dfs]
            print(f'Going to have {len(old_df.columns) + sum(new_feature_lens)} features after concat.')
            new_features_dfs.append(old_df)
            return pd.concat( new_features_dfs, axis=1)
        
        numerical_summary_stats = ('mean', 'std', 'skew', 'median')
        single_feature_operations = ('log', 'exp', 'sin', 'relu', 'sigmoid', 'square', 'cube')
        two_numerical_feature_operations = ('add', 'multiply', 'subtract')
        boolean_operations = ('AND', 'OR', 'XOR', 'NAND')

        feat_ops = FeatureOperations(max_prop_null, null_handling_method)
        
        series_type = type(pd.Series()) # used a number of times for comparison

        num_feats_to_create = self._total_num_features - len(df_inp.columns)
        print(f'Going to create {num_feats_to_create} synthetic features')
        level = 0
        while True:
            feature_names = self._level_to_features[level]
            new_features = {
                'num' : pd.DataFrame(index=df.index),
                'cat' : pd.DataFrame(index=df.index),
                'bool' : pd.DataFrame(index=df.index),
                'date' : pd.DataFrame(index=df.index)
            }
            
            # Populate with synthetic features related to groups and categorical variables
            for cat_feat_name in  feature_names.Categorical:
                category_groups = df.groupby(cat_feat_name)
                for num_feat_name in feature_names.Numerical:
                    for summary_stat in numerical_summary_stats:
                        if completed(num_feats_to_create): 
                            return concat_new_features(df, new_features)
                        new_feature = category_groups[num_feat_name].transform(summary_stat)
                        handled_new_feature = feat_ops.handle_feature(new_feature)
                        if type(handled_new_feature) == series_type:
                            new_feature_name = f'group|{cat_feat_name}|num|{num_feat_name}|op|{summary_stat}|'
                            new_features['num'][new_feature_name] =  handled_new_feature
                            num_feats_to_create -= 1

                # create mode of other categorical variables, while in this category's group
                # this is currently untested, and needs to be tested before uncommenting
                for other_cat_feat_name in feature_names.Categorical:
                    if cat_feat_name == other_cat_feat_name: continue
                    #if completed(num_feats_to_create): 
                     #   return concat_new_features(df, new_features)
                    #new_feature_name = f'group|{cat_feat_name}|cat|{other_cat_feat_name}|op|mode|'
                    #new_feature = category_groups[other_cat_feat_name].agg( lambda x: pd.Series.mode(x)[0] )
                    #new_features['cat'][new_feature_name] = new_feature
                    #num_feats_to_create -= 1

            # Populate with synthetic features related to numerical transformations
            for i, num_feat_name in enumerate(feature_names.Numerical):
                for single_feat_op in single_feature_operations:
                    if completed(num_feats_to_create): 
                        return concat_new_features(df, new_features) 
                    new_feature = feat_ops.apply_single_num_feat_operation(df[num_feat_name], single_feat_op)
                    handled_new_feature = feat_ops.handle_feature(new_feature)
                    if type(handled_new_feature) == series_type:
                        new_feature_name = f'num|{num_feat_name}|op|{single_feat_op}|'
                        new_features['num'][new_feature_name] = handled_new_feature
                        num_feats_to_create -= 1
                    
                other_num_feature_names = feature_names.Numerical[i+1:] # avoids pairs of features from getting called twice; also avoids pairing with self
                for other_num_feat_name in other_num_feature_names:
                    for two_feature_operation in two_numerical_feature_operations:
                        if completed(num_feats_to_create): 
                            return concat_new_features(df, new_features)
                        new_feature = feat_ops.apply_two_num_feat_operation(df[num_feat_name], df[other_num_feat_name], two_feature_operation)
                        handled_new_feature = feat_ops.handle_feature(new_feature)
                        if type(handled_new_feature) == series_type:
                            new_feature_name = f'num1|{num_feat_name}|num2|{other_num_feat_name}|op|{two_feature_operation}|'
                            new_features['num'][new_feature_name] = handled_new_feature
                            num_feats_to_create -= 1
            
            # Populate with synthetic features related to boolean transformations
            for i, bool_feat_name in enumerate(feature_names.Boolean):
                other_bool_feat_names = feature_names.Boolean[i+1:]
                for other_bool_feat_name in other_bool_feat_names:
                    for bool_op in boolean_operations:
                        if completed(num_feats_to_create):
                            return concat_new_features(df, new_features)
                        new_feature = feat_ops.apply_two_bool_feat_operation(df[bool_feat_name], df[other_bool_feat_name], bool_op)
                        handled_new_feature = feat_ops.handle_feature(new_feature)
                        if type(handled_new_feature) == series_type:
                            new_feature_name = f'bool_feat1|{bool_feat_name}|bool_feat2|{other_bool_feat_name}|op|{bool_op}|'
                            new_features['bool'][new_feature_name] = handled_new_feature
                            num_feats_to_create -= 1

            # Update df and _level_to_features before looping
            print(f'DF num features: {len(df.columns)}')
            df = concat_new_features(df, new_features)
            print(f'DF num features: {len(df.columns)}')
            self._level_to_features[level+1] = FeatureTypes( 
                numerical = new_features['num'].columns,
                categorical = new_features['cat'].columns,
                boolean = new_features['bool'].columns,
                date = new_features['date'].columns)
            level += 1   
        
        return

        
        
        
   

## An example
The following is an example of generating a bunch of features, given an input feature dataset. Notice, there is no mention of labels. This is because LABELS SHOULD NOT BE PRESENT AT ALL IN THE FEATURE SYNTHESIZATION PROCESS. This is a clear demonstration of not allowing a data leak.

In [126]:
n_rows = 100

categories_1 = ('a', 'b', 'c', 'd', 'e')
categories_2 = ('InstaMed', 'is', 'a', 'cool', 'company', 'check', 'it', 'out', 'sometime')
rand_cats_1 = [rd.choice(categories_1) for i in range(n_rows)]
rand_cats_2 = [rd.choice(categories_2) for i in range(n_rows)]

example_df = pd.DataFrame({
    'example_numerical_col_1': np.random.rand(n_rows)*50,
    'example_numerical_col_2': np.random.rand(n_rows)*20,
    'example_categorical_col_1': rand_cats_1,
    'example_categorical_col_2': rand_cats_2,
    'example_boolean_col_1': np.random.randint(low=0, high=2, size=n_rows),
    'example_boolean_col_2': np.random.randint(low=0, high=2, size=n_rows)
})
example_feature_names = FeatureTypes(
    categorical=['example_categorical_col_1', 'example_categorical_col_2'],
    numerical= ['example_numerical_col_1'],
    boolean = []
)

In [128]:
tnf = 16**2 * 3
sf = SynthesizeFeatures(example_feature_names, tnf)
new_feats = sf.synthesize_features(example_df)

Aiming to have a total of 768 features.
Going to create 762 synthetic features
DF num features: 6
Going to have 21 features after concat.
DF num features: 21
from FeatureOperations.handle_features: A feature was labeled as "tooManyNullValues", orFeatureStates.Bad. Returning None
from FeatureOperations.handle_features: A feature was labeled as "tooManyNullValues", orFeatureStates.Bad. Returning None
from FeatureOperations.handle_features: A feature was labeled as "tooManyNullValues", orFeatureStates.Bad. Returning None
DF num features: 21
Going to have 438 features after concat.
DF num features: 438
Going to have 768 features after concat.


In [118]:
sum(example_df.example_numerical_col_1)
example_df.example_numerical_col_1

0     32.817796
1     19.780416
2      2.430758
3     33.051457
4     19.729015
5     10.724396
6     21.695011
7      0.771400
8     40.054078
9     42.695037
10    35.527105
11    42.967288
12    33.306614
13    15.552081
14     4.732205
15    20.407633
16     7.121013
17    47.916773
18     7.420468
19    16.339080
20    22.277184
21     3.349100
22    26.825644
23    37.670575
24    14.679746
25    38.924181
26    23.985838
27     6.612167
28    12.657089
29    12.092826
        ...    
70    28.833634
71    41.199897
72    46.012133
73     1.958819
74    22.498034
75    29.603923
76    40.437914
77    25.895743
78    23.842856
79    45.546951
80    36.768485
81    35.248605
82    34.128038
83     6.095427
84    18.055611
85    32.991018
86    37.288151
87    21.697201
88    32.769313
89    31.489249
90    47.399106
91     5.177844
92    30.145177
93     0.432898
94    25.212211
95     5.536069
96     7.800222
97     8.331253
98    24.260259
99    44.163982
Name: example_numerical_

In [114]:
type(pd.Series())

pandas.core.series.Series