In this notebook, I'm going to experiment with feature creation with the numeric variables in our data set using a few techniques. I will select the important ones using permutation importance. The techniques I'm going to try out are:


- means for particular categories
- standard deviations for particular categories
- standardised deviations from the uncoditional mean (and maybe the conditional mean for a particular category)
- frequency counts for different categories
- interaction terms via cross products

At each stage, I'll evaluate each feature by running a logistic regression and scoring it using the roc_auc_score. Useful features should have values greater than 0.5.

In [1]:
import pandas as pd
import numpy as np
import time
from scipy.stats import pearsonr
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
#Defining feature evaluation function
def feature_auc(X,y,test_size):
    
    #Concatenating to drop nas
    temp = pd.concat([y,X],axis=1)
    temp = temp.dropna()
    y=temp.iloc[:,0:1]
    X=temp.iloc[:,1:]
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, shuffle=False) 
    model = LogisticRegression(solver='lbfgs').fit(X_train,np.array(y_train).ravel())
    preds = model.predict_proba(X_test)
    score = roc_auc_score(y_test,preds[:,1])
    return(score)

In [3]:
#Creating function to deal with NAs by shuffling and forward filling.

def ffill(df):
    
    t0 = time.time()
    
    na_count = df.isna().sum().sum()
    while na_count>0:
        df = df.sample(frac=1)
        df = df.fillna(method='ffill',limit=10)
        na_count = df.isna().sum().sum()

    
    df = df.sort_index()
    t1 = time.time()

    return(df)
    print(t1-t0)

In [4]:
train_transaction = pd.read_csv('Data/train_transaction.csv')

In [5]:
fraud = train_transaction['isFraud']
train_transaction.drop('isFraud',axis=1,inplace=True)
strings = train_transaction.select_dtypes(include='object')
numerics = train_transaction.select_dtypes(exclude='object')

del train_transaction

numerics = ffill(numerics)
strings = strings.fillna('NaN')

train_transaction = pd.concat([fraud,numerics,strings],axis=1)

del numerics, strings

In [6]:
use = train_transaction.iloc[:200000,:]

The first features we are going to create are card counts and average transaction amounts for each card

In [7]:
fraud = use['isFraud']
temp = use['card4'].value_counts().to_dict()
use['card4_counts'] = use['card4'].map(temp)

feature_auc(use['card4_counts'],fraud,0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0.5115938349609495

In [8]:
card_means = use.groupby('card4')['TransactionAmt'].agg(['mean']).to_dict()
use['card4_mean_spend'] = use['card4'].map(card_means['mean'])
feature_auc(use['card4_mean_spend'],fraud,0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0.5091185459658263

Now I'm going to create a variable which calculates the deviation from the mean for a particular transaction. The means will be conditional on the card type e.g. Visa, Mastercard etc.

In [9]:
use['card4_spend_dev'] = use['TransactionAmt'] - use['card4_mean_spend']
feature_auc(use['card4_spend_dev'],fraud,0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0.5100160501571082

Let's see if standardisation makes a difference, I will be dividing by the standard deviation of the transaction amounts. That standard deviation will be taken over the transactions for a particular card types.

In [10]:
card_stds = use.groupby('card4')['TransactionAmt'].agg(['std']).to_dict()
use['card4_spend_dev_std'] = use['card4_spend_dev']/use['card4'].map(card_stds['std'])
feature_auc(use['card4_spend_dev_std'],fraud,0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0.5098732026849845

Doesn't seem to make much of a difference. How about the standard deviations themselves?

In [11]:
use['card4_spend_std'] = use['card4'].map(card_stds['std'])
feature_auc(use['card4_spend_std'],fraud,0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0.5128556847091534

This looks somewhat promising. Note, these are just point estimates so we have no idea if it is actually statistically significant. These values could be positive just due to random chance. However this is probably not the case since we used a large number of observations to calculate it.

So, we've got five techniques to aggregate continous data across different categories and a way to evaluate their usefulness. One way would be to exhaustively evaluate all possible combinations, but with my limited computational budget, I don't think that's practical. While I'm not very good at this, I'll have to think about the problem in more depth to come up with potentially useful combinations.

In the mean time let's create a function that allows us to evaluate categorical numerical pairs using the five above techniques.

In [12]:
def feature_aggregation_eval(categorical, numerical, df):
    
    counts_temp = df[categorical].value_counts().to_dict()
    counts = df[categorical].map(counts_temp)
    counts_auc = feature_auc(counts,fraud,0.5)
    
    means_temp = df.groupby(categorical)[numerical].agg(['mean']).to_dict()
    means = df[categorical].map(means_temp['mean'])
    means_auc = feature_auc(means,fraud,0.5)
    
    stds_temp = df.groupby(categorical)[numerical].agg(['std']).to_dict()
    stds = df[categorical].map(stds_temp['std'])
    stds_auc = feature_auc(stds,fraud,0.5)
    
    devs = df[numerical] - means
    devs_auc = feature_auc(devs,fraud,0.5)
    
    std_devs = devs/stds
    std_devs_auc = feature_auc(std_devs,fraud,0.5)
    
    scores = {
        'feature_type':['counts', 'means', 'stds', 'devs', 'std_devs'],
        'auc':[counts_auc, means_auc, stds_auc, devs_auc, std_devs_auc]
    }
    
    scores = pd.DataFrame.from_dict(scores)
    return(scores)

In [13]:
card4_transamt = feature_aggregation_eval(categorical='card4', numerical='TransactionAmt', df=use)

Now we're going to ramp things up a bit so that the feature evaluation function will loop through all possible combinations of selected numeric and categorical variables. That way we can identify good features more exhaustively.

In [14]:
def feature_aggregation_eval_2(categorical_variables, numerical_variables, df):
    
    combination_scores = {}
    
    for numerical in numerical_variables:
        for categorical in categorical_variables:
 
        
            counts_temp = df[categorical].value_counts().to_dict()
            counts = df[categorical].map(counts_temp)
            counts_auc = feature_auc(counts,fraud,0.5)

            means_temp = df.groupby(categorical)[numerical].agg(['mean']).to_dict()
            means = df[categorical].map(means_temp['mean'])
            means_auc = feature_auc(means,fraud,0.5)

            stds_temp = df.groupby(categorical)[numerical].agg(['std']).to_dict()
            stds = df[categorical].map(stds_temp['std'])
            stds_auc = feature_auc(stds,fraud,0.5)

            devs = df[numerical] - means
            devs_auc = feature_auc(devs,fraud,0.5)

            std_devs = devs/stds
            std_devs_auc = feature_auc(std_devs,fraud,0.5)

            scores = {
                'feature_type':['counts', 'means', 'stds', 'devs', 'std_devs'],
                'auc':[counts_auc, means_auc, stds_auc, devs_auc, std_devs_auc]
            }           
    
            scores = pd.DataFrame.from_dict(scores)
        
            name = categorical + '.'  + numerical
            combination_scores[name] = scores
            
    return(combination_scores)

In [15]:
scores = feature_aggregation_eval_2(
    categorical_variables = ['ProductCD'],
    numerical_variables = ['TransactionDT'],
    df = use)

for key in scores.keys():
    print(key)
    print(scores[key])

ProductCD.TransactionDT
  feature_type       auc
0       counts  0.635902
1        means  0.638352
2         stds  0.637701
3         devs  0.593297
4     std_devs  0.384229


In [16]:
def combinations_filter(dict_of_dfs, lower_bound):
    scores = dict_of_dfs
    combinations = []
    for key in scores.keys():

        names = key.split('.')
        categorical = names[0]
        numerical = names[1]

        for i in np.arange(0,5):
            if scores[key]['auc'].iloc[i] > lower_bound:
                method = scores[key]['feature_type'].iloc[i]
                combination = [categorical, numerical, method]
                combinations.append(combination)

    return(combinations)

Now that we have a good idea of what features might be worth creating, I'm going to create a function that appends these new features to our dataframe of explanatory variables. The function will take a tuple and dataframe as input. The tuple contains the numerical variable to be aggregated, the categorical variable to be aggregated across, and the aggregation method. 

I will do this acros two functions, one that creates the feature itself, and an outer loop that appends each categorical-numerical-method tuple to the full dataframe

In [17]:
def feature_creation(categorical, numerical, method, df):
    
    #Creating some features by default because they will probably be needed anyway
    means_temp = df.groupby(categorical)[numerical].agg(['mean']).to_dict()
    means = df[categorical].map(means_temp['mean'])
    
    stds_temp = df.groupby(categorical)[numerical].agg(['std']).to_dict()
    stds = df[categorical].map(stds_temp['std'])
    
    
    if method == 'counts':
        counts_temp = df[categorical].value_counts().to_dict()
        counts = df[categorical].map(counts_temp)
        return(counts)
    
    if method == 'means':
        return(means)
    
    if method == 'stds':
        return(stds)
    
    if method == "devs":
        devs = df[numerical] - means
        return(devs)
    
    if method == "std_devs":
        devs = df[numerical] - means
        std_devs = devs/stds
        return(std_devs)

In [18]:
def feature_aggregation_creation(combination_list, df):
    out_df = pd.DataFrame(
        {'temp':np.zeros(len(df))}
    )
    
    for i in np.arange(0,len(combination_list)):
        combination = combination_list[i]
        
        print(combination)
        feature = feature_creation(
            categorical = combination[0],
            numerical = combination[1],
            method = combination[2],
            df=df)
        
        name = combination[0] + '.' + combination[1] + '.' + combination[2]
        out_df[name] = feature
        
    out_df.drop('temp',axis=1,inplace=True)
    return(out_df)

In [19]:
numerics_rankings = pd.read_csv('Data/numerics_rankings.csv')

In [20]:
strong_numeric_features = numerics_rankings.iloc[:20]['feature'].tolist()
strong_categorical_features = ['M5','P_emaildomain','M4','ProductCD','card6','M6','R_emaildomain']

In [21]:
scores = feature_aggregation_eval_2(
    categorical_variables = strong_categorical_features,
    numerical_variables = strong_numeric_features,
    df = use)

In [22]:
combinations = combinations_filter(scores,0.65)

In [23]:
new_features = feature_aggregation_creation(combinations,use)

['R_emaildomain', 'V108', 'means']
['R_emaildomain', 'V108', 'stds']
['ProductCD', 'V278', 'stds']
['M6', 'V278', 'std_devs']
['R_emaildomain', 'V278', 'means']
['R_emaildomain', 'V278', 'stds']
['R_emaildomain', 'V278', 'devs']
['M4', 'V188', 'means']
['M6', 'V188', 'means']
['M6', 'V188', 'stds']
['M6', 'V188', 'devs']
['M6', 'V188', 'std_devs']
['M4', 'V153', 'stds']
['M6', 'V153', 'means']
['M6', 'V153', 'stds']
['ProductCD', 'V63', 'means']
['ProductCD', 'V63', 'stds']
['R_emaildomain', 'V63', 'means']
['R_emaildomain', 'V63', 'stds']
['M4', 'V198', 'means']
['M4', 'V198', 'devs']
['M4', 'V198', 'std_devs']
['ProductCD', 'V198', 'stds']
['M6', 'V198', 'means']
['ProductCD', 'V129', 'devs']
['ProductCD', 'V129', 'std_devs']
['M6', 'V129', 'means']
['M6', 'V129', 'stds']
['M6', 'V129', 'devs']
['R_emaildomain', 'V129', 'means']
['R_emaildomain', 'V129', 'stds']
['R_emaildomain', 'V129', 'devs']
['R_emaildomain', 'V129', 'std_devs']
['M4', 'V276', 'devs']
['ProductCD', 'V276', 'stds'

In [24]:
best_numerical_feats = use[strong_numeric_features]
best_categorical_feats = use[strong_categorical_features]
best_categorical_feats = best_categorical_feats.apply(LabelEncoder().fit_transform)

In [25]:
hybrid = pd.concat([best_numerical_feats,best_categorical_feats,new_features],axis=1)

hybrid = ffill(hybrid)

hybrid_train = hybrid.iloc[:100000,:]
hybrid_val = hybrid.iloc[100000:200000,:]

fraud_train = fraud.iloc[:100000]
fraud_val = fraud.iloc[100000:200000]

Now we've added these new features and encoded the strings, let's give this new data a test run.

In [26]:
import xgboost as xgb
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10,
    objective = 'binary:logistic'
) 

model.fit(hybrid_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(hybrid_val, fraud_val)],
          early_stopping_rounds = 10
         )

[0]	validation_0-auc:0.798216
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.812628
[2]	validation_0-auc:0.823784
[3]	validation_0-auc:0.82734
[4]	validation_0-auc:0.82713
[5]	validation_0-auc:0.828047
[6]	validation_0-auc:0.829488
[7]	validation_0-auc:0.833554
[8]	validation_0-auc:0.838711
[9]	validation_0-auc:0.839557
[10]	validation_0-auc:0.840092
[11]	validation_0-auc:0.840577
[12]	validation_0-auc:0.842423
[13]	validation_0-auc:0.842807
[14]	validation_0-auc:0.842579
[15]	validation_0-auc:0.842959
[16]	validation_0-auc:0.841455
[17]	validation_0-auc:0.84131
[18]	validation_0-auc:0.840495
[19]	validation_0-auc:0.83901
[20]	validation_0-auc:0.838915
[21]	validation_0-auc:0.839463
[22]	validation_0-auc:0.83877
[23]	validation_0-auc:0.837916
[24]	validation_0-auc:0.837335
[25]	validation_0-auc:0.836469
Stopping. Best iteration:
[15]	validation_0-auc:0.842959



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

We've fitted our model, now let's check if the new variables actually contribute amything of significance as opposed to paraphrasing what already exists in the data

In [30]:
import eli5
from eli5.sklearn import PermutationImportance

perm_hybrid = PermutationImportance(model, random_state=1).fit(hybrid_val.iloc[:100000], fraud_val.iloc[:100000])
perm_hybrid_df = eli5.explain_weights_df(perm_hybrid, feature_names = hybrid.columns.tolist())

In [31]:
perm_hybrid_df['standardised_weight'] = perm_hybrid_df['weight']/perm_hybrid_df['std']
perm_hybrid_df = perm_hybrid_df.sort_values('standardised_weight',ascending=False)
perm_hybrid_df

Unnamed: 0,feature,weight,std,standardised_weight
26,ProductCD.V129.std_devs,0.000070,0.000000,inf
3,C4,0.002896,0.000043,67.077305
0,C14,0.005882,0.000088,67.049029
2,M6.C1.std_devs,0.004056,0.000155,26.124825
4,M4.V198.devs,0.001400,0.000054,25.732512
1,C1,0.004640,0.000192,24.174537
7,V58,0.000662,0.000038,17.592452
20,ProductCD.V63.means,0.000112,0.000007,14.966630
9,R_emaildomain.V108.means,0.000500,0.000033,14.940358
5,ProductCD,0.000838,0.000062,13.495081


Ok, seems like quite a few of the new features do make a difference. Let's save the rankings as a csv and train up our full model with the signficant ones included.

In [32]:
new_rankings = perm_hybrid_df.to_csv('Data/new_rankings.csv')