# Feature Engineering
Feature Engineering is the key.

  
reference:  
https://www.kaggle.com/mjbahmani/statistical-analysis-for-elo  
https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/82055  
https://www.kaggle.com/chauhuynh/my-first-kernel-3-699  
https://www.kaggle.com/fabiendaniel/elo-world  
https://www.kaggle.com/raddar/target-true-meaning-revealed  
https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/82036#479038  
https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/  
https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering this is really helpfull

In [22]:
import pandas as pd

import os
print(os.listdir("../data"))

import numpy as np

import datetime
import gc
import matplotlib.pyplot as plt
import seaborn as sns
# import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
np.random.seed(4590)

['Data_Dictionary.xlsx', 'new_merchant_transactions.csv', 'test.csv', 'merchants.csv', 'historical_transactions.csv', 'train.csv', 'load_data.py', 'sample_submission.csv']


### Load Data

In [2]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
merchants = pd.read_csv('../data/merchants.csv')
historical_transactions = pd.read_csv('../data/historical_transactions.csv')
new_transactions = pd.read_csv('../data/new_merchant_transactions.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

### Clean Data

In [3]:
# drop NA data, be careful
train = train.dropna()

# fill the NA data, is it going to have a bad influence on our model?
for df in [historical_transactions, new_transactions]:
    df['category_3'].fillna('A', inplace=True)
    df['category_2'].fillna(1.0, inplace=True)
    df['merchant_id'].fillna('M_ID_00a6ca8a8a', inplace=True)

### Feature Extraction

Manual feature engineering can be a tedious process (which is why we use automated feature engineering with featuretools!) and often relies on domain expertise. Since I have limited domain knowledge of loans and what makes a person likely to default, I will instead concentrate of getting as much info as possible into the final training dataframe. The idea is that the model will then pick up on which features are important rather than us having to decide that. Basically, our approach is to make as many features as possible and then give them all to the model to use! Later, we can perform feature reduction using the feature importances from the model or other techniques such as PCA.

In [13]:
# print all current features
print(list(train))
print(list(test))
print(list(historical_transactions))
print(list(new_transactions))

['first_active_month', 'card_id', 'feature_1', 'feature_2', 'feature_3', 'target']
['first_active_month', 'card_id', 'feature_1', 'feature_2', 'feature_3']
['authorized_flag', 'card_id', 'city_id', 'category_1', 'installments', 'category_3', 'merchant_category_id', 'merchant_id', 'month_lag', 'purchase_amount', 'purchase_date', 'category_2', 'state_id', 'subsector_id']
['authorized_flag', 'card_id', 'city_id', 'category_1', 'installments', 'category_3', 'merchant_category_id', 'merchant_id', 'month_lag', 'purchase_amount', 'purchase_date', 'category_2', 'state_id', 'subsector_id']


In [24]:
## main data preprocessing block

# 1. One important feature here is purchase_date feature, we need to extract it into year, month,
# week of year, day of week, weekend, hour
# 2. get time difference to today which 

# 2. normalize binary data to 1/0 int


for df in [historical_transactions, new_transactions]:
    
    # date conversion
    df['purchase_date'] = pd.to_datetime(df['purchase_date'])
    df['year'] = df['purchase_date'].dt.year
    df['weekofyear'] = df['purchase_date'].dt.weekofyear
    df['month'] = df['purchase_date'].dt.month
    df['dayofweek'] = df['purchase_date'].dt.dayofweek
    df['weekend'] = (df['purchase_date'].dt.dayofweek >= 5).astype(int) # 0-5 week day
    df['hour'] = df['purchase_date'].dt.hour
    
    ## time difference
    # https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/73244
    df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30
    df['month_diff'] += df['month_lag']
    
    # normalization
    # TODO still not well done here
    df['authorized_flag'] = df['authorized_flag'].map({'Y': 1, 'N': 0})
    df['category_1'] = df['category_1'].map({'Y': 1, 'N': 0})
    


In [25]:
## feature aggregation
df.head()

Unnamed: 0,authorized_flag,card_id,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,...,category_2,state_id,subsector_id,year,weekofyear,month,dayofweek,weekend,hour,month_diff
0,1,C_ID_415bb3a509,107,0,1,B,307,M_ID_b0c793002c,1,-0.557574,...,1.0,9,19,2018,10,3,6,1,14,14
1,1,C_ID_415bb3a509,140,0,1,B,307,M_ID_88920c89e8,1,-0.56958,...,1.0,9,19,2018,12,3,0,0,18,13
2,1,C_ID_415bb3a509,330,0,1,B,507,M_ID_ad5237ef6b,2,-0.551037,...,1.0,9,14,2018,17,4,3,0,14,13
3,1,C_ID_415bb3a509,-1,1,1,B,661,M_ID_9e84cda3b1,1,-0.671925,...,1.0,-1,8,2018,10,3,2,0,9,14
4,1,C_ID_ef55cf8d4b,-1,1,1,B,166,M_ID_3c86fa3831,1,-0.659904,...,1.0,-1,29,2018,12,3,3,0,21,13
