# Advanced Feature Engineering

#### GPU vs CPU Performance Comparison
This notebook incorporates advanced feature engineering techniques for credit default prediction. We will be comparing the performance between GPU-accelerated RAPIDS (cuDF) and CPU-based pandas implementations. The workflow processes multiple data sources including bureau data, credit card balances, and payment histories to create an enriched feature set. 
Each implementation follows identical steps:
- loading raw parquet files
- combining train/test data
- applying feature engineering functions
- handling data type conversions
- managing missing values
- and saving the processed features.

The notebook highlights the significant performance advantages of GPU acceleration in data processing pipelines, particularly for large-scale feature engineering tasks.

We'll first run the notebook with the Rapids libraries, then with Pandas. 

## RAPIDS cuDF

In [1]:
import os
os.chdir('/home/cdsw')

In [2]:
from feature_engineering import (
    pos_cash, process_unified, process_bureau_and_balance, 
    process_previous_applications, installments_payments,
    credit_card_balance
    )
import cudf as xd
import gc
import rmm
import numpy as np
import pandas as pd

In [3]:
rmm.reinitialize(managed_memory=True) # roughly 14GB pool

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
%%time
bureau_balance = xd.read_parquet('raw_data/bureau_balance.parquet')
bureau = xd.read_parquet('raw_data/bureau.parquet')
cc_balance = xd.read_parquet('raw_data/cc_balance.parquet')
payments = xd.read_parquet('raw_data/payments.parquet')
pc_balance = xd.read_parquet('raw_data/pc_balance.parquet')
prev = xd.read_parquet('raw_data/prev.parquet')
train = xd.read_parquet('raw_data/train.parquet')
test = xd.read_parquet('raw_data/test.parquet')

CPU times: user 1.86 s, sys: 426 ms, total: 2.28 s
Wall time: 2.33 s


In [6]:
train_target = train['TARGET']

train_index = train.index
test_index = test.index

unified = xd.concat([train.drop('TARGET', axis=1), test])

In [9]:
del(train)
del(test)
gc.collect()

471

This code block performs comprehensive feature engineering across multiple data sources. First, `process_unified()` creates new features from the main application data including various financial ratios, aggregated indicators, and encoded categorical variables. Next, `process_bureau_and_balance()` generates features from credit bureau history by aggregating information about active and closed credits. There is a similar aggregations process for previous applications, POS loans, installment payments, and credit card balances. Finally, all these feature sets are merged together to create a new feature set for modeling.

In [11]:
%%time

unified_feat = process_unified(unified, xd)

bureau_agg = process_bureau_and_balance(bureau, bureau_balance, xd)
del unified, bureau, bureau_balance

prev_agg = process_previous_applications(prev, xd)
pos_agg = pos_cash(pc_balance, xd)
ins_agg = installments_payments(payments, xd)
cc_agg = credit_card_balance(cc_balance, xd)

del prev, pc_balance, payments, cc_balance
gc.collect()

unified_feat = unified_feat.merge(bureau_agg, how='left', on='SK_ID_CURR') \
    .merge(prev_agg, how='left', on='SK_ID_CURR') \
    .merge(pos_agg, how='left', on='SK_ID_CURR') \
    .merge(ins_agg, how='left', on='SK_ID_CURR') \
    .merge(cc_agg, how='left', on='SK_ID_CURR')

del bureau_agg, prev_agg, pos_agg, ins_agg, cc_agg
gc.collect()

# we can't use bool column types in xgb later on
bool_columns = [col for col in unified_feat.columns if (unified_feat[col].dtype in ['bool']) ]    
unified_feat[bool_columns] = unified_feat[bool_columns].astype('int64')

# We will label encode for xgb later on
from sklearn.preprocessing import LabelEncoder
# label encode cats
label_encode_dict = {}

categorical = unified_feat.select_dtypes(include=pd.CategoricalDtype).columns 
for column in categorical:
    label_encode_dict[column] = LabelEncoder()
    unified_feat[column] =  label_encode_dict[column].fit_transform(unified_feat[column])
    unified_feat[column] = unified_feat[column].astype('int64')

### Fix for Int64D
Int64D = unified_feat.select_dtypes(include=[pd.Int64Dtype]).columns
unified_feat[Int64D] = unified_feat[Int64D].fillna(0)
unified_feat[Int64D] = unified_feat[Int64D].astype('int64')

### fix unit8
uint8 = unified_feat.select_dtypes(include=['uint8']).columns
unified_feat[uint8] = unified_feat[uint8].astype('int64')

#unified_feat.replace([np.inf, -np.inf], np.nan, inplace=True)
na_cols = unified_feat.isna().any()[unified_feat.isna().any()==True].index.to_arrow().to_pylist()
unified_feat[na_cols] = unified_feat[na_cols].fillna(0)

train_feats = unified_feat.loc[train_index].merge(train_target, how='left', 
                                               left_index=True, right_index=True)
test_feats = unified_feat.loc[test_index]

  return infer_dtype_from_object(dtype)


CPU times: user 11.3 s, sys: 1.59 s, total: 12.9 s
Wall time: 13.3 s


In [17]:
%%time
train_feats.to_parquet('data_eng/feats/train_feats.parquet')
test_feats.to_parquet('data_eng/feats/test_feats.parquet')

CPU times: user 2.73 s, sys: 437 ms, total: 3.16 s
Wall time: 5.7 s


In [None]:
del train_feats

## Pandas

In [18]:
from feature_engineering import (
    pos_cash, process_unified, process_bureau_and_balance, 
    process_previous_applications, installments_payments,
    credit_card_balance 
    )
import pandas as xd
import gc

In [19]:
%%time
bureau_balance = xd.read_parquet('raw_data/bureau_balance.parquet')
bureau = xd.read_parquet('raw_data/bureau.parquet')
cc_balance = xd.read_parquet('raw_data/cc_balance.parquet')
payments = xd.read_parquet('raw_data/payments.parquet')
pc_balance = xd.read_parquet('raw_data/pc_balance.parquet')
prev = xd.read_parquet('raw_data/prev.parquet')
train = xd.read_parquet('raw_data/train.parquet')
test = xd.read_parquet('raw_data/test.parquet')

train_index = train.index
test_index = test.index

train_target = train['TARGET']
unified = xd.concat([train.drop('TARGET', axis=1), test])

del(train)
del(test)
gc.collect()

CPU times: user 8.29 s, sys: 3.37 s, total: 11.7 s
Wall time: 2.52 s


0

In [20]:
# fix for the process functions not working with columns of type `category`
bureau_balance['STATUS'] = bureau_balance['STATUS'].astype('object') 
bureau['CREDIT_ACTIVE'] = bureau['CREDIT_ACTIVE'].astype('object')
bureau['CREDIT_CURRENCY'] = bureau['CREDIT_CURRENCY'].astype('object')

prev['NAME_CONTRACT_STATUS'] = prev['NAME_CONTRACT_STATUS'].astype('object')

In [21]:
%%time

unified_feat = process_unified(unified, xd)

bureau_agg = process_bureau_and_balance(bureau, bureau_balance, xd)

prev_agg = process_previous_applications(prev, xd)
pos_agg = pos_cash(pc_balance, xd)
ins_agg = installments_payments(payments, xd)
cc_agg = credit_card_balance(cc_balance, xd)

unified_feat = unified_feat.merge(bureau_agg, how='left', on='SK_ID_CURR') \
    .merge(prev_agg, how='left', on='SK_ID_CURR') \
    .merge(pos_agg, how='left', on='SK_ID_CURR') \
    .merge(ins_agg, how='left', on='SK_ID_CURR') \
    .merge(cc_agg, how='left', on='SK_ID_CURR')

# we can't use bool column types in xgb later on
bool_columns = [col for col in unified_feat.columns if (unified_feat[col].dtype in ['bool']) ]
unified_feat[bool_columns] = unified_feat[bool_columns].astype('int64')

# We will label encode for xgb later on
from sklearn.preprocessing import LabelEncoder
# label encode cats
label_encode_dict = {}

categorical = unified_feat.select_dtypes(include=pd.CategoricalDtype).columns 
for column in categorical:
    label_encode_dict[column] = LabelEncoder()
    unified_feat[column] =  label_encode_dict[column].fit_transform(unified_feat[column])
    unified_feat[column] = unified_feat[column].astype('int64')

### Fix for Int64D
Int64D = unified_feat.select_dtypes(include=[pd.Int64Dtype]).columns
unified_feat[Int64D] = unified_feat[Int64D].fillna(0)
unified_feat[Int64D] = unified_feat[Int64D].astype('int64')

### fix unit8
uint8 = unified_feat.select_dtypes(include=['uint8']).columns
unified_feat[uint8] = unified_feat[uint8].astype('int64')

nan_columns = unified_feat.columns[unified_feat.isna().any()].tolist()
unified_feat.replace([np.inf, -np.inf], np.nan, inplace=True)
unified_feat[nan_columns] = unified_feat[nan_columns].fillna(0)

train_feats = unified_feat.loc[train_index].merge(train_target, how='left', 
                                               left_index=True, right_index=True)
test_feats = unified_feat.loc[test_index]

  income_by_organisation = unified[['AMT_INCOME_TOTAL', 'ORGANIZATION_TYPE']].groupby('ORGANIZATION_TYPE').median()['AMT_INCOME_TOTAL']


CPU times: user 55.4 s, sys: 10.8 s, total: 1min 6s
Wall time: 1min 4s


In [22]:
%%time
train_feats.to_parquet('data_eng/feats/train_feats.parquet')
test_feats.to_parquet('data_eng/feats/test_feats.parquet')

CPU times: user 7.43 s, sys: 517 ms, total: 7.94 s
Wall time: 8.76 s
