# Feature Tranformation and Engineering

From EDA, we derived some actionable insights and would love to further transform and create features for further research. 3 most important steps that I want to point out are:

1. generating the job normalisation gazetteer

2. create customer spending behaviour on a rolling window base

3. calculate merchant risk factor on a rolling window base

## Table of Contents

1. [Feature Transformation and Engineering](#feature-transformation-and-engineering)
    1. [Feature Transformation](#feature-transformation)
        1. [Encode Gender](#encode-gender)
        2. [Encode Job](#encode-job)
        3. [Encode Merchant Category](#encode-merchant-category)
    2. [Feature Engineering](#feature-engineering)
        1. [Age Group](#age-group)
        2. [Customer Spending Behaviour](#customer-spending-behaviour)
        3. [City Size](#city-size)
        4. [Merchant Risk Factor](#merchant-risk-factor)
2. [Save Data](#save-data)
3. [Outlook](#outlook)

In [1]:
import warnings
from datetime import datetime
import os
import sys
import json
import pickle

import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from geopy.distance import geodesic
from matplotlib import pyplot as plt
from summarytools import dfSummary

# Add the src directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from utils.plots import FraudMap
from features.feature_engineering import generic_customer_spending_behaviour, general_customer_bahaviour, get_merchant_risk_rolling_window
from features.feature_transformation import encode, categorize_jobs

warnings.filterwarnings("ignore")

%load_ext autoreload
%autoreload 2

In [2]:
print("Load transaction data")
%time df = pd.read_csv("../data/raw/tr_fincrime_train.csv")
print("{0} transaction data loaded, containing {1} fraudulent transactions".format(len(df),df['is_fraud'].sum()))

Load transaction data
CPU times: user 5.78 s, sys: 825 ms, total: 6.61 s
Wall time: 6.88 s
1296675 transaction data loaded, containing 7506 fraudulent transactions


## Feature Transformation

This section mainly includes datetime transformation and categorical feature encoding

In [3]:
# Convert the 'trans_date_trans_time' column to datetime if not already done
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
# Extract the date part from the 'trans_date_trans_time' column
df['trans_date'] = df['trans_date_trans_time'].dt.date

### Encode Gender

In [4]:
# One-hot encode the gender column
df_enc_gender, _ = encode(df, "gender", "gender", encoding = "onehot")

In [5]:
df_enc_gender['gender_M'].value_counts()

0    709863
1    586812
Name: gender_M, dtype: int64

### Encode Job

The job column contains ambuigious entries of 494 different values with most of them being similar to each other ("engineer, water" and "engineer, operation"). We start by normalising these jobs into a pre-defined category. With the help of Gemini 1.5 pro, I divide the jobs into 11 categories:

1. Healthcare & Medical

2. Engineering & Technology

3. Finance, Banking & Insurance

4. Education & Research

5. Creative Arts, Design & Media

6. Legal & Public Sector

7. Business, Management & Consultancy

8. Science & Research

9. Logistics, Transport & Supply Chain

10. Construction & Property

11. Hospitality, Tourism & Leisure

In [6]:
# Load job normalisation gazetteer
with open('../jobs_by_category.json', 'r') as f:
    job_categories = json.load(f)
# normalize and categorize the job column 
df_cat_job = categorize_jobs(df_enc_gender, "job", job_categories)

In [7]:
# encode the job category
df_enc_job, job_encoder = encode(df_cat_job, "job_category", "job_encoded", encoding="ordinal")

In [8]:
mapping = {category: int(code) for category, code in zip(job_encoder.categories_[0], range(len(job_encoder.categories_[0])))}
print(mapping)

{'Business, Management & Consultancy': 0, 'Construction & Property': 1, 'Creative Arts, Design & Media': 2, 'Education & Research': 3, 'Engineering & Technology': 4, 'Finance, Banking & Insurance': 5, 'Healthcare & Medical': 6, 'Hospitality, Tourism & Leisure': 7, 'Legal & Public Sector': 8, 'Logistics, Transport & Supply Chain': 9, 'Science & Research': 10}


In [9]:
# Save the encoder
with open('../saved_model/encoders/job_encoder.pkl', 'wb') as file:
    pickle.dump(job_encoder, file)

Well, you might wonder, what if we invented a new job like "gardening engineer" and is now logged in the data storage for testing? 

The test data for the job category are handled as follows: we first compare the job in test data to the [job normalisation gazetter](https://github.com/ichbinlan99/Fraud-Detection-TR/blob/eda/jobs_by_category.json) to normalise existing jobs. Then, we look at the non-matched jobs and use fuzzy match to match it with the existing jobs. If the similarity score is too low, we add a new category "Other" and use ordinal encoder to handle the new added category.


### Encode Merchant Category

Since the merchant category only has 14 categories in the training set, we handle it similary.

In [10]:
df_enc_cat, merchant_cat_encoder = encode(df_enc_job, "category", "category_encoded", encoding="ordinal")

In [11]:
# Create the mapping
mapping = {category: int(code) for category, code in zip(merchant_cat_encoder.categories_[0], range(len(merchant_cat_encoder.categories_[0])))}
print(mapping)

{'entertainment': 0, 'food_dining': 1, 'gas_transport': 2, 'grocery_net': 3, 'grocery_pos': 4, 'health_fitness': 5, 'home': 6, 'kids_pets': 7, 'misc_net': 8, 'misc_pos': 9, 'personal_care': 10, 'shopping_net': 11, 'shopping_pos': 12, 'travel': 13}


In [12]:
# Save the encoder
with open('../saved_model/encoders/merchant_cat_encoder.pkl', 'wb') as file:
    pickle.dump(merchant_cat_encoder, file)

## Feature Engineering

### 1. Age Group 

We calculate the age of the credicard holder when the transaction was made since the trascation history has a duration of 536 days and divide them into age groups. The choice of group boundary and bandwidth could be decided in several ways. Either be looking at the quantile of the age distribution or heuristically from domain knowledge or look at the 'risk' of being frauded.

In [13]:
df_enc_cat["dob"] = pd.to_datetime(df_enc_cat["dob"], errors="coerce")
df_enc_cat["trans_date_trans_time"] = pd.to_datetime(
    df_enc_cat["trans_date_trans_time"], errors="coerce"
)
df_enc_cat["age"] = (df_enc_cat["trans_date_trans_time"] - df_enc_cat["dob"]).dt.days // 365

The average retire age in the US is around 63-65 (65+) and considering average poeple leave university around 25. More than 82% of adults(25-54) had a credit card as of 2023. Financially, most people hit their career peak in their late 40s, 50s. Considering all these factors combined with the plot from EDA, I decided to group the card holders in 4 age groups: Under 25, 25-45, 45-65, 65+.

<p align="center">
    <img src="../out/age.png" width=500>
</p>

In [14]:
# Define age bins based on observed patterns
# another level on larger granular level would be 0-25, 25-50, 50+
bins = [0, 25, 45, 65, 100]
labels = ["Under 25", "25-45", "45-65", "65+"]
true_lables = [1, 0, 2, 3]  # 0 has the lowest fraud rate (risk), 1 has the highest
df_enc_cat["age_group"] = pd.cut(df_enc_cat["age"], bins=bins, labels=true_lables)
df_enc_cat["age_group"] = df_enc_cat["age_group"].astype('int64')

### 2. Customer Spending behaviour

Fraudsters typically try to make the most out of the stolen card before the fraud is detected. Aside from making several purchases in quick period of time, the fraudsters also try to get high transactions suggesting that a high amount of transaction than average history spending habbit of the crad holder could be suspicious. On the other hand, individuals may exhibit patterned bahaviouss While this could be a good indicator for a certain transaction being flagged as fraudulent, we might also want to capture a global pattern to combine with. Thus, we quantify the customer spending behaviour in the below 2 aspects:

1. Generic spending bahaviour (indicating individual behaviours):

- `avg_amount_{window_length}_days`: Rolling average transaction amount for the past {window_length} days.
- `count_amount_{window_length}_days`: Number of transactions in the past {window_length} days.
- `amount_over_average_{window_length}_days`: Ratio of transaction amount to the rolling average amount.
- `inter_transaction_time_{window_length}_days`: Time (in seconds) since the last transaction.
- `card_count_sparsity`: Average transaction counts per card per day.
- `merchant_count_sparsity`: Average transaction counts per merchant per day.
- `card_amount_sparsity`: Average transaction amounts per card per day.
- `merchant_amount_sparsity`: Average transaction amounts per merchant per day.

However, as the window size serves as a hypter-paramter and can be tuned later, here we generate a list of size of [1, 7, 30, 90, 180] days

<p align="center">
    <img src="../out/customer_spending_behaviour.png" width=1000>
</p>

In addition, geographical discrepancies (Unusually far distance between cardholder and merchant address): When a cardholder's address and the merchant’s geographical location are far apart, this could potentially signal that the transaction is fraudulent. This is particularly relevant for Card Not Present (CNP) fraud, where fraudsters often use stolen card details to make purchases in distant locations or countries. Or if the cardholder has typically made purchases in one region and suddenly makes a large purchase in a city far away, it could indicate that the card details have been stolen and used without authorization. The geodesic distances are calculated using the [GeoPy](https://geopy.readthedocs.io/en/stable/) library. We add

- `avg_distance_{window_length}_days`: Rolling average distance for the past {window_length} days.
- `distance_over_avg_{window_length}_days`: Ratio of transaction distance to the rolling average distance.

to the generic customer spending behaviour

<p align="center">
    <img src="../out/geodesic.png" width=180>
</p>

In [15]:
def calculate_distance(row):
    cust_location = (row["lat"], row["long"])
    merch_location = (row["merch_lat"], row["merch_long"])
    return geodesic(cust_location, merch_location).miles

In [16]:
df_enc_cat["distance"] = df_enc_cat.apply(calculate_distance, axis=1)

In [17]:
df_dist = generic_customer_spending_behaviour(df_enc_cat, window_lengths=[1, 7, 30, 90, 180]) #define time windows in 1, 7, 30, 90, 180 days
df_dist = general_customer_bahaviour(df_dist)

In [18]:
fraud_transactions, non_fraud_transactions = df_dist[df_dist["is_fraud"] == 1], df_dist[df_dist["is_fraud"] == 0]
non_fraud_transactions['distance_over_avg_180_days'].mean(), fraud_transactions['distance_over_avg_180_days'].mean()

(1.0010983665420605, 1.0105628470572936)

2. General spending behaviour (indicating global patterns for the entire population):
- `transaction_hour`: Hour of the transaction.
- `transaction_day_of_week`: Day of the week of the transaction (0=Monday, 6=Sunday).
- `is_holiday`: 1 if the transaction occurred on a US holiday, 0 otherwise.
- `is_weekend`: 1 if the transaction occurred on a weekend, 0 otherwise.
- `trans_date`: Date of the transaction.
- `daily_trans_count`: Number of transactions for the given credit card on that day.

Late-Night or Off-Hours Transactions: Transactions made outside typical shopping hours, such as late at night or during weekends or holidays when fraud is less likely to be detected, may indicate that a fraud is taking place, particularly if they are not in line with the cardholder’s usual behavior. In Addition, multiple transactions in a short period (e.g., multiple small purchases in one hour) may also indicates a fraudster testing or running up charges on a stolen card.

From EDA we noticed a huge discrepancy between the transaction ratio and fraud ratio on an hourly level thus we add a risk feature to indicate if the hour of the transaction lies in a high risk hour

In [19]:
## add a column to indicate if the transaction is in a high risk hour
df_dist["is_high_risk_hour"] = df_dist["transaction_hour"].apply(
    lambda x: 2 if x in [22, 23] else (1 if x in [0, 1, 2, 3] else 0)
)

### 3. City Size

**Major Cities:** Often considered to be populations over 500,000 or even 1 million. Some might even set the bar higher, considering only the very largest metropolitan areas like New York, Los Angeles, and Chicago as truly "big." Since the dataset dosen't contain much transaction history over cities with this much large of population, I took the margin at 150k.

**Mid-Sized Cities**: This is where there's a lot of variation. A common range is between 100,000 and 500,000. Some definitions might go as low as 50,000 or as high as approaching 1 million.

In [20]:
# Define the bins and labels for city size
city_size_bins = [0, df_dist['city_pop'].quantile(0.85), df_dist['city_pop'].quantile(0.99) ,float('inf')]
city_size_labels = ['Small', 'Medium', 'Large']
city_size_labels_true = [0,1,2]

# Create the city_size column
df_dist['city_size'] = pd.cut(df_dist['city_pop'], bins=city_size_bins, labels=city_size_labels_true)
df_dist['city_size'] = df_dist['city_size'].astype('int64')

In [21]:
df_dist['city_pop'].quantile(0.85), df_dist['city_pop'].quantile(0.99)

(88735.0, 1577385.0)

### 4. Merchat Risk Factor

The main goal will be to extract a risk score that assesses the exposure of a given merchant name to fraudulent transactions. The risk score will be defined as the average number of fraudulent transactions that occurred on a merchant name over a time window. The time windows will not directly precede a given transaction. Instead, they will be shifted back by a delay period. The delay period accounts for the fact that, in practice, the fraudulent transactions are only discovered after a fraud investigation or a customer complaint. Hence, the fraudulent labels, which are needed to compute the risk score, are only available after this delay period. To a first approximation, this delay period will be set to one week. Let us perform the computation of the risk scores by defining a get_merchant_risk_rolling_window function. The function takes as inputs the DataFrame of transactions for a given merchant name, the delay period, and a list of window sizes. In the first stage, the number of transactions and fraudulent transactions are computed for the delay period. In the second stage, the number of transactions and fraudulent transactions are computed for each window size plus the delay period. The number of transactions and fraudulent transactions that occurred for a given window size, shifted back by the delay period, is then obtained by simply computing the differences of the quantities obtained for the delay period and the window size plus delay period. The risk score is finally obtained by computing the proportion of fraudulent transactions for each window size (or 0 if no transaction occurred for the given window). Additionally, to the risk score, the function also returns the number of transactions for each window size.

<p align="center">
    <img src="../out/merchant_risk_factor.png" width=1000>
</p>

The risk score $R$ is computed as:
$$
R = 
\begin{cases} 
\frac{\text{Number of fraudulent transactions in the window}}{\text{Total transactions in the window}}, & \text{if Total transactions > 0} \\
0, & \text{otherwise}
\end{cases}
$$

Where:

- **Window size**: The time period for which transactions are considered (e.g., 1 week, 1 month).
- **Delay period**: The time shift to account for the delay in fraud detection (e.g., 1 week).
- **Total transactions in the window**: The number of transactions during the delay period + window size, minus the transactions during the delay period.
- **Number of fraudulent transactions in the window**: The count of fraudulent transactions during the delay period + window size, minus those during the delay period.

In [22]:
df_merch_risk = get_merchant_risk_rolling_window(transactions = df_dist, delay_period=7, window_size=[1, 7, 30, 90, 180])

In [23]:
df_merch_risk['merchant_risk_1_day_window'].mean(), df_merch_risk['merchant_risk_30_day_window'].mean(), df_merch_risk['merchant_risk_180_day_window'].mean()

(0.005483918793180499, 0.005806033249438283, 0.006133785097165681)

## Save Data

a quick overview of correlations between the variables in numerical values.

<p align="center">
    <img src="../out/corr_map.png" width=1000>
</p>

In [24]:
df_merch_risk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 86 columns):
 #   Column                             Non-Null Count    Dtype         
---  ------                             --------------    -----         
 0   Unnamed: 0                         1296675 non-null  int64         
 1   trans_date_trans_time              1296675 non-null  datetime64[ns]
 2   cc_num                             1296675 non-null  int64         
 3   merchant                           1296675 non-null  object        
 4   category                           1296675 non-null  object        
 5   amt                                1296675 non-null  float64       
 6   first                              1296675 non-null  object        
 7   last                               1296675 non-null  object        
 8   street                             1296675 non-null  object        
 9   city                               1296675 non-null  object        
 10  state 

In [25]:
# Save the DataFrame to a CSV file
df_merch_risk.to_csv('../data/processed/train_data.csv', index=False, header=True)

## Outlook

- I later on realized that adding an indicator on if a credict card has been flagged before (within a time window) could be a very good featrue
- A ordinal encoding serves as a quick and easy encoding scheme for our categorical features, we could also try customized encoding schemes incorporated with domain knowledge
- ...