## Credit Card Fraud Detection

In recent years, credit card fraud has caused significant financial losses worldwide, with an estimated $28.58 billion reported in 2020 (Nilson, 2021
). Seeing the growing impact of fraudulent activity on both customers and businesses, I decided to work on this project to better understand how data can be used to detect and prevent such cases.

My project applies feature engineering techniques to raw credit card transaction data to uncover patterns that may indicate fraud. By creating new, more insightful variables from existing data, such as customer age, transaction distance, and domestic versus international activity, I aimed to make the dataset more informative and improve the ability of predictive models to flag suspicious transactions. This project reflects my interest in using data-driven approaches to solve real-world financial problems and enhance customer trust.

# Package Imports and Data Preparation
The following cells install and import the packages necessary to import and manipulate fraud detection data. They also load the example data and preview it.


In [7]:
%%capture
!pip install geopy
import numpy as np
import pandas as pd
from datetime import date
from geopy import distance

In [11]:
# Specify the transaction time column
trans_time = "trans_date_trans_time"

# Specify any additional date columns
date_cols = ["dob", trans_time]

fraud_df = pd.read_csv(
    r"C:\Users\mishi\Downloads\fraud_data.csv",
    parse_dates=date_cols,
    index_col=trans_time
).sort_index()

# Preview the data
fraud_df

Unnamed: 0_level_0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,customer_country,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,merchant_country,is_fraud
trans_date_trans_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-06-21 13:35:00,4.822370e+15,Conroy-Emard,food_dining,11.18,Christopher,Farrell,M,97070 Anderson Land,Haines City,FL,...,United States,33804,Exercise physiologist,1991-01-01,0183bebc44f97bd8ed49a27e7504ad0a,1371821701,28.944803,-81.819858,United States,0
2020-06-21 14:02:00,4.951650e+15,Kutch-Wilderman,home,18.76,Kimberly,Miller,F,75533 Tamara Valleys,Logan,IL,...,United States,324,"Scientist, research (physical sciences)",1976-06-15,e364cbb43064585db6fef8adad798ae3,1371823347,38.621529,-88.338930,United States,0
2020-06-21 14:12:00,6.011690e+15,"Heathcote, Yost and Kertzmann",shopping_net,5.48,Victoria,Fleming,F,2807 Parker Station Suite 080,Stanchfield,MN,...,United States,2607,"Lecturer, further education",1995-12-04,dccc50d067404415f5546ce81c6ee9e4,1371823965,45.339446,-92.880697,United States,0
2020-06-21 15:31:00,6.759900e+11,Hackett Group,travel,6.22,Amanda,Spencer,F,6682 Green Forks,Ogdensburg,NJ,...,United States,2456,Senior tax professional/tax inspector,1994-03-13,116301dce011c7a0190696720339620d,1371828712,41.041988,-73.652938,United States,0
2020-06-21 17:28:00,4.128730e+18,Abernathy and Sons,food_dining,36.04,Monique,Martin,F,68276 Matthew Springs,Ratcliff,TX,...,United States,43,"Engineer, production",1949-10-04,93079573b9ea6535a5361ad2af1237d8,1371835696,30.669891,-94.919879,United States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-09-21 09:30:00,4.424340e+15,"Terry, Johns and Bins",misc_pos,37.83,Denise,Barnett,F,23220 Eaton Harbors,Kirby,OH,...,United States,118,Private music teacher,1957-11-12,54a277240ae5e40df89af08d55a77993,1379755858,41.499281,-83.222959,United States,0
2020-09-21 10:48:00,4.155020e+15,Huels-Nolan,gas_transport,70.23,Renee,Parrish,F,174 Jennifer Meadow Apt. 467,Mountain Park,OK,...,United States,540,Research scientist (life sciences),1983-10-12,286ef7d5ac19160c37d42b7d67b5ac53,1379760484,34.175115,-98.646250,United States,0
2020-09-21 12:03:00,4.822370e+15,Torphy-Kertzmann,health_fitness,11.83,Christopher,Farrell,M,97070 Anderson Land,Haines City,FL,...,United States,33804,Exercise physiologist,1991-01-01,7b6bac5fd68d47e7410f95901c599d00,1379765026,29.026640,-81.306019,United States,0
2020-09-21 12:04:00,2.131420e+14,"Crist, Jakubowski and Littel",home,101.39,Margaret,Curtis,F,742 Oneill Shore,Florence,MS,...,United States,19685,Fine artist,1984-12-24,9839f85921af06ce3521b18bd5c7b527,1379765078,32.170674,-90.335578,United States,0


### Inspecting the features and data types
The first step is to inspect the columns available using the pandas' method [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html). 

In [13]:
# Print summary of the DataFrame
fraud_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1261 entries, 2020-06-21 13:35:00 to 2020-09-21 12:49:00
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cc_num            1261 non-null   float64       
 1   merchant          1261 non-null   object        
 2   category          1261 non-null   object        
 3   amt               1261 non-null   float64       
 4   first             1261 non-null   object        
 5   last              1261 non-null   object        
 6   gender            1261 non-null   object        
 7   street            1261 non-null   object        
 8   city              1261 non-null   object        
 9   state             1261 non-null   object        
 10  zip               1261 non-null   int64         
 11  lat               1261 non-null   float64       
 12  long              1261 non-null   float64       
 13  customer_country  1261 non-null   object  

## Customer and Transaction Details
### Extracting age from date of birth
As noted in Bahnsen et al., 2016 (referenced earlier), the customer's age is a common feature in raw credit card data. In the cell below, we add an age column based on the date of birth. 


In [15]:
# Specify the customer date of birth column
dob = "dob"

# Define a function to extract age from date of birth
def age(date_of_birth):
    today = date.today()
    return (
        date.today().year
        - date_of_birth.year
        - ((today.month, today.day) < (date_of_birth.month, date_of_birth.day))
    )


# Create a new column
fraud_df["age"] = fraud_df[dob].apply(age)

### Distance from Merchant
The next feature I will create will be the distance between the customer and the merchant. The reasoning for this feature is that the transactions at merchants that are further away from the customer may be a sign of suspicious activity.

Since I have the latitude and longitude of both the customer and the merchant, it makes it easier to perform this step.

In [17]:
# Specify the customer and merchant coordinate columns
customer_lat, customer_long = "lat", "long"
merchant_lat, merchant_long = "merch_lat", "merch_long"

# Create function to calculate the distance between two sets of coordinates
calculate_distance = lambda x: distance.geodesic(
    (x[customer_lat], x[customer_long]), (x[merchant_lat], x[merchant_long])
).km

# Create a new column for the distance
fraud_df["distance_from_merchant"] = fraud_df.apply(calculate_distance, axis=1)

### Domestic Transactions
I can also check whether a transaction is domestic or not by comparing the country of the customer and the country of the merchant. 



In [19]:
# Specify the customer country and merchant country
customer_country, merchant_country = "customer_country", "merchant_country"

# Create new column indicating domestic transactions
fraud_df["is_domestic"] = fraud_df[customer_country] == fraud_df[merchant_country]


It can also be useful to determine whether the transaction is a whole number (which may indicate a suspicious transaction).



In [21]:
# Specify the transaction amount column
amt = "amt"

# Create a column when the transaction amount is a whole number
fraud_df["is_whole_number"] = fraud_df[amt] == fraud_df[amt].astype(int)

### Time of Transaction
Although the data contains the time of the transaction, it may be useful to extract components of this for predictive purposes. For example, fraud may be more likely at different times of the day or year.

In [23]:
# Extract the hour, day, and month from the datetime index
fraud_df["hour"] = fraud_df.index.hour
fraud_df["day"] = fraud_df.index.day
fraud_df["month"] = fraud_df.index.month

### Previewing new customer and transaction details features


In [25]:
fraud_df[
    [
        "age",
        "distance_from_merchant",
        "is_domestic",
        "is_whole_number",
        "hour",
        "day",
        "month",
    ]
]

Unnamed: 0_level_0,age,distance_from_merchant,is_domestic,is_whole_number,hour,day,month
trans_date_trans_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-06-21 13:35:00,35,98.839094,True,False,13,21,6
2020-06-21 14:02:00,49,87.331893,True,False,14,21,6
2020-06-21 14:12:00,30,46.178250,True,False,14,21,6
2020-06-21 15:31:00,31,79.551483,True,False,15,21,6
2020-06-21 17:28:00,76,80.249480,True,False,17,21,6
...,...,...,...,...,...,...,...
2020-09-21 09:30:00,68,77.982903,True,False,9,21,9
2020-09-21 10:48:00,42,65.256515,True,False,10,21,9
2020-09-21 12:03:00,35,109.055196,True,False,12,21,9
2020-09-21 12:04:00,41,20.269394,True,False,12,21,9


## Customer History
Customer history plays a crucial role in identifying suspicious transactions. In my project, I focused on analyzing each customer’s spending behavior over time to detect unusual activity patterns. For instance, if a customer suddenly increases the frequency or value of their transactions, it could suggest that their card information has been compromised.

Following the approach discussed by Bhattacharyya et al. (2011), I engineered several historical features that capture variations in transaction behavior, such as spending trends and merchant interactions. These features help provide additional context to each transaction and improve the reliability of fraud detection. Some of these calculations require unique merchant and category identifiers, so the steps can be adjusted or skipped depending on the available data.

### Average transaction amount over time
The first general feature I will engineer will be the average amount a customer spends on their credit card.



In [27]:
# Specify the credit card number column
cc_num = "cc_num"

# Calculate the average transaction amount for the credit card across all purchases
fraud_df["avg_amount"] = fraud_df.groupby(cc_num)[amt].transform(np.mean)

  fraud_df["avg_amount"] = fraud_df.groupby(cc_num)[amt].transform(np.mean)


### Tracking Number of transactions at a merchant over time
Tracking how frequently a customer makes purchases from a specific merchant provides valuable behavioral insight. In my project, I used this feature to identify unusual spending activity, particularly transactions occurring at merchants the customer has never interacted with before. A sudden purchase at a completely new merchant can serve as a strong indicator of potential fraud or unauthorized card use.

In [29]:
# Specify the merchant column
merchant = "merchant"

# Calculate the total number of purchases made at a merchant
fraud_df["merchant_qty"] = fraud_df.groupby([cc_num, merchant])[amt].transform(
    np.size
)

### Average Transaction Amount in the Past 30 Days

Monitoring a customer’s average spending over recent transactions can highlight changes in their usual behavior. In my project, I calculated the average transaction amount for each customer over the past 30 days. Sudden increases or unusually high spending compared to their typical pattern can serve as an indicator of potential fraud or unauthorized card use.

In [41]:
# Calculate the average amount per credit card for the past 30 days
fraud_monthly_amt = fraud_df.groupby(cc_num)[amt].rolling("30D").mean().reset_index()

# Rename the column
fraud_monthly_amt.rename(columns={amt: "avg_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_monthly_amt, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Frequency of Transactions with the Same Merchant (Last 30 Days)


Understanding how often a customer interacts with a particular merchant within a recent time frame can reveal meaningful spending patterns. In my project, I analyzed the number of transactions each customer made with the same merchant over the past 30 days to identify sudden behavioral changes. A sharp increase or decrease in this frequency may signal unusual activity, such as a compromised account or unauthorized use at unfamiliar merchants.

In [45]:
# Calculate the total number of transactions per credit card for the past 30 days
fraud_monthly_qty = (
    fraud_df.groupby([cc_num, merchant])[amt]
    .rolling("30D")
    .count()
    .reset_index()
    .drop(columns=merchant)
)

# Rename the column
fraud_monthly_qty.rename(columns={amt: "merchant_qty_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_monthly_qty, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Average Daily Transaction Amount in the Past 30 Days

To gain further insight into a customer’s spending behavior, I calculated the average daily transaction amount over the past 30 days. This metric helps identify unusual spending patterns by showing whether a customer’s daily expenditures have significantly increased or decreased, which can be a potential signal of fraud or unauthorized activity.

In [47]:
# Calculate the total amount spent per day
fraud_df["date"] = fraud_df.index.strftime("%d-%m-%Y")
fraud_df["day_amt"] = fraud_df.groupby([cc_num, "date"])[amt].transform(np.sum)

# Calculate the average daily amount per credit card for the past 30 days
fraud_daily_avg = (
    fraud_df.groupby([cc_num])["day_amt"].rolling("30D").mean().reset_index()
)

# Rename the column
fraud_daily_avg.rename(columns={"day_amt": "daily_avg_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_daily_avg, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

  fraud_df["day_amt"] = fraud_df.groupby([cc_num, "date"])[amt].transform(np.sum)


### Average Spending by Category in the Past 30 Days

To better understand a customer’s purchasing patterns, I calculated the average amount spent in each category over the past 30 days. For example, if a customer who typically avoids health and fitness suddenly makes multiple large purchases at a fitness store, it could indicate unusual or potentially fraudulent activity.

In [50]:
# Specify the transaction category column
category = "category"

# Calculate the average amount spent in a category for the past 30 days
fraud_category_avg_month = (
    fraud_df.groupby([cc_num, category])[amt]
    .rolling("30D")
    .mean()
    .reset_index()
    .drop(columns=category)
)

# Rename the column
fraud_category_avg_month.rename(columns={amt: "category_avg_month"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_category_avg_month, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Cumulative Transactions on the Day of the Transaction

To capture sudden spikes in activity, I calculated the total number of transactions a customer made earlier in the same day up to the current transaction. An unusually high number of transactions within a short period can be a strong indicator of potential fraud or unauthorized card use.

In [53]:
# Calculate the total number of transactions in a day up to current transaction
fraud_day_transactions = (
    fraud_df.groupby(cc_num)[amt]
    .rolling("1D", closed="left")
    .count()
    .replace({np.nan: 0})
    .reset_index()
)

# Rename the column
fraud_day_transactions.rename(columns={amt: "day_qty"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_day_transactions, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Previewing the new customer history features


In [55]:
# Preview new columns
fraud_df[
    [
        amt,
        category,
        merchant,
        "avg_amount",
        "merchant_qty",
        "avg_30_day",
        "merchant_qty_30_day",
        "daily_avg_30_day",
        "category_avg_month",
        "day_qty",
    ]
]

Unnamed: 0_level_0,amt,category,merchant,avg_amount,merchant_qty,avg_30_day,merchant_qty_30_day,daily_avg_30_day,category_avg_month,day_qty
trans_date_trans_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2020-06-21 13:35:00,11.18,food_dining,Conroy-Emard,36.516667,1,11.1800,1.0,11.1800,11.18,0.0
2020-06-21 14:02:00,18.76,home,Kutch-Wilderman,13.015000,1,18.7600,1.0,18.7600,18.76,0.0
2020-06-21 14:12:00,5.48,shopping_net,"Heathcote, Yost and Kertzmann",36.360000,1,5.4800,1.0,5.4800,5.48,0.0
2020-06-21 15:31:00,6.22,travel,Hackett Group,15.350000,1,6.2200,1.0,6.2200,6.22,0.0
2020-06-21 17:28:00,36.04,food_dining,Abernathy and Sons,98.365000,1,36.0400,1.0,36.0400,36.04,0.0
...,...,...,...,...,...,...,...,...,...,...
2020-09-21 09:30:00,37.83,misc_pos,"Terry, Johns and Bins",43.495000,1,37.8300,1.0,37.8300,37.83,0.0
2020-09-21 10:48:00,70.23,gas_transport,Huels-Nolan,63.150000,1,69.3200,1.0,69.3200,69.32,0.0
2020-09-21 12:03:00,11.83,health_fitness,Torphy-Kertzmann,36.516667,1,11.8300,1.0,11.8300,11.83,0.0
2020-09-21 12:04:00,101.39,home,"Crist, Jakubowski and Littel",57.625000,1,101.3900,1.0,101.3900,101.39,0.0


## Merchant and Category History
Some merchants and categories may be at a higher risk of fraud than others. I used the following code to create a column based on the proportion of fraud for each merchant and category.


In [58]:
# Specify the fraud label column
is_fraud = "is_fraud"

# Calculate the proportion of fraud per merchant
fraud_df["merch_fraud_prop"] = fraud_df.groupby(merchant)[is_fraud].transform("mean")

# Calculate the proportion of fraud per category
fraud_df["category_fraud_prop"] = fraud_df.groupby(category)[is_fraud].transform("mean")

### Previewing the new merchant and category history features


In [60]:
fraud_df[[merchant, category, "merch_fraud_prop", "category_fraud_prop"]]

Unnamed: 0_level_0,merchant,category,merch_fraud_prop,category_fraud_prop
trans_date_trans_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-06-21 13:35:00,Conroy-Emard,food_dining,0.0,0.000000
2020-06-21 14:02:00,Kutch-Wilderman,home,0.0,0.000000
2020-06-21 14:12:00,"Heathcote, Yost and Kertzmann",shopping_net,0.0,0.009901
2020-06-21 15:31:00,Hackett Group,travel,0.0,0.000000
2020-06-21 17:28:00,Abernathy and Sons,food_dining,0.0,0.000000
...,...,...,...,...
2020-09-21 09:30:00,"Terry, Johns and Bins",misc_pos,0.0,0.013333
2020-09-21 10:48:00,Huels-Nolan,gas_transport,0.0,0.000000
2020-09-21 12:03:00,Torphy-Kertzmann,health_fitness,0.0,0.000000
2020-09-21 12:04:00,"Crist, Jakubowski and Littel",home,0.0,0.000000
