# Feature Engineering for Fraud Detection
Recent estimates suggest that in 2020, credit card fraud was responsible for losses totaling 28.58 billion dollars ([Nilson, 2021](https://nilsonreport.com/upload/content_promo/NilsonReport_Issue1209.pdf)). The ability to accurately detect fraud protects and ensures customers' peace of mind and can prevent massive financial losses.

The quality of predictions is highly dependent on the data and features used. This template takes raw credit card data with standard features and engineers additional information to help assist with fraud prediction.

## Imports and data preparation
The following cells install and import the packages necessary to import and manipulate fraud detection data. They also load the example data and preview it.

💡&nbsp;&nbsp;_Be sure to add and remove imports (if you find you don't need them) as per your requirements._

In [None]:
%%capture
!pip install geopy
import numpy as np
import pandas as pd
from datetime import date
from geopy import distance

### Load in your data
The cell below imports the credit card data, which contains typical raw credit card transaction features such as the transaction time, the merchant, the amount, the credit card, and customer details (see [Bahnsen et al., 2016](https://www.sciencedirect.com/science/article/abs/pii/S0957417415008386) for a list of common features).

👇&nbsp;&nbsp;_To use your data, you will need to:_
- _Upload a file and update the `path` variable._
    - _Alternatively, if you have data in a database, you can add a SQL cell and connect to a custom [database](https://workspace-docs.datacamp.com/connect-to-data/connect-your-data-to-workspace)._
- _Set the column that contains the transaction time._
- _Set any other columns that contain date data (you may need to update this after loading the data in)_.

In [None]:
# Set path to data
path = "data/fraud_data.csv"

# Specify the transaction time column
trans_time = "trans_date_trans_time"

# Specify any additional date columns
date_cols = ["dob", trans_time]

# Read in the data as a DataFrame and set the index
fraud_df = pd.read_csv(path, parse_dates=date_cols, index_col=trans_time).sort_index()

# Preview the data
fraud_df

### Inspect the features and data types
The first step is to inspect the columns available using the pandas' method [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html). You can also review the date types of each column.

In [None]:
# Print summary of the DataFrame
fraud_df.info()

## Customer and Transaction Details
### Extracting age from date of birth
As noted in Bahnsen et al., 2016 (referenced earlier), the customer's age is a common feature in raw credit card data. In the cell below, we add an age column based on the date of birth. 

If your data already contains the correct age information, you can skip this step.

👇&nbsp;&nbsp;_Make sure to update the `dob` label below._

In [None]:
# Specify the customer date of birth column
dob = "dob"

# Define a function to extract age from date of birth
def age(date_of_birth):
    today = date.today()
    return (
        date.today().year
        - date_of_birth.year
        - ((today.month, today.day) < (date_of_birth.month, date_of_birth.day))
    )


# Create a new column
fraud_df["age"] = fraud_df[dob].apply(age)

### Distance from merchant
The next feature we will create will be the distance between the customer and the merchant. The reasoning for this feature is that the transactions at merchants that are further away from the customer may be a sign of suspicious activity.

This step assumes you have the coordinates (latitude and longitude) of both the customer and the merchant. If you don't possess this information, you can skip this step.

👇&nbsp;&nbsp;_Make sure to update the latitude and longitude labels below._

In [None]:
# Specify the customer and merchant coordinate columns
customer_lat, customer_long = "lat", "long"
merchant_lat, merchant_long = "merch_lat", "merch_long"

# Create function to calculate the distance between two sets of coordinates
calculate_distance = lambda x: distance.geodesic(
    (x[customer_lat], x[customer_long]), (x[merchant_lat], x[merchant_long])
).km

# Create a new column for the distance
fraud_df["distance_from_merchant"] = fraud_df.apply(calculate_distance, axis=1)

### Domestic transactions
We can also check whether a transaction is domestic or not by comparing the country of the customer and the country of the merchant. 

Note that if you do not have this data but do have latitude and longitude information, you can calculate the country using the `geopy` library. See [this tutorial](https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/) for how to extract the country based on latitude and longitude.

👇&nbsp;&nbsp;_Make sure to update the country labels below._

In [None]:
# Specify the customer country and merchant country
customer_country, merchant_country = "customer_country", "merchant_country"

# Create new column indicating domestic transactions
fraud_df["is_domestic"] = fraud_df[customer_country] == fraud_df[merchant_country]

### Whole number transactions
It can also be useful to determine whether the transaction is a whole number (which may indicate a suspicious transaction).

👇&nbsp;&nbsp;_Make sure to update the transaction amount label below._

In [None]:
# Specify the transaction amount column
amt = "amt"

# Create a column when the transaction amount is a whole number
fraud_df["is_whole_number"] = fraud_df[amt] == fraud_df[amt].astype(int)

### Time and date of transaction
Although the data contains the time of the transaction, it may be useful to extract components of this for predictive purposes. For example, fraud may be more likely at different times of the day or year.

In [None]:
# Extract the hour, day, and month from the datetime index
fraud_df["hour"] = fraud_df.index.hour
fraud_df["day"] = fraud_df.index.day
fraud_df["month"] = fraud_df.index.month

### Preview the new customer and transaction details features
❗&nbsp;&nbsp;_If you omit or add other features, adapt the list of columns below accordingly._

In [None]:
fraud_df[
    [
        "age",
        "distance_from_merchant",
        "is_domestic",
        "is_whole_number",
        "hour",
        "day",
        "month",
    ]
]

## Customer History
Customer history is an important indicator of a suspicious transaction. For example, if a customer suddenly increases the volume or cost of their transactions, this may indicate their card has become compromised. The cells below create a number of features that reflect a customer's transaction history.

[Bhattacharyya et al., 2011](https://www.sciencedirect.com/science/article/abs/pii/S0167923610001326) identify a number of different types of historical features that can be engineered. Be sure to use the existing code to adapt other features as needed.

Note that a number of these features rely upon a unique merchant identifier and a unique category (or type of transaction) identifier. If you do not have these, you can skip these steps.

### Average transaction amount over time
The first general feature we will engineer will be the average amount a customer spends on their credit card.

👇&nbsp;&nbsp;_Make sure to update the credit card number label below._

In [None]:
# Specify the credit card number column
cc_num = "cc_num"

# Calculate the average transaction amount for the credit card across all purchases
fraud_df["avg_amount"] = fraud_df.groupby(cc_num)[amt].transform(np.mean)

### Total number of transactions at a merchant over time
Knowing how often a customer shops at a particular merchant is also helpful. If they have never shopped at a specific merchant, it may indicate that it is not them making the transaction.

👇&nbsp;&nbsp;_Make sure to update the merchant label below._

In [None]:
# Specify the merchant column
merchant = "merchant"

# Calculate the total number of purchases made at a merchant
fraud_df["merchant_qty"] = fraud_df.groupby([cc_num, merchant])[amt].transform(
    np.size
)

### Average monthly transaction for the past 30 days
Customer spending habits may change over time. This cell calculates the average amount per transaction for the past 30 days.

In [None]:
# Calculate the average amount per credit card for the past 30 days
fraud_monthly_amt = fraud_df.groupby(cc_num)[amt].rolling("30D").mean().reset_index()

# Rename the column
fraud_monthly_amt.rename(columns={amt: "avg_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_monthly_amt, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Total number of transactions with the same merchant over the past 30 days
Like the cell above, customers may discover new shopping places or stop shopping at existing merchants. This cell calculates the total number of transactions at a merchant over the past 30 days.

In [None]:
# Calculate the total number of transactions per credit card for the past 30 days
fraud_monthly_qty = (
    fraud_df.groupby([cc_num, merchant])[amt]
    .rolling("30D")
    .count()
    .reset_index()
    .drop(columns=merchant)
)

# Rename the column
fraud_monthly_qty.rename(columns={amt: "merchant_qty_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_monthly_qty, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Average Daily Transaction Amount for Past 30 Days
This cell provides another form of spending history: how much they spend on average on a daily basis, using data from the past 30 days.

In [None]:
# Calculate the total amount spent per day
fraud_df["date"] = fraud_df.index.strftime("%d-%m-%Y")
fraud_df["day_amt"] = fraud_df.groupby([cc_num, "date"])[amt].transform(np.sum)

# Calculate the average daily amount per credit card for the past 30 days
fraud_daily_avg = (
    fraud_df.groupby([cc_num])["day_amt"].rolling("30D").mean().reset_index()
)

# Rename the column
fraud_daily_avg.rename(columns={"day_amt": "daily_avg_30_day"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_daily_avg, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Average amount by category for past 30 days
If a customer is not into health and fitness and suddenly makes many large purchases at a fitness warehouse, it may be a sign of suspicious activity. This cell calculates the average amount spent in a category over the past 30 days.

👇&nbsp;&nbsp;_Make sure to update the category label below._

In [None]:
# Specify the transaction category column
category = "category"

# Calculate the average amount spent in a category for the past 30 days
fraud_category_avg_month = (
    fraud_df.groupby([cc_num, category])[amt]
    .rolling("30D")
    .mean()
    .reset_index()
    .drop(columns=category)
)

# Rename the column
fraud_category_avg_month.rename(columns={amt: "category_avg_month"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_category_avg_month, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Total number of transactions on this day up to this transaction
If a credit card is used for fraudulent activity, there may be an increase in the frequency of transactions in a short time. This cell calculates the total number of transactions in the day prior to the current transaction.

In [None]:
# Calculate the total number of transactions in a day up to current transaction
fraud_day_transactions = (
    fraud_df.groupby(cc_num)[amt]
    .rolling("1D", closed="left")
    .count()
    .replace({np.nan: 0})
    .reset_index()
)

# Rename the column
fraud_day_transactions.rename(columns={amt: "day_qty"}, inplace=True)

# Merge in the new column by cc number and transaction time
fraud_df = (
    fraud_df.merge(fraud_day_transactions, on=[cc_num, trans_time])
    .set_index(trans_time)
    .sort_index()
)

### Preview the new customer history features
❗&nbsp;&nbsp;_If you omit or add other features, adapt the list of columns below accordingly._

In [None]:
# Preview new columns
fraud_df[
    [
        amt,
        category,
        merchant,
        "avg_amount",
        "merchant_qty",
        "avg_30_day",
        "merchant_qty_30_day",
        "daily_avg_30_day",
        "category_avg_month",
        "day_qty",
    ]
]

## Merchant and Category History
Some merchants and categories may be at a higher risk of fraud than others. You can use the following code to create a column based on the proportion of fraud for each merchant and category.

👇&nbsp;&nbsp;_Make sure to update the fraud label below._

In [None]:
# Specify the fraud label column
is_fraud = "is_fraud"

# Calculate the proportion of fraud per merchant
fraud_df["merch_fraud_prop"] = fraud_df.groupby(merchant)[is_fraud].transform("mean")

# Calculate the proportion of fraud per category
fraud_df["category_fraud_prop"] = fraud_df.groupby(category)[is_fraud].transform("mean")

### Preview the new merchant and category history features
❗&nbsp;&nbsp;_If you omit or add other features, adapt the list of columns below accordingly._

In [None]:
fraud_df[[merchant, category, "merch_fraud_prop", "category_fraud_prop"]]

## Next steps
Make sure to create further features as needed for your data. Via exploratory analysis, you may uncover trends that merit new features, or you may have access to other types of raw data that could be beneficial for predicting fraud.

If you're interested in proceeding to the modeling stage of fraud detection, you may want to check out:
- DataCamp's course [Fraud Detection in Python](https://app.datacamp.com/learn/courses/fraud-detection-in-python) covers this topic in greater depth.
- [Workspace templates](https://app.datacamp.com/workspace/templates?selectedLabels=%5B%22classification%22%5D) focused on classification problems.