# Day 11: Payment Fraud Risk Detection in Online Transactions

You are a data analyst in Stripe's risk management team investigating transaction patterns to identify potential fraud. The team needs to develop a systematic approach to screen transactions for financial risks. Your goal is to create an initial risk assessment methodology using transaction characteristics.

In [2]:
import pandas as pd
import numpy as np

dim_risk_flags_data = [
  {
    "risk_level": "Low",
    "risk_flag_id": 1,
    "transaction_id": 2
  },
  {
    "risk_level": "Medium",
    "risk_flag_id": 2,
    "transaction_id": 7
  },
  {
    "risk_level": "High",
    "risk_flag_id": 3,
    "transaction_id": 11
  },
  {
    "risk_level": "High",
    "risk_flag_id": 4,
    "transaction_id": 12
  },
  {
    "risk_level": "High",
    "risk_flag_id": 5,
    "transaction_id": 13
  },
  {
    "risk_level": "Medium",
    "risk_flag_id": 6,
    "transaction_id": 14
  },
  {
    "risk_level": "High",
    "risk_flag_id": 7,
    "transaction_id": 15
  },
  {
    "risk_level": "Low",
    "risk_flag_id": 8,
    "transaction_id": 1
  },
  {
    "risk_level": "Medium",
    "risk_flag_id": 9,
    "transaction_id": 6
  },
  {
    "risk_level": "Low",
    "risk_flag_id": 10,
    "transaction_id": 3
  }
]
dim_risk_flags = pd.DataFrame(dim_risk_flags_data)

fct_transactions_data = [
  {
    "customer_email": "alice@gmail.com",
    "transaction_id": 1,
    "transaction_date": "2024-10-05",
    "transaction_amount": 120,
    "fraud_detection_score": 10
  },
  {
    "customer_email": "bob@customdomain.com",
    "transaction_id": 2,
    "transaction_date": "2024-10-15",
    "transaction_amount": 250.5,
    "fraud_detection_score": 20
  },
  {
    "customer_email": "charlie@yahoo.com",
    "transaction_id": 3,
    "transaction_date": "2024-10-20",
    "transaction_amount": 75.25,
    "fraud_detection_score": 15
  },
  {
    "customer_email": "dana@hotmail.com",
    "transaction_id": 4,
    "transaction_date": "2024-10-25",
    "transaction_amount": 100,
    "fraud_detection_score": 30
  },
  {
    "customer_email": "eve@biz.org",
    "transaction_id": 5,
    "transaction_date": "2024-10-30",
    "transaction_amount": 300,
    "fraud_detection_score": 40
  },
  {
    "customer_email": "frank@gmail.com",
    "transaction_id": 6,
    "transaction_date": "2024-11-03",
    "transaction_amount": 150.75,
    "fraud_detection_score": 25
  },
  {
    "customer_email": "grace@outlook.com",
    "transaction_id": 7,
    "transaction_date": "2024-11-10",
    "transaction_amount": None,
    "fraud_detection_score": 50
  },
  {
    "customer_email": "ivan@yahoo.com",
    "transaction_id": 8,
    "transaction_date": "2024-11-15",
    "transaction_amount": 200,
    "fraud_detection_score": 35
  },
  {
    "customer_email": "judy@hotmail.com",
    "transaction_id": 9,
    "transaction_date": "2024-11-21",
    "transaction_amount": 250,
    "fraud_detection_score": 45
  },
  {
    "customer_email": "ken@domain.net",
    "transaction_id": 10,
    "transaction_date": "2024-11-29",
    "transaction_amount": 300,
    "fraud_detection_score": 55
  },
  {
    "customer_email": "laura@riskmail.com",
    "transaction_id": 11,
    "transaction_date": "2024-12-02",
    "transaction_amount": 100,
    "fraud_detection_score": 80
  },
  {
    "customer_email": "mike@securepay.com",
    "transaction_id": 12,
    "transaction_date": "2024-12-03",
    "transaction_amount": 180,
    "fraud_detection_score": 85
  },
  {
    "customer_email": "nina@trusthub.com",
    "transaction_id": 13,
    "transaction_date": "2024-12-09",
    "transaction_amount": 220,
    "fraud_detection_score": 90
  },
  {
    "customer_email": "oscar@fintech.com",
    "transaction_id": 14,
    "transaction_date": "2024-12-16",
    "transaction_amount": 140,
    "fraud_detection_score": 70
  },
  {
    "customer_email": "paula@alertsys.com",
    "transaction_id": 15,
    "transaction_date": "2024-12-23",
    "transaction_amount": 260,
    "fraud_detection_score": 95
  }
]
fct_transactions = pd.DataFrame(fct_transactions_data)


## Question 1

How many transactions in October 2024 have a customer email ending with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'? This metric will help us identify transactions associated with less common email providers that may indicate emerging risk patterns.

In [3]:
# filter to October 2024
oct_2024 = fct_transactions[
    (fct_transactions['transaction_date'] >= '2024-10-01') &
    (fct_transactions['transaction_date'] <  '2024-11-01')
].copy()

# normalize email column and extract domain
# - convert to str first to avoid errors on NaN
# - strip whitespace and lowercase
oct_2024['email_norm'] = oct_2024['customer_email'].astype(str).str.strip().str.lower()

# extract domain after '@'; if no '@' present, domain will equal the whole string (we'll treat as "other")
oct_2024['email_domain'] = oct_2024['email_norm'].str.split('@').str[-1]

# define allowed domains
allowed = ['gmail.com', 'yahoo.com', 'hotmail.com']

# boolean mask for domain not in allowed (also treat empty / 'nan' as other)
mask_other_domains = ~oct_2024['email_domain'].isin(allowed)

# optionally: exclude records where email is literally 'nan' (string from astype on actual NaN),
# because these are missing emails — up to you whether to count them as "other"
# To include missing as other, skip this step; to exclude them:
# mask_valid_email_present = oct_2024['customer_email'].notna()
# mask_other_domains = mask_other_domains & mask_valid_email_present

# count
count_other_domains = mask_other_domains.sum()

print("Transactions in Oct 2024 with non-gmail/yahoo/hotmail domains:", count_other_domains)

# If you want to inspect them:
other_domain_rows = oct_2024[mask_other_domains][['transaction_id', 'transaction_date', 'customer_email', 'email_domain']]
print(other_domain_rows.head(20))

Transactions in Oct 2024 with non-gmail/yahoo/hotmail domains: 2
   transaction_id transaction_date        customer_email      email_domain
1               2       2024-10-15  bob@customdomain.com  customdomain.com
4               5       2024-10-30           eve@biz.org           biz.org


## Question 2

For transactions occurring in November 2024, what is the average transaction amount, using 0 as a default for any missing values? This calculation will help us detect abnormal transaction amounts that could be related to fraudulent activity.

In [7]:
# Filter for November 2024
fct_transactions['transaction_date'] = pd.to_datetime(fct_transactions['transaction_date'])

november_df = fct_transactions[
    (fct_transactions['transaction_date'].dt.year == 2024) &
    (fct_transactions['transaction_date'].dt.month == 11)
]

# Fill missing transaction_amount with 0
november_df['transaction_amount'] = november_df['transaction_amount'].fillna(0)

# Calculate average
avg_transaction_amount = november_df['transaction_amount'].mean()

print(f"Average transaction amount (Nov 2024): {avg_transaction_amount}")

Average transaction amount (Nov 2024): 180.15


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  november_df['transaction_amount'] = november_df['transaction_amount'].fillna(0)


## Question 3

Among transactions flagged as 'High' risk in December 2024, which day of the week recorded the highest number of such transactions? This analysis is intended to pinpoint specific days with concentrated high-risk activity and support the development of our preliminary fraud detection score.

In [9]:
# Merge datasets on transaction_id
merged_df = fct_transactions.merge(dim_risk_flags, on='transaction_id')

# Filter for High risk and December 2024
filtered_df = merged_df[
    (merged_df['risk_level'] == 'High') &
    (merged_df['transaction_date'].dt.month == 12) &
    (merged_df['transaction_date'].dt.year == 2024)
]

# Extract day of the week
filtered_df['weekday'] = filtered_df['transaction_date'].dt.day_name()

# Count transactions per weekday
weekday_counts = filtered_df['weekday'].value_counts()

# Get the weekday with highest High-risk transactions
top_day = weekday_counts.idxmax()
top_count = weekday_counts.max()

print("Day with highest High-risk transactions in Dec 2024:", top_day)
print("Number of transactions:", top_count)

Day with highest High-risk transactions in Dec 2024: Monday
Number of transactions: 3


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['weekday'] = filtered_df['transaction_date'].dt.day_name()


Made with ❤️ by [Interview Master](https://www.interviewmaster.ai)