In [1]:
import pandas as pd

DATA = '/kaggle/input/comprehensive-indian-online-fraud-dataset/updated_fraudulent_online_fraud_data_in_India.csv'

df = pd.read_csv(filepath_or_buffer=DATA, index_col=['transaction_id']).drop(columns=['purchase_category'])
df['date'] = pd.to_datetime(df['transaction_time'].apply(func=lambda x: x.split()[0]))

df.head()

Unnamed: 0_level_0,customer_id,merchant_id,amount,transaction_time,is_fraudulent,card_type,location,customer_age,transaction_description,fraud_type,date
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,684415,2028,1262.77,11/24/2023 22:39,0,Rupay,Bangalore,28,Order from Restaurant-6203,Identity theft,2023-11-24
2,447448,2046,1852.44,03/30/2024 16:18,0,MasterCard,Surat,62,Payment to Restaurant-6134,Malware,2024-03-30
3,975001,2067,6827.12,03/07/2024 18:27,1,MasterCard,Hyderabad,24,Payment to Restaurant-2890,Malware,2024-03-07
4,976547,2075,1855.31,02/01/2024 00:58,1,Rupay,Hyderabad,62,Transaction at Retailer-4205,Payment card fraud,2024-02-01
5,935741,2044,5275.38,12/22/2023 18:42,1,Rupay,Bangalore,19,Order from Online Store-9669,scam,2023-12-22


We already know this is synthetic data,so let's look for some of the telltale signs of synthetic data.

In [2]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)

express.histogram(data_frame=df, x='amount', facet_col='is_fraudulent')

The fact that we have more fraudulent transactions than otherwise is a clue: in a real transaction dataset fraudulent transactions would be rare to very rare.

In [3]:
express.histogram(data_frame=df, x='customer_age', facet_col='is_fraudulent')

We would also expect to see fraudulent tranactions to be unevenly distributed across our customer age distribution, but we don't.

In [4]:
express.histogram(data_frame=df, x='merchant_id', facet_col='is_fraudulent')

In [5]:
express.imshow(img=df.corr(numeric_only=True))

There being no correlation across our numeric data is a big clue our data is synthetic.

In [6]:
express.scatter(data_frame=df, x='date', y='amount')

But this is really the clincher: our amounts are not just random but uniformly distributed across time.