### EDA of IEEE-CIS Fraud Detection dataset

In [1]:
import pandas as pd

# Load datasets
transactions = pd.read_csv("../data/raw/train_transaction.csv")
identity = pd.read_csv("../data/raw/train_identity.csv")

# Merge datasets
df = transactions.merge(identity, on="TransactionID", how="left")

# Quick overview
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print("Target distribution:")
print(df['isFraud'].value_counts(normalize=True))

Shape: (590540, 434)
Columns: ['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 

> Large high-dimensional dataset with 590540 rows and 434 columns with severely imbalanced datset with only ~3.5% fraud cases. Accuracy becomes misleading as a model predicting "not fraud" always gets 96.5% accuracy.

In [2]:
df.info()
df.describe()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 1.9+ GB


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


> Mix of numerical and categorcial features with most of them being numerical where as categorical features are about 31 out of 434 columns. Large memory footprint implies corss-validation with heavy models is expensive and pandas operations become slow.

In [27]:
# Missing values analysis
df.isnull().mean().sort_values(ascending=False).head(20)

id_24    0.991962
id_25    0.991310
id_07    0.991271
id_08    0.991271
id_21    0.991264
id_26    0.991257
id_22    0.991247
id_27    0.991247
id_23    0.991247
dist2    0.936284
D7       0.934099
id_18    0.923607
D13      0.895093
D14      0.894695
D12      0.890410
id_04    0.887689
id_03    0.887689
D6       0.876068
id_33    0.875895
D8       0.873123
dtype: float64

>Identity features only exist for small subset of transactions - Absence itself is informative. (Eg: User not providing device or identity infor could correlate with fraud risk.)

In [29]:
# Categorical cardinality
df.select_dtypes('object').nunique().sort_values(ascending=False)

DeviceInfo       1786
id_33             260
id_31             130
id_30              75
R_emaildomain      60
P_emaildomain      59
ProductCD           5
card6               4
card4               4
id_34               4
id_15               3
M4                  3
id_23               3
M2                  2
M1                  2
id_12               2
M9                  2
M8                  2
M7                  2
M6                  2
M5                  2
M3                  2
id_27               2
id_28               2
id_29               2
id_16               2
id_35               2
id_36               2
id_37               2
id_38               2
DeviceType          2
dtype: int64

> One-hot encoding is not suitable as there are many high-cardinality categoricals which would explode feature space and slows down training. Low cardinality categoricals work extremely well with tree splits and do not need heavy encoding.

In [None]:
# Missingness distribution
missing_bins = pd.cut(
    missing,
    bins=[0, 0.1, 0.3, 0.6, 0.9, 1.0],
    labels=["<10%", "10–30%", "30–60%", "60–90%", ">90%"]
)

missing_bins.value_counts()

60–90%    196
<10%       92
10–30%     90
30–60%     24
>90%       12
Name: count, dtype: int64

> Majority of features have moderate to high missingness and ~20% are mostly complete. Missingness represents user behaviour, device availability or system constraints.

In [None]:
# Identify constant or near-constant Features
constant_features = [col for col in df.columns if df[col].nunique() <= 1]
print("Constant features:", constant_features)

Constant features: []


> No constant or empty features in the dataset to remove.

In [None]:
# Temporal features
df["Transaction_hour"] = (df["TransactionDT"] // 3600) % 24
df["Transaction_day"] = (df["TransactionDT"] // 86400)
df.groupby("Transaction_hour")["isFraud"].mean()

Transaction_hour
0     0.031380
1     0.031314
2     0.037483
3     0.038314
4     0.051890
5     0.070302
6     0.077743
7     0.106102
8     0.093014
9     0.089956
10    0.053212
11    0.038816
12    0.030439
13    0.022889
14    0.024216
15    0.025399
16    0.029511
17    0.031530
18    0.035231
19    0.034738
20    0.034273
21    0.034005
22    0.032694
23    0.036997
Name: isFraud, dtype: float64

> TransactionDT is modified into Transaction_hour and Transaction_day where it can be interpreted as fraud spikes in early morning hours from the data. Transaction_hour can be used as strong signal and Transaction_day captures trends. But fraud pattern changes over time which means concept drift is possible. Time-aware validation is important.