In [1]:
!pip install duckdb

Collecting duckdb
  Downloading duckdb-1.4.3-cp39-cp39-win_amd64.whl.metadata (4.3 kB)
Downloading duckdb-1.4.3-cp39-cp39-win_amd64.whl (12.3 MB)
   ---------------------------------------- 0.0/12.3 MB ? eta -:--:--
    --------------------------------------- 0.3/12.3 MB ? eta -:--:--
   -- ------------------------------------- 0.8/12.3 MB 2.1 MB/s eta 0:00:06
   ---- ----------------------------------- 1.3/12.3 MB 2.4 MB/s eta 0:00:05
   ------ --------------------------------- 2.1/12.3 MB 2.7 MB/s eta 0:00:04
   -------- ------------------------------- 2.6/12.3 MB 2.9 MB/s eta 0:00:04
   ----------- ---------------------------- 3.4/12.3 MB 3.0 MB/s eta 0:00:04
   -------------- ------------------------- 4.5/12.3 MB 3.2 MB/s eta 0:00:03
   ----------------- ---------------------- 5.5/12.3 MB 3.5 MB/s eta 0:00:02
   -------------------- ------------------- 6.3/12.3 MB 3.6 MB/s eta 0:00:02
   ------------------------ --------------- 7.6/12.3 MB 3.8 MB/s eta 0:00:02
   ------------------

In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import os
import duckdb
import kagglehub

  from .autonotebook import tqdm as notebook_tqdm


### Use the following bash command to find the cached file path
find ~ -name "financial_fraud_detection_dataset.csv"

If you're not able to, then download a copy of the dataset to your machine, unzip it and set the absolute path of the csv file in the next cell.

- Download link:
https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/data


In [5]:
# Copy and paste the file path of the cached dataset below to read it into a pandas DataFrame
df = pd.read_csv("/Users/User/.cache/kagglehub/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/versions/1/financial_fraud_detection_dataset.csv")
print(df.head())

  transaction_id                   timestamp sender_account receiver_account  \
0        T100000  2023-08-22T09:22:43.516168      ACC877572        ACC388389   
1        T100001  2023-08-04T01:58:02.606711      ACC895667        ACC944962   
2        T100002  2023-05-12T11:39:33.742963      ACC733052        ACC377370   
3        T100003  2023-10-10T06:04:43.195112      ACC996865        ACC344098   
4        T100004  2023-09-24T08:09:02.700162      ACC584714        ACC497887   

    amount transaction_type merchant_category location device_used  is_fraud  \
0   343.78       withdrawal         utilities    Tokyo      mobile     False   
1   419.65       withdrawal            online  Toronto         atm     False   
2  2773.86          deposit             other   London         pos     False   
3  1666.22          deposit            online   Sydney         pos     False   
4    24.43         transfer         utilities  Toronto      mobile     False   

  fraud_type  time_since_last_transact

## Data Exploration and Cleaning Using SQL Queries in DuckDB

METHOD:
- Using a Local SQL Engine (DuckDB)
    - For complex SQL queries, loading our data into a local analytical database like DuckDB is very effective. It's fast and supports direct querying on Pandas DataFrames or CSV files.
    - We can use DuckDB to query CSV/parquet file directly and perform the filtering in SQL, which is more memory efficient.
    - DuckDB is optimized for analytical queries and can be faster than pandas for complex operations.

   ** Workflow**
   - 
    - Download dataset to local machine
    - connect to path in jupyter notebook, and convert csv to parquet files (columnar Parquet files that are much faster to query)
    - Store parquet files in folder within the repo (parquet files are smaller)
    - Run sql queries directly on he parquet files without importing them into memory
    - Feature Engineering (DuckDB SQL) or Pandas
    - Saved clean and processed parquet shards/files to be used in other notebooks


In [6]:
# Define file paths
csv_path = "/Users/User/.cache/kagglehub/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/versions/1/financial_fraud_detection_dataset.csv"
parquet_path = "./raw_data/financial_fraud_detection_dataset.parquet"
cleaned_parquet_path = "./cleaned_data/cleaned_fraud.parquet"

# 1. Check if source CSV exists
if not os.path.exists(csv_path):
    raise FileNotFoundError(f"CSV file not found at {csv_path}")

print(f"üìÅ Source CSV: {csv_path}")
print(f"üìÅ Target Parquet: {parquet_path}")
print(f"üìä Original size: {os.path.getsize(csv_path) / (1024**3):.2f} GB")

üìÅ Source CSV: /Users/User/.cache/kagglehub/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/versions/1/financial_fraud_detection_dataset.csv
üìÅ Target Parquet: ./raw_data/financial_fraud_detection_dataset.parquet
üìä Original size: 0.74 GB


In [None]:
# Connect to duckdb and Convert CSV to Parquet with DuckDB

# CREATE THE DIRECTORY parquet_path directory FIRST
os.makedirs(os.path.dirname(parquet_path), exist_ok=True)

# f used to pass in a variable to print or SQL statements
# """ are used for multi-line SQL queries, " is used for single-line SQL queries
con = duckdb.connect()
con.execute(f"""
COPY (SELECT * FROM read_csv_auto('{csv_path}'))
TO '{parquet_path}' (FORMAT 'parquet', COMPRESSION 'zstd');
""")

#  3. Verify the result
if os.path.exists(parquet_path):
    parquet_size = os.path.getsize(parquet_path) / (1024**3)
    compression_ratio = (1 - parquet_size / (os.path.getsize(csv_path) / (1024**3))) * 100
    print(f"üìä Parquet size: {parquet_size:.2f} GB")
    print(f"üéØ Compression ratio: {compression_ratio:.1f}% reduction")
    
    # Quick verification query
    row_count = con.execute(f"SELECT COUNT(*) FROM '{parquet_path}'").fetchone()[0]
    print(f"üî¢ Row count in Parquet: {row_count:,}")
else:
    print("‚ùå Parquet file was not created")

# To Close DB connection, but can be left open for further queries
# con.close()


üìä Parquet size: 0.19 GB
üéØ Compression ratio: 74.6% reduction
üî¢ Row count in Parquet: 5,000,000


In [8]:
# con = duckdb.connect()

# using DESCRIBE instead of pandas dtypes to avoid loading data into memory
print(con.execute(f"DESCRIBE SELECT * FROM read_parquet('{parquet_path}')").fetch_df())


                    column_name column_type null   key default extra
0                transaction_id     VARCHAR  YES  None    None  None
1                     timestamp   TIMESTAMP  YES  None    None  None
2                sender_account     VARCHAR  YES  None    None  None
3              receiver_account     VARCHAR  YES  None    None  None
4                        amount      DOUBLE  YES  None    None  None
5              transaction_type     VARCHAR  YES  None    None  None
6             merchant_category     VARCHAR  YES  None    None  None
7                      location     VARCHAR  YES  None    None  None
8                   device_used     VARCHAR  YES  None    None  None
9                      is_fraud     BOOLEAN  YES  None    None  None
10                   fraud_type     VARCHAR  YES  None    None  None
11  time_since_last_transaction      DOUBLE  YES  None    None  None
12     spending_deviation_score      DOUBLE  YES  None    None  None
13               velocity_score   

In [14]:
con.execute(f"SELECT * FROM read_parquet('{parquet_path}')LIMIT 5").fetch_df()

Unnamed: 0,transaction_id,timestamp,sender_account,receiver_account,amount,transaction_type,merchant_category,location,device_used,is_fraud,fraud_type,time_since_last_transaction,spending_deviation_score,velocity_score,geo_anomaly_score,payment_channel,ip_address,device_hash
0,T100000,2023-08-22 09:22:43.516168,ACC877572,ACC388389,343.78,withdrawal,utilities,Tokyo,mobile,False,,,-0.21,3,0.22,card,13.101.214.112,D8536477
1,T100001,2023-08-04 01:58:02.606711,ACC895667,ACC944962,419.65,withdrawal,online,Toronto,atm,False,,,-0.14,7,0.96,ACH,172.52.47.194,D2622631
2,T100002,2023-05-12 11:39:33.742963,ACC733052,ACC377370,2773.86,deposit,other,London,pos,False,,,-1.78,20,0.89,card,185.98.35.23,D4823498
3,T100003,2023-10-10 06:04:43.195112,ACC996865,ACC344098,1666.22,deposit,online,Sydney,pos,False,,,-0.6,6,0.37,wire_transfer,107.136.36.87,D9961380
4,T100004,2023-09-24 08:09:02.700162,ACC584714,ACC497887,24.43,transfer,utilities,Toronto,mobile,False,,,0.79,13,0.27,ACH,108.161.108.255,D7637601


In [None]:
# determine data collection period
# check to see if there are any rows with erroneous timestamps set in the future (> 2025)
# check to ensure timestamp consistency format

# sort timestamp column by highest to lowest, include transaction_id column (to indentify unique rows in case of located error)
con.execute(f"SELECT transaction_id, timestamp FROM read_parquet('{parquet_path}') ORDER BY timestamp DESC").fetch_df()

Unnamed: 0,transaction_id,timestamp
0,T1280251,2024-01-01 22:58:30.131850
1,T4841484,2024-01-01 22:54:21.281089
2,T2469382,2024-01-01 22:53:53.515483
3,T341139,2024-01-01 22:52:56.620090
4,T681385,2024-01-01 22:50:49.475634
...,...,...
4999995,T3517687,2023-01-01 00:23:15.259766
4999996,T648800,2023-01-01 00:21:19.560899
4999997,T3001064,2023-01-01 00:12:48.028557
4999998,T114745,2023-01-01 00:11:36.452582


The data set spans the period of one year from 2023-01-01 to 2024-01-01

There are no future date values in the timestamps

The timestamp column does not need to be adjusted. If column is already of type TIMESTAMP, DATETIME, or TIMESTAMPZ then the database already stores them in a consistent internal format (which in our case it is stored as TIMESTAMP type). However, if the column is stored as text (VARCHAR/CHAR) it needs to be converted.

In [None]:
# Check for logic relating to duplicate/distinct values
# Run a loop to query the count of distinct results for each column, and return the results in a dictionary.


results = {}

for col in df.columns:
    query = f"""
        SELECT COUNT(DISTINCT {col}) AS distinct_count
        FROM df
    """
    count = duckdb.query(query).fetchone()[0]
    results[col] = count

results


{'transaction_id': 5000000,
 'timestamp': 4999998,
 'sender_account': 896513,
 'receiver_account': 896639,
 'amount': 217069,
 'transaction_type': 4,
 'merchant_category': 8,
 'location': 8,
 'device_used': 4,
 'is_fraud': 2,
 'fraud_type': 1,
 'time_since_last_transaction': 4103487,
 'spending_deviation_score': 917,
 'velocity_score': 20,
 'geo_anomaly_score': 101,
 'payment_channel': 4,
 'ip_address': 4997068,
 'device_hash': 3835723}

In [None]:
# Check for logic relating to duplicate/distinct values
# Run a loop to query the distinct values for each column, and return the results in a dictionary.

unique_values = {}

for col in df.columns:
    query = f"""
        SELECT DISTINCT {col}
        FROM read_parquet('{parquet_path}')
        ORDER BY {col}
    """
    unique_values[col] = con.execute(query).fetch_df()

unique_values


{'transaction_id':         transaction_id
 0              T100000
 1             T1000000
 2             T1000001
 3             T1000002
 4             T1000003
 ...                ...
 4999995        T999995
 4999996        T999996
 4999997        T999997
 4999998        T999998
 4999999        T999999
 
 [5000000 rows x 1 columns],
 'timestamp':                          timestamp
 0       2023-01-01 00:09:26.241974
 1       2023-01-01 00:11:36.452582
 2       2023-01-01 00:12:48.028557
 3       2023-01-01 00:21:19.560899
 4       2023-01-01 00:23:15.259766
 ...                            ...
 4999993 2024-01-01 22:50:49.475634
 4999994 2024-01-01 22:52:56.620090
 4999995 2024-01-01 22:53:53.515483
 4999996 2024-01-01 22:54:21.281089
 4999997 2024-01-01 22:58:30.131850
 
 [4999998 rows x 1 columns],
 'sender_account':        sender_account
 0           ACC100000
 1           ACC100001
 2           ACC100002
 3           ACC100003
 4           ACC100004
 ...               ...
 896508 

#### **Assesment of distinct values by column:**

**transaction_id:** 5 million unique values, this is logical as there are 5 million rows and each transaction should have its own unique value.

**timestamp:** 4,999,998 unique values, no nulls values so that means there is two transactions that occurred at the same time as other transactions. To be broken down into month, day of week, hour during feature engineering.

**sender_account:** 896,513 unique values (may be hashed for PI reasons)

**receiver_account:** 896639 unique values (may be hashed for PI reaons)

**amount:** 217,068 unique values that range from 0.01 to 3520.57. We may want to consider converting amount into ranges or categories of some sort when feature engineering.

**transaction_type:** 4 unique values; deposit, payment, transfer, withdrawal

**merchant_category:** 8 unique values; entertainment, grocery, online, other, restaurant, retail, travel, utilities

**location:** 8 unique values; Berlin, Dubai, London, New York, Singapore, Sydney, Tokyo, Toronto

**device_used:** 4 unique values;  atm, mobile, pos, web

**is-fraud:** 2 unique values; 0 = false and 1 = true

**fraud_type:** 2 unique values; card_not_present and none. This column offers little value - to be deleted.

**time_since_last transaction:** 4,103,488 unique values. Ranges from -8777.814182 to 8757.758483 We may want to convert into range or categories of some sort when feature engineering (ex. less than one minute, less than 5 minutes etc).

**spending_deviation_score:** 917 unique values; raning from -5.26 to 5.02

**velocity_score:** 20 unique values; ranges from 1-20

**geo_anomaly_score:** 101 unique values; ranges from 0-1 (decimal values)

**payment_channel:** 4 unique values;  ACH, UPI, card, wire_transfer

**ip_address:** 4,997,068 unique values (to be hashed for PI reasons)

**device_hash:** 3,835,723 unique values


In [None]:
# Count of true and false fraud values by payment channel type

query = f"""
    SELECT 
        payment_channel,
        COUNT(*) FILTER (WHERE is_fraud = '1') AS is_fraud_count,
        COUNT(*) FILTER (WHERE is_fraud = '0') AS not_fraud_count
    FROM read_parquet('{parquet_path}')
    GROUP BY payment_channel
    ORDER BY payment_channel DESC
"""

result_df = con.execute(query).fetch_df()
print(result_df)


  payment_channel  is_fraud_count  not_fraud_count
0   wire_transfer           45034          1206185
1            card           44885          1204808
2             UPI           44896          1203951
3             ACH           44738          1205503


In [36]:
# Count of true and false fraud values by payment channel type

query = f"""
    SELECT 
        COUNT(*) FILTER (WHERE is_fraud = '1' AND time_since_last_transaction IS NULL) AS is_fraud_count,
        COUNT(*) FILTER (WHERE is_fraud = '0' AND time_since_last_transaction IS NULL ) AS not_fraud_count
    FROM read_parquet('{parquet_path}')
"""

result_df = con.execute(query).fetch_df()
print(result_df)


   is_fraud_count  not_fraud_count
0               0           896513


In [43]:
query = f"""
    SELECT 
    COUNT (*) is_fraud
    FROM read_parquet('{parquet_path}')
    WHERE time_since_last_transaction IS NULL
"""

result_df = con.execute(query).fetch_df()
print(result_df)


   is_fraud
0    896513


## Data Exploration - TODO

- Get number of columns, column names, column names and data types.
- Check for type mismatches (e.g numeric stored as text)
- Check for rows with missing values/NA
- Check columns with MV/NA
- Check ratio of fraud:non-fraud cases
* If such rows (missing values, NA, null) are excluded how many rows would be left
* Check for duplicates, 
* Outliers, 
* Timestamp consistency, 
* Class imbalance, data leakage/PII checks, and downstream sample sizes after each filter.

##NOTES ON EDA:

- Row/column completeness impact ‚Äî Compute how many rows remain after dropping rows with any NA and after dropping only rows missing critical fields (e.g., is_fraud, amount) so you can plan sample sizes for training

- Class imbalance and sampling ‚Äî Measure fraud:non‚Äëfraud ratio and per‚Äëgroup rates (by merchant, device, country). This informs evaluation metrics and resampling strategies (class weights, SMOTE, stratified sampling)

- Duplicates and identity checks ‚Äî Look for duplicate transaction_id or repeated (sender, receiver, timestamp, amount) tuples. Duplicates can bias counts and model training

- Outliers and distributions ‚Äî Inspect amount, time_since_last_transaction, and anomaly scores for extreme values and skew. Decide winsorizing, log transforms, or robust scaling. Visualize with histograms or quantile summaries.

- Timestamp and temporal integrity ‚Äî Check for timezone issues, future dates, or inconsistent formats. Verify monotonicity for per‚Äëaccount sequences if we‚Äôll use time‚Äëbased features.

- For merchant_category, location, device_used, check unique counts and frequency tails. Rare categories may need grouping into ‚Äúother‚Äù or target encoding.

- Compute correlation matrix for numeric features and check for highly correlated predictors that may harm some models.

- Validate is_fraud and fraud_type consistency; ensure no features leak the label (e.g., fraud_flag derived from is_fraud). Check that features available at prediction time won‚Äôt include future info.

- PII and privacy ‚Äî Identify columns with PII (account IDs, IPs, device hashes). Decide hashing/anonymization and access controls before sharing data.

## Next We find patterns and relationship in the Dataset
- Find patterns and relationships ‚Äî bivariate analysis, correlations, time‚Äëseries patterns per account, and group‚Äëlevel fraud rates.
- Run feature importance checks to see which features/variables are important or critical to the target variable (is_fraud)

# Then we proceed to Feature Engineering
- Feature Engineering is the process of creating new, more informative columns (features) from our raw data to help machine learning models detect patterns better.

- A model looking at raw transaction data might miss subtle fraud patterns. But engineered features can make those patterns obvious.

## Examples
hour_of_day (from timestamp)

is_weekend (1 if Saturday/Sunday)

log_amount (logarithm of transaction amount)

merchant_risk_score (categorize merchants as high/medium/low risk)

amount_deviation = (amount - customer_avg_amount) / customer_avg_amount