In [None]:
!pip install duckdb

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import os
import duckdb
import kagglehub

### Use the following bash command to find the cached file path
find ~ -name "financial_fraud_detection_dataset.csv"

If you're not able to, then download a copy of the dataset to your machine, unzip it and set the absolute path of the csv file in the next cell.

- Download link:
https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/data


In [None]:
# Copy and paste the file path of the cached dataset below to read it into a pandas DataFrame
df = pd.read_csv("/Users/joshuaokojie/.cache/kagglehub/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/versions/1/financial_fraud_detection_dataset.csv")
print(df.head())

## Data Exploration and Cleaning Using SQL Queries in DuckDB

METHOD:
- Using a Local SQL Engine (DuckDB)
    - For complex SQL queries, loading our data into a local analytical database like DuckDB is very effective. It's fast and supports direct querying on Pandas DataFrames or CSV files.
    - We can use DuckDB to query CSV/parquet file directly and perform the filtering in SQL, which is more memory efficient.
    - DuckDB is optimized for analytical queries and can be faster than pandas for complex operations.

   ** Workflow**
   - 
    - Download dataset to local machine
    - connect to path in jupyter notebook, and convert csv to parquet files (columnar Parquet files that are much faster to query)
    - Store parquet files in folder within the repo (parquet files are smaller)
    - Run sql queries directly on he parquet files without importing them into memory
    - Feature Engineering (DuckDB SQL) or Pandas
    - Saved clean and processed parquet shards/files to be used in other notebooks


In [None]:
# Define file paths
csv_path = "/Users/joshuaokojie/.cache/kagglehub/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/versions/1/financial_fraud_detection_dataset.csv"
parquet_path = "./raw_data/financial_fraud_detection_dataset.parquet"
cleaned_parquet_path = "./cleaned_data/cleaned_fraud.parquet"

# 1. Check if source CSV exists
if not os.path.exists(csv_path):
    raise FileNotFoundError(f"CSV file not found at {csv_path}")

print(f"üìÅ Source CSV: {csv_path}")
print(f"üìÅ Target Parquet: {parquet_path}")
print(f"üìä Original size: {os.path.getsize(csv_path) / (1024**3):.2f} GB")

In [None]:
# Connect to duckdb and Convert CSV to Parquet with DuckDB

# CREATE THE DIRECTORY parquet_path directory FIRST
os.makedirs(os.path.dirname(parquet_path), exist_ok=True)

con = duckdb.connect()
con.execute(f"""
COPY (SELECT * FROM read_csv_auto('{csv_path}'))
TO '{parquet_path}' (FORMAT 'parquet', COMPRESSION 'zstd');
""")

#  3. Verify the result
if os.path.exists(parquet_path):
    parquet_size = os.path.getsize(parquet_path) / (1024**3)
    compression_ratio = (1 - parquet_size / (os.path.getsize(csv_path) / (1024**3))) * 100
    print(f"üìä Parquet size: {parquet_size:.2f} GB")
    print(f"üéØ Compression ratio: {compression_ratio:.1f}% reduction")
    
    # Quick verification query
    row_count = con.execute(f"SELECT COUNT(*) FROM '{parquet_path}'").fetchone()[0]
    print(f"üî¢ Row count in Parquet: {row_count:,}")
else:
    print("‚ùå Parquet file was not created")

# To Close DB connection, but can be left open for further queries
# con.close()


üìä Parquet size: 0.19 GB
üéØ Compression ratio: 74.6% reduction
üî¢ Row count in Parquet: 5,000,000


In [None]:
# con = duckdb.connect()

# using DESCRIBE instead of pandas dtypes to avoid loading data into memory
print(con.execute(f"DESCRIBE SELECT * FROM read_parquet('{parquet_path}')").fetch_df())


## Data Exploration - TODO

- Get number of columns, column names, column names and data types.
- Check for type mismatches (e.g numeric stored as text)
- Check for rows with missing values/NA
- Check columns with MV/NA
- Check ratio of fraud:non-fraud cases
* If such rows (missing values, NA, null) are excluded how many rows would be left
* Check for duplicates, 
* Outliers, 
* Timestamp consistency, 
* Class imbalance, data leakage/PII checks, and downstream sample sizes after each filter.

##NOTES ON EDA:

- Row/column completeness impact ‚Äî Compute how many rows remain after dropping rows with any NA and after dropping only rows missing critical fields (e.g., is_fraud, amount) so you can plan sample sizes for training

- Class imbalance and sampling ‚Äî Measure fraud:non‚Äëfraud ratio and per‚Äëgroup rates (by merchant, device, country). This informs evaluation metrics and resampling strategies (class weights, SMOTE, stratified sampling)

- Duplicates and identity checks ‚Äî Look for duplicate transaction_id or repeated (sender, receiver, timestamp, amount) tuples. Duplicates can bias counts and model training

- Outliers and distributions ‚Äî Inspect amount, time_since_last_transaction, and anomaly scores for extreme values and skew. Decide winsorizing, log transforms, or robust scaling. Visualize with histograms or quantile summaries.

- Timestamp and temporal integrity ‚Äî Check for timezone issues, future dates, or inconsistent formats. Verify monotonicity for per‚Äëaccount sequences if we‚Äôll use time‚Äëbased features.

- For merchant_category, location, device_used, check unique counts and frequency tails. Rare categories may need grouping into ‚Äúother‚Äù or target encoding.

- Compute correlation matrix for numeric features and check for highly correlated predictors that may harm some models.

- Validate is_fraud and fraud_type consistency; ensure no features leak the label (e.g., fraud_flag derived from is_fraud). Check that features available at prediction time won‚Äôt include future info.

- PII and privacy ‚Äî Identify columns with PII (account IDs, IPs, device hashes). Decide hashing/anonymization and access controls before sharing data.

## Next We find patterns and relationship in the Dataset
- Find patterns and relationships ‚Äî bivariate analysis, correlations, time‚Äëseries patterns per account, and group‚Äëlevel fraud rates.
- Run feature importance checks to see which features/variables are important or critical to the target variable (is_fraud)

# Then we proceed to Feature Engineering
- Feature Engineering is the process of creating new, more informative columns (features) from our raw data to help machine learning models detect patterns better.

- A model looking at raw transaction data might miss subtle fraud patterns. But engineered features can make those patterns obvious.

## Examples
hour_of_day (from timestamp)

is_weekend (1 if Saturday/Sunday)

log_amount (logarithm of transaction amount)

merchant_risk_score (categorize merchants as high/medium/low risk)

amount_deviation = (amount - customer_avg_amount) / customer_avg_amount