# **1.0 - Data Ingestion and Anonymization**

_by Michael Joshua Vargas_

This document conceptually outlines the initial data ingestion and anonymization process for the Bank Fraud Detection System. The source code for this step is not included due to the sensitive nature of the data and the proprietary anonymization and hashing procedures involved.

## Purpose

This stage is responsible for securely ingesting raw transaction data and transforming it into an anonymized format suitable for further analysis and model development. It ensures that sensitive customer information is protected while retaining the necessary data integrity and relationships for fraud detection.

## Input

The process begins with raw, unanonymized transaction data, typically in a structured format (e.g., Parquet files) containing various customer and transaction identifiers.

## Key Processes

1.  **Data Loading:** Raw data is loaded from its source.
2.  **Identifier Anonymization:**
    *   Sensitive identifier columns (e.g., `profile_id`, `account_no`, `full_name`, `username`, `gr_card_no`, `cellphone`, `account_number`, `source_account_number`, `destination_account_number`) are anonymized.
    *   A one-way cryptographic hashing function, often with a private salt, is applied to these fields. This converts their original values into non-reversible, string-like representations, ensuring privacy while maintaining uniqueness for analytical linking.
3.  **Initial Data Type Correction:** Basic data type adjustments are performed to ensure consistency (e.g., converting date/time strings to datetime objects).

## Output

The primary output of this stage is the `data/interim/1.0_initial_eda_dataset.parquet` file. This file contains the anonymized and initially processed data, ready to be consumed by subsequent data preparation and exploration stages.

This conceptual overview ensures transparency regarding the data pipeline while strictly adhering to data privacy and security requirements.