# Enhanced Fraud Detection using Graph Neural Networks
## Introduction
Learn to Boost fraud detection accuracy and developer efficiency through Intel's end-to-end, no-code, graph-neural-networks-boosted and multi-node distributed workflows.
Check out more workflow examples and reference implementations in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).
## Solution Technical Overview
Fraud detection has traditionally been tackled with classical machine learning algorithms such as gradient boosted machines. However, such supervised machine learning algorithms can lead to unsatisfactory precision and recall due to a few reasons:
- Severe class imbalance: ratio of fraud to non-fraud transactions is extremely imbalanced with typical values less than 1% 
- Complex fraudster behavior which evolves with time: it is quite difficult to capture user behavior using traditional ML techniques 
- Scale of data: credit card transaction datasets can have billions of transactions which require distributed preprocessing and training 
- Latency of fraud detection: it is important to detect fraud quickly in order to minimize losses, thus highlighting the need for distributed inference <br />

In Intel's Enhanced Fraud Detection reference kit, we employ Graph Neural Networks (GNN) popular for their ability to capture complex behavioral patterns (e.g., fraudsters performing multiple small transactions from different cards to not get caught). We also demonstrate a boost in accuracy by using GNN-boosted features over a baseline trained on traditional ML-only features. 

## Validated Hardware Details
There are workflow-specific hardware and software setup requirements depending on
how the workflow is run. Bare metal development system and Docker\* image running
locally have the same system requirements.

| Supported Hardware           | Precision  |
| ---------------------------- | ---------- |
| Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors | FP32 |
|Memory|>200GB|
|Storage|>50GB|
## How It Works
The high-level architecture of the reference use case is shown in the diagram below. We use a credit card transaction dataset open-sourced by IBM (commonly known as the tabformer dataset) in this reference use case to demonstrate the capabilities outlined in the Solution Technical Overview section. 

![folder-structure](assets/architecture.png)


### Task 1: Feature Engineering (Edge Featurization)
The feature engineering stage ingests the raw data, encodes each column into features using the logic defined in the feature engineering config yaml file and saves processed data.
### Task 2: GNN Training (Node Featurization)
The GNN training stage creates homogenous graphs by consuming the processed data generated by Task 1 and trains a GraphSage model in a self-supervised link prediction task setting to learn the latent representations of the nodes (cards and merchants).  Once the GNN model is trained, the GNN workflow will concatenate the card and merchant features generated by the model to the corresponding transaction features and save the GNN-boosted features to a CSV file.
### Task 3: XGBoost Training (Fraud Classification)
The XGBoost training stage trains a binary classification model using the data splitting, model parameters and runtime parameters set in the XGB training config yaml file. AUCPR (Area Under the Precision-Recall Curve) is used as the evaluation metric due to its robustness in evaluating highly imbalanced datasets. Data splitting is based on temporal sequence to simulate real-life scenario. The model performance on the tabformer dataset can be found in the table from results section on README.md.

## Run Using Docker
Follow these instructions to set up and run a single node pipeline with our provided Docker image.
For running distributed pipeline on bare metal, see the bare metal instructions on README.md file. 

Before running the next cell, refer to ```Getting Started``` section on README.md and follow the instructions. Once user has completed the steps and has
declared the ENVVARs in the same terminal that will run Jupyter Lab, continue with the execution of this notebook. ENVVARs declared on terminal will be present when running this notebook and assigned to variables for Python script when running the next cell.

In [None]:
import os
WORKSPACE = os.environ['WORKSPACE']
DATASET_DIR = os.environ['DATASET_DIR']
print("Work dir: {}".format(WORKSPACE))
print("Dataset dir: {}".format(DATASET_DIR))

#### Set Up Docker Engine
You'll need to install Docker Engine on your development system.
Note that while **Docker Engine** is free to use, **Docker Desktop** may require
you to purchase a license.  See the [Docker Engine Server installation
instructions](https://docs.docker.com/engine/install/#server) for details.

#### Setup Docker Compose
Ensure you have Docker Compose installed on your machine. If you don't have this tool installed, consult the official [Docker Compose installation documentation](https://docs.docker.com/compose/install/linux/#install-the-plugin-manually).

In [None]:
import os
DOCKER_CONFIG=os.getenv('DOCKER_CONFIG', '')
DOCKER_CONFIG+=f"{os.getenv('HOME','')}/.docker" if DOCKER_CONFIG == '' else f":{os.getenv('HOME','')}/.docker"
!mkdir -p $DOCKER_CONFIG/cli-plugins
!curl -SL https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
!chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
!docker compose version

### Set Up Docker Image
Build or pull the provided docker image.

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose build

OR

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker pull intel/ai-workflows:beta-fraud-detection-classical-ml
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker pull intel/ai-workflows:beta-fraud-detection-gnn

### Run Pipeline with Docker Compose
#### Run feature engineering to get edge features
The `preprocess` workflow will ingest the raw data in the ```$DATASET_DIR/raw_data/``` directory, generate a preprocessed CSV file, and save it in the ```$OUTPUT_DIR/data/edge_data/``` directory.

Run the `preprocess` workflow with the following command:

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose run preprocess 2>&1 | tee preprocess.log

The table below shows some of the environment variables you can control according to your needs.

| Environment Variable Name | Default Value | Description |
| --- | --- | --- |
| CONFIG_DIR | `${WORKSPACE}/credit-card-fraud-detection/configs`       | Configurations directory |
| OUTPUT_DIR | `${WORKSPACE}/credit-card-fraud-detection/docker/output` | Logfile and Checkpoint output |

##### Train and evaluate XGBoost model with edge features only

The `preprocess` workflow must complete successfully before running the `baseline-training`.

The `baseline-training` workflow will consume the CSV file generated from `preprocess` workflow above, and run a training of a XGBoost model. It will also print out AUCPR (Area Under the Precision-Recall Curve) results to the console.

Run the `baseline-training` workflow with the command below.

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose run baseline-training 2>&1 | tee baseline-training.log

The table below shows some of the environment variables you can control according to your needs.

| Environment Variable Name | Default Value | Description |
| --- | --- | --- |
| CONFIG_DIR | `${WORKSPACE}/credit-card-fraud-detection/configs`       | Configurations directory |
| OUTPUT_DIR | `${WORKSPACE}/credit-card-fraud-detection/docker/output` | Logfile and Checkpoint output |

#### Train and Evaluate XGBoost model with both edge features and GNN generated node features

To see the improvement over the baseline training you can run the `xgb-training` workflow. Before running the `xgb-training`, the `preprocess` workflow must complete successfully.

The `xgb-training` workflow consumes the CSV file generated from `preprocess` above, runs the `gnn-analytics` pipeline to generate optimized features, and runs a training of a XGBoost model using these features. It will also print out AUCPR (Area Under the Precision-Recall Curve) results to the console.

Note: as this step runs GNN training first, we don't expect to see output for a while. Once GNN training finishes, we will start seeing output from XGBoost training. You can also check GNN log in parallel while running `xgb-training` by using the following command from terminal inside $WORKSPACE/credit-card-fraud-detection/docker:
```bash
docker compose logs gnn-analytics -f
```
After running the command from terminal, skip the next cell and run `xgb-training`. Alternatively, you can check GNN log from this notebook by running the following cell.

In [None]:
#To run this cell, first run the cell bellow and once it has finished run this cell.
#The process will keep running until interrupted by stopping it with "Interrupt the kernel".
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose logs gnn-analytics -f

Run the `xgb-training` container with the command below.

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose run xgb-training 2>&1 | tee xgb-training.log

This command runs the `gnn-analytics` workflow implicitly to generate the node features first and then uses edge features generated from [Step 2](#train-and-evaluate-xgboost-model-with-edge-features-only) to train the XGBoost model and will print out AUCPR (Area Under the Precision-Recall Curve) results to the console.

Note: This steps runs the GNN training first which can take several hours to finish. 

The table below shows some of the environment variables you can control according to your needs.

| Environment Variable Name | Default Value | Description |
| --- | --- | --- |
| CONFIG_DIR | `${WORKSPACE}/credit-card-fraud-detection/configs`       | Configurations directory |
| OUTPUT_DIR | `${WORKSPACE}/credit-card-fraud-detection/docker/output` | Logfile and Checkpoint output |

#### View Logs
Run these commands to check the `preprocess`, `baseline-training`, and `xgb-training` logs:

In [None]:
!echo $'========================Preprocess========================'
!cat $WORKSPACE/credit-card-fraud-detection/docker/preprocess.log
!echo $'\n====================Baseline Training====================='
!cat $WORKSPACE/credit-card-fraud-detection/docker/baseline-training.log
!echo $'\n=======================XGB Training======================='
!cat $WORKSPACE/credit-card-fraud-detection/docker/xgb-training.log

You can also check GNN log using the following commands.

In [None]:
#The process will keep running until interrupted by stopping it with "Interrupt the kernel".
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose logs gnn-analytics -f

#### Reproducing results
After running hyperparameter optimization (HPO), we found the best params for baseline and final models. You can use them to recreate our results from README.md by running the following cells.

In [None]:
#The code comments `hpo_spec` section from the config files and uncomments `model_spec` section.
#It can also return the config files to their original state by selecting `hpo_spec` as input parameter.
def change_case(case):
    file_list = ["baseline-xgb-training.yaml", "xgb-training.yaml"]
    config_path = "/credit-card-fraud-detection/configs/single-node/"
    compose_file = WORKSPACE+"/credit-card-fraud-detection/docker/docker-compose.yml"
    
    with open(compose_file, "r") as f:
        content_list = f.readlines()

    #74:77 is the range of lines to comment or uncomment from the docker-compose.yml file.
    if case == "model_spec":
        content_list[74:77] = ["#"+elem if elem[0] !="#" else elem for elem in content_list[74:77]]
    elif case == "hpo_spec":
        content_list[74:77] = [elem[1:] if elem[0] =="#" else elem for elem in content_list[74:77]]

    with open(compose_file, "w") as f:
        f.writelines(content_list)
    
    for file in file_list:
        path = WORKSPACE+config_path+file
        yml_info = []
        with open(path, 'r') as f:
            for line in f:
                if "data_spec:" in line or "hpo_spec:" in line or "model_spec:" in line:
                    yml_info.append([])

                if "#" == line[0]:
                    yml_info[-1].append(line[2:])
                else:
                    yml_info[-1].append(line)

        with open(path, 'w') as f:            
            for config in yml_info:
                if  "data_spec:" in config[0] or case in config[0]:
                    for elem in config:
                        f.write(elem)
                else:
                    for elem in config:
                        f.write("# "+elem)

In [None]:
change_case("model_spec") #Activates model_spec

!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose run baseline-training 2>&1 | tee baseline-training-model_spec.log

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose run xgb-training 2>&1 | tee xgb-training-model_spec.log

change_case("hpo_spec") #Returns to default (hpo_spec) configuration

---
We consider the AUCPR calculation from the last line (starts with [999]) as our final result. Previous lines are intermediate evaluations which can be used to track progress of the model. We expect final results to match closely to our reported numbers although intermediate evaluations could be different.

You can also check logs of the results running the following cell.

In [None]:
!echo $'\n====================Baseline Training====================='
!cat $WORKSPACE/credit-card-fraud-detection/docker/baseline-training-model_spec.log
!echo $'\n=======================XGB Training======================='
!cat $WORKSPACE/credit-card-fraud-detection/docker/xgb-training-model_spec.log


Run the following command to stop all services and containers created by docker compose and remove them.

In [None]:
!cd $WORKSPACE/credit-card-fraud-detection/docker && \
docker compose down