# Graph Neural Network Demo

## Introduction
Learn to run Graph Neural Network (GNN) training on CPUs in single and distributed mode. The workflow reads tabular data, ingests it into graph format, and then uses a GNN to learn embeddings used as rich features in a downstream task.

This workflow is used by the [Fraud Detection Reference Kit](https://github.com/intel/credit-card-fraud-detection).

Check out more workflow examples and reference implementations in the [Dev
Catalog](https://developer.intel.com/aireferenceimplementations).

## Solution Technical Overview
Graph Neural Networks are effective models for generating node/edge embeddings that can be used as rich features to improve accuracy of downstream tasks.
This workflow provides a step-by-step example for how GNNs can be used in fraud detection to extract node embeddings for all entities (credit cards and merchant) based on the graph structure defined by the transactions between them.

General steps:
- **Graph Construction/Graph Partition**: Conversion of tabular data into a set of files (nodes.csv, edges.csv and meta.yml). These CSV files form a CSVDataset for ingestion into [Deep Graph Library (DGL)](https://www.dgl.ai/) graph.
- **GNN Training**: Training of GNN GraphSAGE model for self-supervised transductive link prediction task. If this is performed on a cluster of machines this training is preceded by graph partitioning.
- **Emb Mapping**: Mapping of generated node embeddings into original tabular dataset.

## Solution Technical Details
Use cases such as fraud detection, are characterized by class imbalance in their datasets that makes training predictor models directly with those labels difficult. This GNN workflow shows an example of how a self-supervised task can be formulated (instead of using the hard imbalanced labels) to learn entity features that capture the graph structure for use by a downstream predictor model such as XGBoost. In this workflow, the self-supervised task consists of link prediction where the edges in the graph are used as positive examples and non-existent edges as negative examples.

GNN workflow ingests tabular data where each row corresponds to a transaction between two types of entities: cards and merchants, and generates a graph where entities are the nodes (that are featureless), and transactions constitute the edges with the associated transaction feature attributes.

<center>
<img src="./docs/GNN_WF.png" width="800"/><figure>GNN Workflow</figure>
</center>

The GNN model consists of a learnable embedding layer followed by an encoder, implemented as a 2-layer GraphSAGE model, and a decoder, implemented as a 3-layer multilayer perceptron (MLP) with a single output for link prediction. During training, positive and negative neighbor sampling is used to generate the training examples. We use a Receiver Operating Characteristic Area Under Curve score (ROC AUC) as the metric to evaluate the quality of the embeddings in predicting if two entities should be connected. The ultimate measure of how useful these embeddings are in predicting fraud needs to be measured by the downstream predictor model, since this workflow is not using the fraud labels directly.

<center>
<img src="./docs/GraphSAGE.png" width="800"/><figure>GraphSAGE Model</figure>
</center>

After several epochs of training, we run GNN inference on the entire graph without neighbor sampling and use the last layer activations generated by the model as node embeddings for nodes of the graph. These embeddings can be mapped to the entities in the tabular data input and used as node features for a downstream prediction task.

## Validated Hardware Details
There are workflow-specific hardware and software setup requirements depending on how the workflow is run. Bare metal development system and Docker image running
locally have the same system requirements.

| Recommended Hardware         | Precision  |
| ---------------------------- | ---------- |
| Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors| FP32 |

For distributed training a high-speed fabric across nodes (e.g., OPA, Mellanox) is recommended.

Workflow has been tested on OS Rocky Linux v8.7 and Ubuntu 20.04

## How it Works
This GNN workflow reads the tabular data, ingests it into the graph, and then use a GNN to learn the embeddings to enrich the features for the following task, where the whole process can be configured by the user using yaml configuration files and it supports running in different ways:
- Run single node bare metal
- Run single node using Docker
- Run bare metal on a cluster of machines
- Run docker on a cluster of machines

The selection between these different modes can be done in the `workflow-config.yaml`.

In these sections you will find instructions on how to update the configuration yaml files to run this workflow.
### Update workflow-config.yaml
`workflow-config.yaml` is the main configuration file for the user to specify:
1. Runtime environment (i,e number of nodes in cluster, IPs, bare metal/docker, ...)
2. Directories for inputs, outputs and configuration files
3. Configure what stages of the workflow to execute. A user may run all stages the first time but may want to skip building or partitioning a graph in later training experiments to save time.

Please refer to the `workflow-config.yaml` for a detailed description of input configurations.

### Update model-training.yaml
In `model-training.yaml` user can specify:
1. Dataloader, sampler and model parameters (i,e batch size, sampling fanout, learning rate)
2. Training hyperparamets (i,e number of epochs)
3. DGL specific parameters for distributed training

Please refer to the `model-training.yaml` for a detailed description of input configurations.

## Get Started

### Download the Workflow Repository

In [1]:
%%bash
## Clone the repo
mkdir -p ~/work && cd ~/work
git clone https://github.com/intel/graph-neural-networks-and-analytics
cd graph-neural-networks-and-analytics
export WORKSPACE=~/work/graph-neural-networks-and-analytics

## Create the conda env
./script/build_dgl1_env.sh
conda activate dgl1.0
conda install -c conda-forge category_encoders

Cloning into 'graph-neural-networks-and-analytics'...


Retrieving notices: ...working... done
Remove existing environment (y/[n])? 
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - python=3.8


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |             main           3 KB  anaconda
    _openmp_mutex-5.1          |            1_gnu          20 KB  anaconda
    ca-certificates-2023.01.10 |       h06a4308_0         127 KB  anaconda
    ld_impl_linux-64-2.38      |       h1181459_1         732 KB  anaconda
    libgcc-ng-11.2.0           |       h1234567_1         8.5 MB  anaconda
    libgomp-11.2.0             |       h1234567_1         560 KB  anaconda
    libstdcxx-ng-11.2.0        |       h1234567_1         6.1 MB  anaconda
    ncurses-6.4                |       h6a678d5_0         1.1 MB  anaconda
    readline-8.2               |       h5eee18b_0         457 KB  anaconda
    tk-8.6.12                  |       h1ccaba5_0         3.3 










libgcc-ng-11.2.0     | 8.5 MB    | #7         |  18% [A[A[A[A[A[A[A[A[A[A









libgcc-ng-11.2.0     | 8.5 MB    | ##8        |  29% [A[A[A[A[A[A[A[A[A[A








ncurses-6.4          | 1.1 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A







libstdcxx-ng-11.2.0  | 6.1 MB    | ##9        |  30% [A[A[A[A[A[A[A[A









libgcc-ng-11.2.0     | 8.5 MB    | ###9       |  39% [A[A[A[A[A[A[A[A[A[A









libgcc-ng-11.2.0     | 8.5 MB    | #####3     |  53% [A[A[A[A[A[A[A[A[A[A







libstdcxx-ng-11.2.0  | 6.1 MB    | ###7       |  37% [A[A[A[A[A[A[A[A









libgcc-ng-11.2.0     | 8.5 MB    | ######2    |  63% [A[A[A[A[A[A[A[A[A[A







libstdcxx-ng-11.2.0  | 6.1 MB    | #####4     |  54% [A[A[A[A[A[A[A[A









libgcc-ng-11.2.0     | 8.5 MB    | #######7   |  78% [A[A[A[A[A[A[A[A[A[A







libstdcxx-ng-11.2.0  | 6.1 MB    | ######6    |  66% [A[A[A[A[A[A[A[A





tk-8.6.1



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - cmake
    - pip


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.05.30 |       h06a4308_0         120 KB
    certifi-2023.5.7           |   py38h06a4308_0         152 KB
    zstd-1.5.5                 |       hc292b87_0         647 KB
    ------------------------------------------------------------
                                           Total:         918 KB

The following NEW packages will be INSTALLED:

  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h7b6447c_0 
  c-ares             pkgs/main/linux-64::c-ares-1.19.0-h5eee18b_0 
  cmake              pkgs/main/linux-64::cmake-3.22.1-h1fce559_0 
  expat              pkgs/main/linux-64::expat-2.4.9-h6a678d5_0 
  krb5               pkgs/main/linux-64::krb5-1.19.4-h568e23c_0 
  libcurl            pkgs/main/



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - cpuonly
    - pytorch


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    intel-openmp-2023.1.0      |   hdb19cb5_46305        17.1 MB
    mkl-2023.1.0               |   h6d00ec8_46342       171.5 MB
    pytorch-2.0.1              |      py3.8_cpu_0        86.0 MB  pytorch
    typing_extensions-4.5.0    |   py38h06a4308_0          46 KB
    ------------------------------------------------------------
                                           Total:       274.8 MB

The following NEW packages will be INSTALLED:

  blas               pkgs/main/linux-64::blas-1.0-mkl 
  cpuonly            pytorch/noarch::cpuonly-2.0-0 
  filelock           pkgs/main/linux-64::filelock-3.9.0-py38h06a4308_0 
  gmp                pkgs/main/linux-64::gmp-6.2.1-h295c915_3 
  gmpy2              pkgs/main/linux-



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - intel-extension-for-pytorch


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotlipy-0.7.0             |py38h27cfd23_1003         349 KB  intel
    ca-certificates-2023.01.10 |       h06a4308_0         127 KB  intel
    certifi-2022.12.7          |   py38h06a4308_0         152 KB  intel
    cffi-1.15.1                |   py38h74dc2b5_0         230 KB  intel
    charset-normalizer-2.0.4   |     pyhd3eb1b0_0          33 KB  intel
    cryptography-39.0.1        |   py38h9ce1e76_0         1.5 MB  intel
    freetype-2.12.1            |       hb267b13_2        1003 KB  intel
    future-0.18.3              |   py38h06a4308_0         717 KB  intel
    idna-3.4                   |   py38h06a4308_0         109 KB  intel
    intel-extension-for-pytorch-1.12.1|   py38h6a678d5_0        24.1 MB



















brotlipy-0.7.0       | 349 KB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A


















pillow-9.4.0         | 723 KB    | 2          |   2% [A[A[A[A[A[A[A[A[A


pillow-9.4.0         | 723 KB    | 6          |   7% [A[A[A


libllvm14-14.0.6     | 33.4 MB   |            |   0% [A[A[A


pillow-9.4.0         | 723 KB    | ##8        |  29% [A[A[A


libllvm14-14.0.6     | 33.4 MB   |            |   0% [A[A[A


pillow-9.4.0         | 723 KB    | ###5       |  35% [A[A[A


pillow-9.4.0         | 723 KB    | #####      |  51% [A[A[A


pillow-9.4.0         | 723 KB    | #####9     |  60% [A[A[A


pillow-9.4.0         | 723 KB    | ########1  |  82% [A[A[A


libllvm14-14.0.6     | 33.4 MB   | 1          |   2% [A[A[A


pillow-9.4.0         | 723 KB    | ########## | 100% [A[A[A


libllvm14-14.0.6     | 33.4 MB   | 2          |   3% [A[A[A


libllvm14-14.0.6     | 33.4 MB   | 4          |   4% [A[A

















libwebp-base-1.2.0   | 815 KB    | 1          |   2% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A














mkl_random-1.2.2     | 309 KB    | #5         |  16% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A













jpeg-9e              | 273 KB    | ########7  |  88% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
















libwebp-base-1.2.0   | 815 KB    | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A














mkl_random-1.2.2     | 309 KB    | ###6       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















brotlipy-0.7.0       | 349 KB    | 4          |   5% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
















libwebp-base-1.2.0   | 815 KB    | ###5       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A













jpeg-9e              | 273 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A














mkl_random-1.2.2     | 309 KB    | ####6      |  47% 



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - dgl


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    dgl-1.1.0                  |           py38_0         5.4 MB  dglteam
    scipy-1.10.1               |   py38hf6e8229_1        22.4 MB
    ------------------------------------------------------------
                                           Total:        27.8 MB

The following NEW packages will be INSTALLED:

  appdirs            pkgs/main/noarch::appdirs-1.4.4-pyhd3eb1b0_0 
  dgl                dglteam/linux-64::dgl-1.1.0-py38_0 
  packaging          pkgs/main/linux-64::packaging-23.0-py38h06a4308_0 
  pooch              pkgs/main/noarch::pooch-1.4.0-pyhd3eb1b0_0 
  scipy              pkgs/main/linux-64::scipy-1.10.1-py38hf6e8229_1 
  tqdm               pkgs/main/linux-64::tqdm-4.65.0-py38hb070fc8_0 

The following packages wil



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - psutil


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.5.7   |       hbcca054_0         145 KB  conda-forge
    certifi-2023.5.7           |     pyhd8ed1ab_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         294 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    pkgs/main::ca-certificates-2023.05.30~ --> conda-forge::ca-certificates-2023.5.7-hbcca054_0 
  certifi            pkgs/main/linux-64::certifi-2023.5.7-~ --> conda-forge/noarch::certifi-2023.5.7-pyhd8ed1ab_0 


Proceed ([y]/n)? 

Downloading and Extracting Packages
certifi-2023.5.7     | 149 KB    |            |   0% 
certifi-2023.5.7     | 149 KB    | #     



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - tqdm


The following NEW packages will be INSTALLED:

  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 

The following packages will be UPDATED:

  tqdm               pkgs/main/linux-64::tqdm-4.65.0-py38h~ --> conda-forge/noarch::tqdm-4.65.0-pyhd8ed1ab_1 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - intel/linux-64::openssl-1.1.1t-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1t-h7f8727e_0done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - ogb


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    numexpr-2.8.4              |   py38hc78ab66_1         136 KB
    scikit-learn-1.2.2         |   py38h6a678d5_1         7.8 MB
    ------------------------------------------------------------
                                           Total:         7.9 MB

The following NEW packages will be INSTALLED:

  bottleneck         pkgs/main/linux-64::bottleneck-1.3.5-py38h7deecbd_0 
  joblib             conda-forge/noarch::joblib-1.2.0-pyhd8ed1ab_0 
  littleutils        conda-forge/noarch::littleutils-0.2.2-py_0 
  numexpr            pkgs/main/linux-64::numexpr-2.8.4-py38hc78ab66_1 
  ogb                conda-forge/noarch::ogb-1.3.6-pyhd8ed1ab_0 
  outdated           conda-forge/noarch::outdated-0.2.2-pyhd8ed1ab_0 
  pandas             pk



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - intel/linux-64::openssl-1.1.1t-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1t-h7f8727e_0done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libblas-3.9.0              |1_h86c2bf4_netlib         199 KB  conda-forge
    libcblas-3.9.0             |5_h92ddd45_netlib          54 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         253 KB

The following NEW packages will be INSTALLED:

  libblas            conda-forge/linux-64::libblas-3.9.0-1_h86c2bf4_netlib 
  libcblas           conda-forge/linux-64::libcblas-3.9.0-5_h92ddd45_netlib 
  python_abi         conda-forge/linux-64::python_abi-3.8-2_cp38 

The following packages will be SUPERSEDED by a higher-priority channel:

  scikit-learn       pkgs/main::scikit-learn-1.2.2-py38h6a~ --> conda-forge::scikit-learn-1.0.2-py38h1561384



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - pydantic


The following NEW packages will be INSTALLED:

  dataclasses        conda-forge/noarch::dataclasses-0.8-pyhc8e2a94_3 
  pydantic           conda-forge/noarch::pydantic-0.18.2-py_0 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - intel/linux-64::openssl-1.1.1t-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1t-h7f8727e_0done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - pyyaml


The following packages will be UPDATED:

  pyyaml                   intel::pyyaml-6.0-py38h5eee18b_1 --> conda-forge::pyyaml-6.0-py38h0a891b7_4 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - intel/linux-64::openssl-1.1.1t-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1t-h7f8727e_0done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - chardet


The following NEW packages will be INSTALLED:

  chardet            conda-forge/linux-64::chardet-5.1.0-py38h578d9bd_0 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - intel/linux-64::openssl-1.1.1t-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1t-h7f8727e_0done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - torchmetrics


The following NEW packages will be INSTALLED:

  torchmetrics       conda-forge/noarch::torchmetrics-0.11.4-pyhd8ed1ab_0 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - htop


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libgcc-ng-13.1.0           |       he5830b7_0         758 KB  conda-forge
    openssl-1.1.1u             |       hd590300_0         1.9 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.6 MB

The following NEW packages will be INSTALLED:

  htop               conda-forge/linux-64::htop-3.2.2-h8228510_0 
  libnl              conda-forge/linux-64::libnl-3.7.0-h166bdaf_0 
  llvm-openmp        pkgs/main/linux-64::llvm-openmp-14.0.6-h9e868ea_0 

The following packages will be REMOVED:

  libgomp-11.2.0-h1234567_1

The following packages will be UPDATED:

  libgcc-ng           anaconda::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-13.1.



  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - conda-pack


The following NEW packages will be INSTALLED:

  conda-pack         conda-forge/noarch::conda-pack-0.7.0-pyh6c4a22f_0 


Proceed ([y]/n)? 

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done



CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.




Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /root/miniconda3/envs/dgl1.0

  added / updated specs:
    - category_encoders


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    category_encoders-2.6.1    |     pyhd8ed1ab_0          71 KB  conda-forge
    statsmodels-0.14.0         |   py38h31356c5_1        10.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        10.2 MB

The following NEW packages will be INSTALLED:

  category_encoders  conda-forge/noarch::category_encoders-2.6.1-pyhd8ed1ab_0 
  patsy              conda-forge/noarch::patsy-0.5.3-pyhd8ed1ab_0 
  statsmodels        conda-forge/linux-64::statsmodels-0.14.0-py38h31356c5_1 


Proceed ([y]/n)? 

Downloading and Extracting Packages
statsmodels-0.14.0   | 10.1 MB   |            |   0% 
category_encoders-2. | 71 KB     |            |   0% [A
stat

### Download the Datasets
The input to this workflow is tabular data in CSV format where each row corresponds to a transaction between two entities. In the case of the [IBM/tabformer](https://github.com/IBM/TabFormer/blob/main/data/credit_card/transactions.tgz) dataset used by Fraud Detection Reference Kit it consists of credit card transaction. Each transaction includes the IDs of the entities involved (Card and Merchant) and the edge features (amount of the transaction, date, etc.).

<center>
<img src="./docs/cc_trans_dataset.png" width="800"/><figure>IBM TabFormer Dataset</figure>
</center>

In [6]:
%%bash
## Download dataset
[[ ! -d ~/dataset/transactions ]] && mkdir -p ~/dataset/transactions && wget https://github.com/IBM/TabFormer/blob/main/data/credit_card/transactions.tgz -O ~/dataset/transactions.tgz && tar -zxvf ~/dataset/transactions.tgz -C ~/dataset/transactions

## Preprocess the dataset
export WORKSPACE=~/work/graph-neural-networks-and-analytics
DATA_IN=~/dataset/transactions/card_transaction.v1.csv
PROCESSED_DATA=${WORKSPACE}/cfg_valid/processed_data.csv
cd $WORKSPACE && mkdir -p ${WORKSPACE}/cfg_valid && ./script/run_data_prep.sh $DATA_IN $PROCESSED_DATA

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[col] = tgt_encoder.fit_transform(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valtest_df[col] = tgt_encoder.transform(valtest_df[col]).astype("float32")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[col] = tgt_encoder.fit_transform(
A value is trying to be set on a copy of a s

Namespace(edge_feature_data_path='/root/work/graph-neural-networks-and-analytics/cfg_valid/processed_data.csv', raw_transaction_data_path='/root/dataset/transactions/card_transaction.v1.csv')
Time to read the dataframe = 1 seconds
Time for featurization = 1 seconds
Writing edge features to csv file takes 1 seconds
(49857, 26)


### Run Using Bare Metal

#### Set Up Worflow

1. edit workflow configuration
```yaml
env:
  num_node: 1
  node_ips: #pls make sure that the ip doesn't contain space in the end
    - 127.0.0.1
  tmp_path: /localdisk/${USER}/cfg_valid
  #tmp_path used to save model, embeddings, partitions...
  data_path: /localdisk/${USER}/cfg_valid
  in_data_filename: processed_data.csv
  #data_path should contain the in_data_filename (processed_data.csv)
  out_path: /localdisk/${USER}/cfg_valid
  #out_path will contain the output csv with the tabular data and new node embeddings
  config_path: /localdisk/${USER}/applications.ai.appliedml.workflow.GNNandAnalytics/configs
  #for single node docker exec paths need to be on /localdisk (or NFS with full permissions)
  #for distributed exec paths need to be on NFS along with code repo
  bare_metal: True
  #bare_metal=False means run using docker container
  #docker_image: intel/ai-workflows:eap-fraud-detection-gnn
  docker_image: intel/ai-workflows:pa-fraud-detection-gnn
  train_config_file: model-training.yaml
  tabular2graph_config_file: tabular2graph.yaml

#first time run all stages but later you can set stages to False to run with prior results
#i.e skip building graph and partitions to save time and jump directly to training
single:
  build_graph: True
  #build_graph stage generates CSVDataset files for DGL to ingest data as graph
  gnn_training: True
  map_save: True
  #map_save stage performs the mapping of the computed node embeddings to the input tabular data file

graph:
  #provide a name for the graph
  CSVDataset_name: sym_tabformer_hetero_CSVDatasets
  name: tabformer_full_homo
```

#### (Optional) Configuration on a Cluster of Machines

1. set up [passwordless ssh](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) acccess accross machines
2. set up [Distributed File System](https://github.com/dmlc/dgl/tree/1.0.0/examples/pytorch/graphsage/dist) for data and file that can be accessed across multiple machines
3. edit workflow configuration

In [3]:
%%bash
## Edit configuration file
export WORKSPACE=~/work/graph-neural-networks-and-analytics
export NODE_IP=$(hostname -i)
export TMP_PATH=$(echo ${WORKSPACE}/cfg_valid |sed -e 's/\//\\\//g')
export DATA_PATH=$(echo ${WORKSPACE}/cfg_valid |sed -e 's/\//\\\//g')
export OUT_PATH=$(echo ${WORKSPACE}/cfg_valid |sed -e 's/\//\\\//g')
export CONFIG_PATH=$(echo ${WORKSPACE}/configs |sed -e 's/\//\\\//g')

cd $WORKSPACE
sed -i "s/    - 123.1.2.3/    - ${NODE_IP}/g" ./configs/workflow-config.yaml
sed -i "s/    - 123.1.2.4/    - ${NODE_IP}/g" ./configs/workflow-config.yaml
sed -i "/tmp_path:/ s/:.*/: ${TMP_PATH}/g" ./configs/workflow-config.yaml
sed -i "/data_path:/ s/:.*/: ${DATA_PATH}/g" ./configs/workflow-config.yaml
sed -i "/out_path:/ s/:.*/: ${OUT_PATH}/g" ./configs/workflow-config.yaml
sed -i "/config_path:/ s/:.*/: ${CONFIG_PATH}/g" ./configs/workflow-config.yaml

#### Run Workflow

##### Launch Single Node Bare Metal Training

In the single process of GNN workflow, it mainly composes three steps:
* **build graph**: convert the pre-processed IBM TabFormer csv dataset to DGL requirement of data format (i.e., meta.yaml, node.csv, edge.csv).
* **gnn training**: training and evaluating the GNN (GraphSAGE+3MLPs) network.
* **map save**: mapping of the computed node embeddings to the input tabular data file.

In [7]:
%%bash
# build graph + graph training + map save
export WORKSPACE=~/work/graph-neural-networks-and-analytics
cd $WORKSPACE && ./run-workflow.sh ./configs/workflow-config.yaml

dgl1.0                   /root/miniconda3/envs/dgl1.0

dgl1.0 conda env already exists, activating environment

Starting single node workflow...

Building graph...
/root/work/graph-neural-networks-and-analytics/configs/tabular2graph.yaml
Namespace(CSVDataset_name='sym_tabformer_hetero_CSVDatasets', data_in='/root/work/graph-neural-networks-and-analytics/cfg_valid/processed_data.csv', gnn_tmp='/root/work/graph-neural-networks-and-analytics/cfg_valid', tab2graph_cfg='/root/work/graph-neural-networks-and-analytics/configs/tabular2graph.yaml')
/root/work/graph-neural-networks-and-analytics/cfg_valid/sym_tabformer_hetero_CSVDatasets
loading processed data
time lo load processed data 0.08425259590148926
Node renumbering
re-enumerated column map:  {'card_id': 'card_id_Idx', 'merchant_id': 'merchant_id_Idx'}
time to renumerate 0.006252288818359375
Writting data into set of CSV files (nodes/edges)
['card_id_Idx', 'transaction', 'merchant_id_Idx']
['merchant_id_Idx', 'sym_transaction', 'card_id_

Namespace(model_emb_path='/root/work/graph-neural-networks-and-analytics/cfg_valid', node_emb_name='node_emb', out_data_path='/root/work/graph-neural-networks-and-analytics/cfg_valid/tabular_with_gnn_emb.csv', processed_data_path='/root/work/graph-neural-networks-and-analytics/cfg_valid/processed_data.csv', tab2graph_cfg='/root/work/graph-neural-networks-and-analytics/configs/tabular2graph.yaml')
loading processed data
time lo load processed data 0.0812215805053711
re-enumerated column map (homogeneous mapping):  {'card_id': 'card_id_Idx', 'merchant_id': 'merchant_id_Idx'}
time to renumerate 0.0060384273529052734
Loading embeddings from file and adding to preprocessed CSV file
CSV output shape:  (49857, 154)
Time to append node embeddings to edge features CSV 5.799329042434692


### Expected Output

#### Build Graph
The successful execution of this stage will create the below contents under `${env_tmp_path}` directory specified in `workflow-config.yaml`:

In [10]:
%%bash
ls ~/work/graph-neural-networks-and-analytics/cfg_valid/sym_tabformer_hetero_CSVDatasets/

edges_0.csv
edges_1.csv
meta.yaml
nodes_0.csv
nodes_1.csv
sym_tabformer_hetero_CSVDatasets


#### GNN Training
The successful training will show the epoch times as it progresses along with roc_auc scores. The training logs are saved under:

In [11]:
%%bash
ls ~/work/graph-neural-networks-and-analytics/logs

log_tabformer_full_homo_1n_20671.txt
log_tabformer_full_homo_1n_272.txt


#### Mapping Embeddings to Input Tabular Data
The "map_save" stage generates a CSV file combining the input tabular data and the node embeddings generated by this GNN workflow. In the case of Tabformer datast this file will have 154 features per transaction and can be used for the fraud prediction downstream task.

The output file (~35GB) can be found on your host system's output directory indicated by `${env_out_path}` in `workflow-config.yaml`:

In [12]:
%%bash
ls ~/work/graph-neural-networks-and-analytics/cfg_valid

model_graphsage_2L_64.pt
node_emb.pt
processed_data.csv
sym_tabformer_hetero_CSVDatasets
tabular_with_gnn_emb.csv
