<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/tigergraph/anti_money_laundering_graph_schema.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graph Schema Anti Money Laundeuring

The AML project is intended to provide a multi-agent based simulator that generates synthetic banking transaction data together with a set of known money laundering patterns - mainly for the purpose of testing machine learning models and graph algorithms.

**AMLSim Official GitHub**
https://github.com/IBM/AMLSim

**AMLSim WIKI** https://github.com/IBM/AMLSim/wiki

**AMLSim Data** https://github.com/IBM/AMLSim/wiki/Download-Example-Data-Se

In [1]:
from IPython.display import clear_output
!pip install -qq watermark

################################
# Packages to install
################################
!pip install -U -qq pyTigerGraph
!pip install -qq flat-table

################################

clear_output()
%reload_ext watermark
%watermark -v -p numpy,pyTigerGraph,flat-table

Python implementation: CPython
Python version       : 3.7.12
IPython version      : 5.5.0

numpy       : 1.21.5
pyTigerGraph: 0.0.9.9.2
flat-table  : not installed



In [2]:
import json

import flat_table
import pandas as pd
import pyTigerGraph as tg
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [3]:
host_name = "http://20.112.15.231"
user_name = "tigergraph"
password = "tigergraph"
graph_name = "AMLGraph"

conn = tg.TigerGraphConnection(host=host_name, username=user_name, password=password, graphname=graph_name)

## Fetch data

In [4]:
!git clone https://github.com/TigerGraph-DevLabs/AMLSim_Python_Lab

Cloning into 'AMLSim_Python_Lab'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (84/84), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 84 (delta 26), reused 48 (delta 10), pack-reused 0[K
Unpacking objects: 100% (84/84), done.


## Loading DataFrames

Taking a look at our data from IBM's AMLSim(ulated) data generator.

*   `accounts.csv`
*   `alerts.csv`
*   `transactions.csv` 

In [5]:
accounts = pd.read_csv("/content/AMLSim_Python_Lab/data/accounts.csv")
alerts = pd.read_csv("/content/AMLSim_Python_Lab/data/alerts.csv")
transactions = pd.read_csv("/content/AMLSim_Python_Lab/data/transactions.csv")

print("Accounts shape:", accounts.shape)
print("Alers shape:", alerts.shape)
print("Transactions shape:", transactions.shape)

Accounts shape: (10000, 7)
Alers shape: (1719, 9)
Transactions shape: (1323234, 8)


In [6]:
accounts["ACCOUNT_ID"] = accounts["ACCOUNT_ID"].astype(str)
accounts.head(3)

Unnamed: 0,ACCOUNT_ID,CUSTOMER_ID,INIT_BALANCE,COUNTRY,ACCOUNT_TYPE,IS_FRAUD,TX_BEHAVIOR_ID
0,0,C_0,184.44,US,I,False,1
1,1,C_1,175.8,US,I,False,1
2,2,C_2,142.06,US,I,False,1


In [7]:
alerts["ALERT_ID"] = alerts["ALERT_ID"].astype(str)
alerts["SENDER_ACCOUNT_ID"] = alerts["SENDER_ACCOUNT_ID"].astype(str)
alerts["RECEIVER_ACCOUNT_ID"] = alerts["RECEIVER_ACCOUNT_ID"].astype(str)
alerts.head(3)

Unnamed: 0,ALERT_ID,ALERT_TYPE,IS_FRAUD,TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP
0,193,fan_in,True,82,6976,9739,TRANSFER,4.85,0
1,377,cycle,True,949,5776,2570,TRANSFER,10.27,0
2,189,fan_in,True,6280,9999,9530,TRANSFER,2.74,1


In [8]:
transactions["TX_ID"] = transactions["TX_ID"].astype(str)
transactions["SENDER_ACCOUNT_ID"] = transactions["SENDER_ACCOUNT_ID"].astype(
    str
)
transactions["RECEIVER_ACCOUNT_ID"] = transactions[
    "RECEIVER_ACCOUNT_ID"
].astype(str)
transactions.head(3)

Unnamed: 0,TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP,IS_FRAUD,ALERT_ID
0,1,6456,9069,TRANSFER,465.05,0,False,-1
1,2,7516,9543,TRANSFER,564.64,0,False,-1
2,3,2445,9356,TRANSFER,598.94,0,False,-1


## Create AMLSim Graph Solution

### Create Schema

In [9]:
schema_output = conn.gsql(
    """
USE GLOBAL
DROP GRAPH AMLGraph
CREATE GRAPH AMLGraph()
USE GRAPH AMLGraph 

CREATE SCHEMA_CHANGE JOB AMLGraph_change_job FOR GRAPH AMLGraph {
    ADD VERTEX Country (PRIMARY_ID id STRING) 
                           WITH primary_id_as_attribute="true";
    ADD VERTEX Customer (PRIMARY_ID id STRING) 
                            WITH primary_id_as_attribute="true";
    ADD VERTEX Account (PRIMARY_ID id STRING, init_balance DOUBLE, 
                       account_type STRING, tx_behavior INT, 
                       pagerank FLOAT, label INT, current_balance DOUBLE, 
                       min_send_tx DOUBLE, min_receive_tx DOUBLE, 
                       max_send_tx DOUBLE, max_receive_tx DOUBLE, 
                       avg_send_tx DOUBLE, avg_receive_tx DOUBLE, 
                       cnt_receive_tx INT, cnt_send_tx INT) 
                       WITH primary_id_as_attribute="true";
    ADD VERTEX Transaction (PRIMARY_ID id STRING, tx_behavior_id INT, 
                           amount DOUBLE, is_fraud BOOL) 
                           WITH primary_id_as_attribute="true";
    ADD VERTEX Alert (PRIMARY_ID id STRING, alert_type STRING, ts INT) 
                     WITH primary_id_as_attribute="true";

    ADD UNDIRECTED EDGE BASED_IN (From Customer, To Country);
    ADD UNDIRECTED EDGE CUSTOMER_ACCOUNT (From Customer, To Account);
    ADD UNDIRECTED EDGE TRANSACTION_FLAGGED (From Transaction, To Alert);
    ADD DIRECTED EDGE SEND_TO (From Account, To Account)
                                  WITH REVERSE_EDGE="reverse_SEND_TO";
    ADD DIRECTED EDGE SEND_TRANSACTION (From Account, To Transaction, ts INT, 
                                           tx_type STRING) 
                                           WITH REVERSE_EDGE="reverse_SEND_TRANSACTION";
    ADD DIRECTED EDGE RECEIVE_TRANSACTION (From Transaction, To Account, 
                                              ts INT, tx_type STRING) 
                                              WITH REVERSE_EDGE="reverse_RECEIVE_TRANSACTION";
}

RUN SCHEMA_CHANGE JOB AMLGraph_change_job
DROP JOB AMLGraph_change_job
"""
)

print(schema_output)

Semantic Check Fails: The job load_job_accounts depends on graph AMLGraph!
Please drop the job first!
The graph AMLGraph could not be dropped!
Semantic Check Fails: The graph name conflicts with another type or existing graph names! Please use a different name.
The graph AMLGraph could not be created!


***Print Graph Components***

In [10]:
graph_components = conn.gsql("""LS""", options=[])
print(graph_components)

---- Graph AMLGraph
Vertex Types:
- VERTEX Country(PRIMARY_ID id STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
- VERTEX Customer(PRIMARY_ID id STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
- VERTEX Account(PRIMARY_ID id STRING, init_balance DOUBLE, account_type STRING, tx_behavior INT, pagerank FLOAT, label INT, current_balance DOUBLE, min_send_tx DOUBLE, min_receive_tx DOUBLE, max_send_tx DOUBLE, max_receive_tx DOUBLE, avg_send_tx DOUBLE, avg_receive_tx DOUBLE, cnt_receive_tx INT, cnt_send_tx INT) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
- VERTEX Transaction(PRIMARY_ID id STRING, tx_behavior_id INT, amount DOUBLE, is_fraud BOOL) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
- VERTEX Alert(PRIMARY_ID id STRING, alert_type STRING, ts INT) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types:
- UNDIRECTED EDGE BASED_IN(FROM Customer, TO Country)
- UNDIRECTED E

***Reconnect to the Graph***

In [11]:
secret = conn.createSecret()
conn.apiToken = conn.getToken(secret)
print("Connection token:", conn.apiToken)

Connection token: ('7akgvktm8krm98bt83ibdackoph84mp5', 1650711476, '2022-04-23 10:57:56')


## Data Loading

***Loading Accounts***

In [12]:
_ = conn.gsql("""
DROP JOB load_job_accounts
CREATE LOADING JOB load_job_accounts FOR GRAPH AMLGraph {
    DEFINE FILENAME MyDataSource;
    LOAD MyDataSource TO VERTEX Customer VALUES($1) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD MyDataSource TO VERTEX Account VALUES($0, $2, $4, $6, _, _, _, _, _, _, _, _, _, _, _) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD MyDataSource TO EDGE BASED_IN VALUES($1, $3) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD MyDataSource TO EDGE CUSTOMER_ACCOUNT VALUES($1, $4) USING SEPARATOR=",", HEADER="true", EOL="\n";
}
""")
print(_)

Successfully dropped jobs on the graph 'AMLGraph': [load_job_accounts].
Successfully created loading jobs: [load_job_accounts].


In [13]:
_ = conn.uploadFile(
    "AMLSim_Python_Lab/data/accounts.csv",
    "MyDataSource",
    "load_job_accounts",
    timeout=32000,
)
print(_)

[{'sourceFileName': 'Online_POST', 'statistics': {'validLine': 10001, 'rejectLine': 0, 'failedConditionLine': 0, 'notEnoughToken': 0, 'invalidJson': 0, 'oversizeToken': 0, 'vertex': [{'typeName': 'Customer', 'validObject': 10001, 'noIdFound': 0, 'invalidAttribute': 0, 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}, {'typeName': 'Account', 'validObject': 10000, 'noIdFound': 0, 'invalidAttribute': 1, 'invalidAttributeLines': ['1:init_balance'], 'invalidAttributeLinesData': ['ACCOUNT_ID,CUSTOMER_ID,INIT_BALANCE,COUNTRY,ACCOUNT_TYPE,IS_FRAUD,TX_BEHAVIOR_ID\r\n'], 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}], 'edge': [{'typeName': 'BASED_IN', 'validObject': 10001, 'noIdFound': 0, 'invalidAttribute': 0, 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}, {'typeName': 'CUSTOMER_ACCOUNT', 'validObject': 10001, 'noIdFound': 

***Loading Transactions***

In [14]:
_ = conn.gsql("""
DROP JOB load_job_transactions
CREATE LOADING JOB load_job_transactions FOR GRAPH AMLGraph {
    DEFINE FILENAME F1;
    LOAD F1 TO VERTEX Transaction VALUES($0, _, $4, $6) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD F1 TO EDGE SEND_TO VALUES($1, $2) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD F1 TO EDGE SEND_TRANSACTION VALUES($1, $0, $5, $3) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD F1 TO EDGE RECEIVE_TRANSACTION VALUES($0, $2, $5, $3) USING SEPARATOR=",", HEADER="true", EOL="\n";
}
""")
print(_)

Successfully dropped jobs on the graph 'AMLGraph': [load_job_transactions].
Successfully created loading jobs: [load_job_transactions].


In [15]:
_ = conn.uploadFile(
    "AMLSim_Python_Lab/data/transactions.csv",
    "F1",
    "load_job_transactions",
    timeout=32000,
)
print(_)

[{'sourceFileName': 'Online_POST', 'statistics': {'validLine': 1323235, 'rejectLine': 0, 'failedConditionLine': 0, 'notEnoughToken': 0, 'invalidJson': 0, 'oversizeToken': 0, 'vertex': [{'typeName': 'Transaction', 'validObject': 1323234, 'noIdFound': 0, 'invalidAttribute': 1, 'invalidAttributeLines': ['1:amount'], 'invalidAttributeLinesData': ['TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP,IS_FRAUD,ALERT_ID\r\n'], 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}], 'edge': [{'typeName': 'SEND_TO', 'validObject': 1323235, 'noIdFound': 0, 'invalidAttribute': 0, 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}, {'typeName': 'SEND_TRANSACTION', 'validObject': 1323234, 'noIdFound': 0, 'invalidAttribute': 1, 'invalidAttributeLines': ['1:ts'], 'invalidAttributeLinesData': ['TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP,IS_FRAUD,ALERT

***Load Alerts***

In [16]:
_ = conn.gsql("""
DROP JOB load_job_alerts
CREATE LOADING JOB load_job_alerts FOR GRAPH AMLGraph {
    DEFINE FILENAME F1;
    LOAD F1 TO VERTEX Alert VALUES($0, $1, $8) USING SEPARATOR=",", HEADER="true", EOL="\n";
    LOAD F1 TO EDGE TRANSACTION_FLAGGED VALUES($3, $0) USING SEPARATOR=",", HEADER="true", EOL="\n";
}
""")
print(_)

Successfully dropped jobs on the graph 'AMLGraph': [load_job_alerts].
Successfully created loading jobs: [load_job_alerts].


In [17]:
_ = conn.uploadFile(
    "AMLSim_Python_Lab/data/alerts.csv",
    "F1",
    "load_job_alerts",
    timeout=32000
)
print(_)

[{'sourceFileName': 'Online_POST', 'statistics': {'validLine': 1720, 'rejectLine': 0, 'failedConditionLine': 0, 'notEnoughToken': 0, 'invalidJson': 0, 'oversizeToken': 0, 'vertex': [{'typeName': 'Alert', 'validObject': 1719, 'noIdFound': 0, 'invalidAttribute': 1, 'invalidAttributeLines': ['1:ts'], 'invalidAttributeLinesData': ['ALERT_ID,ALERT_TYPE,IS_FRAUD,TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP\r\n'], 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}], 'edge': [{'typeName': 'TRANSACTION_FLAGGED', 'validObject': 1720, 'noIdFound': 0, 'invalidAttribute': 0, 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}], 'deleteVertex': [], 'deleteEdge': []}}]


## Install Queries

In [18]:
endpoints = conn.getEndpoints()
print("Sample default endpoint")
print(json.dumps(endpoints["DELETE /graph/{graph_name}/delete_by_type/vertices/{vertex_type}/"], indent=2))

Sample default endpoint
{
  "parameters": {
    "ack": {
      "default": "all",
      "max_count": 1,
      "min_count": 1,
      "options": [
        "all",
        "none"
      ],
      "type": "STRING"
    },
    "graph_name": {
      "default": "",
      "max_count": 1,
      "min_count": 0,
      "type": "STRING"
    },
    "permanent": {
      "default": false,
      "max_count": 1,
      "min_count": 1,
      "type": "BOOL"
    },
    "vertex_type": {
      "type": "TYPENAME"
    }
  }
}


### Create Account Activity
Account activity will traverse transactions to "accumulate" statistics from each user in the solution.

In [19]:
_ = conn.gsql("""
DROP QUERY accountActivity
CREATE QUERY accountActivity() FOR GRAPH AMLGraph { 
  
  SumAccum<DOUBLE> @s_sumAmt, @r_sumAmt;
  SumAccum<DOUBLE> @s_txCnt, @r_txCnt;
  MinAccum<DOUBLE> @s_minAmt, @r_minAmt;
  MaxAccum<DOUBLE> @s_maxAmt, @r_maxAmt;
  AvgAccum @s_avgAmt, @r_avgAmt;
  
  Seed = {Account.*};
  
  acctSend = SELECT tgt FROM Seed:s -(SEND_TRANSACTION:e)-> Transaction:tgt
             ACCUM s.@s_sumAmt += tgt.amount, 
                   s.@s_txCnt += 1,
                   s.@s_minAmt += tgt.amount, 
                   s.@s_maxAmt += tgt.amount,
                   s.@s_avgAmt += tgt.amount
            POST-ACCUM
                s.current_balance = s.@s_sumAmt - s.init_balance,
                s.min_send_tx = s.@s_minAmt,
                s.max_send_tx = s.@s_maxAmt,
                s.avg_send_tx = s.@s_avgAmt,
                s.cnt_send_tx = s.@s_txCnt;
                
  acctreceive = SELECT tgt FROM Seed:s -(reverse_RECEIVE_TRANSACTION:e)-> Transaction:tgt
                ACCUM s.@r_sumAmt += tgt.amount, 
                      s.@r_txCnt += 1,
                      s.@r_minAmt += tgt.amount, 
                      s.@r_maxAmt += tgt.amount,
                      s.@r_avgAmt += tgt.amount
                      
                POST-ACCUM
                  s.current_balance = s.@r_sumAmt + s.init_balance,
                  s.min_receive_tx = s.@r_minAmt,
                  s.max_receive_tx = s.@r_maxAmt,
                  s.avg_receive_tx = s.@r_avgAmt,
                  s.cnt_receive_tx = s.@r_txCnt;
              
  PRINT "Features Have Been Calculated";
}
""")
print(_)

Successfully dropped queries on the graph 'AMLGraph': [accountActivity].
Successfully created queries: [accountActivity].


#### Create Account Info
A basic query that is looking pulls information from x number of accounts. 

In [20]:
_ = conn.gsql("""
CREATE QUERY accountInfo(INT limit_x) FOR GRAPH AMLGraph {  
  /*
    Grabs the account info for limit_x accounts
  */
  
  Seed = {Account.*};
  
  S3 = SELECT a FROM Seed:a 
         LIMIT limit_x;
  
  PRINT S3;   
}
""")
print(_)

Semantic Check Fails: The query name accountInfo is used by another object! Please use a different name.
Failed to create queries: [accountInfo].


In [21]:
_ = conn.gsql("INSTALL QUERY accountInfo")
print(_)

Semantic Check Fails: Graph AMLGraph: all queries in this catalog have been installed already.
Query installation finished.


In [22]:
info = conn.runInstalledQuery("accountInfo", {"limit_x":"1"})
print(json.dumps(info, indent=2))

[
  {
    "S3": [
      {
        "v_id": "9988",
        "v_type": "Account",
        "attributes": {
          "id": "9988",
          "init_balance": 340.28,
          "account_type": "I",
          "tx_behavior": 5,
          "pagerank": 1,
          "label": 705691648,
          "current_balance": 325468059.41,
          "min_send_tx": 2.71,
          "min_receive_tx": 4.75,
          "max_send_tx": 340.27,
          "max_receive_tx": 21474836.47,
          "avg_send_tx": 336.80067,
          "avg_receive_tx": 233143.06528,
          "cnt_receive_tx": 1396,
          "cnt_send_tx": 194
        }
      }
    ]
  }
]


### Create Label Prop

Label Propagation is a heuristic method for determining communities. The idea is simple: If the plurality of your neighbors all bear the label X, then you should label yourself as also a member of X. The algorithm begins with each vertex having its own unique label. Then we iteratively update labels based on the neighbor influence described above. It is important that the order for updating the vertices be random. The algorithm is favored for its efficiency and simplicity, but it is not guaranteed to produce the same results every time. In a variant version, some vertices could initially be known to belong to the same community. If they are well-connected to one another, they are likely to preserve their common membership and influence their neighbors.

In [23]:
_ = conn.gsql("""
CREATE QUERY label_prop (SET<STRING> v_type, SET<STRING> e_type, INT max_iter, INT output_limit, BOOL print_accum = TRUE, STRING file_path = "", STRING attr = "") FOR GRAPH AMLGraph{
# Partition the vertices into communities, according to the Label Propagation method.
# Indicate community membership by assigning each vertex a community ID.

        OrAccum @@changed = true;
        MapAccum<INT, INT> @map;     # <communityId, numNeighbors>
        MapAccum<INT, INT> @@commSizes;   # <communityId, members>
        SumAccum<INT> @label, @num;  
        FILE f (file_path);
        Start = {v_type};

# Assign unique labels to each vertex
        Start = SELECT s FROM Start:s ACCUM s.@label = getvid(s);

# Propagate labels to neighbors until labels converge or the max iterations is reached
        WHILE @@changed == true LIMIT max_iter DO
                @@changed = false;
                Start = SELECT s 
                        FROM Start:s -(e_type:e)-> :t
                        ACCUM t.@map += (s.@label -> 1)  # count the occurrences of neighbor's labels
                        POST-ACCUM
                                INT maxV = 0,
                                INT label = 0,
                                # Iterate over the map to get the neighbor label that occurs most often
                                FOREACH (k,v) IN t.@map DO
                                        CASE WHEN v > maxV THEN
                                                maxV = v,
                                                label = k
                                        END
                                END,
                                # When the neighbor search finds a label AND it is a new label
                                # AND the label's count has increased, update the label.
                                CASE WHEN label != 0 AND t.@label != label AND maxV > t.@num THEN
                                        @@changed += true,
                                        t.@label = label,
                                        t.@num = maxV
                                END,
                                t.@map.clear();
        END;

        Start = {v_type};
        Start =  SELECT s FROM Start:s
                  POST-ACCUM 
                        IF attr != "" THEN s.setAttr(attr, s.@label) END,
                        IF file_path != "" THEN f.println(s, s.@label) END,
                        IF print_accum THEN @@commSizes += (s.@label -> 1) END
                  LIMIT output_limit;

        IF print_accum THEN 
           PRINT @@commSizes;
           PRINT Start[Start.@label];
        END;
}
""")
print(_)

Semantic Check Fails: The query name label_prop is used by another object! Please use a different name.
Failed to create queries: [label_prop].


#### Create Page Rank
The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.

In [24]:
_ = conn.gsql("""
CREATE QUERY pageRank (STRING v_type, STRING e_type,
 FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,
 BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "",
 BOOL display_edges = FALSE) FOR GRAPH AMLGraph{
/*
 Compute the pageRank score for each vertex in the GRAPH
 In each iteration, compute a score for each vertex:
     score = (1-damping) + damping*sum(received scores FROM its neighbors).
 The pageRank algorithm stops when either of the following is true:
 a) it reaches max_iter iterations;
 b) the max score change for any vertex compared to the last iteration <= max_change.
 v_type: vertex types to traverse          print_accum: print JSON output
 e_type: edge types to traverse            result_attr: INT attr to store results to
 max_iter; max #iterations                 file_path: file to write CSV output to
 top_k: #top scores to output              display_edges: output edges for visualization
 max_change: max allowed change between iterations to achieve convergence
 damping: importance of traversal vs. random teleport

 This query supports only taking in a single edge for the time being (8/13/2020).
*/
	TYPEDEF TUPLE<VERTEX Vertex_ID, FLOAT score> Vertex_Score;
	HeapAccum<Vertex_Score>(top_k, score DESC) @@topScores;
	MaxAccum<FLOAT> @@max_diff = 9999;    # max score change in an iteration
	SumAccum<FLOAT> @recvd_score = 0; # sum of scores each vertex receives FROM neighbors
	SumAccum<FLOAT> @score = 1;           # initial score for every vertex is 1.
	SetAccum<EDGE> @@edgeSet;             # list of all edges, if display is needed
	FILE f (file_path);

# PageRank iterations	
	Start = {v_type};                     # Start with all vertices of specified type(s)
	WHILE @@max_diff > max_change LIMIT max_iter DO
			@@max_diff = 0;
			V = SELECT s
				FROM Start:s -(e_type:e)-> v_type:t
				ACCUM t.@recvd_score += s.@score/(s.outdegree(e_type)) 
				POST-ACCUM s.@score = (1.0-damping) + damping * s.@recvd_score,
						   s.@recvd_score = 0,
						   @@max_diff += abs(s.@score - s.@score');
	END; # END WHILE loop

# Output
	IF file_path != "" THEN
	  f.println("Vertex_ID", "PageRank");
	END;

	V = SELECT s FROM Start:s
		POST-ACCUM 
			IF result_attr != "" THEN s.setAttr(result_attr, s.@score) END,
			IF file_path != "" THEN f.println(s, s.@score) END,
			IF print_accum THEN @@topScores += Vertex_Score(s, s.@score) END;

	IF print_accum THEN
		PRINT @@topScores;
		IF display_edges THEN
			PRINT Start[Start.@score];
			Start = SELECT s
					FROM Start:s -(e_type:e)-> v_type:t
					ACCUM @@edgeSet += e;
		   PRINT @@edgeSet;
		END;
	END;
}
""")
print(_)

Semantic Check Fails: The query name pageRank is used by another object! Please use a different name.
Failed to create queries: [pageRank].


#### Create Transaction Features in Tuple
Using a Tuple we can fetch all the data needed for our SKLearn Decision Tree Classifier. 

In [25]:
_ = conn.gsql("""
DROP QUERY txMultiHopLimit
CREATE QUERY txMultiHopLimit() FOR GRAPH AMLGraph SYNTAX v2 { 

  /*
    This query grabs all of the features and returns it.
  */
  
  TYPEDEF TUPLE <BOOL tx_fraud, DOUBLE tx_amount, FLOAT s_pagerank, INT s_label, DOUBLE s_min_send_tx, DOUBLE s_min_receieve_tx, DOUBLE s_max_send_tx, DOUBLE s_max_receive_tx, DOUBLE s_avg_send_tx, DOUBLE s_avg_receive_tx, INT s_cnt_receive_tx, INT s_cnt_send_tx, INT s_timestamp, FLOAT r_pagerank, INT r_label, DOUBLE r_min_send_tx, DOUBLE r_min_receieve_tx, DOUBLE r_max_send_tx, DOUBLE r_max_receive_tx, DOUBLE r_avg_send_tx, DOUBLE r_avg_receive_tx, INT r_cnt_receive_tx, INT r_cnt_send_tx, INT r_timestamp> ORDER_TX;
  ListAccum<ORDER_TX> @@txRecords;
  
  Seed = {Transaction.*};
  
  acctSend = SELECT tgt FROM Seed:s -( (reverse_SEND_TRANSACTION> |  RECEIVE_TRANSACTION>):e)- Account:tgt
      ACCUM @@txRecords += ORDER_TX(s.is_fraud, s.amount, tgt.pagerank, tgt.label, tgt.min_send_tx, tgt.min_receive_tx, tgt.max_send_tx, tgt.max_receive_tx, tgt.avg_send_tx, tgt.avg_receive_tx, tgt.cnt_receive_tx, tgt.cnt_send_tx, e.ts, tgt.pagerank, tgt.label, tgt.min_send_tx, tgt.min_receive_tx, tgt.max_send_tx, tgt.max_receive_tx, tgt.avg_send_tx, tgt.avg_receive_tx, tgt.cnt_receive_tx, tgt.cnt_send_tx, e.ts);

  PRINT @@txRecords;
}
""")
print(_)

Successfully dropped queries on the graph 'AMLGraph': [txMultiHopLimit].
Successfully created queries: [txMultiHopLimit].


#### Install all Queries

In [26]:
_ = conn.gsql("INSTALL QUERY ALL")
print(_)

Start installing queries, about 1 minute ...
accountActivity query: curl -X GET 'http://127.0.0.1:9000/query/AMLGraph/accountActivity'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
txMultiHopLimit query: curl -X GET 'http://127.0.0.1:9000/query/AMLGraph/txMultiHopLimit'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
Select 'm1' as compile server, now connecting ...
Node 'm1' is prepared as compile server.

Query installation finished.


## Run Queries for Feature Extraction

Check attributes of accounts.

In [27]:
info = conn.runInstalledQuery("accountInfo", {"limit_x":"1"})
print(json.dumps(info, indent=2))

[
  {
    "S3": [
      {
        "v_id": "9988",
        "v_type": "Account",
        "attributes": {
          "id": "9988",
          "init_balance": 340.28,
          "account_type": "I",
          "tx_behavior": 5,
          "pagerank": 1,
          "label": 705691648,
          "current_balance": 325468059.41,
          "min_send_tx": 2.71,
          "min_receive_tx": 4.75,
          "max_send_tx": 340.27,
          "max_receive_tx": 21474836.47,
          "avg_send_tx": 336.80067,
          "avg_receive_tx": 233143.06528,
          "cnt_receive_tx": 1396,
          "cnt_send_tx": 194
        }
      }
    ]
  }
]


Run `accountActivity` Query

In [28]:
conn.runInstalledQuery("accountActivity", {})

[{'"Features Have Been Calculated"': 'Features Have Been Calculated'}]

Check attributes of accounts (check if accountActivity attrbutes are populated).

In [29]:
info = conn.runInstalledQuery("accountInfo", {"limit_x":"1"})
print(json.dumps(info, indent=2))

[
  {
    "S3": [
      {
        "v_id": "9988",
        "v_type": "Account",
        "attributes": {
          "id": "9988",
          "init_balance": 340.28,
          "account_type": "I",
          "tx_behavior": 5,
          "pagerank": 1,
          "label": 705691648,
          "current_balance": 325468059.41,
          "min_send_tx": 2.71,
          "min_receive_tx": 4.75,
          "max_send_tx": 340.27,
          "max_receive_tx": 21474836.47,
          "avg_send_tx": 336.80067,
          "avg_receive_tx": 233143.06528,
          "cnt_receive_tx": 1396,
          "cnt_send_tx": 194
        }
      }
    ]
  }
]


Run `pageRank` Query

In [30]:
pagerank = conn.runInstalledQuery(
    "pageRank",
    {
        "v_type": "Account",
        "e_type": "Send_To",
        "max_change": "0.001",
        "max_iter": "25",
        "damping": "0.85",
        "top_k": "10",
        "print_accum": "TRUE",
        "result_attr": "pagerank",
        "file_path": "",
        "display_edges": "FALSE",
    },
)
print(json.dumps(pagerank, indent=2))

[
  {
    "@@topScores": [
      {
        "Vertex_ID": "9747",
        "score": 1
      },
      {
        "Vertex_ID": "9961",
        "score": 1
      },
      {
        "Vertex_ID": "9691",
        "score": 1
      },
      {
        "Vertex_ID": "9525",
        "score": 1
      },
      {
        "Vertex_ID": "9651",
        "score": 1
      },
      {
        "Vertex_ID": "9507",
        "score": 1
      },
      {
        "Vertex_ID": "9479",
        "score": 1
      },
      {
        "Vertex_ID": "9580",
        "score": 1
      },
      {
        "Vertex_ID": "9767",
        "score": 1
      },
      {
        "Vertex_ID": "9988",
        "score": 1
      }
    ]
  }
]


Check attributes of accounts (check if pageRank attrbutes are populated.

In [31]:
info = conn.runInstalledQuery("accountInfo", {"limit_x":"1"})
print(json.dumps(info, indent=2))

[
  {
    "S3": [
      {
        "v_id": "9988",
        "v_type": "Account",
        "attributes": {
          "id": "9988",
          "init_balance": 340.28,
          "account_type": "I",
          "tx_behavior": 5,
          "pagerank": 1,
          "label": 705691648,
          "current_balance": 325468059.41,
          "min_send_tx": 2.71,
          "min_receive_tx": 4.75,
          "max_send_tx": 340.27,
          "max_receive_tx": 21474836.47,
          "avg_send_tx": 336.80067,
          "avg_receive_tx": 233143.06528,
          "cnt_receive_tx": 1396,
          "cnt_send_tx": 194
        }
      }
    ]
  }
]


Run `label_prop` Query

In [32]:
label_prop = conn.runInstalledQuery(
    "label_prop",
    {
        "v_type": "Account",
        "e_type": "Send_To",
        "max_iter": "10",
        "output_limit": "10",
        "print_accum": "TRUE",
        "file_path": "",
        "attr": "label",
    },
)
json.dumps(label_prop, indent=1)

'[\n {\n  "@@commSizes": {\n   "771752133": 1,\n   "771752132": 1,\n   "771752131": 1,\n   "771752130": 1,\n   "771752129": 1,\n   "771752128": 1,\n   "771752127": 1,\n   "771752126": 1,\n   "771752125": 1,\n   "771752124": 1,\n   "771752123": 1,\n   "771752122": 1,\n   "771752121": 1,\n   "771752120": 1,\n   "771752119": 1,\n   "771752118": 1,\n   "771752117": 1,\n   "771752116": 1,\n   "771752115": 1,\n   "771752114": 1,\n   "771752113": 1,\n   "771752112": 1,\n   "771752111": 1,\n   "771752110": 1,\n   "771752109": 1,\n   "771752108": 1,\n   "771752107": 1,\n   "771752106": 1,\n   "771752105": 1,\n   "771752104": 1,\n   "771752103": 1,\n   "771752102": 1,\n   "771752101": 1,\n   "771752100": 1,\n   "771752099": 1,\n   "771752098": 1,\n   "771752097": 1,\n   "771752096": 1,\n   "771752095": 1,\n   "771752094": 1,\n   "771752093": 1,\n   "771752092": 1,\n   "771752091": 1,\n   "771752090": 1,\n   "771752089": 1,\n   "771752088": 1,\n   "771752087": 1,\n   "771752086": 1,\n   "77175208

Check attributes of accounts (check if label_prop attrbutes are populated).

In [33]:
info = conn.runInstalledQuery("accountInfo", {"limit_x":"1"})
print(json.dumps(info, indent=2))

[
  {
    "S3": [
      {
        "v_id": "9988",
        "v_type": "Account",
        "attributes": {
          "id": "9988",
          "init_balance": 340.28,
          "account_type": "I",
          "tx_behavior": 5,
          "pagerank": 1,
          "label": 705691648,
          "current_balance": 325468059.41,
          "min_send_tx": 2.71,
          "min_receive_tx": 4.75,
          "max_send_tx": 340.27,
          "max_receive_tx": 21474836.47,
          "avg_send_tx": 336.80067,
          "avg_receive_tx": 233143.06528,
          "cnt_receive_tx": 1396,
          "cnt_send_tx": 194
        }
      }
    ]
  }
]


Run `txMultiHopLimit` Query

In [34]:
tx_hop = conn.runInstalledQuery(
    "txMultiHopLimit", {}, timeout="10000000000000000", sizeLimit="1500000000"
)

In [35]:
df_tx_hop = pd.DataFrame(tx_hop[0]["@@txRecords"])
df_tx_hop = flat_table.normalize(df_tx_hop)
df_tx_hop.head()

Unnamed: 0,index,tx_fraud,tx_amount,s_pagerank,s_label,s_min_send_tx,s_min_receieve_tx,s_max_send_tx,s_max_receive_tx,s_avg_send_tx,...,r_label,r_min_send_tx,r_min_receieve_tx,r_max_send_tx,r_max_receive_tx,r_avg_send_tx,r_avg_receive_tx,r_cnt_receive_tx,r_cnt_send_tx,r_timestamp
0,0,False,428.79,1,793772108,10.94,6.47,428.79,558.46,424.18685,...,793772108,10.94,6.47,428.79,558.46,424.18685,404.52105,267,181,199
1,1,False,428.79,1,790626380,11.64,5.26,292.85,18629974.0,284.35621,...,790626380,11.64,5.26,292.85,18629974.0,284.35621,17013.09206,2610,66,199
2,2,False,6.11,1,787480598,6.11,8.51,18.54,21474836.47,6.17316,...,787480598,6.11,8.51,18.54,21474836.47,6.17316,181257.51613,284,266,199
3,3,False,6.11,1,751829086,3.9,6.11,416.02,8945538.0,413.69164,...,751829086,3.9,6.11,416.02,8945538.0,413.69164,166276.9644,116,177,199
4,4,False,6.11,1,787480598,6.11,8.51,18.54,21474836.47,6.17316,...,787480598,6.11,8.51,18.54,21474836.47,6.17316,181257.51613,284,266,199


## Run SKLearn DecisionTreeClassifier

Assign all the attributes that will be used as features

In [36]:
features = [
    "tx_amount",
    "s_pagerank",
    "s_label",
    "s_min_send_tx",
    "s_min_receieve_tx",
    "s_max_send_tx",
    "s_max_receive_tx",
    "s_avg_send_tx",
    "s_avg_receive_tx",
    "s_cnt_receive_tx",
    "s_cnt_send_tx",
    "s_timestamp",
    "r_pagerank",
    "r_label",
    "r_min_send_tx",
    "r_min_receieve_tx",
    "r_max_send_tx",
    "r_max_receive_tx",
    "r_avg_send_tx",
    "r_avg_receive_tx",
    "r_cnt_receive_tx",
    "r_cnt_send_tx",
    "r_timestamp",
]

Split our data into a test data set and train data set 

In [37]:
X_train, X_test, y_train, y_test = train_test_split(
    df_tx_hop[features], df_tx_hop["tx_fraud"], test_size=0.2, random_state=42
)

Assign clf to SKlearns Decision Tree Classifier

In [38]:
clf = DecisionTreeClassifier()

Train the Decision Tree Classifier with our training data

In [39]:
clf.fit(X_train, y_train)

DecisionTreeClassifier()

Run a prediction using our test data set and output confusion matrix results

In [40]:
preds = clf.predict(X_test)
confusion_matrix(y_test, preds)

array([[528631,     38],
       [    58,    567]])

Print out a report to see results

In [41]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00    528669
        True       0.94      0.91      0.92       625

    accuracy                           1.00    529294
   macro avg       0.97      0.95      0.96    529294
weighted avg       1.00      1.00      1.00    529294



# WARNING: DROP ALL - Will Delete everything in your graph!

In [42]:
# print(conn.gsql("""
#     USE GLOBAL
#     DROP GRAPH AMLGraph
# """))

## Additional Queries for To Explore with GraphStudio


### Select Account Query
```
CREATE QUERY selectAccount(STRING acct) FOR GRAPH AMLSim { 
  seed = {Account.*};
  S1 = SELECT s FROM seed:s WHERE s.id == acct; 
  PRINT S1; 
}
```
### Select Account Query
```
CREATE QUERY selectTopPageRank() FOR GRAPH AMLSim { 
  seed = {Account.*};
  S1 = SELECT s FROM seed:s ORDER BY s.pagerank DESC LIMIT 10;
  PRINT S1; 
}
```
### Select Account Transactions
```
CREATE QUERY selectAccountTx(STRING acct) FOR GRAPH AMLSim { 
  ListAccum<EDGE> @@txSend, @@txreceive;
  seed = {Account.*};
  
  SendTx = SELECT tgt FROM seed:s -(Send_Transaction:e)-> Transaction:tgt
           WHERE s.id == acct
           ACCUM @@txSend +=  e;
  
  receiveTx = SELECT tgt FROM seed:s -(reverse_receive_Transaction:e)-> Transaction:tgt
              WHERE s.id == acct
              ACCUM @@txreceive += e;
  
  PRINT @@txSend, @@txreceive; 
}
```

## Additional Queries To INTERPRET or CREATE

In [43]:
# FETCH DATA FOR ACCOUNT 9913 TO CHECK DATA POPLULATION
print(conn.gsql("""
INTERPRET QUERY () FOR GRAPH AMLGraph { 
  seed = {Account.*};
  S1 = SELECT s FROM seed:s WHERE s.id == "9913"; 
  PRINT S1; 
}
"""))

{
"error": false,
"message": "",
"version": {
"schema": 13,
"edition": "enterprise",
"api": "v2"
},
"results": [{"S1": [{
"v_id": "9913",
"attributes": {
"account_type": "I",
"max_receive_tx": 2.147483647E7,
"label": 871366657,
"tx_behavior": 5,
"cnt_receive_tx": 522,
"cnt_send_tx": 70,
"avg_receive_tx": 87608.81203,
"avg_send_tx": 259.85971,
"min_receive_tx": 5.38,
"init_balance": 263.35,
"min_send_tx": 19.03,
"current_balance": 4.573206323E7,
"max_send_tx": 263.35,
"id": "9913",
"pagerank": 1
},
"v_type": "Account"
}]}]
}


In [44]:
# CREATE/INSTALL QUERY FOR ACCOUNT INFO
print(conn.gsql("""
CREATE QUERY selectAccount(STRING acct) FOR GRAPH AMLGraph { 
  seed = {Account.*};
  S1 = SELECT s FROM seed:s WHERE s.id == acct; 
  PRINT S1; 
}
INSTALL QUERY selectAccount
"""))

Semantic Check Fails: The query name selectAccount is used by another object! Please use a different name.
Failed to create queries: [selectAccount].


In [45]:
# RUN INSTALLED QUERY selectAccount
print(conn.runInstalledQuery("selectAccount", {"acct": "9913"}))

[{'S1': [{'v_id': '9913', 'v_type': 'Account', 'attributes': {'id': '9913', 'init_balance': 263.35, 'account_type': 'I', 'tx_behavior': 5, 'pagerank': 1, 'label': 871366657, 'current_balance': 45732063.23, 'min_send_tx': 19.03, 'min_receive_tx': 5.38, 'max_send_tx': 263.35, 'max_receive_tx': 21474836.47, 'avg_send_tx': 259.85971, 'avg_receive_tx': 87608.81203, 'cnt_receive_tx': 522, 'cnt_send_tx': 70}}]}]


In [46]:
# CREATE/INSTALL QUERY FOR ACCOUNTS TOP PAGERANK SCORES
print(conn.gsql("""
CREATE QUERY selectTopPageRank() FOR GRAPH AMLGraph { 
  seed = {Account.*};
  S1 = SELECT s FROM seed:s ORDER BY s.pagerank DESC LIMIT 10;
  PRINT S1; 
}
INSTALL QUERY selectTopPageRank
"""))

Semantic Check Fails: The query name selectTopPageRank is used by another object! Please use a different name.
Failed to create queries: [selectTopPageRank].


In [47]:
# RUN INSTALLED QUERY selectAccount
print(conn.runInstalledQuery("selectTopPageRank", {}))

[{'S1': [{'v_id': '9593', 'v_type': 'Account', 'attributes': {'id': '9593', 'init_balance': 573.08, 'account_type': 'I', 'tx_behavior': 3, 'pagerank': 1, 'label': 778043395, 'current_balance': 39047124.74, 'min_send_tx': 573.08, 'min_receive_tx': 10.34, 'max_send_tx': 573.08, 'max_receive_tx': 19653748, 'avg_send_tx': 573.08, 'avg_receive_tx': 205508.16663, 'cnt_receive_tx': 190, 'cnt_send_tx': 180}}, {'v_id': '9751', 'v_type': 'Account', 'attributes': {'id': '9751', 'init_balance': 128.79, 'account_type': 'I', 'tx_behavior': 4, 'pagerank': 1, 'label': 778043393, 'current_balance': 30231087.71, 'min_send_tx': 128.78, 'min_receive_tx': 2.92, 'max_send_tx': 128.78, 'max_receive_tx': 21474836.47, 'avg_send_tx': 128.78, 'avg_receive_tx': 25382.83704, 'cnt_receive_tx': 1191, 'cnt_send_tx': 800}}, {'v_id': '9584', 'v_type': 'Account', 'attributes': {'id': '9584', 'init_balance': 563.7, 'account_type': 'I', 'tx_behavior': 3, 'pagerank': 1, 'label': 778043396, 'current_balance': 1605352.54, 'm

In [50]:
# # CREATE/INSTALL QUERY FOR ACCOUNTS TRANSACTIONS
# print(conn.gsql("""
# CREATE QUERY selectAccountTx(STRING acct) FOR GRAPH AMLGraph { 
#   ListAccum<EDGE> @@txSend, @@txreceive;
#   seed = {Account.*};
  
#   SendTx = SELECT tgt FROM seed:s -(Send_Transaction:e)-> Transaction:tgt
#            WHERE s.id == acct
#            ACCUM @@txSend +=  e;
  
#   receiveTx = SELECT tgt FROM seed:s -(reverse_receive_Transaction:e)-> Transaction:tgt
#               WHERE s.id == acct
#               ACCUM @@txreceive += e;
  
#   PRINT @@txSend, @@txreceive; 
# }
# """))

In [51]:
# # RUN INSTALLED QUERY selectAccountTx
# print(conn.runInstalledQuery("selectAccountTx", {"acct": "9913"}))

## Additional Colab Projects to Explore


### An Introduction to Graph Machine Learning with Keras and TigerGraph
blog: https://medium.com/@dezhouc2/an-introduction-to-graph-machine-learning-with-keras-and-tigergraph-ab050f3be78b

colabnotebook: https://colab.research.google.com/drive/1Q3LDwceG1ZKEzxlYL52hyFPdO2S14HbI?authuser=1#scrollTo=aNUxboCOZAFc

### An Introduction to Graph Machine Learning with Tensorflow and TigerGraph
blog:
https://medium.com/@dezhouc2/an-introduction-to-graph-machine-learning-with-tensorflow-and-tigergraph-1e8aa7dd4386

colabnotebook:
https://colab.research.google.com/drive/14zmslH-QL7MYtwWaP5X7-BvasihWQt3K

### An Introduction to Graph Machine Learning with Pytorch and TigerGraph
blog: https://medium.com/@dezhouc2/an-introduction-to-graph-machine-learning-with-pytorch-and-tigergraph-b243f3906170

colabnotebook: https://colab.research.google.com/drive/1iKqxBVbwTruK8JTP_bSiLKMOelrBLZ5m#scrollTo=XxdsIUtXWEXh

### An Introduction to Graph Machine Learning with Scikit-Learn and TigerGraph
blog: https://medium.com/@dezhouc2/an-introduction-to-graph-machine-learning-with-sklearn-recommendation-engine-and-tigergraph-d6bbf2f93380

colabnotebook: https://colab.research.google.com/drive/170LpEAYK25uPWKdxNarJAJNnAVfokF-j?authuser=2

### The PageRank Algorithm and its Application
blog: https://medium.com/@dezhouc2/the-pagerank-algorithm-and-its-application-e75a8a6d5ef2

### Label Propagation Algorithm and its Application
blog: https://medium.com/@dezhouc2/label-propagation-algorithm-and-its-application-162d03f10d3a