# Graph Data Science workshop with Neo4j

Click on the link below to open a Colab version of the notebook. You will be able to create your own version.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/neo4j-field/graph-summit-apac-2023/blob/main/GDS_Workshop.ipynb" target="_blank">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo">Run your own notebook in Colab
    </a>
  </td>
</table>

---
## Target

Do fraud analysis on a group of persons and transactions using graphs and data science.  

## Context

This notebook allows you load a dataset based on an updated version of [Paysim](https://www.sisu.io/posts/paysim/).  
PaySim is an approximation using an agent-based model and some anonymized, aggregate transactional data from a real mobile money network operator to create synthetic financial data sets academics and hackers can use for exploring ways to detect fraudulent behavior.  
Using this [code](https://github.com/voutilad/paysim), you can generate your own dataset with different caracteristics (size, fraud occurences...).     

## Dataset

The dataset used in this notebook represents money transafers between around 2500 clients, 75 merchants, 5 banks with 175000 transactions across 30 days.  
There are 5 types of transactions:  
* CashIn: a client moves money into the network via a merchant
* CashOut: a client moves money out of the network via a merchant
* Debit: a client moves money into a bank
* Transfer: a client sends money to another client
* Payment: a client exchanges money for something via a merchant

We will try to identify Clients which are fraudsters, trying to target other clients by taking their money below limits ot be unnoticed.  
We added from the the original Paysim some clients details (Phone, Email, SSN) to identify fake profiles too.  

---

## Let's get a graph database

We will use a Neo4j graph database created on the [Neo4j sandbox](https://neo4j.com/sandbox/).  
Once connected, on the _Select a project_ page, go to the section _Your own data_ and select the _Blank Sandbox_.  
Click on the _Create_ button at the bottom of the page.  
After few seconds, you should see the below.  
<img src="../img/sandbox_start.png" alt="Sandbox Start" width="50%" height="50%" title="Sandbox Start">  

And once it's up and running, you can access the connection details by clicking on the top right down arrow and picking the *Connection details* tab.  
You will need 2 things:
* Password  
* Bolt URL   

<img src="../img/sandbox_details.png" alt="Sandbox Details" width="50%" height="50%" title="Sandbox Details">  

---

## Let's code

First we will import the [Neo4j python driver](https://pypi.org/project/neo4j-driver/)  

In [None]:
# Install Neo4j GDS Python Client
import sys
!{sys.executable} -m pip install neo4j-driver

# Import our GDS entry point
from neo4j import GraphDatabase

### Instantiate your GDS Session

Use Neo4j/Bolt URI and credentials according to your setup  

For local standalone instance Bolt connection without auth    
`gds = GraphDataScience("bolt://localhost:7687", auth=None)`  

For local standalone instance Bolt connection with auth    
`gds = GraphDataScience("bolt://localhost:7687", auth=("neo4j", "<password>"))`  

For remote cluster Neo4j connection with auth  
`gds = GraphDataScience("neo4j://<FQDN or IP Address>:7687", auth=("neo4j", "<password>"))`  

For remote standalone instance Bolt connection with auth   
`gds = GraphDataScience("bolt://<FQDN or IP Address>:7687", auth=("neo4j", "<password>"))` 

In [None]:
driver = GraphDatabase.driver("bolt://100.26.208.7:7687", auth=("neo4j", "sting-paygrade-fiction")) # >> Update the password and the URL here <<

### Check the server details

In [None]:
server_info = driver.get_server_info()
print("IP:Port", server_info.address, "Protocol Version", server_info.protocol_version)

In [None]:
DATABASE = "neo4j"

### Cleaning the database or making it ready for a rerun of the notebook.
Then we will use it to load the data from CSV files, running [Cypher](https://neo4j.com/developer/cypher/) queries. The RELOAD_DATA flag can be used to skip this step for experimenting with different algorithms later on.

In [None]:
# Run implicit transactions
def execute(command_list, driver): 
    with driver.session(database=DATABASE) as session:
        for c in command_list:
            result = session.run(c)
            print(result)

In [None]:
# Set flag to control reloading of all data
RELOAD_DATA = True

if RELOAD_DATA: # Delete all, takes few minutes on a full database
    execute(["""
            MATCH (n) CALL {
              WITH n
              DETACH DELETE n
            } IN TRANSACTIONS OF 10 ROWS;
            """], driver)

### Test reading some data

Using [LOAD CSV](https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/), we are loading csv files into the database, creating the graph on the fly.  
The first cell is to test the file access, by reading it and showing only the first 5 rows.  

In [None]:
# Checking if we can access the data
if RELOAD_DATA:
    records, _, _ = driver.execute_query(
        """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/clients.csv" AS row
        RETURN row.NAME as Name, row.PHONENUMBER as phoneNumber, row.SSN as SSN, row.EMAIL as email LIMIT 5
        """
    , database_=DATABASE)
    # The object returned is a Pandas Data Frame, so we can explore using standard Pandas methods
    print(records)

### Creating constraints and indexes

For data integrity, we will create [constraints](https://neo4j.com/docs/cypher-manual/current/constraints/) to have a robust graph data model. Each constraint enforces uniqueness of an identifier for a given label. An index is also created for the name property on Client nodes, this allows fast lookups when querying clients by name.

In [None]:
if RELOAD_DATA:
    # First we create index
    CONSTRAINTS = [
      "CREATE CONSTRAINT ClientConstraint IF NOT EXISTS FOR (p:Client) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT EmailConstraint IF NOT EXISTS FOR (p:Email) REQUIRE p.email IS UNIQUE;",
      "CREATE CONSTRAINT PhoneConstraint IF NOT EXISTS FOR (p:Phone) REQUIRE p.phoneNumber IS UNIQUE;",
      "CREATE CONSTRAINT SSNConstraint IF NOT EXISTS FOR (p:SSN) REQUIRE p.ssn IS UNIQUE;",
      "CREATE CONSTRAINT MerchantConstraint IF NOT EXISTS FOR (p:Merchant) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT BankConstraint IF NOT EXISTS FOR (p:Bank) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT TransactionConstraint IF NOT EXISTS FOR (p:Transaction) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT DebitConstraint IF NOT EXISTS FOR (p:Transaction) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT CashInConstraint IF NOT EXISTS FOR (p:CashIn) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT CashOutConstraint IF NOT EXISTS FOR (p:CashOut) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT TransferConstraint IF NOT EXISTS FOR (p:Transfer) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT PaymentConstraint IF NOT EXISTS FOR (p:Payment) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE INDEX      ClientNameIndex IF NOT EXISTS FOR (n:Client) ON (n.name)"
    ]
    for c in CONSTRAINTS:
        records, _, _ = driver.execute_query(c, database_=DATABASE)

### Loading all the data

We will load 7 csv files:  
* one for clients   
* one for merchants  
* five for transactions  

We can see how each node is created with a label and at leats one property.  
We see all the relationships between all the nodes, to show the money exchanges betweens all entities.  

In [None]:
COMMANDS = ["""
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/clients.csv" AS row
        WITH row
        MERGE (c:Client { id: row.ID })
        SET c.name = row.NAME
        MERGE (p:Phone { phoneNumber: row.PHONENUMBER })
        MERGE (c)-[:HAS_PHONE]->(p)
        MERGE (s:SSN { ssn: row.SSN })
        MERGE (c)-[:HAS_SSN]->(s)
        MERGE (e:Email { email: row.EMAIL })
        MERGE (c)-[:HAS_EMAIL]->(e);
    """,
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/merchants.csv" AS row
        WITH row
        MERGE (m:Merchant { id: row.ID })
        SET m.name = row.NAME, m.highRisk = toBoolean(row.HIGHRISK);
    """,
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/debit.csv" AS row
        WITH row
        MERGE (b:Bank { id: row.IDDEST })
        SET b.name = row.NAMEDEST
        MERGE (c:Client { id: row.IDORIG })
        MERGE (t:Transaction:Debit { globalStep: toInteger(row.GLOBALSTEP) })
        SET t.amount = toFloat(row.AMOUNT)
        MERGE (t)-[:TO]->(b)
        MERGE (c)-[:PERFORMED]->(t);
    """,
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/cashin.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            MERGE (t:Transaction:CashIn { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 10 ROWS;
    """,
        """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/cashout.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            SET c.name = row.NAMEORIG
            MERGE (t:Transaction:CashOut { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 10 ROWS;
    """,
        """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/payment.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            SET c.name = row.NAMEORIG
            MERGE (t:Transaction:Payment { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 5 ROWS;
    """,
        """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/transfer.csv" AS row
        CALL {
            WITH row
            MERGE (cd:Client { id: row.IDDEST })
            SET cd.name = row.NAMEDEST
            MERGE (co:Client { id: row.IDORIG })
            SET co.name = row.NAMEORIG
            MERGE (t:Transaction:Transfer { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(cd)
            MERGE (co)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 5 ROWS;
    """
]
        
if RELOAD_DATA:
    execute(COMMANDS, driver)

We have now taken a series of flat data sources and constructed a rich graph representation of the connections present in the sample dataset. At this point we have the following data model :
 
<img src="../img/initial_data_model.png" alt="Initial graph data model" width="100%" height="100%" title="Initial Graph Data Model">  

---
### Enriching the graph

Using the transaction details, we are able to enrich the model by adding the ordering of the transaction using the global step.

In [None]:
if RELOAD_DATA:
    # Update data model with new relationships
    execute([
    """
    MATCH (c:Client) with c.id as clientId
    CALL {
        WITH clientId
        MATCH (c:Client {id: clientId})-[:PERFORMED]->(tx:Transaction)
        WITH c, tx ORDER BY tx.globalStep
        WITH c, collect(tx) AS txs
        WITH c, txs, head(txs) AS _start, last(txs) AS _last

        MERGE (c)-[:FIRST_TX]->(_start)
        MERGE (c)-[:LAST_TX]->(_last)
        WITH c, apoc.coll.pairsMin(txs) AS pairs

        UNWIND pairs AS pair
          WITH pair[0] AS a, pair[1] AS b
          MERGE (a)-[n:NEXT]->(b)
    } IN TRANSACTIONS OF 10 ROWS;
    """
    ], driver)

These changes have created a new layer in the graph, where relationships show transactions in chronological order :

<img src="../img/enhanced_data_model.png" alt="Enhanced graph data model" width="100%" height="100%" title="Enhanced Graph Data Model"> 

This allows us to query and view transaction data in different ways, for example we can show Johns transactions as a group or in their order as below :

Performed transactions | Ordered transactions
- | - 
![alt](../img/performed_relationships.png) | ![alt](../img/ordered_relationships.png)




### Having a first look at the dataset

In [None]:
result, _, _ = driver.execute_query(
    """
    CALL apoc.meta.stats() YIELD nodeCount, labels
    UNWIND keys(labels) as label
    RETURN label as nodeLabel, 
        labels[label] as frequency,
        round(toFloat(labels[label])/nodeCount, 3) as relativeFrequency
    ORDER BY frequency DESC
    """
, database_=DATABASE)
result

### Let's have a look on how the money is exchanged across entities

In [None]:
result, _, _ = driver.execute_query(
    """
    MATCH (t:Transaction)
    WITH sum(t.amount) AS globalSum, count(t) AS globalCnt 
    MATCH (t:Transaction)
    WITH labels(t)[1] as txType, count(t) as txCnt, sum(t.amount) as txTotal, globalSum, globalCnt
    RETURN
        txType,
        toInteger(round(txTotal/1000000)) + 'M' AS TotalMarketValue,
        round(100 * txTotal / globalSum, 1) AS `%MarketValue`,
        round(100 * toFloat(txCnt) / globalCnt, 1) AS `%MarketTransactions`,
        toInteger(txTotal / txCnt) AS AvgTransactionValue,
        txCnt AS NumberOfTransactions
    ORDER BY `%MarketTransactions` DESC
    """
, database_=DATABASE)
result